Each MFS handles replication by 1) maintaining a list of gateways for each destination cluster and 2) distributing replication data among the gateways in a round-robin fashion. The gateway generally changes with each bucket. If all gateways are running, the MFS sends data to one gateway, then another, then another, and so on, indefinitely.
In the case of a communication error (such as a TCP timeout or hang), a gateway is blacklisted temporarily. Initially, it’s blacklisted for 3 seconds; this means that it’s not considered if it has a turn in the round-robin list. Once the blacklist time expires, the gateway is tried again when its turn comes up in the round-robin list. If it fails again, the blacklist time is doubled. It’s 3 seconds, then 6 seconds, and so on. If the gateway is still down after a retry and if the blacklist interval exceeds one minute, the gateway is removed from the replication list and is not considered again until the list is updated.
If the gateway is unreachable, MapR uses a network level timeout that depends on the OS configuration (typically 60 seconds) to detect failures. If the gateway hangs, MapR uses the MapR RPC timeout, which has a default of 300 seconds. Therefore, if a gateway node fails, replication is delayed for some time until the gateway is removed from the round-robin list.
For general information on MapR-DB gateways, see the MapR user documentation, Configuring MapR Gateways.
Retrieving data ...