A very common alarm that most of the admins will come across during the lifetime of MapR cluster management. What should you know about it?
1. Purpose of MapRFS heartbeats
When MFS processes start, they search for a master CLDB, register with it, and then begin to send regular heartbeats detailing various internal information. The CLDB processes the heartbeats and sends responses back to MFS that may contain important instructions for the MFS process, such as to create a new replica of some under replicated container. There are many important functions in MapRFS that rely on heartbeats between CLDB and MFS, and some of these functions can be time sensitive.
When MFS and CLDB processes do not exchange heartbeats in a timely manner, the CLDB may need to discard the information sent by an MFS process as it may be too old to consider, or the CLDB may time out instructions the CLDB sent to MFS if no heartbeat is received and reassign the work (or other similar work) to other nodes. These are just some examples, the point is that timely heartbeating between MFS and CLDB processes is important.
2. Implementation of MapRFS heartbeats
To exchange a single heartbeat, the following steps are executed:
- The heartbeat thread in MFS prepares a heartbeat request to send to the CLDB and places it in the RPC TX queue, timestamp 1 is taken
- The RPC thread in MFS transmits the request through the TCP socket to the CLDB (after first transmitting any earlier RPCs that were ahead of it in the queue)
- The Linux kernel and NIC driver on the MFS node transmits the packets to the network
- The Linux kernel and NIC driver on the CLDB node receive the packets from the network and pass them to the TCP socket owned by the CLDB process
- The RPC thread in CLDB receives the heartbeat request and dispatches a heartbeat processing work unit to a work unit queue
- A worker thread from a thread pool in the CLDB retrieves the work unit (after other worker threads have retrieved all other work units that were ahead of it in the queue)
- The worker thread processes the heartbeat request, generates a heartbeat response, and places it in the RPC TX queue, timestamp 2 is taken
- The RPC thread in CLDB transmits the request through the TCP socket to the remote MFS process (after first transmitting any earlier RPCs that were ahead of it in the queue)
- The packets are passed again through Linux kernel/NIC driver/network
- The RPC thread in MFS receives the heartbeat response and passes it to the heartbeat thread
- The MFS heartbeat thread processes the response
3. Implementation of slow heartbeat processing alarm
Between sequential heartbeat exchanges, if the elapsed time between the timestamps mentioned in the previous section exceeds a threshold (default 5 seconds) then the slow heartbeat processing alarm is raised for the node from which the heartbeat originated.
For instance, when MFS is preparing the heartbeat request, it will take timestamp 1 as mentioned in the previous section and include that timestamp in the heartbeat sent to the CLDB, lets call this timestamp 1a. When MFS prepares the subsequent heartbeat request, it will again take timestamp 1 and include it in the heartbeat request, lets call this timestamp 1b.
If the elapsed time between timestamp 1a and 1b exceeds the threshold (5 seconds), then the slow heartbeat processing alarm is raised.
The same behavior is applied to timestamp 2 in the heartbeat responses generated by the CLDB.
Once the heartbeat alarm is raised for a node, it remains raised indefinitely until an administrator clears the alarm. That the alarm is raised indicates heartbeat processing was slow for a period of time up until the point in time when the alarm was raised. If you are looking at a cluster and see the alarm raised, it indicates heartbeat processing *was* slow, it does necessarily indicate that heartbeat processing is *still* slow.
Once the administrator clears the alarm manually, it will be raised by the CLDB if sequential heartbeats show long elapsed times once again. Thus, if you clear the alarm and do not see it come back for a significant period of time then it indicates the condition causing the delay in execution of the heartbeat code was transient.
4. Causes of slow heartbeat processing alarms
As detailed in section 2, there are multiple steps in the heartbeat processing code in MapRFS. If any of these steps is unable to complete it's work in a timely manner then the heartbeat processing alarm can be raised.
For instance, if a node where MFS runs begins to page memory aggressively to/from disk then the kernel may be unable to provide one or more of the threads in the MFS process with sufficient running time, resulting in a delay in execution of the heartbeat code.
Another example, if packet loss is occurring over the network and multiple TCP retransmissions are required to successfully deliver the heartbeat requests/responses between MFS and CLDB, or if the network is heavily loaded and packets can not be delivered in a timely manner, then heartbeat alarms may be raised.
As another example, if the JVM running the CLDB process reaches it's heap limit and needs to perform a full garbage collection over a large amount of dirty heap space, then the threads in the CLDB process may be paused by the JVM for an extended period of time, resulting in heartbeat alarms.
Essentially, any delay affecting execution of any part of the heartbeat code for > 5 seconds can result in the heartbeat alarm being raised.
In general, resource contention or hardware problems are the root cause in nearly all instances where the heartbeat alarm is raised in customer environments.
5. Implications of slow heartbeat processing alarms
In the event that heartbeats between MFS and CLDB processes are not exchanged in a timely manner, the severity of the problems that result tend to depend upon:
- The number of MFS processes affected by the slow heartbeat processing
- The duration of the slow heartbeat processing
When the slow heartbeat alarm is raised for a single MFS process/node, it typically indicates a problem with that specific node (such as aggressive memory paging). When the slow heartbeat alarm is raised for multiple MFS processes/nodes at nearly the same time (e.g. within 5 seconds or so), it typically indicates a problem with the CLDB (such as CLDB JVM garbage collection).
In the message text alongside the slow heartbeat processing alarm, you are provided with a string like:
Heartbeat processing is slow. Delay of 9:1 seconds
This message indicates that the time between sequential MFS heartbeat requests was 9 seconds, and the time between sequential CLDB heartbeat responses was 1 second. Whichever of these numbers is highest indicates the total duration of time for which the heartbeat code did not run in a timely manner. In this case, it would indicate that there was a 9 second period where the heartbeat code had a problem.
If no other nodes had the heartbeat alarm raised around the time at which this heartbeat alarm was raised, the problem is likely related to the MFS process and you may see, for instance, that during that 9 second period there was heavy paging to/from swap space on that MFS node. You might also be able to observe that memory usage by map/reduce tasks running on that node spiked in the moments leading up to that 9 second period and then subsided around the time the 9 second period ended.
Retrieving data ...