How to diagnose slow performance of mirroring in MapR

Document created by jbubier Employee on Feb 7, 2016
Version 1Show Document
  • View in full screen mode

Author: Jonathan Bubier

 

Original Publication Date: November 12, 2014

 

MapR provides the capability to mirror a volume from one MapR cluster to a remote cluster to provide a backup for disaster recovery purposes.  In the event there is a total loss of function in the source cluster the mirrored dataset can be recovered from the mirror volume in the remote cluster.  The operation to sync the remote volume with the active volume in the local cluster is asynchronous and triggered on a user defined schedule or initiated manually.  Depending on a number of factors including but not exclusively the number of nodes in each cluster, the hardware capabilities of each node, the network capacity between the clusters and the amount of data to be replicated the mirror operation may take longer than expected.  This article will discuss areas to be investigated if mirroring is not completing in the expected amount of time to determine if the performance can be improved.

 

1. Identify mirror timeframe

 

First it is important to establish an accurate timeline of the mirror operation.  The CLDB log on the master CLDB node in the destination cluster contains detailed information about the mirror operation as soon as it is initiated.  The following log snippet is an example of the logging for a complete mirror operation from /opt/mapr/logs/cldb.log from the destination cluster master CLDB.

 

2014-11-17 13:58:30,845 INFO VolumeMirror [RPC-7]: Starting mirror of volume mirror, cluster my.cluster.com volumeId 253062672, mirrortype MIRROR_TYPE_LIVE, rollforward post mirror true, snap expiry time 0, dest creds uid 0, src creds uid 0 binding 140585419956720 restoreSessionId 0

2014-11-17 13:58:30,872 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Read mirror status of volume mirror is STATE_MIRROR_COMPLETE

2014-11-17 13:58:30,899 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Got the list of containers in the destination volumemirror count 1 number of master nodes for volume 1

2014-11-17 13:58:30,901 INFO CLDBServer [RPC-10]: VolumeUpdate: VolName: mirrorStarting mirror op STATE_UPDATE

2014-11-17 13:58:30,929 INFO CLDBServer [RPC-10]: VolumeUpdate: VolName: mirror Volume Properties updated

2014-11-17 13:58:30,929 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Update mirror status of volume mirror@my.cluster.com newstate STATE_MIRROR_STARTED mirrorid 1 nextMirrorId 2 completed successfully. srcsnapshot name destsnap name destsnap id 0 srcsnap id 0

2014-11-17 13:58:32,043 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Creating source snapshot mirrorsrcsnap.17-Nov-2014-13-58-32 of volume mirrorsrc@ubuntu311Cluster src vol id 123125888 src root cid 2315 for mirroring of mirror@my.cluster.comenabled featurelist - mfs.feature.devicefile.support, mfs.feature.tables, mfs.feature.db.spillv2.support, mfs.feature.db.ace.support, mfs.feature.db.regionmerge.support, mfs.feature.filecipherbit.support, mfs.feature.db.bulkload.support,

2014-11-17 13:58:36,335 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Created Source volume snapshotmirrorsrcsnap.17-Nov-2014-13-58-32 snapshotId 256000058 for src volume mirrorsrc@ubuntu311Cluster mirror volume mirror@my.cluster.com dataSrcSnapCreateTimeMillis 1416261512053

2014-11-17 13:58:37,435 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Creating destination snapshot mirrorsnap.17-Nov-2014-13-58-37 of volume mirror@my.cluster.com dest vol id 253062672 dest rw cid 11780 snap expiry time 0

2014-11-17 13:58:38,800 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Created destination volume snapshotmirrorsnap.17-Nov-2014-13-58-37 snapshotId 256000051 for src volume mirrorsrc@ubuntu311Cluster mirror volume mirror@my.cluster.com retry count 0

2014-11-17 13:58:38,814 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Got the list of containers in the source snapshot for mirorring of mirror@my.cluster.com src snapshot name mirrorsrcsnap.17-Nov-2014-13-58-32 count 6

2014-11-17 13:58:38,815 INFO CLDBServer [RPC-6]: VolumeUpdate: VolName: mirrorStarting mirror op STATE_UPDATE

2014-11-17 13:58:38,815 INFO CLDBServer [RPC-6]: Saving the src vol snapshot id 256000058 src vol snapshot name mirrorsrcsnap.17-Nov-2014-13-58-32

2014-11-17 13:58:38,842 INFO CLDBServer [RPC-6]: VolumeUpdate: VolName: mirror Volume Properties updated

2014-11-17 13:58:38,842 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Update mirror status of volume mirror@my.cluster.com newstate STATE_MIRROR_RESYNC_INPROGRESS mirrorid 1 nextMirrorId 2 completed successfully. srcsnapshot name mirrorsrcsnap.17-Nov-2014-13-58-32 destsnap name mirrorsnap.17-Nov-2014-13-58-37 destsnap id 256000051 srcsnap id 256000058

2014-11-17 13:58:38,909 INFO CLDBServer [RPC-7]: ContainerCreate Container ID:11781 vol:253062672 Servers: 10.10.70.163 10.10.70.165 10.10.70.162

2014-11-17 13:58:38,963 INFO CLDBServer [RPC-3]: ContainerCreate Container ID:11782 vol:253062672 Servers: 10.10.70.164 10.10.70.163 10.10.70.165

2014-11-17 13:58:39,013 INFO CLDBServer [RPC-2]: ContainerCreate Container ID:11783 vol:253062672 Servers: 10.10.70.162 10.10.70.164 10.10.70.163

2014-11-17 13:58:39,076 INFO CLDBServer [RPC-10]: ContainerCreate Container ID:11784 vol:253062672 Servers: 10.10.70.165 10.10.70.162 10.10.70.164

2014-11-17 13:58:39,130 INFO CLDBServer [RPC-9]: ContainerCreate Container ID:11785 vol:253062672 Servers: 10.10.70.163 10.10.70.165 10.10.70.162

2014-11-17 13:58:39,172 INFO CLDBServer [RPC-8]: VolumeUpdate: VolName: mirrorStarting mirror op STATUS_UPDATE

2014-11-17 13:58:39,239 INFO CLDBServer [RPC-8]: VolumeUpdate: VolName: mirror Volume Properties updated

2014-11-17 13:59:10,244 INFO CLDBServer [RPC-7]: VolumeUpdate: VolName: mirrorStarting mirror op STATUS_UPDATE

2014-11-17 13:59:10,278 INFO CLDBServer [RPC-7]: VolumeUpdate: VolName: mirror Volume Properties updated

2014-11-17 13:59:41,280 INFO CLDBServer [RPC-4]: VolumeUpdate: VolName: mirrorStarting mirror op STATUS_UPDATE

2014-11-17 13:59:41,347 INFO CLDBServer [RPC-4]: VolumeUpdate: VolName: mirror Volume Properties updated

^@2014-11-17 14:00:12,349 INFO CLDBServer [RPC-10]: VolumeUpdate: VolName: mirrorStarting mirror op STATUS_UPDATE

2014-11-17 14:00:12,384 INFO CLDBServer [RPC-10]: VolumeUpdate: VolName: mirror Volume Properties updated

2014-11-17 14:00:43,386 INFO CLDBServer [RPC-5]: VolumeUpdate: VolName: mirrorStarting mirror op STATUS_UPDATE

2014-11-17 14:00:43,421 INFO CLDBServer [RPC-5]: VolumeUpdate: VolName: mirror Volume Properties updated

2014-11-17 14:01:14,422 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Completed resync of all containers for volume mirror@my.cluster.com

2014-11-17 14:01:14,423 INFO CLDBServer [RPC-8]: VolumeUpdate: VolName: mirrorStarting mirror op STATUS_UPDATE

2014-11-17 14:01:14,459 INFO CLDBServer [RPC-8]: VolumeUpdate: VolName: mirror Volume Properties updated

2014-11-17 14:01:14,460 INFO CLDBServer [RPC-10]: VolumeUpdate: VolName: mirrorStarting mirror op STATE_UPDATE

2014-11-17 14:01:14,476 INFO CLDBServer [RPC-10]: VolumeUpdate: VolName: mirror Volume Properties updated

2014-11-17 14:01:14,477 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Update mirror status of volume mirror@my.cluster.com newstate STATE_MIRROR_ROLLFORWARD_INPROGRESS mirrorid 1 nextMirrorId 2 completed successfully. srcsnapshot name mirrorsrcsnap.17-Nov-2014-13-58-32 destsnap name mirrorsnap.17-Nov-2014-13-58-37 destsnap id 256000051 srcsnap id 256000058

2014-11-17 14:01:14,764 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Completed rollforward of all containers for volume mirror@my.cluster.com

2014-11-17 14:01:14,814 INFO CLDBServer [RPC-6]: VolumeUpdate: VolName: mirrorStarting mirror op STATE_UPDATE

2014-11-17 14:01:14,844 INFO CLDBServer [RPC-6]: VolumeUpdate: VolName: mirror Volume Properties updated

2014-11-17 14:01:14,844 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Update mirror status of volume mirror@my.cluster.com newstate STATE_MIRROR_COMPLETE mirrorid 1 nextMirrorId 2 completed successfully. srcsnapshot name destsnap name mirrorsnap.17-Nov-2014-13-58-37 destsnap id 256000051 srcsnap id 0

2014-11-17 14:01:14,844 INFO VolumeMirrorInfo [VolumeMirrorThread0]: Mirroring successfully completed for volume mirror@my.cluster.com From mirrorsrc@ubuntu311Cluster

By comparing the timestamps of the first and last log messages above we can determine the timeframe of the mirror operation.  In this case the mirror took approximately 2 minutes and 44 seconds.

 

2.  Review MFS logs

 

To get additional detail about what activity is being done during this time we first review the MFS logs on each node in the destination cluster.  The below log messages indicate when the individual container resync operations are initiated after the mirroring operation is started.  These can be found in /opt/mapr/logs/mfs.log-3 on each node in the destination cluster.  This is important to confirm that all containers start resyncing with their counterpart on the source cluster without delay after the mirror is started.  Significant delay in starting the container resync operation could indicate a software issue in MFS or a resource constraint which is impacting the performance of MFS.

 

[root@rhl63-n1 ~]# grep "Triggering resync srcnode" /opt/mapr/logs/mfs.log-3

2014-11-17 13:58:39,1709 INFO containerrestore.cc:4823 Triggering resync srcnode FSID 1852088569723883781, 10.10.80.176:5660, container 11783 for mirror id 2 srccid 256000068

[root@rhl63-n2 ~]# grep "Triggering resync srcnode" /opt/mapr/logs/mfs.log-3

2014-11-17 13:58:39,1688 INFO containerrestore.cc:4823 Triggering resync srcnode FSID 1852088569723883781, 10.10.80.176:5660, container 11781 for mirror id 2 srccid 256000069

2014-11-17 13:58:39,1720 INFO containerrestore.cc:4823 Triggering resync srcnode FSID 1852088569723883781, 10.10.80.176:5660, container 11785 for mirror id 2 srccid 256000066

[root@rhl63-n3 ~]# grep "Triggering resync srcnode" /opt/mapr/logs/mfs.log-3

2014-11-17 13:58:39,1694 INFO containerrestore.cc:4823 Triggering resync srcnode FSID 1852088569723883781, 10.10.80.176:5660, container 11782 for mirror id 2 srccid 256000067

[root@rhl63-n4 ~]# grep "Triggering resync srcnode" /opt/mapr/logs/mfs.log-3

2014-11-17 13:58:39,1648 INFO containerrestore.cc:4823 Triggering resync srcnode FSID 3331203719309336710, 10.10.80.177:5660, container 11780 for mirror id 2 srccid 256000064

2014-11-17 13:58:39,1715 INFO containerrestore.cc:4823 Triggering resync srcnode FSID 1852088569723883781, 10.10.80.176:5660, container 11784 for mirror id 2 srccid 256000065

With a timeline established and after confirming the initialization of container resync operations it is important to review the duration of individual container operations. Using the output above and focusing on CID 11783, the following is from /opt/mapr/logs/mfs.log-3 on 10.10.80.176 in the source cluster:

 

# grep 11783 /opt/mapr/logs/mfs.log-3

2014-11-17 13:58:39,3844 INFO containerresync.cc:176 CONTAINER_RESYNC_START -- destnode FSID 757066042881826920, 10.10.70.162:5660, , srccid 256000068, replicacid 11783, wa 0x6b090000, resyncwacount 8 replica txnvn 0:0 writevn 0:0 snapvn 0:0 chainSeqNumber 0 destRWMirrorId 0 destCloneMirrorId 0, destVolumeId 253062672, metaDataOnly 0, replica compression type 3

2014-11-17 13:58:39,3844 INFO containerresync.cc:822 Resync Container srccid 256000068 replicacid 11783 destnode 10.10.70.162:5660undoneeded 0 undotxnvn 0 undowritevn 0 undosnapvn 0 redotxnvn 0 redowritevn 0 redosnapvn 0

2014-11-17 13:58:39,3844 INFO containerresync.cc:845 Converting the resync to full resync because VNs are all zero for srccid 256000068, replicacid 11783, destnode 10.10.70.162:5660

2014-11-17 13:58:39,3844 INFO containerresync.cc:2342 Redoing replica cid 11783 remotenode 10.10.70.162:5660 from txnvn:0 snapvn:0 writevn:0 master end txnVN 1049239 end snapVN 32 writeVN 1049239

2014-11-17 13:58:39,3844 INFO containerresync.cc:2688 Resync of cid 11783 snapcontainer 256000068 prevcid 0 prevresynccid 0

2014-11-17 13:58:39,3844 INFO containerresyncfromsnapshot.cc:112 CONTAINER_RESYNC_FROM_SNAPSHOT_START -- from cid 256000068 replica 11783 txnVN 0 snapVN 0 writeVN 0 isundo 0, rollforwardcontainer 0 dumpSnapshotInode 0, netowrkthrottleenabled 1, diskthrottleenabled 1

2014-11-17 13:59:48,1913 INFO containerresyncfromsnapshot.cc:721 ResyncFromSnapshot resyncing versioninfo srccid:256000068 replicacid:11783TxnVN 1049239:1049239 WriteVN 1049239:1049239 snapVN 32:32 maxUniq 131222, maxinum 255

2014-11-17 13:59:48,4882 INFO containerresyncfromsnapshot.cc:1019 CONTAINER_RESYNC_FROM_SNAPSHOT_END-- from cid 256000068, replica 11783, time 69104 ms, numInodes 15, numBytes 86904577, numMsgs 1338, err 0

2014-11-17 13:59:48,4882 INFO containerresync.cc:2823 Resync from snapshot completed srccid:256000068 replicacid:11783 resynccid:256000068 err 0 svderr 0

2014-11-17 13:59:48,4896 INFO containerresync.cc:3631 ResyncContainer complete srccid 256000068 replicacid 11783 err 0x0

2014-11-17 13:59:48,4896 INFO containerresync.cc:3013 CONTAINER_RESYNC_END -- destnode FSID 757066042881826920, 10.10.70.162:5660, srccid 256000068, replicacid 11783, time 69105 ms, numIterations 1, numInodes 15, numMsgs 1338, numBytes 86904577, err 0

In the 'CONTAINER_RESYNC_END' message note the 'time' and 'numBytes' field to get an approximation of how much data was transferred for the resync operation and the time taken for the resync operation.  Searching for this error message across all nodes in the source cluster is helpful to identify any resync operations which take an extended period of time and the performance of the individual operations.  It is very common that at least one container resync operation identified using the above log messages will have a duration close to the entire duration of the mirror operation. 

 

3. Network Validation

 

Once the duration of the mirror is identified and there are slower than expected container resync operations identified one important factor to check is the network environment between the two clusters.  It is important to verify that the network has sufficient capacity for the mirror operation, specifically the individual container resync operations.  If the network resources are already constrained it is unlikely that the mirror operation will see acceptable performance. 

 

There are multiple ways to test the network capabilities between nodes in separate MapR clusters.  One such method is the iperf utility.  This utility is run in a client or server mode on a given node and a defined amount of data is transferred from the client to the host running iperf in server mode and the performance is reported to the client.  The following is an example of running iperf in client mode to transmit 3000MB to the host host1.domain.prv. 

 

main.prv -u -b 3000M

------------------------------------------------------------

Client connecting to host1.domain.prv, UDP port 5001

Sending 1470 byte datagrams

UDP buffer size: 1.00 MByte (default)

------------------------------------------------------------

[ 3] local 10.26.201.101 port 60846 connected with 10.10.80.176 port 5001

[ ID] Interval Transfer Bandwidth

[ 3] 0.0-10.0 sec 4.56 GBytes 3.92 Gbits/sec

[ 3] Sent 3333070 datagrams

[ 3] Server Report:

[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams

[ 3] 0.0-10.0 sec 4.56 GBytes 3.92 Gbits/sec 0.003 ms 0/3333070 (0%)

Note that the mirroring in MapR is between individual MFS nodes between clusters and is not between CLDB nodes.  Accordingly the network should be tested between MFS nodes in separate clusters to validate the available network bandwidth.  For more information on using iperf and its corresponding options consult the iperf manpage. 

 

Another method for network testing is the rpctest utility that ships with MapR under /opt/mapr/server/tools/.  This utility is similar to iperf as it is run in a client or server mode and used to transmit a defined amount of data from the client to the server.  The main difference between rpctest and iperf is that rpctest uses MapR's RPC layer for data transmission so it can expose issues on the network specific to MapR that would not be seen using standard tools like iperf.  The following is an example of using rpctest in client mode to transmit 3000MB of data to the host host1.domain.prv.

 

# /opt/mapr/server/tools/rpctest -client 3000 host1.domain.prv

15:19:41 rpcs 1794, mb 117.0

15:19:42 rpcs 1791, mb 117.4

15:19:43 rpcs 1795, mb 117.6

...

15:20:04 rpcs 1791, mb 117.4

15:20:05 rpcs 1791, mb 117.4

15:20:06 rpcs 1775, mb 116.3

Rate: 116.92 mb/s, time: 26.9089 sec, #rpcs 48007, rpcs/sec 1784.06

If low transfer rates are seen when using iperf and rpctest consult with your network infrastructure team to determine whether the network environment between the MapR clusters has sufficient available resources.  If low transfer rates are seen when using rpctest only and iperf shows significantly better performance consult with MapR support to investigate further. 

 

4. Kernel TCP settings

 

Depending on the characteristics of the network between the MapR clusters it may be necessary to modify certain kernel parameters to more efficiently use the network.  For example on high latency WANs the default settings for TCP read and write socket buffers are typically too low to maximize network throughput. During installation of the mapr-core package from MapR these settings are set as follows:

 

net.core.wmem_max = 4194304

net.core.rmem_max = 4194304

net.core.wmem_default = 1048576

net.core.rmem_default = 1048576

net.ipv4.tcp_wmem = 4096 1048576 4194304

net.ipv4.tcp_rmem = 4096 1048576 4194304

Focusing on net.ipv4.tcp_wmem and net.ipv4.tcp_rmem these values indicate a low threshold of 4KB, a default of 1MB and a high threshold of 4MB.  These settings work well in a local network and in a low latency WAN but are not optimal for a high latency WAN.  It is recommended to set these values higher, for example a low threshold of 128KB, a default of 4MB and a high threshold of 16MB.  The settings can be updated in /etc/sysctl.conf and refreshed using 'sysctl -p' or updated directly using 'sysctl -w'.  These settings should be put in place on all nodes in both clusters to ensure optimal utilization of the network when transmitting over a high latency WAN.

After updating the kernel settings on the source and destination cluster nodes it is a good idea to retest with iperf and rpctest as described in #3 above to verify the result.  Note that due to a known issue in MapR v4.0.1 and earlier the modified kernel settings will not improve the performance of rpctest as the TCP buffer sizes are hardcoded.  This issue will be addressed in the MapR v4.0.2 release.

 

 

5.  MFS network throttle

 

Once the network has been properly validated as having sufficient capacity for the resync operations it may be necessary to reduce the effect of network throttling in MFS.  By default the network throttle in MFS takes into account the round trip time between resync nodes and a defined factor mfs.network.resynciothrottle.factor which can be used to increase or decrease the effect of throttling.  This throttle takes effect on the source of the resync operation and is used to avoid degrading client access to MapR-FS when there is significant replication traffic between MFS nodes.

 

The effect of network throttling is controlled by the value of mfs.network.resynciothrottle.factor and is set in /opt/mapr/conf/mfs.conf.  This parameter defaults to '20' and can be overridden by setting a new value in mfs.conf.  There is additional information on the disk throttle and network throttle in MFS at the following page: http://doc.mapr.com/display/MapR/mfs.conf.  The network throttle effect is inversely proportional to the value of mfs.network.resynciothrottle.factor so the effect can be decreased by increasing the value of mfs.network.resynciothrottle.factor in mfs.conf and restarting MFS. 

Consider setting this parameter to '100' in mfs.conf on all MFS nodes in the source cluster, restart MFS on all nodes and allow the mirror to rerun on the normal schedule.  After the mirror completes compare the execution time and the resync details using the information provided in #1 and #2 above.  If the slower than expected performance of resync operations was caused by network throttling on the source cluster the performance should be significantly better after increasing
mfs.network.resynciothrottle.factor. 

If the performance of mirroring in MapR is still lower than expected after reviewing the above items consult with MapR support to investigate further.

 

 

Attachments

    Outcomes