Question asked by dimamah on Jan 2, 2014
Latest reply on Jan 14, 2014
We are experiencing CLDB failures under moderate load in the cluster. 
This is a 9 node cluster running 

During the failure the tasks are failing on : Could not create FileClient at com.mapr.fs.MapRFileSystem.lookupClient(
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    2014-01-02 23:48:53,7468 ERROR Client fs/client/fileclient/cc/ Thread: 140575603222272 Failed to initialize client for cluster minato, error Connection reset by peer(104)

(These are samples from different tasks)

From the zookeepers logs we mainly see : 

    2014-01-02 23:17:33,269 - INFO  [ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x142a3f87f821837 type:delete cxid:0xf zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/datacenter/controlnodes/cldb/active/CLDBNodes/3029773089033243430 Error:KeeperErrorCode = NoNode for /datacenter/controlnodes/cldb/active/CLDBNodes/3029773089033243430
    2014-01-02 23:18:50,001 - INFO  [SessionTracker:ZooKeeperServer@316] - Expiring session 0x43494ec2cf01c0, timeout of 30000ms exceeded

[zookeeper1 log][1] 
[zookeeper2 log][2] 
[zookeeper9 log][3] 

In the CLDB [log][4] we see things like : 

    2014-01-02 23:12:13,461 WARN Topology [RPC-200]: FileSever on minato-03 reported an invalid topology . Ignoring reported topology
    2014-01-02 23:12:31,314 WARN Alarms [HB-12]: Alarm raised: NODE_ALARM_TIME_SKEW; Cluster:; Node: minato-04; Message: Clock skew of 29 seconds
    2014-01-02 23:12:36,151 INFO Containers [CLDB-1]: ContainerFailure reported by FileServer for container 1 on StoragePool a38deeaa94c426d700514988a3078d12 on failed fileserver
    2014-01-02 23:12:36,842 INFO ZooKeeperClient [CLDB-1]: Storing KvStoreContainerInfo to ZooKeeper  Container ID:1 Master: Servers: Inactive: Unused:  Epoch:51 SizeMB:0
    2014-01-02 23:12:45,164 FATAL CLDB [CLDB-2]: CLDBShutdown: FileServer reported failure of star replicated container 1 on StoragePool a38deeaa94c426d700514988a3078d12 on server This will cause the CLDB volume's replication (1) to go below the min replication factor (2).
    2014-01-02 23:12:45,164 INFO CLDBServer [CLDB-2]: Shutdown: Stopping CLDB

Here is the [log][5] from the MFS in the CLDB's node.
Can you please help us to identify the problem? 
