AnsweredAssumed Answered

unexpected CLDB Failure

Question asked by dimamah on Jan 2, 2014
Latest reply on Jan 14, 2014 by dimamah
We are experiencing CLDB failures under moderate load in the cluster. 
This is a 9 node cluster running 2.1.2.18401.GA 

During the failure the tasks are failing on : 

    java.io.IOException: Could not create FileClient at com.mapr.fs.MapRFileSystem.lookupClient(MapRFileSystem.java:250)
    
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    
    2014-01-02 23:48:53,7468 ERROR Client fs/client/fileclient/cc/client.cc:304 Thread: 140575603222272 Failed to initialize client for cluster minato, error Connection reset by peer(104)

(These are samples from different tasks)

From the zookeepers logs we mainly see : 

    2014-01-02 23:17:33,269 - INFO  [ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x142a3f87f821837 type:delete cxid:0xf zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/datacenter/controlnodes/cldb/active/CLDBNodes/3029773089033243430 Error:KeeperErrorCode = NoNode for /datacenter/controlnodes/cldb/active/CLDBNodes/3029773089033243430
    2014-01-02 23:18:50,001 - INFO  [SessionTracker:ZooKeeperServer@316] - Expiring session 0x43494ec2cf01c0, timeout of 30000ms exceeded

[zookeeper1 log][1] 
[zookeeper2 log][2] 
[zookeeper9 log][3] 

In the CLDB [log][4] we see things like : 

    2014-01-02 23:12:13,461 WARN Topology [RPC-200]: FileSever on minato-03 reported an invalid topology . Ignoring reported topology
    2014-01-02 23:12:31,314 WARN Alarms [HB-12]: Alarm raised: NODE_ALARM_TIME_SKEW; Cluster: my.cluster.com; Node: minato-04; Message: Clock skew of 29 seconds
    2014-01-02 23:12:36,151 INFO Containers [CLDB-1]: ContainerFailure reported by FileServer 10.20.40.181(2) for container 1 on StoragePool a38deeaa94c426d700514988a3078d12 on failed fileserver 10.20.40.186(2)
    2014-01-02 23:12:36,842 INFO ZooKeeperClient [CLDB-1]: Storing KvStoreContainerInfo to ZooKeeper  Container ID:1 Master:10.20.40.181(2)-51(3029773089033243430) Servers:  10.20.40.181(2)-51(3029773089033243430) 10.20.40.184(2)-51(5266422388232269765) 10.20.40.187(2)-51(4050572479563766579) 10.20.40.188(2)-51(3255864519558145688) 10.20.40.185(2)-51(5221447176002153173) Inactive:  10.20.40.186(2)-50(7878517535932714013) Unused:  Epoch:51 SizeMB:0
    2014-01-02 23:12:45,164 FATAL CLDB [CLDB-2]: CLDBShutdown: FileServer 10.20.40.181-10.20.41.181- reported failure of star replicated container 1 on StoragePool a38deeaa94c426d700514988a3078d12 on server 10.20.40.184-10.20.41.184-. This will cause the CLDB volume's replication (1) to go below the min replication factor (2).
    2014-01-02 23:12:45,164 INFO CLDBServer [CLDB-2]: Shutdown: Stopping CLDB

Here is the [log][5] from the MFS in the CLDB's node.
 
Can you please help us to identify the problem? 
Thank you, 
Dima.

  [1]: http://pastebin.com/9sjLJE53
  [2]: http://pastebin.com/9K6cwRrs
  [3]: http://pastebin.com/dDSiKdpv
  [4]: http://pastebin.com/PyhcbfWb
  [5]: http://pastebin.com/MxZVecMC

Outcomes