AnsweredAssumed Answered

CLDB restarted by ZooKeeper

Question asked by chriscurtin on Sep 10, 2012
Latest reply on Sep 11, 2012 by nabeel
Hi,

This morning our cluster restarted the CLDB process three times in a few hours. The system was running our nightly jobs so it was running probably 80% of capacity when this happened. Unfortunately, since the Hadoop side couldn't reach the CLDB process, the tasks failed and eventually the job did.

No ERROR events in the logs pointing to specifically what happened, but it does look like ZooKeeper told CLDB to restart. I can't find any logs about why ZooKeeper would have done this. (I've been told the BindDn errors aren't a big deal since we aren't sending email alerts.) Operations is looking at the time skew, but they say we're syncing all the clocks using ntp so they are not sure what this is about.

From the most recent restart, from the cldb.log file:


    2012-09-11 04:20:14,898 WARN  com.mapr.fs.cldb.alarms.Alarms [pool-1-thread-55]: NODE_ALARM_TIME_SKEW raised, for node mapr01.atlis1, Clock skew of 23 seconds
    2012-09-11 04:20:14,898 ERROR com.mapr.util.LDAPUtil [EmailManager]: BindDn not defined
    2012-09-11 04:20:15,910 WARN  com.mapr.fs.cldb.alarms.Alarms [pool-1-thread-54]: NODE_ALARM_TIME_SKEW cleared, for node mapr01.atlis1
    2012-09-11 04:22:22,064 INFO  com.mapr.fs.cldb.dialhome.metrics.MetricsWriter [MetricsStorageTimer]: Writing dialhome data to /var/mapr/metrics/September.11.2012. Num metric snapshots : 16
    2012-09-11 04:22:48,864 WARN  com.mapr.fs.cldb.alarms.Alarms [pool-1-thread-56]: NODE_ALARM_TIME_SKEW raised, for node mapr01.atlis1, Clock skew of 21 seconds
    2012-09-11 04:22:48,864 ERROR com.mapr.util.LDAPUtil [EmailManager]: BindDn not defined
    2012-09-11 04:22:48,865 WARN  com.mapr.fs.cldb.alarms.Alarms [pool-1-thread-53]: NODE_ALARM_TIME_SKEW raised, for node mapr03.atlis1, Clock skew of 21 seconds
    2012-09-11 04:22:48,865 ERROR com.mapr.util.LDAPUtil [EmailManager]: BindDn not defined
    2012-09-11 04:22:48,865 ERROR com.mapr.util.LDAPUtil [EmailManager]: BindDn not defined
    2012-09-11 04:22:48,865 WARN  com.mapr.fs.cldb.alarms.Alarms [pool-1-thread-60]: NODE_ALARM_TIME_SKEW raised, for node mapr02.atlis1, Clock skew of 21 seconds
    2012-09-11 04:22:48,866 WARN  com.mapr.fs.cldb.alarms.Alarms [pool-1-thread-61]: NODE_ALARM_TIME_SKEW raised, for node mapr05.atlis1, Clock skew of 21 seconds
    2012-09-11 04:22:48,866 WARN  com.mapr.fs.cldb.alarms.Alarms [pool-1-thread-59]: NODE_ALARM_TIME_SKEW raised, for node mapr04.atlis1, Clock skew of 21 seconds
    2012-09-11 04:22:48,866 ERROR com.mapr.util.LDAPUtil [EmailManager]: BindDn not defined
    2012-09-11 04:22:48,868 ERROR com.mapr.util.LDAPUtil [EmailManager]: BindDn not defined
    2012-09-11 04:22:48,967 INFO  com.mapr.baseutils.zookeeper.ZKDataRetrieval [pool-1-thread-2-EventThread]: Process path: null. Event state: Disconnected. Event type: None
    2012-09-11 04:22:48,968 INFO  com.mapr.fs.cldb.CLDBServer [main-EventThread]: ZooKeeper event None on path null
    2012-09-11 04:22:48,968 FATAL com.mapr.fs.cldb.CLDB [main-EventThread]: CLDBShutdown: Received Disconnect from ZooKeeper. Shutting down CLDB
    2012-09-11 04:22:48,968 WARN  com.mapr.fs.cldb.zookeeper.ZooKeeperClient [ReplicationManagerThread]: ZooKeeperClient : KvStoreContainerInfo read received connection loss exception. Sleeping for 30 Number of retry left 1
    2012-09-11 04:22:48,968 INFO  com.mapr.fs.cldb.CLDBServer [main-EventThread]: Shutdown: Stopping CLDB
    2012-09-11 04:22:48,969 INFO  com.mapr.fs.cldb.CLDB [Thread-9]: CLDB ShutDown Hook called
    2012-09-11 04:22:48,969 INFO  com.mapr.fs.cldb.zookeeper.ZooKeeperClient [Thread-9]: Zookeeper Client: Closing client connection:
    2012-09-11 04:22:51,202 INFO  com.mapr.baseutils.zookeeper.ZKDataRetrieval [pool-1-thread-2-EventThread]: Process path: null. Event state: SyncConnected. Event type: None
    2012-09-11 04:22:51,311 INFO  com.mapr.baseutils.zookeeper.ZKDataRetrieval [pool-1-thread-2-EventThread]: Process path: /datacenter/controlnodes/cldb/active/CLDBMaster. Event state: SyncConnected. Event type: NodeDeleted
    2012-09-11 04:22:51,312 INFO  com.mapr.fs.cldb.CLDB [Thread-9]: CLDB shutdown
    Header: hostName: mapr01.atlis1, Time Zone: Eastern Standard Time, processName: cldb, processId: 1390, MapR Build Version: 1.2.3.12961.GA
    2012-09-11 04:26:54,231 INFO  com.mapr.fs.cldb.CLDB [main]: Initializing CLDB


(Side note, how come the cldb.log file never rotates? It has events from the first time we started this cluster still in it.)

Thanks,

Chris

Outcomes