AnsweredAssumed Answered

Urgent: CLDB in M3 Down SP Failure

Question asked by mandoskippy on Sep 1, 2015
Latest reply on Feb 5, 2016 by keysbotzum
Mapr Version: M3 4.0.2
Java Version: 7
CLDB Node: node21 (192.168.3.129)

I am helping a friend on his cluster, and they had a Service pool fail on our CLDB node.  It's ugly. The cluster is in a bad state. (nothing but Zookeepers are up).   As you can see, zookeeper "thinks" the only valid CLDB volume is on the CLDB where the other two show as Invalid.

     {
            "timestamp":1441125590075,
            "timeofday":"2015-09-01 12:39:50.075 GMT-0400",
            "status":"OK",
            "total":2,
            "data":[
                {
                    "valid":"192.168.3.129:5660-"
                },
                {
                    "invalid":[
                        "192.168.3.10:5660-",
                        "192.168.3.117:5660-"
                    ]
                }
            ]
        }

When we try to start on the valid CLDB, it won't start due to SP issues. (This is in a remote cluster, thus in a bad way, even if we could change out the drives, we'd be in a funky place.  What is the recommended approach here, something isn't kosher and it won't allow the cluster to start, we have a lot of data here, not sure how to approach next steps.  We tried waiting, that didn't help, we rebooted the CLDB, nothing.

        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.run(Thread.java:745)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: Thread-6 ID: 21
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.sleep(Native Method)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: com.mapr.fs.cldb.ActiveContainersMap$WriteBackAtimeUpdater$IdleFlusher.run(ActiveContainersMap.java:745)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: RPC-1 ID: 31
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: sun.misc.Unsafe.park(Native Method)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.run(Thread.java:745)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: WaitForLocalKvstore Thread ID: 26
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.dumpThreads(Native Method)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.getAllStackTraces(Thread.java:1640)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: com.mapr.fs.cldb.CLDB.shutdown(CLDB.java:177)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: com.mapr.fs.cldb.WaitForLocalKvstoreThread.run(WaitForLocalKvstoreThread.java:49)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: Timer-0 ID: 30
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Object.wait(Native Method)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.TimerThread.mainLoop(Timer.java:552)
        2015-09-01 15:19:14,808 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.TimerThread.run(Timer.java:505)
        2015-09-01 15:19:14,808 INFO CLDBServer [WaitForLocalKvstore Thread]: Shutdown: Stopping CLDB
        2015-09-01 15:19:14,809 INFO CLDB [Thread-13]: CLDB ShutDown Hook called
        2015-09-01 15:19:14,810 INFO ZooKeeperClient [Thread-13]: Setting the clean cldbshutdown flag to true
        2015-09-01 15:19:14,816 INFO ZooKeeperClient [Thread-13]: Zookeeper Client: Closing client connection:
        2015-09-01 15:19:14,820 INFO CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type NodeDeleted occurred on path /datacenter/controlnodes/cldb/active/CLDBMaster
        2015-09-01 15:19:14,820 INFO ZooKeeper [Thread-13]: Session: 0x14f89cad2ec0009 closed
        2015-09-01 15:19:14,820 INFO CLDB [Thread-13]: CLDB shutdown
        Header: hostName: node21.cluster.ourcompany.com, Time Zone: Eastern Standard Time, processName: cldb, processId: 30102, MapR Build Version: 4.0.2.29870.GA
        2015-09-01 15:30:26,818 INFO CLDB [main]: Loading properties file : /opt/mapr/conf/cldb.conf
        2015-09-01 15:30:27,035 INFO CLDBMetrics [main]: Initializing CLDB Metrics with serviceName: cldbServer
        2015-09-01 15:30:27,038 INFO CLDB [main]: CLDBInit: Using hostname file /opt/mapr/hostname and hostid file /opt/mapr/hostid
        2015-09-01 15:30:27,038 INFO CLDB [main]: CLDB Properties from configuration file: cldb.web.port=7221cldb.zookeeper.servers=node11:5181,node16:5181,node19:5181cldb.numthreads=10cldb.web.https.port=7443hadoop.version=2.5.1cldb.port=7222cldb.min.fileservers=1cldb.detect.dup.hostid.enabled=falsenum.volmirror.threads=1cldb.jmxremote.port=7220
        2015-09-01 15:30:27,038 INFO CLDB [main]: CLDB Command line args: /opt/mapr/conf/cldb.conf
        2015-09-01 15:30:27,038 INFO CLDB [main]: CLDBInit: Initializing CLDB
        2015-09-01 15:30:27,039 INFO CLDB [main]: MapR BuildVersion: 4.0.2.29870.GA
        2015-09-01 15:30:27,039 INFO CLDB [main]: CLDBInit: Start CLDBServer
        2015-09-01 15:30:27,085 INFO CLDBServer [main]: CLDBInit: HostName: node21.cluster.ourcompany.com ServerId: 5640304449658178893
        2015-09-01 15:30:27,085 INFO CLDBServer [main]: CLDBInit: Cluster name : cluster
        2015-09-01 15:30:27,091 INFO CLDBServer [main]: CLDB creds setting uid as 501
        2015-09-01 15:30:27,091 INFO CLDBServer [main]: CLDB creds setting adding gid 501
        2015-09-01 15:30:27,102 INFO CLDB [main]: CLDBState: CLDB State change : INITIAZING
        2015-09-01 15:30:27,142 INFO CLDBServer [main]: Initializing Cldb JNI with memory 810903142 bytes
        2015-09-01 15:30:27,152 INFO ZooKeeperClient [main]: ZooKeeperClient init: zk timeout = 30000 ms
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:zookeeper.version=3.4.5-mapr-1406--1, built on 01/19/2015 22:57 GMT
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:host.name=node21.cluster.ourcompany.com
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:java.version=1.7.0_75
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:java.vendor=Oracle Corporation
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:java.home=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75.x86_64/jre
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:java.class.path=/opt/mapr:/opt/mapr/conf:/opt/mapr/lib/adminuiapp-4.0.2-mapr.jar:/opt/mapr/lib/ant-1.7.1.jar:/opt/mapr/lib/antlr-2.7.7.jar:/opt/mapr/lib/baseutils-4.0.2-mapr.jar:/opt/mapr/lib/c3p0-0.9.1.2.jar:/opt/mapr/lib/central-logging-4.0.2-mapr.jar:/opt/mapr/lib/cldb-4.0.2-mapr.jar:/opt/mapr/lib/cliframework-4.0.2-mapr.jar:/opt/mapr/lib/commons-codec-1.5.jar:/opt/mapr/lib/commons-collections-3.2.1.jar:/opt/mapr/lib/commons-configuration-1.6.jar:/opt/mapr/lib/commons-el-1.0.jar:/opt/mapr/lib/commons-email-1.2.jar:/opt/mapr/lib/commons-lang-2.5.jar:/opt/mapr/lib/commons-logging-1.1.3-api.jar:/opt/mapr/lib/commons-logging-1.1.3.jar:/opt/mapr/lib/dom4j-1.6.1.jar:/opt/mapr/lib/eval-0.5.jar:/opt/mapr/lib/flexjson-2.1.jar:/opt/mapr/lib/globalfsck-4.0.2-mapr.jar:/opt/mapr/lib/gson-2.1.jar:/opt/mapr/lib/guava-13.0.1.jar:/opt/mapr/lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/lib/hadoop-auth-2.5.1.jar:/opt/mapr/lib/hadoop-common-2.5.1.jar:/opt/mapr/lib/hadoop-yarn-api-2.5.1.jar:/opt/mapr/lib/hadoop-yarn-client-2.5.1.jar:/opt/mapr/lib/hadoop-yarn-common-2.5.1.jar:/opt/mapr/lib/hbase-utils-4.0.2-mapr.jar:/opt/mapr/lib/hibernate-c3p0-3.3.1.GA.jar:/opt/mapr/lib/hibernate-commons-annotations-3.2.0.Final.jar:/opt/mapr/lib/hibernate-core-3.6.8.Final.jar:/opt/mapr/lib/httpclient-4.2.jar:/opt/mapr/lib/httpclient-cache-4.2.jar:/opt/mapr/lib/httpcore-4.2.jar:/opt/mapr/lib/jasper-compiler-5.5.12.jar:/opt/mapr/lib/jasper-runtime-5.5.12.jar:/opt/mapr/lib/javassist-3.12.1.GA.jar:/opt/mapr/lib/jetty-6.1.26.jar:/opt/mapr/lib/jetty-plus-6.1.26.jar:/opt/mapr/lib/jetty-util-6.1.26.jar:/opt/mapr/lib/jobmngmnt-4.0.2-mapr.jar:/opt/mapr/lib/joda-time-2.0.jar:/opt/mapr/lib/JPam-1.1.jar:/opt/mapr/lib/json-20080701.jar:/opt/mapr/lib/jsp-2.1.jar:/opt/mapr/lib/jsp-api-2.1.jar:/opt/mapr/lib/jta-1.1.jar:/opt/mapr/lib/junit-3.8.1.jar:/opt/mapr/lib/junit-4.5.jar:/opt/mapr/lib/kvstore-4.0.2-mapr.jar:/opt/mapr/lib/libprotodefs-4.0.2-mapr.jar:/opt/mapr/lib/log4j-1.2.15.jar:/opt/mapr/lib/mail.jar:/opt/mapr/lib/maprbuildversion-4.0.2-mapr.jar:/opt/mapr/lib/maprcli-4.0.2-mapr.jar:/opt/mapr/lib/maprfs-4.0.2-mapr.jar:/opt/mapr/lib/maprfs-core-4.0.2-mapr.jar:/opt/mapr/lib/maprfs-core-4.0.2-mapr-tests.jar:/opt/mapr/lib/maprfs-diagnostic-tools-4.0.2-mapr.jar:/opt/mapr/lib/maprfs-jni-4.0.2-mapr.jar:/opt/mapr/lib/mapr-hbase-4.0.2-mapr.jar:/opt/mapr/lib/maprutil-4.0.2-mapr.jar:/opt/mapr/lib/maven-ant-tasks-2.1.3.jar:/opt/mapr/lib/mysql-connector-java-5.1.25-bin.jar:/opt/mapr/lib/persistence-api-1.0.jar:/opt/mapr/lib/protobuf-java-2.5.0.jar:/opt/mapr/lib/servlet-api-2.5-6.1.26.jar:/opt/mapr/lib/slf4j-api-1.7.5.jar:/opt/mapr/lib/slf4j-log4j12-1.7.5.jar:/opt/mapr/lib/volumemirror-4.0.2-mapr.jar:/opt/mapr/lib/warden-4.0.2-mapr.jar:/opt/mapr/lib/zookeeper-3.4.5-mapr-1406.jar:/opt/mapr/hadoop/hadoop-2.5.1/lib/hadoop-2.5.1-dev-core.jar:
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:java.library.path=/opt/mapr/lib
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:java.io.tmpdir=/tmp
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:java.compiler=
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:os.name=Linux
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:os.arch=amd64
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:os.version=2.6.32-504.12.2.el6.x86_64
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:user.name=mapr
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:user.home=/home/mapr
        2015-09-01 15:30:27,168 INFO ZooKeeper [main]: Client environment:user.dir=/opt/mapr/initscripts
        2015-09-01 15:30:27,169 INFO ZooKeeper [main]: Initiating client connection, connectString=node11:5181,node16:5181,node19:5181 sessionTimeout=30000 watcher=com.mapr.fs.cldb.CLDBServer@55dd9f37
        2015-09-01 15:30:27,187 INFO CLDBServer [main]: CLDB configured with ZooKeeper ensemble with connection string node11:5181,node16:5181,node19:5181
        2015-09-01 15:30:27,211 INFO Login [main-SendThread(node19.cluster.ourcompany.com:5181)]: successfully logged in.
        2015-09-01 15:30:27,213 INFO ZooKeeperSaslClient [main-SendThread(node19.cluster.ourcompany.com:5181)]: Client will use SIMPLE-SECURITY as SASL mechanism.
        2015-09-01 15:30:27,217 INFO ClientCnxn [main-SendThread(node19.cluster.ourcompany.com:5181)]: Opening socket connection to server node19.cluster.ourcompany.com/192.168.3.127:5181. Will attempt to SASL-authenticate using Login Context section 'Client_simple'
        2015-09-01 15:30:27,220 INFO ClientCnxn [main-SendThread(node19.cluster.ourcompany.com:5181)]: Socket connection established to node19.cluster.ourcompany.com/192.168.3.127:5181, initiating session
        2015-09-01 15:30:27,224 INFO ActiveContainersMap [main]: Caching a max of 3071602 containers in cache
        2015-09-01 15:30:27,227 INFO ActiveContainersMap [main]: WriteBackAtimeUpdater AtimeUpdateInterval = 360 mins
        2015-09-01 15:30:27,230 INFO ClientCnxn [main-SendThread(node19.cluster.ourcompany.com:5181)]: Session establishment complete on server node19.cluster.ourcompany.com/192.168.3.127:5181, sessionid = 0x24f89cadb88000c, negotiated timeout = 30000
        2015-09-01 15:30:27,232 INFO CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type None occurred on path null
        2015-09-01 15:30:27,239 INFO CLDBServer [main-EventThread]: onZKConnect: The CLDB has successfully connected to the ZooKeeper server State:CONNECTED Timeout:30000 sessionid:0x24f89cadb88000c local:/192.168.3.129:56392 remoteserver:node19.cluster.ourcompany.com/192.168.3.127:5181 lastZxid:0 xid:1 sent:1 recv:1 queuedpkts:0 pendingresp:0 queuedevents:0 in the ZooKeeper ensemble with connection string node11:5181,node16:5181,node19:5181
        2015-09-01 15:30:27,284 INFO VolumeMirror [main]: Initializing volume mirror thread ...
        2015-09-01 15:30:27,286 INFO VolumeMirror [main]: Spawned 1 VolumeMirror Threads
        2015-09-01 15:30:27,298 INFO HttpServer [main]: Creating http listener for 0.0.0.0
        2015-09-01 15:30:27,305 INFO CLDBServer [main-EventThread]: The CLDB received notification that a ZooKeeper event of type None occurred on path null
        2015-09-01 15:30:27,314 INFO log [main]: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
        2015-09-01 15:30:27,333 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: KvStore is of latest epoch CLDB trying to become Master
        2015-09-01 15:30:27,336 INFO ZooKeeperClient [ZK-Connect]: ZooKeeperClient: CLDB is current Master
        2015-09-01 15:30:27,336 INFO ZooKeeperClient [ZK-Connect]: CLDB became master. Initializing KvStoreContainer for cid: 1
        2015-09-01 15:30:27,340 INFO ZooKeeperClient [ZK-Connect]: Storing KvStoreContainerInfo to ZooKeeper  Container ID:1 Servers:  192.168.3.129-331(5640304449658178893) Inactive:  192.168.3.10-331(154753047820195968) 192.168.3.117-331(1898149662065153619) Unused:  Epoch:331 SizeMB:0
        2015-09-01 15:30:27,347 INFO CLDBServer [ZK-Connect]: Starting thread to monitor waiting for local kvstore to become master
        2015-09-01 15:30:27,362 INFO CLDB [main]: CLDBState: CLDB State change : WAIT_FOR_FILESERVERS
        2015-09-01 15:30:27,362 INFO CLDB [main]: CLDBInit: Starting RPCServer on port 7222 with num thread 10, heap size of 3866(MB) and with startup options -Xms2400m -Xmx4000m -XX:ErrorFile=/opt/cores/hs_err_pid%p.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/cores -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -XX:ThreadStackSize=256
        2015-09-01 15:30:27,365 INFO CLDB [main]: CLDBInit: Starting HTTP Server
        2015-09-01 15:30:27,365 INFO HttpServer [main]: WebServer: Starting WebServer
        2015-09-01 15:30:27,366 INFO HttpServer [main]: Listener started on SelectChannelConnector@0.0.0.0:7221 port 7221
        2015-09-01 15:30:27,366 INFO HttpServer [main]: Starting Jetty WebServer
        2015-09-01 15:30:27,366 INFO log [main]: jetty-6.1.26
        2015-09-01 15:30:27,396 WARN log [main]: Can't reuse /tmp/Jetty_0_0_0_0_7221_cldb____qb58s0, using /tmp/Jetty_0_0_0_0_7221_cldb____qb58s0_6921456065670212084
        2015-09-01 15:30:27,484 INFO CLDBServer [RPC-1]: FSRegister: Request  FSID: 5640304449658178893 FSNetworkLocation:  FSHost:Port: 192.168.3.129- FSHostName: node21.cluster.ourcompany.com StoragePools 31b926425aeab7f400551be71e0ef3e5-5c6d806e0166b27100551be7200ef096-2d5afa783a2d31cd00551be7220d9d12- Capacity: 49336389 Available: 6105032 Used: 43231357 Role: 0 isDCA: false Received registration request
        2015-09-01 15:30:27,484 INFO CLDBServer [RPC-1]: Cluster uuid is -6088325864781372954-6167454770989910379
        2015-09-01 15:30:27,484 WARN Topology [RPC-1]: FileSever on node21.cluster.ourcompany.com reported an invalid topology . Ignoring reported topology
        2015-09-01 15:30:27,532 INFO CLDBServer [RPC-1]: FSRegister: Registered FileServer: 192.168.3.129- at topology /default-rack/node21.cluster.ourcompany.com
        2015-09-01 15:30:27,701 INFO log [main]: Started SelectChannelConnector@0.0.0.0:7221
        2015-09-01 15:30:29,559 INFO CLDBServer [Lookup-1]: Rejecting RPC 2345.5 from 192.168.3.18:50514 with status 3 as CLDB is waiting for local kvstore to become master.
        2015-09-01 15:30:30,417 INFO CLDBServer [RPC-8]: FSRegister: CLDB waiting for kvstore to register, requesting nfs server: node7.cluster.ourcompany.com to try again: returning ESRCH
        2015-09-01 15:31:29,981 INFO CLDBServer [Lookup-5]: Rejecting RPC 2345.5 from 192.168.3.26:49119 with status 3 as CLDB is waiting for local kvstore to become master.
        2015-09-01 15:31:30,446 INFO CLDBServer [RPC-10]: FSRegister: CLDB waiting for kvstore to register, requesting nfs server: node7.cluster.ourcompany.com to try again: returning ESRCH

Outcomes