AnsweredAssumed Answered

CLDB is not running and can not log into system control UI

Question asked by sand on Oct 20, 2012
Latest reply on Oct 20, 2012 by nabeel
I installed a 5-nodes cluster (MAPR M3), the following is the roles file(just for filesystem): <br>
node0  fileserver, cldb, nfs, zookeeper,webserver<br>
node1  fileserver, zookeeper<br>
node2  fileserver<br>
node3  fileserver<br>
node4  fileserver<br>
<br>
after I finished installing, I found webserver is not started automatically, and I started webserver manually, I can not log into system control UI, I check all log files, it seems the CLDb  is not running, the following is main information in log files:
<br><br>
wardeninit.log
<br>
<code>
Sat Oct 20 18:56:18 CST 2012
Start warden
/etc/rc2.d/S70mapr-warden: line 125: kill: (1017) - No such process
no jobtracker to stop
MAPR_USER=root, skip changes to TTLocalDir
Warden started
Sat Oct 20 19:32:04 CST 2012
Stop warden
/etc/rc6.d/K30mapr-warden: line 240: kill: (990) - No such process
/etc/rc6.d/K30mapr-warden: line 258: kill: (1015) - No such process
CLDB not running.
Waiting up to 30 seconds ...
Sat Oct 20 19:33:21 CST 2012
Start warden
/etc/rc2.d/S70mapr-warden: line 125: kill: (990) - No such process
no jobtracker to stop
MAPR_USER=root, skip changes to TTLocalDir
Warden started
</code>
<br><br>

cldb.log:
<br>
<code>
2012-10-20 18:56:50,972 INFO HttpServer [main]: Starting Jetty WebServer
2012-10-20 18:56:50.972::INFO:  jetty-6.1.14
2012-10-20 18:56:51.594::INFO:  Started SelectChannelConnector@0.0.0.0:7221
2012-10-20 18:56:51,595 INFO CLDBServer [main]: HTTP Server
2012-10-20 18:56:51,687 INFO CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.0.42.200:1111 Rejecting rpc with status 30 as CLDB is not yet initialized.
2012-10-20 18:56:51,709 INFO CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.0.42.200:1111 Rejecting rpc with status 30 as CLDB is not yet initialized.
2012-10-20 18:56:53,642 WARN ClientCnxn [main-SendThread(data-node1:5181)]: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2012-10-20 18:56:53,663 INFO CLDBServer [RPC-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 103 from 10.0.42.200:35668 Rejecting rpc with status 30 as CLDB is not yet initialized.
2012-10-20 18:56:53,689 INFO CLDBServer [RPC-thread-2]: RPC: PROGRAMID: 2345 PROCEDUREID: 40 from 10.0.42.200:35668 Rejecting rpc with status 30 as CLDB is not yet initialized.
2012-10-20 18:56:54,566 INFO ClientCnxn [main-SendThread(data-node1:5181)]: Opening socket connection to server data-node2/10.0.42.202:5181
2012-10-20 18:56:54,567 INFO ClientCnxn [main-SendThread(data-node2:5181)]: Socket connection established to data-node2/10.0.42.202:5181, initiating session
2012-10-20 18:56:54,644 INFO ClientCnxn [main-SendThread(data-node2:5181)]: Session establishment complete on server data-node2/10.0.42.202:5181, sessionid = 0x23a7dd1290e0005, negotiated timeout = 30000
2012-10-20 18:56:54,645 INFO CLDBServer [main-EventThread]: ZooKeeper event None on path null
2012-10-20 18:56:54,645 INFO CLDBServer [main-EventThread]: onZKConnect: Received connect from ZooKeeper State:CONNECTED Timeout:30000 sessionid:0x23a7dd1290e0005 local:/10.0.42.200:47557 remoteserver:data-node2/10.0.42.202:5181 lastZxid:0 xid:1 sent:1 recv:1 queuedpkts:0 pendingresp:0 queuedevents:0
2012-10-20 18:56:54,949 INFO CLDBServer [main-EventThread]: onZKConnect: Established connection to ZooKeeper
2012-10-20 18:56:54,970 INFO ZooKeeperClient [Wait for ZooKeeper Connected thread]: ZooKeeperClient : No FileServers for KvStore container.  New Installation, becoming Master
2012-10-20 18:56:54,987 INFO ZooKeeperClient [Wait for ZooKeeper Connected thread]: ZooKeeperClient: CLDB is current Master
2012-10-20 18:56:54,987 INFO ZooKeeperClient [Wait for ZooKeeper Connected thread]: CLDB became master. Creating new KvStoreContainer for cid: 1
2012-10-20 18:56:54,987 INFO ZooKeeperClient [Wait for ZooKeeper Connected thread]: No FileServers at this point
2012-10-20 18:56:54,989 INFO ZooKeeperClient [Wait for ZooKeeper Connected thread]: Storing KvStoreContainerInfo to ZooKeeper  Container ID:1 Servers:  Inactive Servers:  Unused Servers:  Latest epoch:3 SizeMB:0
2012-10-20 18:56:55,007 INFO ZooKeeperClient [Wait for ZooKeeper Connected thread]: Adding KvStore ContainerInfo 1 to ZooKeeper  Container ID:1 VolumeId:1 Servers:  Inactive Servers:  Unused Servers:  Latest epoch:3
2012-10-20 18:56:55,008 INFO ZooKeeperClient [Wait for ZooKeeper Connected thread]: CLDB became master. Initializing KvStoreContainer for cid: 1
2012-10-20 18:56:55,010 INFO ZooKeeperClient [Wait for ZooKeeper Connected thread]: Storing KvStoreContainerInfo to ZooKeeper  Container ID:1 Servers:  Inactive Servers:  Unused Servers:  Latest epoch:3 SizeMB:0
2012-10-20 18:56:55,027 INFO CLDBServer [Wait for ZooKeeper Connected thread]: Starting thread to monitor waiting for local kvstore to become master
2012-10-20 18:56:57,579 INFO CLDBServer [RPC-thread-3]: RPC: PROGRAMID: 2345 PROCEDUREID: 103 from 10.0.42.200:60871 Rejecting rpc with status 3 as CLDB is waiting for local kvstore to become master.
2012-10-20 18:56:57,753 INFO CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.0.42.200:1111 Rejecting rpc with status 3 as CLDB is waiting for local kvstore to become master.
2012-10-20 18:56:57,991 INFO CLDBServer [RPC-thread-4]: FSRegister: Request  FSID: 6800597116191256797 FSNetworkLocation: / FSHost:Port: 10.0.42.200:5660- FSHostName: data-node0 StoragePools  Capacity: 0 Available: 0 Used: 0 Role: 0 isDCA: false Received registration request
2012-10-20 18:56:57,991 INFO CLDBServer [RPC-thread-4]: Cluster uuid is -7224280450690225629--4547375276730596499
2012-10-20 18:56:58,015 INFO FileServerMetrics [RPC-thread-4]: Initializing File Server Metrics with hostName=data-node0
2012-10-20 18:56:58,015 INFO FileServer [RPC-thread-4]: Instantiating fileserver metrics with context:com.mapr.fs.cldb.counters.MapRGangliaContext31
2012-10-20 18:56:58,016 INFO CLDBServer [RPC-thread-4]: FSRegister: Registered FileServer: 10.0.42.200:5660-
2012-10-20 18:56:58,753 INFO CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.0.42.200:1111 Rejecting rpc with status 3 as CLDB is waiting for local kvstore to become master.
2012-10-20 18:56:59,058 INFO CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.0.42.202:1111 Rejecting rpc with status 3 as CLDB is waiting for local kvstore to become master.
2012-10-20 18:57:00,059 INFO CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.0.42.202:1111 Rejecting rpc with status 3 as CLDB is waiting for local kvstore to become master.
2012-10-20 18:57:00,585 INFO CLDBServer [RPC-thread-5]: RPC: PROGRAMID: 2345 PROCEDUREID: 103 from 10.0.42.200:60871 Rejecting rpc with status 3 as CLDB is waiting for local kvstore to become master.
2012-10-20 18:57:00,754 INFO CLDBServer [Lookup-thread-1]: RPC: PROGRAMID: 2345 PROCEDUREID: 5 from 10.0.42.200:1111 Rejecting rpc with status 3 as CLDB is waiting for local kvstore to become master.

...
<code>
<br><br>
maprcli-root-0.log<br>
Header: hostName: data-node0, Time Zone: China Standard Time, processName: MapRCLI, processId: 1022, MapR Build Version: 2.0.1.15869.GA
2012-10-20 18:56:20,992 INFO  com.mapr.cliframework.driver.CLIMainDriver [main]: [node, list, -columns, hostname, -noheader]
2012-10-20 18:56:21,074 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [main]: init
2012-10-20 18:56:21,200 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils getDataForParticularCLDB [main]: No data returned in RPC: CLDB Host: 10.0.42.200, CLDB IP: 7222. Continue searching for correct CLDB
2012-10-20 18:56:21,202 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [main]: Bad CLDB credentials removed: CLDB Host: 10.0.42.200, CLDB IP: 7222
2012-10-20 18:56:21,202 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils getZkConnect [main]: No data is received from any cldb
2012-10-20 18:56:21,203 ERROR com.mapr.cli.ServerCommands getServicesInfo [main]: zkConnectString is null/empty. Cannot proceed further.
2012-10-20 18:56:21,219 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [main]: init
2012-10-20 18:56:21,241 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils getDataForParticularCLDB [main]: No data returned in RPC: CLDB Host: 10.0.42.200, CLDB IP: 7222. Continue searching for correct CLDB
2012-10-20 18:56:21,241 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [main]: Bad CLDB credentials removed: CLDB Host: 10.0.42.200, CLDB IP: 7222
2012-10-20 18:56:21,241 ERROR com.mapr.cli.ServerCommands sendRequest [main]: RPC Request to list Nodes failed. No data returned
2012-10-20 18:56:21,243 INFO  com.mapr.cliframework.driver.CLIMainDriver [main]: ERROR (10009) -  Couldn't connect to the CLDB service. Check if at least one CLDB is running.
<br><br>

mfs.log:<br>
2012-10-20 18:56:39,8525 ERROR  lun.cc:613 x.x.0.0:0 Disk /dev/loop0 size 0 too small
2012-10-20 18:56:39,8532 ERROR  lun.cc:786 x.x.0.0:0 OnlineDisk /dev/loop0 failed No space left on device, errno 28
2012-10-20 18:56:39,8545 ERROR  loadsp.cc:82 x.x.0.0:0 Onlinedisk(/dev/loop0) failed(28)
2012-10-20 18:56:39,8546 INFO  iomgr.cc:947 x.x.0.0:0 execv of /opt/mapr/server/handle_disk_failure.sh with option : /opt/mapr/server/handle_disk_failure.sh
2012-10-20 18:56:39,8546 INFO  iomgr.cc:947 x.x.0.0:0 execv of /opt/mapr/server/handle_disk_failure.sh with option : -d
2012-10-20 18:56:39,8546 INFO  iomgr.cc:947 x.x.0.0:0 execv of /opt/mapr/server/handle_disk_failure.sh with option : /dev/loop0
2012-10-20 18:56:39,8546 INFO  iomgr.cc:947 x.x.0.0:0 execv of /opt/mapr/server/handle_disk_failure.sh with option : -c
2012-10-20 18:56:39,8546 INFO  iomgr.cc:947 x.x.0.0:0 execv of /opt/mapr/server/handle_disk_failure.sh with option : my.cluster.com
2012-10-20 18:56:39,8546 INFO  iomgr.cc:947 x.x.0.0:0 execv of /opt/mapr/server/handle_disk_failure.sh with option : -e
2012-10-20 18:56:39,8546 INFO  iomgr.cc:947 x.x.0.0:0 execv of /opt/mapr/server/handle_disk_failure.sh with option : 28
2012-10-20 18:56:39,8727 INFO  loadsp.cc:166 x.x.0.0:0 Done loading disks from disktab
2012-10-20 18:56:39,8727 ERROR  iomgr.cc:2469 x.x.0.0:0 Missing disk for GUID 54CE8A8B-BC44-F41E-B775-00F85D825000
2012-10-20 18:56:39,8728 INFO  iomgr.cc:2483 x.x.0.0:0 Refresh disktab state: old state: 0 1, failed SPs: 0, failed disks: 1
2012-10-20 18:56:39,8739 INFO  containerinfo.cc:943 x.x.0.0:0 Starting thread to periodically mark containers clean
2012-10-20 18:56:39,8959 INFO  procqueue.cc:41 x.x.0.0:0 proc-queue limits : clientRpcs(163) mfsRpcs(163) bulkRpcs(81)
2012-10-20 18:56:39,9337 ERROR  cldbha.cc:405 x.x.0.0:0 Got error Connection reset by peer (104) while trying to register with CLDB 10.0.42.200:7222
2012-10-20 18:56:42,9344 INFO  cldbha.cc:211 x.x.0.0:0 No storage pools are ready on this node. Marking CLDB unreachable
2012-10-20 18:56:58,1020 INFO  cldbha.cc:475 x.x.0.0:0 Re-established communication link with CLDB master at 10.0.42.200:7222.
2012-10-20 18:56:58,1020 INFO  fileserver.cc:6734 x.x.0.0:0 Registered with cldb 10.0.42.200:7222
2012-10-20 18:57:01,1184 INFO  fileserver.cc:7173 x.x.0.0:0 recieved updated no-compress list from cldb: bz2,gz,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mpeg,avi,gif,png,lzo
2012-10-20 18:57:05,1761 INFO  fileserver.cc:6751 x.x.0.0:0 Sending full container report to cldb.
2012-10-20 18:57:05,1762 INFO  fileserver.cc:6855 x.x.0.0:0 Sending vol list with 0 volumes.
2012-10-20 19:00:08,1515 INFO  fileserver.cc:6751 x.x.0.0:0 Sending full container report to cldb.
2012-10-20 19:00:08,1516 INFO  fileserver.cc:6855 x.x.0.0:0 Sending vol list with 0 volumes.
2012-10-20 19:03:11,0867 INFO  fileserver.cc:6751 x.x.0.0:0 Sending full container report to cldb.
2012-10-20 19:03:11,0868 INFO  fileserver.cc:6855 x.x.0.0:0 Sending vol list with 0 volumes.
2012-10-20 19:03:55,0624 ERROR  cldbha.cc:666 x.x.0.0:0 Failed to reach CLDB node due to error Read-only file system (30) for operation 2345.33 at 10.0.42.200:7222. Will retry after finding CLDB master.
</code>
<br><br>

mrdisk.log<br>
<code>
2012-10-20 18:56:38,6893 INFO IOMgr lun.cc:782 x.x.0.0:0 /dev/sda2 LoadDisk 0x17425d0 retry 0
2012-10-20 18:56:38,6893 ERROR IOMgr lun.cc:613 x.x.0.0:0 Disk /dev/sda2 size 0 too small
2012-10-20 18:56:38,6893 ERROR IOMgr lun.cc:786 x.x.0.0:0 OnlineDisk /dev/sda2 failed No space left on device, errno 28
2012-10-20 18:56:38,6894 ERROR Global init.cc:77 x.x.0.0:0 Onlinedisk /dev/sda2 failed 28
2012-10-20 18:56:38,6894 INFO IOMgr lun.cc:770 x.x.0.0:0 Loading disk:/dev/sda5
2012-10-20 18:56:38,6894 INFO IOMgr lun.cc:782 x.x.0.0:0 /dev/sda5 LoadDisk 0x17425d0 retry 0
2012-10-20 18:56:38,7363 ERROR IOMgr iomgr.cc:2617 x.x.0.0:0 bad magic on Disk Private block 0x1: exp 0x4e554c2d5250414d found 0x506b1801a1c85f69
2012-10-20 18:56:38,7364 INFO IOMgr iomgr.cc:2817 x.x.0.0:0 CRC error in one of the DG meta blocks 0x1 0x3a385e4 ret: 1 0
2012-10-20 18:56:38,7364 ERROR IOMgr iomgr.cc:2617 x.x.0.0:0 bad magic on Disk Private block 0x2: exp 0x4e554c2d5250414d found 0xa03a1b4f4ffa55b0
2012-10-20 18:56:38,7364 INFO IOMgr iomgr.cc:2817 x.x.0.0:0 CRC error in one of the DG meta blocks 0x2 0x3a385e5 ret: 1 0
2012-10-20 18:56:38,7364 ERROR IOMgr iomgr.cc:2617 x.x.0.0:0 bad magic on Disk Private block 0x3: exp 0x4e554c2d5250414d found 0x3c84e5d133ed050b
2012-10-20 18:56:38,7364 INFO IOMgr iomgr.cc:2817 x.x.0.0:0 CRC error in one of the DG meta blocks 0x3 0x3a385e6 ret: 1 0
2012-10-20 18:56:38,7364 ERROR IOMgr iomgr.cc:2617 x.x.0.0:0 bad magic on Disk Private block 0x4: exp 0x4e554c2d5250414d found 0x15f6287d5d64431c
2012-10-20 18:56:38,7364 INFO IOMgr iomgr.cc:2817 x.x.0.0:0 CRC error in one of the DG meta blocks 0x4 0x3a385e7 ret: 1 0
2012-10-20 18:56:38,7364 ERROR IOMgr iomgr.cc:2617 x.x.0.0:0 bad magic on Disk Private block 0x5: exp 0x4e554c2d5250414d found 0xd0e40656ddb0307c
2012-10-20 18:56:38,7364 INFO IOMgr iomgr.cc:2817 x.x.0.0:0 CRC error in one of the DG meta blocks 0x5 0x3a385e8 ret: 1 0
2012-10-20 18:56:38,7364 ERROR IOMgr iomgr.cc:2617 x.x.0.0:0 bad magic on Disk Private block 0x6: exp 0x4e554c2d5250414d found 0xf5c67673703d1d5e
2012-10-20 18:56:38,7364 INFO IOMgr iomgr.cc:2817 x.x.0.0:0 CRC error in one of the DG meta blocks 0x6 0x3a385e9 ret: 1 0
2012-10-20 18:56:38,7377 ERROR IOMgr lun.cc:874 x.x.0.0:0 MapR disk private data not found on /dev/sda
2012-10-20 18:56:38,7377 ERROR IOMgr lun.cc:1045 x.x.0.0:0 /dev/sda Disk online failed 61
2012-10-20 18:56:38,7377 ERROR Global init.cc:77 x.x.0.0:0 Onlinedisk /dev/sda failed 61
2012-10-20 18:56:38,7983 ERROR IOMgr iomgr.cc:2617 x.x.0.0:0 bad magic on Disk Private block 0x1: exp 0x4e554c2d5250414d found 0x20005fe000400020
</code>
<br><br>
adminuiapp.log:<br>
<pre>
2012-10-20 19:17:16,623 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [833041663@qtp0-9]: Bad CLDB credentials removed: CLDB Host: 10.0.42.200, CLDB IP: 7222
2012-10-20 19:17:16,624 INFO  com.mapr.adminuiapp.commands.LoginCallable [pool-1-thread-3]: LoginCallable: Parameters = [login, -user, root]
2012-10-20 19:17:16,624 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [pool-1-thread-3]: init
2012-10-20 19:17:16,631 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils getDataForParticularCLDB [pool-1-thread-3]: CLDB Host: 10.0.42.200, CLDB IP: 7222 is attempting to become a master. Retrying !
2012-10-20 19:17:51,635 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [pool-1-thread-3]: Bad CLDB credentials removed: CLDB Host: 10.0.42.200, CLDB IP: 7222
</pre>

<br><br>
mfsinit.log:
<br>
<code>
Disabling disk cache and set max io size on mapr disks
Disabling the disk cache on mapr disk: /dev/loop0
 HDIO_DRIVE_CMD(identify) failed: Invalid argument
 HDIO_DRIVE_CMD(flushcache) failed: Invalid argument
 HDIO_DRIVE_CMD(setcache) failed: Invalid argument
 HDIO_DRIVE_CMD(identify) failed: Invalid argument
 HDIO_DRIVE_CMD(flushcache) failed: Invalid argument
 HDIO_DRIVE_CMD(identify) failed: Invalid argument

/dev/loop0:
 setting drive write-caching to 0 (off)
unable to access /dev/loop0, ATA disk?
Disabling the disk cache on mapr disk: /dev/loop0: Failed
Setting max_sectors_kb for mapr disk: loop0
Error: Set max_sectors_kb to 1024 failed on mapr disk: loop0
Sat Oct 20 19:33:44 CST 2012
</code>

<br><br>
hoststats.log:
<br><code>
Setting continuous mode
2012-10-20 18:56:49,6606 Program: hoststats on Host:  IP: 0.0.0.0, Port: 1111, PID: 1945
2012-10-20 18:56:51,7311 ERROR Hoststats hoststats.cc:213 Thread: 140616744883968 Oops! Failed to connect to maprfs!  with error count since last shown error: 0
2012-10-20 19:04:12,7701 ERROR Hoststats hoststats.cc:213 Thread: 140616744883968 Oops! Failed to connect to maprfs!  with error count since last shown error: 0
2012-10-20 19:12:17,8077 ERROR Hoststats hoststats.cc:213 Thread: 140616744883968 Oops! Failed to connect to maprfs!  with error count since last shown error: 10
2012-10-20 19:20:32,8823 ERROR Hoststats hoststats.cc:213 Thread: 140616794593088 Oops! Failed to connect to maprfs!  with error count since last shown error: 10
2012-10-20 19:25:33,1069 ERROR Hoststats hoststats.cc:213 Thread: 140616794593088 Oops! Failed to connect to maprfs!  with error count since last shown error: 43
2012-10-20 19:30:41,9174 ERROR Hoststats hoststats.cc:213 Thread: 140616744883968 Oops! Failed to connect to maprfs!  with error count since last shown error: 38
Hoststats is shutting down with signal: 15. No further requests will be served**** starting hoststats **** args: 5660 /opt/mapr/logs/TaskTracker.stats -S 1
isGatherStats=true
</code>
<br> <br>
P.S. I use a /dev/loop0 on node0, node3, node4, which is mounted from a file( dd ), is this OK for mapr fs?


<br>

<br>

The above information may be in chocas, but is all what i got,  would you kind enough to tell me how to resovle these problems?

<br>

<br>

Thanks very much!!
<br>

Dengshaodong

Outcomes