AnsweredAssumed Answered

node services not coming up

Question asked by elin on Mar 13, 2013
Latest reply on Mar 13, 2013 by Ted Dunning
We have a 2 node cluster: node 1 has cldb+zookeeper+nfs+fileserver+tasktracker on it, and node 2 has webserver+fileserver+jobtracker+tasktracker.  There was a failure and both nodes went down uncleanly.  I started up the cluster again, but it looks like the processes on node 1 are up, but none on node 2 are.  I've tried stopping everything, including the zookeeper and restarting.   Tried bringing up the webserver manually, which complains that node 2 can't connect to the cldb as do any maprcli commands issued from node 2. And it doesn't seem to have anything to do with the failover delay either since it's been trying for 19+ hours now.  the warden logs on node 2 has errors which say it can't connect to the zookeeper - but the host and port for the zookeeper in the logs look fine?  Any help would be appreciated!

snippets of logs below


cldb.log:

    2013-03-12 15:26:52,199 INFO ActiveContainersMap [Monitor Thread for Heartbeat/License]: ContainerUpdate  Container ID:2123 VolumeId:168252501 Master:10.160.24.170:5660--3-VALID Servers:  10.160.24.170:5660--3-VALID Inactive Servers:  Unused Servers:  10.160.24.229:5660--3 Latest epoch:3
    2013-03-12 15:26:52,203 INFO ActiveContainersMap [Monitor Thread for Heartbeat/License]: ContainerUpdate  Container ID:2049 VolumeId:170959866 Master:10.160.24.170:5660--6-VALID Servers:  10.160.24.170:5660--6-VALID Inactive Servers:  Unused Servers:  10.160.24.229:5660--4 Latest epoch:6
    2013-03-12 15:26:52,206 INFO ActiveContainersMap [Monitor Thread for Heartbeat/License]: ContainerUpdate  Container ID:2060 VolumeId:192297096 Servers:  Inactive Servers:  Unused Servers:  10.160.24.229:5660--4 Latest epoch:4
    2013-03-12 15:26:52,210 INFO ActiveContainersMap [Monitor Thread for Heartbeat/License]: ContainerUpdate  Container ID:2053 VolumeId:234088537 Master:10.160.24.170:5660--3-VALID Servers:  10.160.24.170:5660--3-VALID Inactive Servers:  Unused Servers:  10.160.24.229:5660--3 Latest epoch:3
    2013-03-12 15:26:52,211 INFO ZooKeeperClient [Monitor Thread for Heartbeat/License]: Storing KvStoreContainerInfo to ZooKeeper  Container ID:1 Master:10.160.24.170:5660--9-VALID(8761517861414123294) Servers:  10.160.24.170:5660--9-VALID(8761517861414123294) Inactive Servers:  Unused Servers:  10.160.24.229:5660--7(3823136139902439350) Latest epoch:9 SizeMB:0
    2013-03-12 15:28:51,196 WARN Alarms [ReplicationManagerThread]: Alarm raised: VOLUME_ALARM_DATA_UNAVAILABLE; Cluster: Test_MapR_Cluster; Volume: mapr.qa-centos-amd64-22.sv.splunk.com.local.mapred; Message: Volume data unavailable
    2013-03-12 15:28:51,272 INFO ActiveVolumeMap [ReplicationManagerThread]: UpdateVolumeNeedsGfsckState: Volume 192297096 needsGfsck state set to true
    2013-03-12 15:31:36,988 INFO ActiveContainersMap [RPC-thread-1]: ContainerUpdate  Container ID:2062 VolumeId:121126150 Master:10.160.24.170:5660--5-VALID Servers:  10.160.24.170:5660--5-VALID Inactive Servers:  Unused Servers:  10.160.24.229:5660--4 Latest epoch:5
    2013-03-12 18:12:35,787 INFO ZKDataRetrieval [RPC-thread-7-EventThread]: Process path: /services_config/cldb/qa-centos-amd64-21.sv.splunk.com. Event state: SyncConnected. Event type: NodeDataChanged

warden.log on node 2:

    2013-03-13 10:12:48,102 INFO  com.mapr.warden.centralconfig.PullCentralConfigTaskScheduler [PullCentralConfigTask]: Launching a separate process to execute /opt/mapr/server/pullcentralconfig
    2013-03-13 10:13:06,094 ERROR com.mapr.warden.service.baseservice.zksessionmgmnt.ZookeeperClientSessionManagement connect [main]: Could not connect to ZK within: 30000 ms. Check if ZK connection defined correctly: qa-centos-amd64-21.sv.splunk.com:5181. No data from ZK will be returned.
    2013-03-13 10:13:36,095 ERROR com.mapr.warden.service.baseservice.zksessionmgmnt.ZookeeperClientSessionManagement connect [main]: Could not connect to ZK within: 30000 ms. Check if ZK connection defined correctly: qa-centos-amd64-21.sv.splunk.com:5181. No data from ZK will be returned.
    2013-03-13 10:14:06,096 ERROR com.mapr.warden.service.baseservice.zksessionmgmnt.ZookeeperClientSessionManagement connect [main]: Could not connect to ZK within: 30000 ms. Check if ZK connection defined correctly: qa-centos-amd64-21.sv.splunk.com:5181. No data from ZK will be returned.
    2013-03-13 10:14:19,997 INFO  com.mapr.warden.centralconfig.PullCentralConfigTaskScheduler [PullCentralConfigTask]: /opt/mapr/server/pullcentralconfig process terminated with status: 0
    2013-03-13 10:14:36,097 ERROR com.mapr.warden.service.baseservice.zksessionmgmnt.ZookeeperClientSessionManagement connect [main]: Could not connect to ZK within: 30000 ms. Check if ZK connection defined correctly: qa-centos-amd64-21.sv.splunk.com:5181. No data from ZK will be returned.

mfs.log on node 2:

    2013-03-12 15:00:08,7747 ERROR  nodefailure.cc:1735 x.x.0.0:0 Get Container failed(19) on cid 2123 for marking stale.
    2013-03-12 15:00:08,7750 ERROR  nodefailure.cc:268 x.x.24.229:1111 CLDB responded with err No such device.(19) while reporting failure of 10.160.24.170:5660 to CLDB 10.160.24.170:7222 for 2080.
    2013-03-12 15:00:08,7750 ERROR  nodefailure.cc:364 x.x.24.229:1111 CLDB returned error No such device (19) while reporting failure of node 10.160.24.170:5660 as part of replicating ops for container 2080
    2013-03-12 15:00:08,7750 INFO  nodefailure.cc:1774 x.x.24.229:1111 Marking container 2080 stale
    2013-03-12 15:00:08,7750 INFO  nodefailure.cc:463 x.x.24.229:1111 Removing replica FSID 8761517861414123294 10.160.24.170:5660 for container (2080).
    2013-03-12 15:00:08,7750 ERROR  fileserver.cc:6695 x.x.0.0:0 heartbeat thread didn't get response for 91140 msec
    2013-03-12 15:00:08,7751 ERROR  fileserver.cc:6712 x.x.0.0:0 Heartbeat to cldb failed No such device (19). cldb:10.160.24.170:7222
    2013-03-12 15:00:08,7751 INFO  replicate.cc:293 x.x.24.229:1111 CLDB returned an error while processing op (39) with version (1077330)
    2013-03-12 15:00:08,7752 ERROR  versionnumbers.cc:500 x.x.24.229:1111 Not issuing a version number since this container has become stale. Failing the op with 2080
    2013-03-12 15:00:08,7752 ERROR  writev3.cc:557 x.x.24.229:1111 WriteFile: fid 2080.176.16842 offset 142936 count 401 Obtaining write VN failed 3
    2013-03-12 15:00:08,7753 INFO  spinit.cc:1136 x.x.0.0:0 SP SP1:/dev/sdb Containers Refs Dropped
    2013-03-12 15:00:08,8019 INFO  spinit.cc:1154 x.x.0.0:0 SP SP1:/dev/sdb Containers offlined
    2013-03-12 15:00:08,8023 INFO  spinit.cc:1197 x.x.0.0:0 SP SP1:/dev/sdb >Wait for BlockAliasPersist
    2013-03-12 15:00:08,8023 INFO  spinit.cc:1226 x.x.0.0:0 SP SP1:/dev/sdb >Stop Allocator
    2013-03-12 15:00:08,8023 INFO  allocator.cc:38 x.x.0.0:0 Alloc(0x2d74e30):Stop cptr(0x2d772c0) fidx(10) bmaps(200)
    2013-03-12 15:00:08,8024 INFO  spinit.cc:1261 x.x.0.0:0 SP SP1:/dev/sdb <allocator stopped
    2013-03-12 15:00:08,8027 INFO  spinit.cc:1317 x.x.0.0:0 SP SP1:/dev/sdb Log stopped
    2013-03-12 15:00:08,8622 INFO  spinit.cc:1356 x.x.0.0:0 SP SP1:/dev/sdb ICache Purged
    2013-03-12 15:00:08,8622 INFO  spinit.cc:1423 x.x.0.0:0 SP SP1:/dev/sdb offline Done
    2013-03-12 15:00:08,8622 INFO  iomgr.cc:1723 x.x.0.0:0 =========== All SPs offlined. =============
    2013-03-12 15:00:08,8622 INFO  spserver.cc:1078 x.x.0.0:0 < Shutdown exitMfs 1 ctx 0x7fdefc0036a0 err 0
    2013-03-12 15:00:08,8626 INFO  spserver.cc:1092 x.x.0.0:0 =========== *********************************** =============
    2013-03-12 15:00:08,8626 INFO  spserver.cc:1093 x.x.0.0:0 =========== All SPs offlined. Shutting down mfs =============
    2013-03-12 15:00:08,8626 INFO  spserver.cc:1094 x.x.0.0:0 =========== *********************************** =============

adminuiapp.log on node 2:

    2013-03-12 14:46:35,548 INFO  com.mapr.adminuiapp.commands.ConfigCallable [pool-1-thread-367]: ConfigCallable: Parameters = [config, load]
    2013-03-12 14:46:36,558 INFO  com.mapr.adminuiapp.commands.CLDBCallable [pool-1-thread-367]: CLDBCallable: Parameters = [node, heatmap, -view, status]
    2013-03-12 14:46:36,563 INFO  com.mapr.cli.common.ListCommand [pool-1-thread-367]: in hasMore:: Last fetched record count is 0. no more records to fetch
    2013-03-12 14:48:05,554 INFO  com.mapr.adminuiapp.commands.CLDBCallable [pool-1-thread-369]: CLDBCallable: Parameters = [dashboard, info]
    2013-03-12 14:48:05,554 INFO  com.mapr.adminuiapp.commands.ConfigCallable [pool-1-thread-370]: ConfigCallable: Parameters = [config, load]
    2013-03-12 14:48:05,613 INFO  com.mapr.adminuiapp.commands.CLDBCallable [pool-1-thread-369]: CLDBCallable: Parameters = [alarm, list, -summary, 1, -output, terse]
    2013-03-12 14:48:05,619 INFO  com.mapr.cli.common.ListCommand [pool-1-thread-369]: in hasMore:: Last fetched record count is 0. no more records to fetch
    2013-03-12 14:48:06,562 INFO  com.mapr.adminuiapp.commands.CLDBCallable [pool-1-thread-369]: CLDBCallable: Parameters = [node, heatmap, -view, status]
    2013-03-12 14:48:06,567 INFO  com.mapr.cli.common.ListCommand [pool-1-thread-369]: in hasMore:: Last fetched record count is 0. no more records to fetch


zookeeper log on node 1:

    2013-03-12 18:08:51,981 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn$Factory@251] - Accepted socket connection from /10.160.24.170:38711
    2013-03-12 18:08:51,993 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@777] - Client attempting to establish new session at /10.160.24.170:38711
    2013-03-12 18:08:52,000 - INFO  [SyncThread:0:NIOServerCnxn@1580] - Established session 0x13d60a792ac0014 with negotiated timeout 30000 for client /10.160.24.170:38711
    2013-03-12 18:08:52,427 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x13d60a792ac0014, likely client has closed socket
    2013-03-12 18:08:52,428 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@1435] - Closed socket connection for client /10.160.24.170:38711 which had sessionid 0x13d60a792ac0014
    2013-03-12 18:09:24,000 - INFO  [SessionTracker:ZooKeeperServer@316] - Expiring session 0x13d60a792ac0014, timeout of 30000ms exceeded
    2013-03-12 18:09:24,002 - INFO  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x13d60a792ac0014
    2013-03-12 18:12:14,786 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn$Factory@251] - Accepted socket connection from /10.160.24.170:38733
    2013-03-12 18:12:14,794 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@777] - Client attempting to establish new session at /10.160.24.170:38733
    2013-03-12 18:12:14,819 - INFO  [SyncThread:0:NIOServerCnxn@1580] - Established session 0x13d60a792ac0015 with negotiated timeout 30000 for client /10.160.24.170:38733
    2013-03-12 18:12:15,188 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x13d60a792ac0015, likely client has closed socket
    2013-03-12 18:12:15,190 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@1435] - Closed socket connection for client /10.160.24.170:38733 which had sessionid 0x13d60a792ac0015
    2013-03-12 18:12:35,563 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn$Factory@251] - Accepted socket connection from /10.160.24.170:38734
    2013-03-12 18:12:35,571 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@777] - Client attempting to establish new session at /10.160.24.170:38734
    2013-03-12 18:12:35,579 - INFO  [SyncThread:0:NIOServerCnxn@1580] - Established session 0x13d60a792ac0016 with negotiated timeout 30000 for client /10.160.24.170:38734
    2013-03-12 18:12:35,696 - INFO  [ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x13d60a792ac0016 type:create cxid:0x11 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/nodes Error:KeeperErrorCode = NodeExists for /nodes
    2013-03-12 18:12:35,721 - INFO  [ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x13d60a792ac0016 type:create cxid:0x12 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/nodes/qa-centos-amd64-21.sv.splunk.com Error:KeeperErrorCode = NodeExists for /nodes/qa-centos-amd64-21.sv.splunk.com
    2013-03-12 18:12:35,728 - INFO  [ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x13d60a792ac0016 type:create cxid:0x13 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/nodes/qa-centos-amd64-21.sv.splunk.com/services Error:KeeperErrorCode = NodeExists for /nodes/qa-centos-amd64-21.sv.splunk.com/services
    2013-03-12 18:12:36,095 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@639] - Exception causing close of session 0x13d60a792ac0016 due to java.io.IOException: Connection reset by peer
    2013-03-12 18:12:36,097 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@1435] - Closed socket connection for client /10.160.24.170:38734 which had sessionid 0x13d60a792ac0016
    2013-03-12 18:12:46,001 - INFO  [SessionTracker:ZooKeeperServer@316] - Expiring session 0x13d60a792ac0015, timeout of 30000ms exceeded
    2013-03-12 18:12:46,006 - INFO  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x13d60a792ac0015
    2013-03-12 18:13:06,001 - INFO  [SessionTracker:ZooKeeperServer@316] - Expiring session 0x13d60a792ac0016, timeout of 30000ms exceeded
    2013-03-12 18:13:06,007 - INFO  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x13d60a792ac0016






Outcomes