AnsweredAssumed Answered

Exception in Rpc.initialize java.io.IOException: Error in RPC init, CLDBShutdown: CLDB had master lock and was waiting for its local mf

Question asked by suxingfate on Jan 20, 2015
Latest reply on Jan 22, 2015 by suxingfate
**First Problem**

I am using MapR3.1, with 4 nodes. and CLDB is running in Node2.

after mapr is installed,
Node1: maprcli node cldbmaster

Node2:maprcli node cldbmaster

Node3:maprcli node cldbmaster

Node4:maprcli node cldbmaster

When execute the command in node1 and node2, there is error

listen: error 137
Exception in Rpc.initialize java.io.IOException: Error in RPC init
listen: error 139
listen: error 142
listen: error 145
ERROR (10009) -  Couldn't connect to the CLDB service

However, there is no problem when executing in node3 and node4. It is very suprising that, cldb is running in node2, however node2 itself can't connect to cldb !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

hadoop fs -ls /    is not stable, espeically in node1 and node2, however perform well in node3 and node4


Node2:~ # hadoop fs ls /
Bad connection to FS. command aborted. exception: Failed to initialize config
listen: error 88
Failed to initialize Rpc, error Resource temporarily unavailable(-11)

 1. warden.log

2015-01-21 14:08:08,093 INFO  com.mapr.warden.centralconfig.PullCentralConfigTaskScheduler [PullCentralConfigTask]: /opt/mapr/server/pullcentralconfig process terminated with status: 0
2015-01-21 14:10:13,320 ERROR com.mapr.warden.service.baseservice.Service$ServiceMonitorRun run [tasktracker_monitor]: Monitor command: [/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh, status, tasktracker]can not determine if service: tasktracker is running. Retrying. Retrial #2. Total retries count is: 3
2015-01-21 14:10:13,321 ERROR com.mapr.warden.service.baseservice.Service$ServiceMonitorRun run [tasktracker_monitor]: /opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh: line 199: kill: (2229) - No such process
/opt/mapr/hadoop/hadoop-0.20.2/bin/../pids/hadoop-mapr-tasktracker.pid exists with pid 2229 but no tasktracker.

2015-01-21 14:10:13,321 INFO  com.mapr.warden.service.baseservice.Service$ServiceRun [tasktracker_monitor]: Command: [/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh, start, tasktracker], Directory: /opt/mapr/hadoop/hadoop-0.20.2/bin
2015-01-21 14:10:14,357 INFO  com.mapr.warden.service.baseservice.Service$ServiceRun [tasktracker_monitor]: starting tasktracker, logging to /opt/mapr/hadoop/hadoop-0.20.2/bin/../logs/hadoop-mapr-tasktracker-SC-2.out

2015-01-21 14:10:24,357 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [tasktracker_monitor]: [e_SERV_RUN, hostName, ma_host, ma_process]
2015-01-21 14:10:24,357 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [tasktracker_monitor]: []
2015-01-21 14:13:07,319 INFO  com.mapr.warden.centralconfig.PullCentralConfigTaskScheduler [PullCentralConfigTask]: Launching a separate process to execute /opt/mapr/server/pullcentralconfig
2015-01-21 14:13:08,087 INFO  com.mapr.warden.centralconfig.PullCentralConfigTaskScheduler [PullCentralConfigTask]: /opt/mapr/server/pullcentralconfig process terminated with status: 0
2015-01-21 14:18:07,319 INFO  com.mapr.warden.centralconfig.PullCentralConfigTaskScheduler [PullCentralConfigTask]: Launching a separate process to execute /opt/mapr/server/pullcentralconfig
2015-01-21 14:18:08,072 INFO  com.mapr.warden.centralconfig.PullCentralConfigTaskScheduler [PullCentralConfigTask]: /opt/mapr/server/pullcentralconfig process terminated with status: 0
2015-01-21 14:20:24,638 ERROR com.mapr.warden.service.baseservice.Service$ServiceMonitorRun run [tasktracker_monitor]: Monitor command: [/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh, status, tasktracker]cannot determine if service: tasktracker is running. Number of retrials exceeded. Closing Zookeeper

 1. cldb.log

2015-01-21 13:13:55,861 INFO CLDBServer [RPC-2]: ContainerCreate  Container ID:2185 vol:2108831 Servers:  169.254.200.1(2) 169.254.200.2
2015-01-21 13:14:19,443 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker/SC-1. Event state: SyncConnected. Event type: NodeDeleted
2015-01-21 13:14:19,443 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker. Event state: SyncConnected. Event type: NodeChildrenChanged
2015-01-21 13:14:19,445 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services_config/tasktracker/SC-1. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 13:14:19,446 ERROR ZKDataRetrieval [RPC-10-EventThread]: Can not get data from /services_config/tasktracker/SC-1with error: KeeperErrorCode = NoNode for /services/tasktracker/SC-1
2015-01-21 13:14:20,205 WARN Alarms [RPC-4]: Alarm raised: NODE_ALARM_SERVICE_TT_DOWN; Cluster: my.cluster.com; Node: SC-1; Message: Can not determine if service: tasktracker is running. Check logs at: /opt/mapr/hadoop/hadoop-0.20.2/logs
2015-01-21 13:19:19,569 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker/SC-2. Event state: SyncConnected. Event type: NodeDeleted
2015-01-21 13:19:19,569 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker. Event state: SyncConnected. Event type: NodeChildrenChanged
2015-01-21 13:19:19,572 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services_config/tasktracker/SC-2. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 13:19:19,574 ERROR ZKDataRetrieval [RPC-10-EventThread]: Can not get data from /services_config/tasktracker/SC-2with error: KeeperErrorCode = NoNode for /services/tasktracker/SC-2
2015-01-21 13:19:20,333 WARN Alarms [RPC-4]: Alarm raised: NODE_ALARM_SERVICE_TT_DOWN; Cluster: my.cluster.com; Node: SC-2; Message: Can not determine if service: tasktracker is running. Check logs at: /opt/mapr/hadoop/hadoop-0.20.2/logs
2015-01-21 13:19:37,058 WARN CLDBServer [RPC-1]: ModifyACL request from 169.254.200.2:61463 by uid 0 on cluster my.cluster.com.Previous ACL mapr:login,ss,cv,a,fc,root:login,ss,cv,a,fc,New ACL mapr:login,ss,cv,fc,root:login,ss,cv,a,fc
2015-01-21 13:44:49,458 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker. Event state: SyncConnected. Event type: NodeChildrenChanged
2015-01-21 13:44:49,464 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services_config/tasktracker/SC-1. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 13:44:49,466 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services_config/tasktracker/SC-1. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 13:44:50,220 WARN Alarms [RPC-9]: NODE_ALARM_SERVICE_TT_DOWN cleared,
2015-01-21 13:45:01,612 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker/SC-1. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 13:45:20,398 WARN Alarms [HB-3]: Alarm raised: NODE_ALARM_CORE_PRESENT; Cluster: my.cluster.com; Node: SC-2; Message: Node has core file(s)
2015-01-21 13:49:49,587 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker. Event state: SyncConnected. Event type: NodeChildrenChanged
2015-01-21 13:49:49,595 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services_config/tasktracker/SC-2. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 13:49:49,597 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services_config/tasktracker/SC-2. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 13:49:50,382 WARN Alarms [RPC-10]: NODE_ALARM_SERVICE_TT_DOWN cleared,
2015-01-21 13:50:01,776 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker/SC-2. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 14:15:24,465 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker/SC-1. Event state: SyncConnected. Event type: NodeDeleted
2015-01-21 14:15:24,465 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker. Event state: SyncConnected. Event type: NodeChildrenChanged
2015-01-21 14:15:24,468 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services_config/tasktracker/SC-1. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 14:15:24,468 ERROR ZKDataRetrieval [RPC-10-EventThread]: Can not get data from /services_config/tasktracker/SC-1with error: KeeperErrorCode = NoNode for /services/tasktracker/SC-1
2015-01-21 14:15:25,229 WARN Alarms [RPC-3]: Alarm raised: NODE_ALARM_SERVICE_TT_DOWN; Cluster: my.cluster.com; Node: SC-1; Message: Can not determine if service: tasktracker is running. Check logs at: /opt/mapr/hadoop/hadoop-0.20.2/logs
2015-01-21 14:20:24,640 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker/SC-2. Event state: SyncConnected. Event type: NodeDeleted
2015-01-21 14:20:24,641 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services/tasktracker. Event state: SyncConnected. Event type: NodeChildrenChanged
2015-01-21 14:20:24,644 INFO ZKDataRetrieval [RPC-10-EventThread]: Process path: /services_config/tasktracker/SC-2. Event state: SyncConnected. Event type: NodeDataChanged
2015-01-21 14:20:24,645 ERROR ZKDataRetrieval [RPC-10-EventThread]: Can not get data from /services_config/tasktracker/SC-2with error: KeeperErrorCode = NoNode for /services/tasktracker/SC-2
2015-01-21 14:20:25,404 WARN Alarms [RPC-4]: Alarm raised: NODE_ALARM_SERVICE_TT_DOWN; Cluster: my.cluster.com; Node: SC-2; Message: Can not determine if service: tasktracker is running. Check logs at: /opt/mapr/hadoop/hadoop-0.20.2/logs

 1. maprcli.log

Header: hostName: SC-2, Time Zone: China Standard Time, processName: MapRCLI, processId: 553, MapR Build Version: 3.1.1.26113.GA
2015-01-21 14:10:05,166 INFO  com.mapr.cliframework.driver.CLIMainDriver [main]: [alarm, clear, -alarm, NODE_ALARM_TT_LOCALDIR_FULL, -entity, SC-2]
2015-01-21 14:10:05,195 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils getDataForParticularCLDB [main]: Exception while trying to bind on: CLDB Ips: 169.254.200.2-, Port: 7222
2015-01-21 14:10:05,197 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [main]: Bad CLDB credentials removed: CLDB Ips: 169.254.200.2-, Port: 7222
2015-01-21 14:10:05,198 INFO  com.mapr.cliframework.driver.CLIMainDriver [main]: ERROR (10003) -  Failed to connect to CLDB

Header: hostName: SC-2, Time Zone: China Standard Time, processName: MapRCLI, processId: 31938, MapR Build Version: 3.1.1.26113.GA
2015-01-21 14:20:16,482 INFO  com.mapr.cliframework.driver.CLIMainDriver [main]: [alarm, clear, -alarm, NODE_ALARM_TT_LOCALDIR_FULL, -entity, SC-2]
2015-01-21 14:20:16,511 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils getDataForParticularCLDB [main]: Exception while trying to bind on: CLDB Ips: 169.254.200.2-, Port: 7222
2015-01-21 14:20:16,513 INFO  com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [main]: Bad CLDB credentials removed: CLDB Ips: 169.254.200.2-, Port: 7222
2015-01-21 14:20:16,514 INFO  com.mapr.cliframework.driver.CLIMainDriver [main]: ERROR (10003) -  Failed to connect to CLDB

Header: hostName: SC-2, Time Zone: China Standard Time, processName: MapRCLI, processId: 32198, MapR Build Version: 3.1.1.26113.GA
2015-01-21 14:20:25,368 INFO  com.mapr.cliframework.driver.CLIMainDriver [main]: [alarm, raise, -alarm, NODE_ALARM_SERVICE_TT_DOWN, -entity, SC-2, -description, Can not determine if service: tasktracker is running. Check logs at: /opt/mapr/hadoop/hadoop-0.20.2/logs]
2015-01-21 14:20:25,406 INFO  com.mapr.cliframework.driver.CLIMainDriver [main]:


Do you have any clue, what would be the cause ?

========================================================================================================================================================================================================


**Second Problem**















when I removed cldb from node2, and installed cldb in node3, it is even worse

 - warden.log

listen: error 149
2015-01-21 19:12:25,304 ERROR com.mapr.warden.service.baseservice.Service$ServiceMonitorRun run [cldb_monitor]: Monitor command: [/etc/init.d/mapr-cldb, status]can not determine if service: cldb is running. Retrying. Retrial #1. Total retries count is: 3
2015-01-21 19:12:25,304 ERROR com.mapr.warden.service.baseservice.Service$ServiceMonitorRun run [cldb_monitor]: 0.20.2
/opt/mapr/pid/cldb.pid exists with pid 14253 but no CLDB.

2015-01-21 19:12:25,305 INFO  com.mapr.warden.service.baseservice.Service$ServiceRun [cldb_monitor]: Command: [/etc/init.d/mapr-cldb, start], Directory: /etc/init.d
2015-01-21 19:12:26,382 INFO  com.mapr.warden.service.baseservice.Service$ServiceRun [cldb_monitor]: 0.20.2
Starting CLDB, logging to /opt/mapr/logs/cldb.log

listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 149
2015-01-21 19:12:36,383 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [cldb_monitor]: [e_SERV_RUN, hostName, ma_host, ma_process]
2015-01-21 19:12:36,383 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [cldb_monitor]: []
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 149
2015-01-21 19:12:36,383 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [cldb_monitor]: [e_SERV_RUN, hostName, ma_host, ma_process]
2015-01-21 19:12:36,383 INFO  com.mapr.job.mngmnt.hadoop.metrics.WardenRequestBuilder [cldb_monitor]: []
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
listen: error 138
listen: error 140
listen: error 143
listen: error 146
listen: error 149
                                              

 - cldb.log



2015-01-21 19:12:26,253 INFO CLDB [main]: CLDBInit: Starting HTTP Server
2015-01-21 19:12:26,253 INFO HttpServer [main]: WebServer: Starting WebServer
2015-01-21 19:12:26,254 INFO HttpServer [main]: Listener started on SelectChannelConnector@0.0.0.0:7221 port 7221
2015-01-21 19:12:26,254 INFO HttpServer [main]: Starting Jetty WebServer
2015-01-21 19:12:26,254 INFO log [main]: jetty-6.1.26
2015-01-21 19:12:26,577 INFO log [main]: Started SelectChannelConnector@0.0.0.0:7221
2015-01-21 19:19:26,164 FATAL CLDB [WaitForLocalKvstore Thread]: CLDBShutdown: CLDB had master lock and was waiting for its local mfs to become Master.Waited for 7 (minutes) but mfs did not become Master. Shutting down CLDB to release master lock.
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: kvwriter ID: 23
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: sun.misc.Unsafe.park(Native Method)
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: com.mapr.kvstore.KvStoreWriter$KvStoreWriterThread.run(KvStoreWriter.java:42)
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.run(Thread.java:745)
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: WaitForLocalKvstore Thread ID: 24
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.dumpThreads(Native Method)
2015-01-21 19:19:26,165 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.getAllStackTraces(Thread.java:1640)
                                                                                                                      


Outcomes