AnsweredAssumed Answered

after disk offline cluster not work

Question asked by seamus on Jul 9, 2017
Latest reply on Jul 11, 2017 by seamus

Hi all,

   Help!

    My cldb node encountered a problem, the disk is offline, it has been restored now , but the cldb node can not be restarted.

I get error on /opt/mapr/logs/warden.log

Header 1
no jobtracker to stop
Header: hostName: d2.infopower.com, Time Zone: China Standard Time, processName: warden, processId: 50127, MapR Build Version: 5.2.0.39122.GA
2017-07-10 11:50:30,833 INFO com.mapr.warden.WardenMain [main]: Log dir: /opt/mapr/hadoop/hadoop-0.20.2/logs
2017-07-10 11:50:30,834 INFO com.mapr.warden.WardenMain [main]: Log dir: /opt/mapr/hadoop/hadoop-2.7.0/logs
2017-07-10 11:50:30,888 INFO com.mapr.job.mngmnt.hadoop.metrics.MaprRPCContext [main]: init MAPRContext
2017-07-10 11:50:30,889 INFO com.mapr.job.mngmnt.hadoop.metrics.MaprRPCContext [main]: init MAPRContextHS
2017-07-10 11:50:30,901 WARN com.mapr.job.mngmnt.hadoop.metrics.MaprRPCContext [main]: Error while trying to get correct Ticket with errorCode: 2
2017-07-10 11:50:30,905 WARN com.mapr.job.mngmnt.hadoop.metrics.MaprRPCContext [Thread-5]: Error while trying to get correct Ticket with errorCode: 2
2017-07-10 11:50:30,924 INFO com.mapr.warden.WardenMain [main]: My pid: 50549
2017-07-10 11:50:30,925 INFO com.mapr.warden.WardenManager [maprUserTicketGetCheckThread]: maprUserTicketExpiration: 922337203685477
Warden started
Warden started
In sysVol
2017-07-10 11:50:31,215 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils getDataForParticularCLDB [main-SendThread(d2.infopower.com:5181)]: No data returned in
RPC: CLDB Ips: 192.168.0.22-, Port: 7222. Continue searching for correct CLDB
2017-07-10 11:50:31,219 INFO com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [main-SendThread(d2.infopower.com:5181)]: Bad CLDB credentials removed: CLDB Ips: 192.
168.0.22-, Port: 7222
2017-07-10 11:50:31,219 WARN com.mapr.security.maprsasl.MaprSecurityLoginModule [main-SendThread(d2.infopower.com:5181)]: Unable to generate the server key.
2017-07-10 11:50:31,247 INFO com.mapr.warden.service.baseservice.zksessionmgmnt.ZookeeperClientSessionManagement [main]: Connected to ZK: d0:5181,d1:5181,d2:5181Wi
th State: State:CONNECTED Timeout:30000 sessionid:0x25d2a7815640048 local:/192.168.0.22:35982 remoteserver:d2.infopower.com/192.168.0.22:5181 lastZxid:0 xid:1 sent:
1 recv:1 queuedpkts:0 pendingresp:0 queuedevents:0
2017-07-10 11:50:31,249 INFO com.mapr.warden.service.baseservice.zksessionmgmnt.ZookeeperClientSessionManagement [main-EventThread]: ZK Connect state:State:CONNECT
ED Timeout:30000 sessionid:0x25d2a7815640048 local:/192.168.0.22:35982 remoteserver:d2.infopower.com/192.168.0.22:5181 lastZxid:0 xid:1 sent:1 recv:1 queuedpkts:0 p
endingresp:0 queuedevents:0

 

 

I get error on /opt/mapr/logs/cldb.log

Header 1

Loaded library '/opt/mapr/server/permissions/libmapr_roles_refimpl.so' to resolve role membershipRoleMap: found 0 users and 0 roles.
Header: hostName: d2.infopower.com, Time Zone: China Standard Time, processName: cldb, processId: 116951, MapR Build Version: 5.2.0.39122.GA
2017-07-10 10:26:43,046 INFO CLDB [main]: Loading properties file : /opt/mapr/conf/cldb.conf
2017-07-10 10:26:43,392 INFO CLDBMetrics [main]: Initializing CLDB Metrics with serviceName: cldbServer
2017-07-10 10:26:43,398 INFO CLDB [main]: CLDBInit: Using hostname file /opt/mapr/hostname and hostid file /opt/mapr/hostid
2017-07-10 10:26:43,399 INFO CLDB [main]: CLDB Properties from configuration file: cldb.web.port=7221cldb.zookeeper.servers=d0:5181,d1:5181,d2:5181cldb.numthreads=1
0cldb.web.https.port=7443enable.replicas.invariant.check=falsehadoop.version=2.7.0cldb.port=7222cldb.min.fileservers=1cldb.detect.dup.hostid.enabled=falsenum.volmir
ror.threads=1cldb.jmxremote.port=7220
2017-07-10 10:26:43,399 INFO CLDB [main]: CLDB Command line args: /opt/mapr/conf/cldb.conf
2017-07-10 10:26:43,399 INFO CLDB [main]: CLDBInit: Initializing CLDB
2017-07-10 10:26:43,401 INFO CLDB [main]: MapR BuildVersion: 5.2.0.39122.GA
2017-07-10 10:26:43,401 INFO CLDB [main]: CLDBInit: Start CLDBServer
2017-07-10 10:26:43,489 INFO CLDBServer [main]: CLDBInit: HostName: d2.infopower.com ServerId: 7680175452834743301
2017-07-10 10:26:43,523 INFO CLDBServer [main]: Set the cldb key file to /opt/mapr/conf/cldb.key
2017-07-10 10:26:43,542 INFO CLDBServer [main]: Successfully set the server key
2017-07-10 10:26:43,542 INFO CLDBServer [main]: Successfully set the cluster key
2017-07-10 10:26:43,542 INFO CLDB [main]: CLDBState: CLDB State change : INITIAZING
2017-07-10 10:26:43,553 INFO CLDBThreadPools [main]: Initializing cleanup pool with 2 threads
2017-07-10 10:26:43,614 INFO CLDBServer [main]: Initializing Cldb JNI with memory 810903142 bytes
2017-07-10 10:26:43,632 INFO ZooKeeperClient [main]: ZooKeeperClient init: zk timeout = 30000 ms

...

2017-07-10 10:26:44,412 INFO CLDB [main]: CLDBInit: Starting HTTP Server
2017-07-10 10:26:44,412 INFO HttpServer [main]: WebServer: Starting WebServer
2017-07-10 10:26:44,414 INFO HttpServer [main]: Listener started on SelectChannelConnector@0.0.0.0:7221 port 7221
2017-07-10 10:26:44,435 INFO SslSocketConnectorSecure [main]: Exclude protocols: SSLv3,TLSv1
2017-07-10 10:26:44,435 INFO SslSocketConnectorSecure [main]: Enabled protocols: [SSLv2Hello, TLSv1.1, TLSv1.2]
2017-07-10 10:26:44,435 INFO HttpServer [main]: Listener started on HttpServer$1@0.0.0.0:7443 port 7443
2017-07-10 10:26:44,435 INFO HttpServer [main]: Starting Jetty WebServer
2017-07-10 10:26:44,435 INFO log [main]: jetty-6.1.26
2017-07-10 10:26:44,498 INFO CLDBServer [RPC-1]: Rejecting RPC 2345.229 from 192.168.0.22:21702 with status 3 as CLDB is waiting for local kvstore to become master.
2017-07-10 10:26:44,498 WARN CLDBServer [RPC-1]: RPC 2345.229 from 192.168.0.22:21702 Procedure ID not found
2017-07-10 10:26:44,723 INFO log [main]: Started SelectChannelConnector@0.0.0.0:7221
2017-07-10 10:26:44,723 INFO log [main]: Started HttpServer$1@0.0.0.0:7443
2017-07-10 10:26:53,954 INFO FileServerHandler [RPC-7]: FSRegister: Request FSID: 7680175452834743301 Build: 5.2.0.39122.GA PatchVersion: $Id: mapr-version: 5.2.0.
39122.GA 39122:49f386ea0304 $ FSNetworkLocation: FSHost:Port: 192.168.0.22-192.168.1.22- FSHost: Secondary Ports 5692- FSHostName: d2.infopower.com StoragePools c4
ae81fe777fd3f90058ab07620872d5- Capacity: 0 Available: 0 Used: 0 Role: 0 isDCA: false uniq: 9dae663958aee81d-5962e5e10dec9a Received registration request
2017-07-10 10:26:53,996 INFO Topology [RPC-7]: fsid:7680175452834743301 became reachable, removing from persist-store
2017-07-10 10:26:53,996 ERROR Topology [RPC-7]: Unable to remove entry from unreachableFSIdTable for fsId 7680175452834743301 current CLDB mode:MASTER_REGISTER_READ
Y
2017-07-10 10:26:54,001 INFO FileServerHandler [RPC-7]: FSRegister: Registered FileServer: 192.168.0.22-192.168.1.22- at topology /default-rack/d2.infopower.com/566
0
2017-07-10 10:26:54,001 INFO FileServerHandler [RPC-7]: FileServer Registration Request: Node Configuration
2017-07-10 10:26:54,001 INFO FileServerHandler [RPC-7]: NumCpus: 32 Avail Memory: 90378 Num Sps: 2 Num Instances: 1
2017-07-10 10:27:23,099 WARN CLDBServer [RPC-2]: RPC 2345.229 from 192.168.0.22:39870 Procedure ID not found
2017-07-10 10:27:45,039 INFO CLDBServer [Lookup-4]: Rejecting RPC 2345.5 from 192.168.0.22:63786 with status 3 as CLDB is waiting for local kvstore to become master
.
2017-07-10 10:28:01,637 WARN CLDBServer [RPC-3]: RPC 2345.229 from 192.168.0.22:47923 Procedure ID not found
2017-07-10 10:28:40,302 WARN CLDBServer [RPC-7]: RPC 2345.229 from 192.168.0.22:60325 Procedure ID not found

....

2017-07-10 10:26:54,001 INFO FileServerHandler [RPC-7]: FSRegister: Registered FileServer: 192.168.0.22-192.168.1.22- at topology /default-rack/d2.infopower.com/566
0
2017-07-10 10:26:54,001 INFO FileServerHandler [RPC-7]: FileServer Registration Request: Node Configuration
2017-07-10 10:26:54,001 INFO FileServerHandler [RPC-7]: NumCpus: 32 Avail Memory: 90378 Num Sps: 2 Num Instances: 1
2017-07-10 10:27:23,099 WARN CLDBServer [RPC-2]: RPC 2345.229 from 192.168.0.22:39870 Procedure ID not found
2017-07-10 10:27:45,039 INFO CLDBServer [Lookup-4]: Rejecting RPC 2345.5 from 192.168.0.22:63786 with status 3 as CLDB is waiting for local kvstore to become master
.
2017-07-10 10:28:01,637 WARN CLDBServer [RPC-3]: RPC 2345.229 from 192.168.0.22:47923 Procedure ID not found
2017-07-10 10:28:40,302 WARN CLDBServer [RPC-7]: RPC 2345.229 from 192.168.0.22:60325 Procedure ID not found
2017-07-10 10:28:45,049 INFO CLDBServer [Lookup-6]: Rejecting RPC 2345.5 from 192.168.0.22:63786 with status 3 as CLDB is waiting for local kvstore to become master
.
2017-07-10 10:29:45,059 INFO CLDBServer [Lookup-3]: Rejecting RPC 2345.5 from 192.168.0.22:63786 with status 3 as CLDB is waiting for local kvstore to become master
.
2017-07-10 10:30:45,069 INFO CLDBServer [Lookup-7]: Rejecting RPC 2345.5 from 192.168.0.22:63786 with status 3 as CLDB is waiting for local kvstore to become master
.
2017-07-10 10:31:45,078 INFO CLDBServer [Lookup-5]: Rejecting RPC 2345.5 from 192.168.0.22:63786 with status 3 as CLDB is waiting for local kvstore to become master
.
2017-07-10 10:32:45,089 INFO CLDBServer [Lookup-1]: Rejecting RPC 2345.5 from 192.168.0.22:63786 with status 3 as CLDB is waiting for local kvstore to become master
.
2017-07-10 10:33:44,060 FATAL CLDB [WaitForLocalKvstore Thread]: CLDBShutdown: CLDB had master lock and was waiting for its local mfs to become Master.Waited for 7
(minutes) but mfs did not become Master. Shutting down CLDB to release master lock.
2017-07-10 10:33:44,065 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: RMI TCP Accept-0 ID: 23
2017-07-10 10:33:44,065 ERROR CLDB [WaitForLocalKvstore Thread]: java.net.PlainSocketImpl.socketAccept(Native Method)
2017-07-10 10:33:44,065 ERROR CLDB [WaitForLocalKvstore Thread]: java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
2017-07-10 10:33:44,065 ERROR CLDB [WaitForLocalKvstore Thread]: java.net.ServerSocket.implAccept(ServerSocket.java:545)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: java.net.ServerSocket.accept(ServerSocket.java:513)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:400)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:372)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: java.lang.Thread.run(Thread.java:745)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: Thread: ACR-4 ID: 78
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: sun.misc.Unsafe.park(Native Method)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchroni
zer.java:2039)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
2017-07-10 10:33:44,066 ERROR CLDB [WaitForLocalKvstore Thread]: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

   

I get error on /opt/mapr/logs/mfs.log-3

Header 1
2017-07-10 10:26:41,9895 INFO IOMgr lun.cc:1127 /dev/sdd1 LoadDisk 0x43bb600 retry 0
2017-07-10 10:26:41,9895 INFO IOMgr lun.cc:834 disk /dev/sdd1 numaid 0
2017-07-10 10:26:41,9895 ERROR IOMgr lun.cc:855 Unable to open file error 2
2017-07-10 10:26:41,9896 INFO IOMgr lun.cc:1115 Loading disk:/dev/sde1
2017-07-10 10:26:41,9896 INFO IOMgr lun.cc:1127 /dev/sde1 LoadDisk 0x43bb968 retry 0
2017-07-10 10:26:41,9896 INFO IOMgr lun.cc:834 disk /dev/sde1 numaid 0
2017-07-10 10:26:41,9896 ERROR IOMgr lun.cc:855 Unable to open file error 2
2017-07-10 10:26:41,9896 INFO IOMgr lun.cc:1115 Loading disk:/dev/sdf1
2017-07-10 10:26:41,9896 INFO IOMgr lun.cc:1127 /dev/sdf1 LoadDisk 0x43bbcd0 retry 0
2017-07-10 10:26:41,9896 INFO IOMgr lun.cc:834 disk /dev/sdf1 numaid 0
2017-07-10 10:26:41,9896 ERROR IOMgr lun.cc:855 Unable to open file error 2
2017-07-10 10:26:41,9897 INFO IOMgr lun.cc:1115 Loading disk:/dev/sdb1
2017-07-10 10:26:41,9897 INFO IOMgr lun.cc:1127 /dev/sdb1 LoadDisk 0x43bc038 retry 0
2017-07-10 10:26:41,9897 INFO IOMgr lun.cc:834 disk /dev/sdb1 numaid 0
2017-07-10 10:26:41,9897 ERROR IOMgr lun.cc:855 Unable to open file error 2
2017-07-10 10:26:41,9897 INFO IOMgr lun.cc:1115 Loading disk:/dev/sdc1
2017-07-10 10:26:41,9897 INFO IOMgr lun.cc:1127 /dev/sdc1 LoadDisk 0x43bc3a0 retry 0
2017-07-10 10:26:41,9897 INFO IOMgr lun.cc:834 disk /dev/sdc1 numaid 0
2017-07-10 10:26:41,9897 ERROR IOMgr lun.cc:855 Unable to open file error 2
2017-07-10 10:26:41,9949 INFO IOMgr lun.cc:1423 /dev/sdc1 Disk Loaded
2017-07-10 10:26:41,9949 INFO IOMgr lun.cc:1424 Disk /dev/sdc1 loaded numRecords 2
2017-07-10 10:26:41,9956 INFO IOMgr lun.cc:1306 Disk /dev/sdb1, Loading concat DG Concat4-3 readystate(1)
2017-07-10 10:26:41,9956 INFO IOMgr iomgr.cc:1566 SP SP2 found on disk /dev/sdb1
2017-07-10 10:26:41,9956 INFO IOMgr lun.cc:1423 /dev/sdb1 Disk Loaded
2017-07-10 10:26:41,9956 INFO IOMgr lun.cc:1424 Disk /dev/sdb1 loaded numRecords 3
2017-07-10 10:26:41,9974 INFO IOMgr lun.cc:1423 /dev/sde1 Disk Loaded
2017-07-10 10:26:41,9974 INFO IOMgr lun.cc:1424 Disk /dev/sde1 loaded numRecords 2
2017-07-10 10:26:42,0104 INFO IOMgr lun.cc:1423 /dev/sdf1 Disk Loaded
2017-07-10 10:26:42,0104 INFO IOMgr lun.cc:1424 Disk /dev/sdf1 loaded numRecords 2
2017-07-10 10:26:42,0146 INFO IOMgr lun.cc:1306 Disk /dev/sdd1, Loading concat DG Concat1-3 readystate(1)
2017-07-10 10:26:42,0146 INFO IOMgr iomgr.cc:1566 SP SP1 found on disk /dev/sdd1
2017-07-10 10:26:42,0146 INFO IOMgr lun.cc:1423 /dev/sdd1 Disk Loaded
2017-07-10 10:26:42,0146 INFO IOMgr lun.cc:1424 Disk /dev/sdd1 loaded numRecords 3
2017-07-10 10:26:42,0146 INFO Global loadsp.cc:473 GetSPList 0x11b2b6a000
2017-07-10 10:26:42,0146 INFO Global loadsp.cc:485 GetSPList 0x11b2b6a000
2017-07-10 10:26:42,0146 INFO Global loadsp.cc:568 Split 0x11b2b6a000 spCount 0 Inst0: 2 0
2017-07-10 10:26:42,0147 INFO Global loadsp.cc:643 Split 2 total 2 spCount 0 Inst0: 2 0
2017-07-10 10:26:42,0147 INFO IOMgr lun.cc:1477 Offline Disk /dev/sdd1 numRecords 3
2017-07-10 10:26:42,0147 INFO IOMgr lun.cc:1488 DeleteFromLUNTable lun /dev/sdd1 0x416ae70 i 2
2017-07-10 10:26:42,0147 INFO IOMgr lun.cc:1488 DeleteFromLUNTable lun /dev/sdd1 0x4163400 i 0
2017-07-10 10:26:42,0147 INFO IOMgr lun.cc:1497 Closed fd in Disk /dev/sdd1 Offline. Status 0
2017-07-10 10:26:42,0147 INFO Global loadsp.cc:732 RemoveDisk /dev/sdd1 returned status 2
2017-07-10 10:26:42,0147 INFO IOMgr lun.cc:1477 Offline Disk /dev/sde1 numRecords 2
2017-07-10 10:26:42,0147 INFO IOMgr lun.cc:1488 DeleteFromLUNTable lun /dev/sde1 0x4163220 i 0

Outcomes