AnsweredAssumed Answered

Input nodes do not match any of the cluster nodes

Question asked by tskyers on Sep 23, 2011
Latest reply on Sep 26, 2011 by yufeldman
I am having an issue where no matter how far I go in removing MapR code and configuration from all systems involved I get the error;

    Input nodes do not match any of the cluster nodes

When interacting with any other node but the CLDB node.

I've run zkdatacleaner.sh on all nodes w/ zookeeper present. I've deleted the entire MapR directory tree, I've deleted the directory zookeeper creates in the /tmp and /var directories I've created new cluster keys on the website. The only thing that I haven't done is delete any Java caches (only because I can't find any) and re-install the OS (which I don't think will solve the problem).

With that said I have two questions:
1) Is there a document or group of documents that hold an exhaustive list what MapR files are responsible for which MapR components? Eg /opt/mapr/conf/warden.com = cldb startup adminui startup etc .... I'd like to have some sort of map as to where to look when things go wrong rather than combing through every file in every directory every time... If this can be found at the Apache site on Hadoop that is fine as well but I've been coming up empty handed on a number of searches on this particular topic.

2) I've followed the documentation explicitly on how to add nodes to a cluster and how to clean zookeeper data and I still have no been successfully able to fix the above error ( Input nodes do not match any of the cluster nodes) Some snippets from logs:

Control node:

    root@MAPR1:/opt/mapr/logs# jps
    3558 WardenMain
    4465 CommandServer
    14689 Jps
    4477 JobTracker
    572 CommandServer
    3911 CLDB
    3503 QuorumPeerMain
    root@MAPR1:/opt/mapr/logs#

**CLDB.LOG**

    2011-09-23 08:29:57,571 INFO  com.mapr.fs.cldb.ActiveContainersMap [pool-1-thread-262]: BatchUpdate containerUpdate CID: 2050 Container ID:2050 Master:192.168.106.67:5660--3-VALID(1733412610376772269) Servers:  192.168.106.67:5660--3-VALID(1733412610376772269) 192.168.106.65:5660--3-VALID(7735693932272923427) 192.168.106.66:5660--3-VALID(7451503201337471211) Inactive Servers:  Unused Servers:  Latest epoch:3 SizeMB:0
    2011-09-23 08:29:57,635 INFO  com.mapr.fs.cldb.zookeeper.ZooKeeperClient [pool-1-thread-262]: Storing KvStoreContainerInfo to ZooKeeper  Container ID:1 VolumeId:1 Master:192.168.106.65:5660--5-VALID Servers:  192.168.106.65:5660--5-VALID 192.168.106.66:5660--5-VALID 192.168.106.67:5660--5-VALID Inactive Servers:  Unused Servers:  Latest epoch:5
    2011-09-23 08:30:32,868 WARN  com.mapr.fs.cldb.alarms.Alarms [ReplicationManagerThread]: VOLUME_ALARM_DATA_UNDER_REPLICATED cleared, for volume mapr.cldb.internal
    2011-09-23 08:41:55,833 ERROR com.mapr.fs.cldb.CLDBServer [pool-1-thread-265]: VolumeLookup: VolName: mapr.MAPR2.lab.net.local.mapred Volume not found
    root@MAPR1:/opt/mapr/logs#

Please note, 192.168.106.66 and .67 no longer have CLDB or zookeeper installed. I've uninstalled them from that node removed all directories etc. They are simply processing nodes now.

Still from control node:
**WARDEN.LOG**

    Header: hostName: MAPR1.lab.net, Time Zone: Eastern Standard Time, processName: warden, processId: 3528
    2011-09-23 08:27:51,084 INFO  com.mapr.warden.service.baseservice.Service [Thread-8-EventThread]: Process path: /services/nfs/master. Event state: SyncConnected. Event type: NodeDeleted
    2011-09-23 08:27:51,092 INFO  com.mapr.warden.service.baseservice.Service [Thread-8-EventThread]: MasterNode is:/services/nfs/master am I master: false
    2011-09-23 08:27:51,098 INFO  com.mapr.warden.service.baseservice.Service [Thread-8-EventThread]: Thread: 110, MasterIP: MAPR7.lab.net
    up
    2011-09-23 08:27:51,099 INFO  com.mapr.warden.service.baseservice.Service [Thread-8-EventThread]: Process path: /services/nfs/master. Event state: SyncConnected. Event type: NodeDataChanged
    2011-09-23 08:28:18,002 INFO  com.mapr.warden.WardenServer [main-EventThread]: Process path: /servers. Event state: SyncConnected. Event type: NodeChildrenChanged
    2011-09-23 08:29:33,105 INFO  com.mapr.warden.WardenServer [main-EventThread]: Process path: /servers. Event state: SyncConnected. Event type: NodeChildrenChanged
    2011-09-23 08:29:33,280 INFO  com.mapr.warden.service.baseservice.Service [Thread-7-EventThread]: Process path: /services/tasktracker. Event state: SyncConnected. Event type: NodeChildrenChanged

**MFS.LOG**

    2011-09-23 08:29:55,6843 INFO Replication fs/server/replication/containerresyncfromsnapshot.cc:89 clnt x.x.0.0:0 req 0 seq 6929664 Resyncing from cid 1 replica 1 txnVN 1048991 snapVN 32 writeVN 1048576 iscow 1 isundo 0, rollforwardcontainer 0 dumpSnapshotInode 1
    2011-09-23 08:29:56,7075 INFO Replication fs/server/replication/containerresync.cc:2141 clnt x.x.0.0:0 req 0 seq 6929664 Resync from snapshot completed srccid:1 replicacid:1 resynccid:1 err 0 svderr 0
    2011-09-23 08:29:56,7287 INFO Replication fs/server/replication/containerresync.cc:2833 clnt x.x.0.0:0 req 0 seq 0 ResyncContainer complete srccid 1 replicacid 1 err 0x0
    
    2011-09-23 08:29:56,7287 INFO Replication fs/server/replication/replicate.cc:1287 clnt x.x.0.0:0 req 0 seq 0 Adding 192.168.106.67:5660 as replica for container (1) after completing resync.
    2011-09-23 08:29:56,7287 INFO Replication fs/server/replication/containerresync.cc:2422 clnt x.x.0.0:0 req 0 seq 0 Deleting snapshot 4063809596
    2011-09-23 08:29:56,7287 INFO Container fs/server/container/delete.cc:976 clnt x.x.0.0:0 req 0 seq 0 Container delete request for cid 4063809596 cb 0x7bba10
    2011-09-23 08:29:56,7289 INFO Container fs/server/container/container.cc:3012 clnt x.x.0.0:771 req 3 seq 11089408 update state for container 4063809596 : removing old orphanEntry with opcode 64908768
    2011-09-23 08:29:56,7290 INFO Replication fs/server/replication/containerresync.cc:2474 clnt x.x.0.0:771 req 0 seq 6738432 Deleting resync container WA 0x3de16e0 cid 1
    2011-09-23 08:29:56,7290 INFO KvStore fs/server/mapserver/kvstoremultiop.cc:1311 clnt x.x.0.0:0 req 0 seq 7647488 Multiop on cid 1 without logflush took 1651 msec
    2011-09-23 08:29:56,7477 INFO Container fs/server/container/delete.cc:3667 clnt x.x.0.0:0 req 0 seq 2 Deleted container with cid 4063809596
    root@MAPR1:/opt/mapr/logs#

Misc Logs:

    root@MAPR1:/opt/mapr/logs# tail createJTVolume.log
    stat: cannot stat `/var/mapr/cluster/mapred': No such file or directory
    stat: cannot stat `/var/mapr/cluster/mapred': No such file or directory
    stat: cannot stat `/var/mapr/cluster/mapred': No such file or directory
    stat: cannot stat `/var/mapr/cluster/mapred': No such file or directory
    2011-09-22 15:48:49
    ---- Thu Sep 22 11:48:51 EDT 2011 --- ALL OK

I've verified forward and reverse DNS, local resolution via gethostip resolves to 127.0.1.1 my hosts file is correct.

    ERROR (22) -  Unable to map host: MAPR1.lab.net to non-local ipaddress while creating volume mapr.MAPR1.lab.net.local.logs
    2011-09-22 11:54:15.236 MAPR1 createsystemvolumes.sh(3614) Install CreateLocalVolumeDirectories:170 CreateLocalVolume: Retrying after 20 seconds. RetryCnt: 1
    3
    ERROR (22) -  Unable to map host: MAPR1.lab.net to non-local ipaddress while creating volume mapr.MAPR1.lab.net.local.logs
    2011-09-22 11:54:37.746 MAPR1 createsystemvolumes.sh(3614) Install CreateLocalVolumeDirectories:170 CreateLocalVolume: Retrying after 20 seconds. RetryCnt: 1
    4
    2011-09-22 11:54:37.757 MAPR1 createsystemvolumes.sh(3614) Install CreateLocalVolumeDirectories:170 'logs' volume could not be created after multiple retries
    stat: cannot stat `/var/mapr/local/MAPR1.lab.net/logs': No such file or directory
    stat: cannot stat `/var/mapr/local/MAPR1.lab.net/logs': No such file or directory
    stat: cannot stat `/var/mapr/local/MAPR1.lab.net/logs': No such file or directory
    stat: cannot stat `/var/mapr/local/MAPR1.lab.net/logs': No such file or directory







Outcomes