I had an issue this weekend where my MapR cluster died and I had to re-build the entire thing from scratch. It is a CE test bed, so nothing critical but it still took time form the weekend.
The issue was around Zookeeper. I had ZK spread out across a 3 node cluster so that 1 could be a leader and 2 followers, however, I could not get the cluster to come back up after some routing maintenance by following the MapR 6.0 documentation.
In some cases, the "myid" file went missing completely, as well as, the directory that is stored in /opt/mapr/zookeeper/zoo.../
The instructions were not clear on how to determine which machine "was" the leader and in which order to bring up the mapr-zookeeper service. The instructions seemed to imply that it didn't matter and to just "perform a rolling restart", but I think the order does matter. I think they may need to be brought up in order of leader->follower 1 -> follower 2? But I couldn't figure out who the leader was. I wanted to look in the myid files for a "0" (zero) but as I said, it was missing in some cases, and I don't know why. Do they need to be brought up in a certain order?
What are the critical config files that ZK reads from? It is more than just zoo.cfg. I believe it is also reading from various conf files?
This may be better for a discussion, but what are the best practices for administering ZK on the MapR nodes? Is there a way to monitor ZK via the MCS like there is in Cloudera Manager?