I couldn't understand why there were no alarms on MCS, while third node was not reachable via ssh and no services were running And I don't even know what caused that behaviour.
How to do RCA of such issue.
Could you please share more details about your environment and version of MapR you are on first?
I am running 3 node MapR v5.2 Community edition cluster deployed using AWS Cloudformation service available in AWS MapR maprketplace.
Sagar Sonawane Hello Sagar. You seem to be hitting quite a few quirky issues with MapR on AWS :-). The earlier thread, if I remember correctly, was surrounding the heart-beating issue, correct? And now, this.
Which makes me doubt something at the connectivity level - either between the nodes in the cluster, or between your client and AWS instances itself. Frankly, not sure where to start.
james sun Any thoughts on these kind of behaviors on the AWS marketplace?
Hello Mufeed Usman. Yes, you correctly mentioned. Actually I found couple of more issues and was thinking of posting it separately but since the context is set here, so I will post my experience here itself.
This is my 3 nodes MapR Cluster, yes there is no typo here it's actually a 3 nodes cluster.
Here is the activity log from autoscaling group:
Now there is a catch too, we have some script to stop-start the cluster every night/weekends to save bucks. But, this cluster was skipped based on some tags we add while creating instance. But, when autoscaling services "terminated" that instance and launched new, that instance was not having those tags, security group etc. see below for reasons:
Well, when I saw this mess on Monday, I did following steps:
1. Removed terminated nodes from cluster using "Remove node" action in MCS
2. Verified if cluster is OK with data ingested till that time, using volumes page in MCS
Then I saw something whose screenshot is not with me, since state has changed now. What I saw was, MaprMonitoring packages are vanished along with my data in ES, since default dbpath for ES is local disk and not MapR-FS.
Then, I did installed MaprMonitoring as per documentation. I thought everything is Ok. But then when I saw "services" section on MCS, following services were in in-correct state:
It clearly tell us that post-autoscaling workflow did not bring cluster node in it's previous stable state or it doesn't configured in that way. There is some "void" in bootstrapping scripts. Ideally, it should have queries CLDB or Zookeeper regarding that previous services/roles of that node and then installed packages accordingly before running configure.sh.
Then, I tried starting Nodemanager service on all nodes but it failed when I checked warden logs it says it is dependent on resourcemanager. Then, I used maprcli to start resourcemanager on every node and i got error saying it is not configured on node, from which I deduced that "Resourcemanager" service must be running on terminated node. So, I have installed "mapr-historyserver mapr-jobtracker mapr-resourcemanager" on other node. Then I saw MCS as below:
Then, I thought mapr-jobtracker package cause "classic MapReduce" tab, which was needed as per my current req and earlier config. So, I removed that package.
Then, my current state of cluster is as below:
If you see, its pretty much expected state except "1" failed count for NFS Gateway. It is because I am using Community edition and it grants only "1" NFS Gateway which is already running on 10-222-20-100. But, when autoscaling service launched new instance, it configured that service on node and hence conflict arise.
In all, here are few things needs to be noted:
1. I did not understood, what causes "instance reachability failure", which triggered autoscaling
2. We must correct bootstrapping scripts in markeplace offering, which will correctly identify nodes state i.e. being freshly deployed via cloudformation or being deployed via autoscaling
3. I was surprised, about data being retained in Mapr-FS and glad to know how resilient it is
Please let me know, your thoughts about whole story ( I know its lengthy, but needed to convey whole experience )
Thanks for patience.
Thank you Sagar for the detailed description.
Community Manager We'll need someone with expertise in handling implementations on the AWS Marketplace. Not sure who I should reach out to here. Please route appropriately. Thanks.
Hi james sun,
Could you diagnose and share your AWS knowledge here?
Thank you very much!
Invite AWS expert, Krishna Chaitanya, to join this discussion and share his knowledge.
Thank you and greatly appreciated.
Hi Sagar Sonawane,
Were you able to make any progress on your own? If you do, please share your learning. I have shared your thread with some members, and hopefully they will reply soon. If no one answers in a few days I will mark it as "Assume answered" and have other members recreate the question in the future hoping that more people help on that.
I have mentioned steps in detail about how I brought that cluster in working state(check my reply to Mufeed U. above). But, my question about how to do RCA of such issues/why that "instance rechability" alert gets triggered in AWS, are still un-answered. I am hoping to get some reply about that.
It seems you are shutting down and restarting the cluster nodes. Please see the final notes for using the CFTs in AWS with MapR. 9 Steps to Deploying the MapR Converged Data Platform on AWS
It seems that you may want to assign elastic IPs to the nodes and also disable the autoscaling policy for the cluster. Both described in the post above.
I already have elastic IPs and disabled the autoscaling policy. As far as, shutting down and restarting the cluster nodes is concerned, that is not the question. There are two important questions raised in entire thread, as follows:
You need to disable autoscaling before properly shutting down the cluster. Please see the final notes in the end of this blog post for using the CFTs in AWS with MapR. 9 Steps to Deploying the MapR Converged Data Platform on AWS
Please contact MapR support if this issue persists.
This answer has been repeated which is irrelevant to my question. Re-iterating my points:
I wanted to check again to see if you had found a solution to your issue that you can share with the community. Otherwise, the team has determined that a specific case must be created to further assist you as it would involve a dedicated engineer evaluating your particular environment. This option is available if you are a customer entitled to support. Let me know if you need more information.
Retrieving data ...