I installed a 5 node MapR 5.1.0 cluster on Rhel 7.2. The cluster got installed successfully and I ran basic map reduce jobs (teragen, terasort) to evaluate performance.
I was trying to run a benchmark (HiBench) to evaluate the cluster performance under different types of loads. One of my nodes suddenly crashed and was unreachabale. It looked like the OS drives got corrupted and the server couldn't be booted. Given there was no data on the server, we decided to re-install the OS. We re-installed MapR on the server and the cluster was back to its original state.
We again tried to run the HiBench suites of tests - with exactly same results! The server has crashed again and is unresponsive. I'm attaching a screenshot from the server console. The only available course of action seems to be a re-install.
We did use the HiBench tests on the same set of servers using Hortonworks and did not run into any issues. So, we know HiBench alone can't be the issue.
If it happened once, I'd have chalked it up to happenstance but twice in a row indicates some underlying problem that is not going to go away by itself.
1. Is there anything I can do to find out more information on what could be causing these failures on the server?
2. Is it even possible for hadoop map reduce (or spark) jobs to cause OS failures/corruption?
Thanks for your help,