AnsweredAssumed Answered

Performance comparison of MapR with Cloudera- what are the configuration parameters to fine tune to compare the results between them

Question asked by prasanth on Aug 6, 2012
Latest reply on Aug 7, 2012 by prasanth

We are trying to select the best suited Hadoop Distro for our application. As part of bench marking Cloudera performnace with MapR using simple arithmetic ops on a four node cluster, I found that for smaller data sets say for a file containing 1 million to 10 million records the performance of Mapr is marginally better compared to Cloudera. But with a file containing more than 100 million or 1 billion data records, the MapR performance seems to dip by a huge margin, which goes against the commonly acclaimed knowledge about MapR. 

I am trying to have a same-to-same comparison between MapR and Cloudera. Surely, there is something wrong in configuration parameters. I am using a block size of 64 on both MapR and Cloudera. I have one single Reduce job and the no. of map output records equal to no. of reduce outputs. I have seen this [link][1]


so my question is what other parameters apart from block size, can we tune such that MapR performs better compared to Cloudera. Second question is what explains the performance issue for MapR at 64MB. Even with default block size of 256MB I got only marginal improvement in results with MapR.

Another important factor might be that I am running MapR on cluster using 50GB flat file on each node and not raw drives as prescribed by MapR. It might be a factor too.

Any words of wisdom on this will be greatly appreciated. I have obviously not drawn any conclusion on which is better, but want to do so with the help of MapR community.

By the way I love MapR NFS feature. it Rocks

Thank you.