AnsweredAssumed Answered

Optimal Settings in Hive for Split

Question asked by mandoskippy on Sep 25, 2012
I am looking for the ideal/optimal settings for hive and how it works out with mapr.  When I cat a partition level (or anylevel for that matter) in hive (dfs_attributes) I get

<pre>
# lines beginning with # are treated as comments
Compression=true
ChunkSize=268435456
</pre>

Makes sense.

When I ran a job that has hits a partition with two files of the following sizes:
<pre>
263715 -rwxr-xr-x 1 darkness darkness 270044140 2012-09-25 13:32 000000_0
158162 -rwxr-xr-x 1 darkness darkness 161956948 2012-09-25 13:32 000001_0
</pre>
I get three map tasks.  And in the xml I see the mapred.max.split.size is
<pre>
mapred.max.split.size 256000000
</pre>
Hmm
so I set the split size manually:
<pre>
set mapred.max.split.size=268435456;
</pre>

Still three mappers. Which is odd to me. I can see three mappers with split size of 256000000. (Which leads me to a separate question, doesn't it make sense to set the default split size to be your block (or in MapR's case Chunk) size? ) But why should I get three mappers when I manually set the split size to something where two mappers at a split size of the chunk size would actually be able to handle the data.  

I don't know if this is the right place or not, I am trying to understand performance with file sizes, number of mappers, and mapr, and frankly hive documentation is a little light on the subject. Hoping for some experts here on the mapr filesystem can help enlighten us on both mapr as well as hive.

Thanks!

Outcomes