AnsweredAssumed Answered

Hive and File Sizes

Question asked by mandoskippy on Sep 20, 2012
Latest reply on Sep 22, 2012 by yufeldman
I am trying to ensure the most efficient way to query my data on hive running on mapr. I understand that mapr doesn't have the same "small files" limitation that standard hadoop does, but that being said, a standard "select * from table where column=value" query gets a ton of map tasks assigned to it, and these map task finish very quickly.  I.e. sub 10 second finish times!

So for example:

I have select column, count(1) from table group by column as my query.  Changing nothing I get

1238 map tasks and 377 reduce tasks with the average task time being 8 seconds.

My first plan was to up my split size
SET mapred.max.split.size = 1512000000;

This produced 244 map tasks and 377 reduce tasks.  This seems better!  But I stopped and hard coded the reduces to be 50. Why did I do that? not sure, it looked like a good ratio and I the split size didn't change the reduce task number.

set mapred.reduce.tasks=50;

Ok, so now I have 244 map tasks and 50 reduce tasks

My average time so far is 32 seconds per map or (averaged 7808 seconds for mapping) This is opposed to 9904 with changing no settings.  So I gained some... but could I gain more?

I changed my pre query settings to
SET mapred.max.split.size = 2048000000;
set mapred.reduce.tasks=40;

This produced 182 map tasks and 40 reduce tasks. at 39 seconds per task.  Still coming down! 

So I am going to let this one finish.. but how do I tweak my hive settings to be smarter about this stuff?  I.e. Should I have to set these things for each query? Is the optimum setting going to be on a per query basis or per cluster? Or per data for that matter?  I'd like to run through a process that will help me get the right amount of data to my mappers for optimum efficiency but how do I do that (smartly?) Also, I'd like to ideally do this in a way that wouldn't force my users to think about the setttings.  Any guides on how to do this?  I know some of my actual RCFile/Gzipped table files in hive are quite small, but I didn't think this would matter with mapr... should I be combining files? Would that help me gain efficiency without having to use the set commands?    I guess those are the questions I am tossing out to the group :)



Outcomes