AnsweredAssumed Answered

Weird behavior deciding on splits / number of mappers

Question asked by dimamah on Jun 9, 2013
Latest reply on Jul 3, 2013 by gera
Im experiencing some weird behavior of hadoop on choosing the amount of mapper for a Hive query. 

I have a table with a single int column. 
The table is constructed from 6 files each 67108864b (64MB). 
Chunk size is default 256MB. 
Each of the files (F1-F6) resides in a different container. 
The containers are distributed in the following manner (from hadoop mfs -lss) : 
F1 -> Servers : 1/7/4 
F2 -> Servers : 1/8/10 
F3 -> Servers : 1/5/3 
F4 -> Servers : 1/7/4 
F5 -> Servers : 1/7/4 
F6 -> Servers : 1/7/10 
The Primary part of each of the files reside in the same container on servers 5/3/10

mapred.max.split.size is 256000000 
mapred.min.split.size is 1

I'm running the query : 

    select count(*) from table

For this query i get 3 Mappers: 

 1. M0 - Running on Server 4, Processing 3 files (F1,F4,F5)  
   *I see this in the mapper's log in the line : "HiveContextAwareRecordReader: Processing file..."*  
 2. M1 - Running on Server 8, Processing 2 files (F2,F6)
 3. M2 - Running on Server 5, Processing 1 file (F3)

**The questions :** 

 4. What is the "Split size" in this case? 
As I understand this, the split size should be max(minimumSize, min(maximumSize, blockSize))  which in this case is : 256000000  
Is this right?
 5. If the above is right, why are there 3 mappers? The total size of the data is 384MB, divided by 256 it should be 1.5 Mappers, rounded to 2 no? 
I'd expect to see 2 mappers, running on server1+server7 each reading 3 files.  
Or any other distribution that is even.
 6. M1 is running on server 8 and process file F6 which isn't found on server 8 at all! how is this possible?
 7. What is the algorithm behind the way the amount of mapper is chosen in MapR?