AnsweredAssumed Answered

ORC split size and the number of mappers launched

Question asked by maciek on Jan 28, 2015
How is the number of mappers to be launched calculated exactly?

Is the file format and compression taken into the picture? (256MB compressed data would give much more MB when mapper decompresses it)

I've created a couple of ORC files (no compression, 1file=1table) with different stripe size settings:
256, 128, 64 and 16MB. Their sizes are respectively:
327,814,200;    413,030,657;    413,030,290;    433,481,175

When I run a query in Hive (SELECT * FROM … ORDER BY) over those tables the number of map tasks launched is respectively:
1, 2, 2, 2.

I would expect it to be aligned with my chunk size (256MB) so always 2 as it's always a multiplier of the stripe sizes I've chosen.
After I change the engine to TEZ it gets even more interesting, the number of mappers is respectively;
2, 2, 4, 13

Why is it different?

Also when I examine the source table files using orcdump utility I can see the number of stripes is not consistent with declared stripe size, respectively:
8, 118, 118, 118.

Is it like the number of mappers is based on the declared stripe size (DDL = Hive metastore) rather than the file itself?