AnsweredAssumed Answered

Merge tab delimited files by key

Question asked by tolgap on Oct 26, 2014
I have three MapReduce jobs that produce tab delimited files, that operate on the same files. The first value is the key. This is the case for every output of these three MR jobs.

What I want to do now, is use MapReduce to "merge" these files together by key. What would be the best Mapper output and Reducer input? I tried using `ArrayWritable`, but because of the shuffle, for some records the `ArrayWritable` from 1 file is in the third position, instead of the second.

I want this:

    Key \t Values-from-first-MR-job \t Values-from-second-MR-job \t Values-from-third-MR-job

And this should be the same for **all** records. But, as I said, because of the shuffle, sometimes this happens for a few records:

    Key \t Values-from-third-MR-job \t Values-from-first-MR-job \t Values-from-second-MR-job

How should I set up my `Mapper` and `Reducer` to fix this? Or can I use a custom `Combiner` to fix this?

Outcomes