How to write a custom partitioner for a MapReduce job?
hadoop, mapreduce, bigdata, hadoop question
Firstly lets understand why we need partitioning in MapReduce Framework:
As we know that Map task take inputsplit as input and produces key,value pair as output. This key-value pairs are then feed to reduce task. But before the reduce phase , one more phase know as partitioning phase runs. This phase partition the map output based on key and keeps the record of the same key into the same partitions.
Lets take an example of Employee Analysis:We want to find the highest paid Female and male employee from the data set.Data Set:Name Age Dept Gender SalaryA 23 IT Male 35B 35 Finance Female 50C 29 IT Male 40
Considering two map tasks gives following <k,v> as output:
Key ValueGender Value
Male A 23 IT Male 35Female B 35 Finance Female 50
Map2 o/pKey ValueGender Value
Male C 29 IT Male 40
So if you observe the output from two map tasks you should have noticed that the gender 'Male' is in outputs of both map tasks and it will be processed twice if sent to two different reducers So here partitioning plays the role.
Before it sends outputs to reducers it will partition the intermediate key value pairs based on key and send the same key to the same partition.
How the number of partitions are decided ??
Hadoop decides it at the time when the map reduce job starts that how may partitions will be there which is controlled by the JobConf.setNumReduceTasks() method, suppose if decide 5 reduce tasks, the 5 partitions will be there and must be filled.
So, lets see by default how it happens.
By default the partitioner implementation is called HashPartitioner. It uses the hashCode() method of the key objects modulo the number of partitions total to determine which partition to send a given (key, value) pair.
Retrieving data ...