What is Shuffling and Sorting in Hadoop MapReduce? , How does MapReduce sort and shuffle work?, What is the purpose of the shuffle operation in Hadoop MapReduce?
When mappers finish their tasks, their output is a series of key-value pairs. Shuffling is simply the act of transferring the mapper output to the reducers.
Sorting is the process or sorting the mappers' output by key. For example, unsorted output from the mappers might look like this:
When it's sorted, it would look like this:
Sorting helps the system determine when a new reducer should be started. The example above has very little data, so doesn't make much sense - but if you have millions of rows, you will likely have many duplicate keys. Sorting makes it easy to get all of the "bill" values to one reducer, the "joe" values to another reducer, and so on.
Data transfer from Mapper to Reducer is called as shuffling. Shuffling is started as soon as a mapper produces output. The (key, value) pair is sorted based on the key before the execution of reducer.
Sorting the (key, value) pair helps in distributing the data to a particular reducer based on keys. Note that shuffling and sorting in Hadoop MapReduce are not performed at all if you specify zero reducers, and it executes faster than MapReduce, this type of processing is known a Map-Only-Job.
Follow the link to learn more about Shuffling-Sorting in Hadoop
Retrieving data ...