What do you mean by Data Locality?
Hadoop major drawback was cross-switch network traffic due to the huge volume of data. To overcome this drawback, Data locality came into the picture. It refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. Data locality increases the overall throughput of the system. In Hadoop, HDFS stores datasets. Datasets are divided into blocks and stored across the datanodes in Hadoop cluster. When a user runs the MapReduce job then NameNode sends this MapReduce code to the datanodes on which data is available related to MapReduce job. Data locality has three categories: • Data local – In this category data is on the same node as the mapper working on the data. In such case, the proximity of the data is closer to the computation. This is the most preferred scenario. • Intra – Rack- In this scenarios mapper run on the different node but on the same rack. As it is not always possible to execute the mapper on the same datanode due to constraints. • Inter-Rack – In this scenarios mapper run on the different rack. As it is not possible to execute mapper on a different node in the same rack due to resource constraints.
Data locality means that the work being performed is local to where the data is being stored.
The whole point of Map/Reduce is that its far cheaper to ship the map/reduce process to where the data is stored than to ship the data around the cluster.
Note the following:
The importance of data locality diminishes based on the network speed and the level of effort of the work being performed within a single cycle of a Mapper.map() method. See: Uncovering mysteries of InputFormat: Providing better control for your Map Reduce execution.
Here we found that because the level of effort to process the data within a single Mapper.map() was longer than it took to transport the data between the storage node and the node where the work was being performed, it was possible to leverage the cluster and complete the job faster.
In more modern architecture models, you will start to see a compute/storage ( or storage / compute) model where you can have multiple clusters. One for storing the data (HDFS or MapRFS) and clusters or single containers to process the data. This is why containers and container management (Mesosphere , Kubernetes, etc...) are important to the future of Hadoop.
Retrieving data ...