Comparison of input split and a block in Hadoop MapReduce?
Inputsplit is a logical reference to data means it doesn't contain any data inside. It is only used during data processing by MapReduce and HDFS block is a physical location where actual data gets stored. And both are configurable by the different methodologies. Moreover, all blocks of the file are of the same size except the last block. The last Block can be of same size or smaller. While Split size is approximately equal to block size, by default. An entire block of data may not fit into a single input split.
Gee willikers ...
What happens when you have a multi-block file that isn't splitable?
A "block" (also referred to as a "chunk") is a reference to how the data of a file is stored in the distributed file system (MapRFS). For instance, the default block size is 256MB, meaning a file that is 1024MB would be stored as 4 blocks, with the first block storing bytes 0 through (256*2^20)-1 (e.g. the first 256MB of the file), the second block storing bytes 256*2^20 through (256*2^21)-1 (e.g. the second 256MB of the file), etc. until 4 blocks are consumed.
Breaking up a large file into multiple blocks allows it to be stored across many different nodes and hence allows it to be, both, read and written with a rate of throughput that can't otherwise be achieved through one node/disk. It also allows a single file to be larger than a single drive/group of drives/node can store.
An "input split" is a way of referencing some specific data in an abstract manner, in the context of a Map/Reduce job. One of the first steps to executing a map/reduce job is to define a list of input splits, which then maps one-to-one to map tasks (e.g. the number of input splits provided during job creation will dictate the number of map tasks that will be run, with each map tasks being assigned one unique input split). When a map task runs, it will interpret the input split and read/process whatever data it references. This may all sound very ambiguous, and that is because it is indeed just an abstract class that the developer of the map/reduce job code will need to define (or re-use one of the example classes in the open source code).
For a more specific example, you might want to run a map/reduce job wherein each map tasks processes exactly 1 chunk of each of the files under some directory path in MapRFS. When developing the map/reduce job code, you'd retrieve a list of files under the directory path, then you could retrieve the block locations of those files and generate an input split for each block location of each file. Thus, when the map tasks run, each map task would be reading back some particular range of bytes of some particular file. And in this example case, the range of bytes would correlate to a block of a file in MapRFS.
However, there is no requirement that the input splits map directly to file blocks. You might decide you want each map task to process an entire file, regardless of whether the file is stored as a single block or many blocks. Or the input splits might not even be files at all. For instance, you might have a map/reduce job to do a count of the number of rows that contain some value in an HBase (e.g. MapRDB) table. In that case, each input split would generally specify a start key and end key over which the associated map task will scan the key range in the table and count the results. So in this case, the input split would be referencing row keys, which are not at all related to files or blocks of files.
To summarize, a block relates to how the data of files are stored in the distributed file system. An input split is a way of arbitrarily specifying some unique data that should be processed as the input to a map task in a map/reduce job. These concepts are not directly related, but, in some cases, developers of map/reduce job code may want to create input splits that correlate to file blocks.
Input splits don't always fall in line with one split per block. It depends on the underlying data and if its split-able.
So you can have a very large compressed file that is more than several blocks and it will still end up with a single map task. And of course depending on the underlying data, you can control how the job splits the input and directs where the task should run. We did a write up on this in InfoQ back in 2011 or 2012...
Michael Segel Were you referring to this one Uncovering mysteries of InputFormat: Providing better control for your Map Reduce execution.?
Yeah, that's one of the articles.
We were limited at the time on some of the stuff we worked on.
Retrieving data ...