Small File Problem in Hadoop? What is Small File Problem in Hadoop?
MapR-FS quite effectively addresses this problem by having a block size of 8KB, rather than 128KB. You can learn more about it here: HDFS vs. MapR FS – 3 Numbers for a Superior Architecture | MapR
File is smaller than the block size is called small files.The small size problem is 2 folds. Small File problem in HDFS and Small File Problem in MapReduce
Small file problem in HDFS:- Small files are smaller than blocks that’s why hdfs can’t handle it efficiently. Reading between datanode to datanode is also take lots of time. In namenode’s memory- directories, files and blocks represents as a separate object in HDFS. Size of each object is of 150 bytes. If we have 5,000 small files so, every file takes separate block in hdfs. It will cause to a use of GB’s or TB’s of memory. It has lots of files and then need to store lots of metadata in namenode. Small file problem in Map Reduce:- Number of files increases number of mappers. Map processes a block of input at a time. So, if there are many small files, then number of inputs will increase according to the number of map tasks. So, process will be slow. For Example- client needs to process 50,000 files, then it needs 50,000 mapppers, process will be slow in this case. Solution of small file problem is: 1. HAR files 2. Sequence Files 3. Hbase
Retrieving data ...