What do you mean by small file problem in Hadoop?
Small file means files which are considerably smaller than block size(64 MB or 128 MB) from Hadoop perspective. Since Hadoop is used for processing huge amount of data, if we are using small files, number of files would be obviously large. Hadoop is actually designed for large amount of data ie small number of large files. Following are the issues with small file 1. Each file, directory and block in HDFS is represented as object in name node’s memory (ie Metadata), and each of which occupies approx. 150 bytes. Scaling these much amount of memory in name node for each of these objects is not feasible. In short, if number of files increases, the memory required to store metadata will be more. 2. HDFS is not designed for efficient access of small files. Handling a large number of small files causes a lot of seeks and lot of hoping from data node to data node to retrieve small files. This is an inefficient data access pattern. 3. Mapper node usually take a block of input at a time. If the file is very small(ie less than typical block size), number of mapper task would increase and each task process very little input. This would create a lot of task in queue and overhead would be high. This decreases the overall speed and efficiency of map jobs.
You are correct in that HDFS does store the block information in memory on the name node and depending on the amount of memory available the number of blocks is finite and can easily be exceeded. At issue is when you store lots of small files that are much less than the size of a block and these files could normally be joined together. (This occurs in HBase as minor compactions and major compactions.
With respect to MapR, they do not have a Name Node, which can become a SPOF. (This has been fixed with the use of a second name node and ZooKeeper ) MapR stores the block information differently using a CLDB and the data node itself.
Here MapR is more resilient, however it too can have a small file issue. Here the first block of the file is written to a directory and that directory can start to slow down based on the number of small files. While this can be a problem... its important to point out that it will take many, many small files. (billions??) and it will cause a volume to fail. The rest of the cluster will be up. In Apache Hadoop, a fraction of what would cause MapR to fail would take down a cluster. And that is an important take away.
I'm not sure what you meant by #3 in your response.
Retrieving data ...