I'm trying to load terabytes of data from Local system to HDFS. What should be the strategy for loading files in terms performance?
Most likely there could be 3 bottlenecks:
1. Disk reading performance of local FS.
2. Network between this local machine and HDFS.
3. HDFS writing performance of HDFS.
You have to understand the performance of above 3 firstly.
The ideal solution should always saturate the #3 before hitting any other bottlenecks.
Obviously, the #1 could be the easiest bottleneck to trigger.
You may need to split the large files into many pieces on different local machines, and then launch multiple "hadoop fs -put".
Hao makes a number of good points. I'll just add that by far the easiest way to load data into MapR is to expose the cluster via NFS and then write directly to it via standard file system commands. While writing to one NFS gateway may not scale well, if the existing system has limited capacity it may not matter. If it does matter, you could use multiple gateways.
Retrieving data ...