Verifying checksum for files moving out of Hadoop (HDFS/MaprFS) by client needed?

Question asked by volans on Nov 24, 2013
Do we need to verify checksum after we move files from Hadoop (HDFS/MaprFS) to a Linux server through a client?


We have M3 installed in our cluster. And on a daily basis, we would like to archive HDFS files (dozens of terabytes in size) to an external Linux server with a lot of storage space.

Our current thought is to install a MapR client on the Linux server. And to archive files, we will run a copyToLocal command on the Linux server, as follows:

`hadoop fs -copyToLocal <hdfs folder to copy from> <Local Linux folder to copy to>`

Obviously we would like to make sure the files on the Linux server have no corruption after they are copied. But is checking checksum necessary?

I came across certain posts that checksum is already checked by the client:


*Ted: "the checksums are computed on the client side so that they protect against network errors as well as disk errors."*


*Ted: "Your application will never see corrupted data. Checksums are tested at the client level..."*

*MC: "On a CRC error, the client retries the RPC at the same server..."*

So to me, it seems like checking checksum is not necessary. Kindly let me know otherwise.

**copyToLocal VS NFS mount**:

On a different but related topic, if checking checksum is already handled by client and checking checksum externally is not necessary, is there a preference in the following two options:

 - Run "hadoop fs -copyToLocal" command like above

 - Create a NFS mount on the Linux server using the client. Then run a Linux "cp" command.

We prefer the first option simply because we are more familiar with it and traditionally NFS in general (not MapR specific) is a little more error prone. But we are open to both options. Thanks a lot. I appreciate any insight in this topic.