AnsweredAssumed Answered

Verifying checksum for files moving out of Hadoop (HDFS/MaprFS) by client needed?

Question asked by volans on Nov 24, 2013
Latest reply on Jul 8, 2015 by Ted Dunning

Do we need to verify checksum after we move files from Hadoop (HDFS/MaprFS) to a Linux server through a client?


We have M3 installed in our cluster. And on a daily basis, we would like to archive HDFS files (dozens of terabytes in size) to an external Linux server with a lot of storage space.

Our current thought is to install a MapR client on the Linux server. And to archive files, we will run a copyToLocal command on the Linux server, as follows:

`hadoop fs -copyToLocal <hdfs folder to copy from> <Local Linux folder to copy to>`

Obviously we would like to make sure the files on the Linux server have no corruption after they are copied. But is checking checksum necessary?

I came across certain posts that checksum is already checked by the client:


*Ted: "the checksums are computed on the client side so that they protect against network errors as well as disk errors."*


*Ted: "Your application will never see corrupted data. Checksums are tested at the client level..."*

*MC: "On a CRC error, the client retries the RPC at the same server..."*

So to me, it seems like checking checksum is not necessary. Kindly let me know otherwise.

**copyToLocal VS NFS mount**:

On a different but related topic, if checking checksum is already handled by client and checking checksum externally is not necessary, is there a preference in the following two options:

 - Run "hadoop fs -copyToLocal" command like above

 - Create a NFS mount on the Linux server using the client. Then run a Linux "cp" command.

We prefer the first option simply because we are more familiar with it and traditionally NFS in general (not MapR specific) is a little more error prone. But we are open to both options. Thanks a lot. I appreciate any insight in this topic.