Does HDFS ensure Data Integrity of data blocks stored in HDFS? How?
HDFS? MapR-FS? or both?
The short answer is that it doesn't. ;-)
The longer answer...
HDFS / MapR-FS replicates the data based on what you set it to be. By default there are three copies of every block within a file. These block files are stored across the cluster as a way to ensure that if the node / volume in the cluster goes down, there are other copies across the cluster that can be used. Each block has a checksum (hash) of the block. This is a way for HDFS to compare copies of the blocks to see if there is a problem... e.g. a block is corrupted.
With 3 copies, if 2 of the 3 copies match, we know which copy is bad and its dropped and one of the other copies is replicated. If you have less, then you end up with a bad block which could mean data loss. Some will set replication to a higher values. The more replicated blocks the less chance of a file/block corruption leading to a loss of data. (Choose the block where the majority of checksums match)
Note: under Apache HDFS the blocks are immutable. MapR-FS which also supports a POSIX interface means that the files / blocks are mutable. This would imply that MapR will recalculate the block's checksum after the write and the file closes and replicates the block which now has a new checksum.
The reason I said that it doesn't is that while there's a default number of replication, it can be changed and there is nothing to ensure that a block doesn't get corrupted. Every so often a hdfs fsck occurs or can be done manually that checks the blocks for bad blocks or under replicated copies. This is the only check and fix to the problem. (So I guess you can say that it does, but its after the fact so it doesn't ensure integrity but cleans up the mess. ) [Yeah, I'm splitting words here, but you get the idea. ;-) ]
Data Integrity means to make sure that no data is lost or corrupted during storage or processing of the Data. Data Integrity in Hadoop is achieved by maintaining the checksum of the data written to the block. Hadoop checksum is computed when data are written to the disk for the first time and again checked while reading data from the disk. If the checksum matches the original checksum then it is said that data is not corrupted otherwise it is said to be corrupted. It is possible that it’s the checksum that is corrupt, not the data, but this is very unlikely because the checksum is much smaller than the data. HDFS uses a more efficient variant called CRC-32C to calculate the checksum.DataNodes are responsible for verifying the data they receive before storing the data and its checksum. A checksum is computed for the data that they receive from clients and from other DataNodes during replication. Hadoop can heal the corrupted data by copying one of the good replicas to produce the new replica which is an uncorrupted replica. If a client detects an error when reading a block, it reports the bad block and the DataNodes it was trying to read from to the NameNode before throwing a Checksum Exception. The NameNode marks the block replica as corrupt so it doesn’t direct any more clients to it or try to copy this replica to another DataNodes. It provides a copy of the block another DataNodes which is to be replicated, so its replication factor is back at the expected level. Once this has happened, the corrupt replica is deleted
Retrieving data ...