|1||Data is distributed across nodes. HDFS splits data into blocks of 128 megabytes, and distributes these blocks across different locations throughout your cluster. Files are automatically distributed as they are written.||MapR-FS builds on the HDFS strategy. It still distributes data, distributes computation, and tolerates faults, but is more efficient than Hadoop. There are many architectural components which are unique to the MapR File System, which overcome some of the weaknesses of HDFS. When data is written to MapR-FS, it is sharded into chunks. The default chunk size is|
256 Megabytes. Chunks are striped across storage pools in a series of blocks, into logical entities called containers. Striping the data across multiple disks allows data to be written faster, because the file will be split across the three physical disks in a storage pool, but remain in one logical container.
|2||In HDFS, metadata is managed by the NameNode. Before any operations can be performed on data stored in HDFS, an application must contact the NameNode. The single NameNode maintains metadata information for all the physical data blocks that comprise the files. This can create performance bottlenecks.||MapR-FS distributes and replicates the namespace information throughout the cluster, in the same way that data is replicated. Each volume has a name container, which contains the metadata for the files in that volume. The CLDB service typically runs on multiple nodes in the cluster. CLDB is used to locate the name container for the volume, and the client connects to the name container to access the file|
|3||Data stored on any node gets replicated multiple times across the cluster. These replicas prevent data loss. If one node fails, other nodes can continue processing the|
|Both HDFS and MapR-FS use replication for high availability and fault tolerance.|
Replication protects from hardware failures. File chunks, table regions and metadata are automatically replicated. There is generally at least one replica on a different rack.
|4||Data in HDFS is immutable. If the source data changes, the data must be appended to existing data, or else reloaded into the cluster||MapR-FS allows updates to files in increments as small as 8K. Having a|
smaller I/O size reduces overhead on the cluster, which allows for snapshots, and is one of the reasons that MapR-FS is randomly read/write, even during ingestion. In HDFS, however, files are append-only because of the large I/O size.
|5||The same block size, while configurable, is used to define the I/O size, the sharding size, and replication. This one-size-fits-all approach does not exploit the inherent differences between these requirements.|
|6||HDFS does not support standard POSIX file semantics. You must use the ‘hadoop fs’ command in order to read and write the data. Users of HDFS must learn and incorporate this into their data flows.||Supports standard POSIX commands|
|7||HDFS has limited support for snapshots or remote mirrors, both of which enhance the availability and usability of the data set.||MapR-FS provides additional options for fault-tolerance. Snapshots protect|
against user errors or application failures. Snapshots allow point-in-time recovery. You can read files and tables directly from a snapshot.
|8||Since HDFS is written in Java, it is slower than file systems written and compiled into machine code.|
|9||No Mirroring||MapR-FS also provides the option of mirroring entire volumes. Mirroring provides additional remote disaster recovery back-ups, as well as local load balancing.|
|10||In HDFS, NameNodes can lead to single point of failure and performance bottlenecks.||MapR-FS avoids this problem by fully distributing the metadata for file|
|11||In HDFS, NameNodes scale up to 100 million files per cluster.||In MapR, there are no limitations. You can create files as long as there is disk space.|
|12||HDFS is written in Java||MapR-FS written in C. Being written in C means less garbage collection for the operating system, which translates to faster performance.|