Discussion created by foudroyantanalytics on Aug 23, 2016
1Data is distributed across nodes. HDFS splits data into blocks of 128 megabytes, and distributes these blocks across different locations throughout your cluster. Files are automatically distributed as they are written.MapR-FS builds on the HDFS strategy. It still distributes data, distributes computation, and tolerates faults, but is more efficient than Hadoop. There are many architectural components which are unique to the MapR File System, which overcome some of the weaknesses of HDFS. When data is written to MapR-FS, it is sharded into chunks. The default chunk size is
256 Megabytes. Chunks are striped across storage pools in a series of blocks, into logical entities called containers. Striping the data across multiple disks allows data to be written faster, because the file will be split across the three physical disks in a storage pool, but remain in one logical container.
2In HDFS, metadata is managed by the NameNode. Before any operations can be performed on data stored in HDFS, an application must contact the NameNode. The single NameNode maintains metadata information for all the physical data blocks that comprise the files. This can create performance bottlenecks.MapR-FS distributes and replicates the namespace information throughout the cluster, in the same way that data is replicated. Each volume has a name container, which contains the metadata for the files in that volume. The CLDB service typically runs on multiple nodes in the cluster. CLDB is used to locate the name container for the volume, and the client connects to the name container to access the file
3Data stored on any node gets replicated multiple times across the cluster. These replicas prevent data loss. If one node fails, other nodes can continue processing the
Both HDFS and MapR-FS use replication for high availability and fault tolerance.
Replication protects from hardware failures. File chunks, table regions and metadata are automatically replicated. There is generally at least one replica on a different rack.
4Data in HDFS is immutable. If the source data changes, the data must be appended to existing data, or else reloaded into the clusterMapR-FS allows updates to files in increments as small as 8K. Having a
smaller I/O size reduces overhead on the cluster, which allows for snapshots, and is one of the reasons that MapR-FS is randomly read/write, even during ingestion. In HDFS, however, files are append-only because of the large I/O size.
5The same block size, while configurable, is used to define the I/O size, the sharding size, and replication. This one-size-fits-all approach does not exploit the inherent differences between these requirements.
6HDFS does not support standard POSIX file semantics. You must use the ‘hadoop fs’ command in order to read and write the data. Users of HDFS must learn and incorporate this into their data flows.Supports standard POSIX commands
7HDFS has limited support for snapshots or remote mirrors, both of which enhance the availability and usability of the data set.MapR-FS provides additional options for fault-tolerance. Snapshots protect
against user errors or application failures. Snapshots allow point-in-time recovery. You can read files and tables directly from a snapshot.
8Since HDFS is written in Java, it is slower than file systems written and compiled into machine code.
9No MirroringMapR-FS also provides the option of mirroring entire volumes. Mirroring provides additional remote disaster recovery back-ups, as well as local load balancing.
10In HDFS, NameNodes can lead to single point of failure and performance bottlenecks.MapR-FS avoids this problem by fully distributing the metadata for file
and directories.
11In HDFS, NameNodes scale up to 100 million files per cluster.In MapR, there are no limitations. You can create files as long as there is disk space.
12HDFS is written in JavaMapR-FS written in C. Being written in C means less garbage collection for the operating system, which translates to faster performance.