Discuss some of the advantages and disadvantages of various big data solutions, such as HDFS and MapR-FS. What are the limitations of each? What features do you need to solve your big data problems?
1) Distribute data and computation.The computation local to data prevents the network overload.
2) Tasks are independent The task are independent so,
3) Linear scaling in the ideal case.It used to design for cheap, commodity hardware.
4) Simple programming model.The end-user programmer only writes map-reduce tasks.
5) Flat scalability
6) HDFS store large amount of information
7) HDFS is simple and robust coherency model
8 ) That is it should store data reliably.
9) HDFS is scalable and fast access to this information and it also possible to serve s large number of clients by simply adding more machines to the cluster.
10) HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.
11) HDFS provide streaming read performance.
12) Data will be written to the HDFS once and then read several times.
13) The overhead of cashing is helps the data should simply be re-read from HDFS source.
14) Fault tolerance by detecting faults and applying quick, automatic recovery
15) Processing logic close to the data, rather than the data close to the processing logic
16) Portability across heterogeneous commodity hardware and operating systems
17) Economy by distributing data and processing across clusters of commodity personal computers
18) Efficiency by distributing data and logic to process it in parallel on nodes where data is located
19) Reliability by automatically maintaining multiple copies of data and automatically redeploying processing logic in the event of failures
20) HDFS is a block structured file system: – Each file is broken into blocks of a fixed size and these blocks are stored across a cluster of one or more machines with data storage capacity
21) Ability to write MapReduce programs in Java, a language which even many noncomputer scientists can learn with sufficient capability to meet powerful data-processing needs
22) Ability to rapidly process large amounts of data in parallel
23) Can be deployed on large clusters of cheap commodity hardware as opposed to expensive, specialized parallel-processing hardware
24) Can be offered as an on-demand service, for example as part of Amazon’s EC2 cluster computing service
1) Rough manner:- Hadoop Map-reduce and HDFS are rough in manner. Because the software under active development.
2) Programming model is very restrictive:- Lack of central data can be preventive.
3) Joins of multiple datasets are tricky and slow:- No indices! Often entire dataset gets copied in the process.
4) Cluster management is hard:- In the cluster, operations like debugging, distributing software, collection logs etc are too hard.
5) Still single master which requires care and may limit scaling
6) Managing job flow isn’t trivial when intermediate data should be kept
7) Optimal configuration of nodes not obvious. Eg: – #mappers, #reducers, mem.limits
Wow! Very thorough answer, Aparna Sekar! I hope this answer helps other students!
Thank you. Your answer is thorough but I think those that you mentioned are not fixed issues, because our technology is evolving and things get changed all the times.
I am still learning. Thus, no problems to discuss.
Hmmmm... I thought the QUESTION to be answered was "What features are important to YOUR Big Data Projects?", but the above reply seemed to think the question was "Compare & Contrast the Strengths & Weaknesses of HDFS vs MAPR-FS" ;-) ... That said, I'll try to answer the ACTUAL question - What Features are Important to YOUR Big Data Projects? FEATURE: ECONOMY - in this weird age where EVERYONE is trying to get EVERYTHING in the CLOUD (whose VMs represent comparatively EXPENSIVE, NON-"commodity" servers) HOW to create ECONOMICAL hadoop clusters? I'm new to Hadoop, but the underlying philosophy ASSUMES several things: 1. that "commodity" servers are UBIQUITOUS and BRUTALLY AFFORDABLE; and 2. the "norm" is still creating massive LOCAL data centers which HOUSE these commodity servers
FACT: It is now 2017... and for at least 3 YEARS now the BIG push has been to create CLOUD-hosted servers !
These are neither "commodity" NOR "brutally cheap"; in fact, quite the OPPOSITE. So I'll be looking for how MAPR addresses this challenge of computing-in-the-CLOUD...
There are others solutions - code closed - that apply improvement in disadvantages of HDFS like a Splunk. I want learn more all Hadoop ecosystems to see really what difference exist.
I hope the scalability issue had overcome in hadoop 2.0
Distributed processing is gold at the end of my current rainbow. We currently still transfer terabytes of data nightly from around the world to a central data center in "Somewhere in CA" and into our DW environment. We then perform nightly transformations and analysis to make the daily management reports and data analytics executive reports 'just in time' every weekday. Moving the processing upstream at each location (or several triaged locations) and only transferring summed or totals (transformed) data then running MapReduce jobs at the final central location is the target for us.
Also, it still intrigues me to think that most or all problems can be reduced to chunking the data and algorithms down to output name value pairs then running a MapReduce function on the sets. This is just genius!
I'm glad MapR and related big data technologies can be the gold at the end of your rainbow, Alex Zuniga! Happy learning!
In my point of view, out all this concepts, is an another important point is attached with the Governance of all ingested and transformed data.
Retrieving data ...