AnsweredAssumed Answered

Cluster of NFS local mounts for parallel processing?

Question asked by brenocon on Jul 8, 2012
Latest reply on Jul 8, 2012 by brenocon
Hi Ted & everyone --

I have lots of fairly-embarassingly-parallel processing jobs where I find it's easiest to develop and execute them on a local RAID disk with 20 or parallel jobs via GNU Parallel.  I find this far, far easier than trying to get my processing scripts to run under Hadoop or Pig.  I split the data by files, and different processors take subsets of the files (this is how GNU Parallel works; or it can be done manually, with shell script "wait", etc.) and outputs to suitably named output filenames.  Compared to Hadoop, this approach has two significant flaws: (1) no fault tolerance, and (2) doesn't distribute.

I'm wondering if MapR can help with the second -- do you think it's feasible to distribute this approach with a cluster running MapR local NFS mounts?  Say, in the range of 10 to 50 worker nodes (perhaps 4-8 cores per node).  In the past, I tried doing this with Lustre on a supercomputer architecture, but Lustre started creaking and groaning with funny errors.  I was reading and writing in the range of 1-4 TB per job with dozens of processors.  I got the impression that the quality of the distributed filesystem is the bottleneck.

Compared to what Hadoop is supposed to do, this approach would be bad because you don't get data locality for your  programs -- instead, the program tells the DFS what file it wants and the DFS has to get it there.  At least ideally, Hadoop is supposed to bring your mapper code to the data... run on nodes where the data already is.  However, it was never clear to me in practice how good Hadoop actually is at this, and people seem to often do arbitrary file access inside Hadoop jobs anyway and Hadoop seems to still work.

Thoughts?

Outcomes