AnsweredAssumed Answered

Reaping processes from transform scripts in Hive

Question asked by mandoskippy on Sep 24, 2012
Latest reply on Oct 18, 2012 by nabeel
We are experiencing a weird problem running MapR 2.0 and Hive 0.9.0.  We have a somewhat complex process that involves loading data via a transform script, this transform script (bash script) uses a generated external table that basically has a list of filenames for the transform script to send to another script(python) for processing.  It's a little bubble gum and duct tape, but based how we wanted the files to be processed, it made sense at the time.  (I have considered trying to get rid of the bash script and sending the results of the external table directly to the python script, but only want to rework things if that's an issue).

Now, the python script returns it's data to STDOUT like a good transform script should, the bash script sends that to the hive mapper like a good transform script should. Most of the time. 

When doing a lot of loading, sometimes things misbehave, and without a better way to describe it, it seems the pipe connections between our stuff goes away.  The map task fails, it gets spun up somewhere else where it succeeds and life goes on. However, both the transform script and the python script hang around as processes.  Also,  there start to appear a lot of mapr related processes that never go away. They look as follows:

<pre>
bin/bash /opt/mapr//server/collectTaskDiagnostics.sh 26974 /opt/mapr/hadoop/hadoop-0.20.2/bin/../logs/userlogs/job_201209230944_0093/attempt_201209230944_0093_m_000162_0 mapr
</pre>
At this point, one or two is not a big deal, but they start to add up. Failures become more common, nodes get blacklisted, dogs and cats become friends and the 4 horseman start to saddle up.  

So, the question is as such:  Do I need a post job process that reaps some of these processes? Are there settings that can make this more robust?  Should I eliminate the shell script (can you explain why I should rather than just guess and check)?  I know this is getting into a more esoterical discussion on hive/hadoop, however with the inclusion o those collectTaskDiagnostics.sh processes, I figured I'd start here.

Note:  THe nodes get into a crazy state too, I can't even run ps ax on them without that command hanging... lovely eh?   Any advice would be appreciated.

Outcomes