AnsweredAssumed Answered

Reduce job hangs at 13%

Question asked by theignoredone on Aug 7, 2013
Latest reply on Aug 8, 2013 by gera
Hi,

I'm running a simple mapreduce job on a 3 node cluster of virtual machines. (Virtual box). I am able to ssh into each machine fine - no issue. The mapred.JobClient on the namenode shows that it has finished mapping and starts the reduce process but it hangs at 13%.

Here are the tasktracker logs on the various nodes

namenode tasktracker

2013-08-07 23:41:05,900 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201308072339_0001_m_000000_0  was 12337
2013-08-07 23:41:05,900 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 1
2013-08-07 23:41:06,912 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201308072339_0001_m_-1997119208 exited with exit code 0. Number of tasks it ran: 1
2013-08-07 23:41:07,474 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201308072339_0001_m_691788278 given task: attempt_201308072339_0001_m_000008_0
2013-08-07 23:41:09,402 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201308072339_0001_m_000008_0 1.0%
2013-08-07 23:41:09,408 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201308072339_0001_m_000008_0 is done.
2013-08-07 23:41:09,408 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201308072339_0001_m_000008_0  was 680
2013-08-07 23:41:09,408 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 2
2013-08-07 23:41:09,752 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201308072339_0001_m_691788278 exited with exit code 0. Number of tasks it ran: 1

on slave node 1 it shows this message repeatedly

INFO org.apache.hadoop.mapred.TaskTracker: attempt_123 0.1333% reduce > copy (4 of 10 at 0.00 MB/s)

 * i abbreviated the attempt number

Finally on slave node 2 I get this

2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308072312_0001 for user-log deletion with retainTimeStamp:1376030371691
2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308070104_0002 for user-log deletion with retainTimeStamp:1376030371691
2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308071952_0001 for user-log deletion with retainTimeStamp:1376030371693
2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308070026_0001 for user-log deletion with retainTimeStamp:1376030371693
2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308071037_0001 for user-log deletion with retainTimeStamp:1376030371693
2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308072003_0002 for user-log deletion with retainTimeStamp:1376030371693
2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308072113_0001 for user-log deletion with retainTimeStamp:1376030371693
2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308072250_0001 for user-log deletion with retainTimeStamp:1376030371693
2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308072312_0001 for user-log deletion with retainTimeStamp:1376030371693
2013-08-07 23:39:32,330 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201308070104_0002 for user-log deletion with retainTimeStamp:1376030371693
2013-08-07 23:41:10,250 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201308072339_0001_m_000004_0 task's state:UNASSIGNED
2013-08-07 23:41:10,254 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201308072339_0001_m_000004_0 which needs 1 slots
2013-08-07 23:41:10,254 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_201308072339_0001_m_000004_0 which needs 1 slots
2013-08-07 23:41:10,257 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201308072339_0001_m_000005_0 task's state:UNASSIGNED
2013-08-07 23:41:10,257 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201308072339_0001_m_000005_0 which needs 1 slots
2013-08-07 23:41:10,257 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_201308072339_0001_m_000005_0 which needs 1 slots
2013-08-07 23:41:10,965 INFO org.apache.hadoop.mapred.JobLocalizer: Initializing user hduser on this TT.
2013-08-07 23:41:11,364 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201308072339_0001_m_301152911
2013-08-07 23:41:11,373 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201308072339_0001_m_301152911 spawned.
2013-08-07 23:41:11,382 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201308072339_0001_m_41422609
2013-08-07 23:41:11,385 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201308072339_0001_m_41422609 spawned.
2013-08-07 23:41:11,394 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hduser/jobcache/job_201308072339_0001/attempt_201308072339_0001_
m_000005_0/taskjvm.sh
2013-08-07 23:41:11,397 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hduser/jobcache/job_201308072339_0001/attempt_201308072339_0001_
m_000004_0/taskjvm.sh
2013-08-07 23:41:14,797 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201308072339_0001_m_301152911 given task: attempt_201308072339_0001_m_000005_0
2013-08-07 23:41:14,800 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201308072339_0001_m_41422609 given task: attempt_201308072339_0001_m_000004_0
2013-08-07 23:41:18,199 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201308072339_0001_m_000004_0 1.0%
2013-08-07 23:41:18,203 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201308072339_0001_m_000004_0 is done.
2013-08-07 23:41:18,203 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201308072339_0001_m_000004_0  was 680
2013-08-07 23:41:18,220 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 1
2013-08-07 23:41:18,222 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201308072339_0001_m_000005_0 1.0%
2013-08-07 23:41:18,223 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201308072339_0001_m_000005_0 is done.
2013-08-07 23:41:18,223 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201308072339_0001_m_000005_0  was 680
2013-08-07 23:41:18,224 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 2
2013-08-07 23:41:18,295 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201308072339_0001_m_000009_0 task's state:UNASSIGNED
2013-08-07 23:41:18,295 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201308072339_0001_m_000009_0 which needs 1 slots
2013-08-07 23:41:18,295 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_201308072339_0001_m_000009_0 which needs 1 slots
2013-08-07 23:41:18,346 INFO org.apache.hadoop.mapred.JvmManager: Killing JVM: jvm_201308072339_0001_m_301152911
2013-08-07 23:41:18,410 INFO org.apache.hadoop.util.ProcessTree: Killing process group17330 with signal TERM. Exit code 0
2013-08-07 23:41:18,411 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201308072339_0001_m_451250164
2013-08-07 23:41:18,411 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201308072339_0001_m_451250164 spawned.
2013-08-07 23:41:18,416 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hduser/jobcache/job_201308072339_0001/attempt_201308072339_0001_
m_000009_0/taskjvm.sh
2013-08-07 23:41:20,207 WARN org.apache.hadoop.mapred.DefaultTaskController: Exit code from task is : 143
2013-08-07 23:41:20,207 INFO org.apache.hadoop.mapred.DefaultTaskController: Output from DefaultTaskController's launchTask follows:
2013-08-07 23:41:20,207 INFO org.apache.hadoop.mapred.TaskController:
2013-08-07 23:41:20,207 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201308072339_0001_m_301152911 exited with exit code 143. Number of tasks it ran: 1
2013-08-07 23:41:20,272 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201308072339_0001_m_41422609 exited with exit code 0. Number of tasks it ran: 1
2013-08-07 23:41:20,723 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-08-07 23:41:20,737 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hduser for UID 1001 from the native implementation
2013-08-07 23:41:21,108 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201308072339_0001_m_451250164 given task: attempt_201308072339_0001_m_000009_0
2013-08-07 23:41:22,804 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201308072339_0001_m_000009_0 1.0%
2013-08-07 23:41:22,808 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201308072339_0001_m_000009_0 is done.
2013-08-07 23:41:22,808 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201308072339_0001_m_000009_0  was 680
2013-08-07 23:41:22,809 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 2
2013-08-07 23:41:23,583 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201308072339_0001_m_451250164 exited with exit code 0. Number of tasks it ran: 1


It's interesting to note that the "exited" message shown on the tasktracker log of slavenode2 and namenode ends at 23:41:23 and 23:41:09 respectively and my mapred.JobClient halts at 23.41.27.


And the last line on the jobtracker shows this

2013-08-07 23:48:45,919 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #1 for map task: attempt_201308072339_0001_m_000001_0 running on tracker: tracker_name-VirtualBox:localhost/127.0.0.1:41114 and reduce task: attempt_201308072339_0001_r_000000_0 running on tracker: tracker_name:localhost/127.0.0.1:32986

Thankyou for your help, I wait with bated breath for your response. I understand this may be quite a simple configuration issue. I do have my master and slaves defined in my /etc/hosts file.

I am running hadoop 1.2.1 by the way.


Outcomes