AnsweredAssumed Answered

failed to report status for 600 seconds. Killing!. Diagnostic information will be saved in userlogs

Question asked by tony on Dec 10, 2013
Latest reply on Dec 11, 2013 by tony
it failed when we run hive jobs. many tasktracker jobs were killed as over 600 secodes repoert. below is detail logs:


stdout logs
<pre>
2013-12-11 12:54:00
Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.0-b56 mixed mode):

"SpillThread" daemon prio=10 tid=0x00007fb750865800 nid=0x47a waiting on condition [0x00007fb732ad8000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000e4db2f48> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1473)

"communication thread" daemon prio=10 tid=0x00007fb75063b800 nid=0x44a in Object.wait() [0x00007fb732c12000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000e4db2fe8> (a java.util.concurrent.atomic.AtomicBoolean)
at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:664)
- locked <0x00000000e4db2fe8> (a java.util.concurrent.atomic.AtomicBoolean)
at java.lang.Thread.run(Thread.java:724)

"Thread for syncLogs" daemon prio=10 tid=0x00007fb7505fe800 nid=0x43e waiting on condition [0x00007fb735f6b000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.Child$3.run(Child.java:155)

"sendParams-0" daemon prio=10 tid=0x00007fb750616800 nid=0x43c waiting on condition [0x00007fb73606c000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000e4c89930> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

"IPC Client (47) connection to /127.0.0.1:35534 from job_201311212025_25810" daemon prio=10 tid=0x00007fb750615800 nid=0x43b in Object.wait() [0x00007fb73616d000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000e4daf3c8> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:681)
- locked <0x00000000e4daf3c8> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724)

"Service Thread" daemon prio=10 tid=0x00007fb7500b8800 nid=0x439 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007fb7500ad800 nid=0x438 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007fb7500ab800 nid=0x437 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007fb7500a9800 nid=0x436 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007fb750089000 nid=0x435 in Object.wait() [0x00007fb741a0c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000e4c26408> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x00000000e4c26408> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

"Reference Handler" daemon prio=10 tid=0x00007fb750085000 nid=0x434 in Object.wait() [0x00007fb741b0d000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000e4c25e90> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x00000000e4c25e90> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007fb75000a000 nid=0x41f runnable [0x00007fb754569000]
   java.lang.Thread.State: RUNNABLE
at com.mapr.fs.jni.MapRClient.writeRPC(Native Method)
at com.mapr.fs.Inode.flushPages(Inode.java:284)
at com.mapr.fs.Inode.releaseDirty(Inode.java:392)
at com.mapr.fs.MapRFsOutStream.dropCurrentPage(MapRFsOutStream.java:70)
at com.mapr.fs.MapRFsOutStream.write(MapRFsOutStream.java:81)
- locked <0x00000000f06952a0> (a com.mapr.fs.MapRFsOutStream)
at com.mapr.fs.MapRFsDataOutputStream.write(MapRFsDataOutputStream.java:46)
at org.apache.hadoop.mapred.IFileOutputStream.write(IFileOutputStream.java:84)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
- locked <0x00000000f07b8058> (a org.apache.hadoop.fs.FSDataOutputStream)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:270)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1585)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1427)
- locked <0x00000000e4db0958> (a org.apache.hadoop.mapred.MapTask$MapOutputBuffer)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:419)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

"VM Thread" prio=10 tid=0x00007fb750082800 nid=0x433 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fb75001b000 nid=0x420 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fb75001d000 nid=0x421 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fb75001f000 nid=0x422 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fb750020800 nid=0x423 runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007fb750022800 nid=0x424 runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007fb750024800 nid=0x426 runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007fb750026000 nid=0x427 runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007fb750028000 nid=0x428 runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x00007fb75002a000 nid=0x429 runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x00007fb75002c000 nid=0x42a runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x00007fb75002d800 nid=0x42b runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x00007fb75002f800 nid=0x42c runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x00007fb750031800 nid=0x42d runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x00007fb750033000 nid=0x42e runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x00007fb750035000 nid=0x42f runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x00007fb750037000 nid=0x430 runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x00007fb750039000 nid=0x431 runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x00007fb75003a800 nid=0x432 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007fb7500ba800 nid=0x43a waiting on condition

JNI global references: 218

Heap
 PSYoungGen      total 92672K, used 52660K [0x00000000ef500000, 0x00000000f6100000, 0x0000000100000000)
  eden space 92160K, 56% used [0x00000000ef500000,0x00000000f283d238,0x00000000f4f00000)
  from space 512K, 37% used [0x00000000f6080000,0x00000000f60b0000,0x00000000f6100000)
  to   space 512K, 0% used [0x00000000f6000000,0x00000000f6000000,0x00000000f6080000)
 ParOldGen       total 545792K, used 433706K [0x00000000ce000000, 0x00000000ef500000, 0x00000000ef500000)
  object space 545792K, 79% used [0x00000000ce000000,0x00000000e878a8f0,0x00000000ef500000)
 PSPermGen       total 31744K, used 15675K [0x00000000c8e00000, 0x00000000cad00000, 0x00000000ce000000)
  object space 31744K, 49% used [0x00000000c8e00000,0x00000000c9d4ecf0,0x00000000cad00000)
</pre>
syslog logs

<pre>
2013-12-11 12:54:01,227 INFO org.apache.hadoop.mapred.Task: Communication exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201311212025_25810_m_000088_0, with JvmId: jvm_201311212025_25810_m_1093626998
at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:4716)
at org.apache.hadoop.mapred.TaskTracker.ping(TaskTracker.java:4860)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:993)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1326)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1322)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1320)

at org.apache.hadoop.ipc.Client.call(Client.java:1095)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:275)
at com.sun.proxy.$Proxy0.ping(Unknown Source)
at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:680)
at java.lang.Thread.run(Thread.java:724)
</pre>

Outcomes