AnsweredAssumed Answered

errors in mfs.err and slow reducers

Question asked by dimamah on Dec 1, 2014
Latest reply on Jun 28, 2016 by myacoub
lately when the cluster (V3.0.1 M3) is under heavy loads the mfs.err file on some servers get these errors :

Example from `server-05 (10.20.40.185)`

    2014-12-01 15:23:49,8478 xmit to 10.20.40.195:53695 for 0 28.122 took 3419 msec
    2014-12-01 15:23:49,8479 xmit to 10.20.40.195:53695 for 0 28.6 took 3395 msec
    2014-12-01 15:23:51,1942 xmit to 10.20.40.195:53695 for 0 28.122 took 3985 msec
    2014-12-01 15:23:51,1943 xmit to 10.20.40.195:53695 for 0 28.122 took 3973 msec
    2014-12-01 15:23:51,9452 xmit to 10.20.40.195:53695 for 0 28.122 took 4178 msec
    .....
    2014-12-01 15:24:25,9123 xmit to 10.20.41.194:5660 for 1 30.1 took 4197 msec
    2014-12-01 15:24:25,9123 xmit to 10.20.41.194:5660 for 1 30.1 took 3974 msec
    2014-12-01 15:24:25,9307 xmit to 10.20.41.194:5660 for 1 30.1 took 3981 msec
    2014-12-01 15:24:25,9308 xmit to 10.20.41.194:5660 for 1 30.1 took 3976 msec
    2014-12-01 15:24:25,9308 xmit to 10.20.41.194:5660 for 1 30.1 took 3971 msec
    2014-12-01 15:24:25,9523 xmit to 10.20.41.194:5660 for 1 30.1 took 3985 msec
    2014-12-01 15:24:25,9523 xmit to 10.20.41.194:5660 for 1 30.1 took 3524 msec
    2014-12-01 15:24:25,9717 xmit to 10.20.41.194:5660 for 1 30.1 took 3508 msec
    2014-12-01 15:24:25,9718 xmit to 10.20.41.194:5660 for 1 30.1 took 3473 msec
    2014-12-01 15:24:25,9719 xmit to 10.20.41.194:5660 for 1 30.1 took 3239 msec
    2014-12-01 15:24:26,0086 xmit to 10.20.41.194:5660 for 1 30.1 took 3250 msec
    2014-12-01 15:24:26,0087 xmit to 10.20.41.194:5660 for 1 30.1 took 3222 msec
    2014-12-01 15:24:26,0199 xmit to 10.20.41.194:5660 for 1 30.1 took 3206 msec
    2014-12-01 15:24:49,2884 xmit to 10.20.41.194:5660 for 1 30.1 took 3074 msec
    2014-12-01 15:24:49,2885 xmit to 10.20.41.194:5660 for 1 30.1 took 3059 msec
    2014-12-01 15:24:49,3206 xmit to 10.20.41.194:5660 for 1 30.1 took 3078 msec
    2014-12-01 15:24:49,3207 xmit to 10.20.41.194:5660 for 1 30.1 took 3066 msec
    2014-12-01 15:24:49,3207 xmit to 10.20.41.194:5660 for 1 30.1 took 3053 msec
    2014-12-01 15:24:49,3437 xmit to 10.20.41.194:5660 for 1 30.1 took 3054 msec
    2014-12-01 15:24:49,3438 xmit to 10.20.41.194:5660 for 1 30.1 took 3051 msec
    2014-12-01 15:24:49,3735 xmit to 10.20.41.194:5660 for 1 30.1 took 3064 msec
    2014-12-01 15:24:49,3736 xmit to 10.20.41.194:5660 for 1 30.1 took 3033 msec
    2014-12-01 15:24:49,3737 xmit to 10.20.41.194:5660 for 1 30.1 took 3018 msec
    2014-12-01 15:24:49,4003 xmit to 10.20.41.194:5660 for 1 30.1 took 3031 msec
    2014-12-01 15:24:49,4004 xmit to 10.20.41.194:5660 for 1 30.1 took 3018 msec
    2014-12-01 15:24:49,4331 xmit to 10.20.41.194:5660 for 1 30.1 took 3034 msec

during that time, the mfs.log-3 is mostly contains :

    2014-12-01 15:24:39,3838 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7490.26347.479494 offset 165806080 count 196608 vn 5637669 DHL WriteComplete localTm 0 elapsed
    Tm 5539
    2014-12-01 15:24:39,3843 INFO  writev3.cc:1107 x.x.0.0:2 WriteFile: fid 7375.26579.988362 offset 64421888 count 327680 vn 8988443 DHL WriteComplete localTm 1 elapsedT
    m 5543
    2014-12-01 15:24:39,3844 INFO  writev3.cc:1107 x.x.0.0:2 WriteFile: fid 7490.26347.479494 offset 165675008 count 131072 vn 5637668 DHL WriteComplete localTm 0 elapsed
    Tm 5540
    2014-12-01 15:24:39,4260 INFO  writev3.cc:1107 x.x.0.0:2 WriteFile: fid 7490.26347.479494 offset 166002688 count 131072 vn 5637670 DHL WriteComplete localTm 0 elapsed
    Tm 5567
    2014-12-01 15:24:39,4260 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7375.26556.988316 offset 95944704 count 131072 vn 8988446 DHL WriteComplete localTm 0 elapsedT
    m 4504
    2014-12-01 15:24:39,4287 INFO  writev3.cc:1107 x.x.0.0:2 WriteFile: fid 7490.26347.479494 offset 166133760 count 131072 vn 5637671 DHL WriteComplete localTm 0 elapsed
    Tm 5534
    2014-12-01 15:24:39,4293 INFO  writev3.cc:1107 x.x.0.0:2 WriteFile: fid 7490.26383.479566 offset 65339392 count 327680 vn 5637672 DHL WriteComplete localTm 0 elapsedT
    m 4452
    2014-12-01 15:24:39,4494 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7490.26370.479540 offset 171507712 count 327680 vn 5637673 DHL WriteComplete localTm 0 elapsed
    Tm 4401
    2014-12-01 15:24:39,5047 INFO  writev3.cc:1107 x.x.0.0:4 WriteFile: fid 7490.26383.479566 offset 65667072 count 327680 vn 5637674 DHL WriteComplete localTm 0 elapsedT
    m 3979
    2014-12-01 15:24:39,5048 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7375.26578.988360 offset 27918336 count 131072 vn 8988449 DHL WriteComplete localTm 0 elapsedT
    m 3825
    2014-12-01 15:24:39,5089 INFO  writev3.cc:1107 x.x.0.0:2 WriteFile: fid 7375.26578.988360 offset 27787264 count 131072 vn 8988447 DHL WriteComplete localTm 0 elapsedT
    m 4256
    2014-12-01 15:24:39,5089 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7375.26545.988294 offset 133431296 count 131072 vn 8988448 DHL WriteComplete localTm 0 elapsed
    Tm 3945
    2014-12-01 15:24:39,5383 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7490.26347.479494 offset 166264832 count 131072 vn 5637675 DHL WriteComplete localTm 0 elapsed
    Tm 3590
    2014-12-01 15:24:39,5385 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7375.26578.988360 offset 28180480 count 196608 vn 8988451 DHL WriteComplete localTm 0 elapsedT
    m 3571
    2014-12-01 15:24:39,5385 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7375.26578.988360 offset 28377088 count 131072 vn 8988453 DHL WriteComplete localTm 0 elapsedT
    m 3546
    2014-12-01 15:24:39,5386 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7490.26347.479494 offset 166395904 count 131072 vn 5637676 DHL WriteComplete localTm 0 elapsed
    Tm 3512
    2014-12-01 15:24:39,5387 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7490.26370.479540 offset 171835392 count 327680 vn 5637678 DHL WriteComplete localTm 0 elapsed
    Tm 3434
    2014-12-01 15:24:39,5388 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7375.26545.988294 offset 133693440 count 131072 vn 8988455 DHL WriteComplete localTm 0 elapsed
    Tm 3433
    2014-12-01 15:24:39,5388 INFO  writev3.cc:1107 x.x.0.0:2 WriteFile: fid 7375.26578.988360 offset 28508160 count 131072 vn 8988457 DHL WriteComplete localTm 0 elapsedT
    m 3408
    2014-12-01 15:24:39,5388 INFO  writev3.cc:1107 x.x.0.0:3 WriteFile: fid 7375.26545.988294 offset 133824512 count 131072 vn 8988458 DHL WriteComplete localTm 1 elapsed
    Tm 3378


During this time. We have many reducers stuck / work very very very slow.
in thier stderr we can see this :

Example from `server-11 (10.20.40.191)` (notice the reset by peer to that server-05 above)

    2014-12-01 14:36:44,6614 ERROR Client fs/client/fileclient/cc/client.cc:3468 Thread: 140694752950016 rpc err Connection reset by peer(104) 28.122 to **10.20.40.185**:5660, fid 11562.68439.1082228, upd 0
    2014-12-01 14:36:44,6615 ERROR Client fs/client/fileclient/cc/client.cc:3468 Thread: 140694754002688 rpc err Connection reset by peer(104) 28.122 to **10.20.40.185**:5660, fid 11623.38631.1213326, upd 0
    2014-12-01 14:36:44,6615 ERROR Client fs/client/fileclient/cc/client.cc:3468 Thread: 140694744528640 rpc err Connection reset by peer(104) 28.122 to **10.20.40.185**:5660, fid 11623.38622.1213308, upd 0
    2014-12-01 14:36:44,6616 ERROR Client fs/client/fileclient/cc/client.cc:3468 Thread: 140694748739328 rpc err Connection reset by peer(104) 28.122 to **10.20.40.185**:5660, fid 11555.70358.1086336, upd 0
    2014-12-01 14:36:44,6616 ERROR Client fs/client/fileclient/cc/client.cc:3468 Thread: 140694751897344 rpc err Connection reset by peer(104) 28.122 to **10.20.40.185**5660, fid 11557.68352.1089966, upd 0
    2014-12-01 14:36:44,6617 ERROR Client fs/client/fileclient/cc/client.cc:3468 Thread: 140694742423296 rpc err Connection reset by peer(104) 28.122 to **10.20.40.185**:5660, fid 11557.68401.1089804, upd 0

In their log we see this :

    2014-12-01 14:33:36,591 INFO org.apache.hadoop.mapred.Child: JVM: jvm_201411221244_9881_r_-1939851731 pid: 28300
    2014-12-01 14:33:36,910 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/distcache/1950914576643821857_-1151359923_97416786/maprfs/user/hadoop/tmp/hive/hive_2014-12-01_14-01-41_536_5637790490232049158-1704/-mr-10019/2bc88c61-8ff3-45a4-b172-49d93c50a082/map.xml <- /tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/attempt_201411221244_9881_r_000000_0/work/map.xml
    2014-12-01 14:33:36,911 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/distcache/8483267215955501469_-2041230535_97417161/maprfs/user/hadoop/tmp/hive/hive_2014-12-01_14-01-41_536_5637790490232049158-1704/-mr-10019/2bc88c61-8ff3-45a4-b172-49d93c50a082/reduce.xml <- /tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/attempt_201411221244_9881_r_000000_0/work/reduce.xml
    2014-12-01 14:33:36,917 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/jars/.job.jar.crc <- /tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/attempt_201411221244_9881_r_000000_0/work/.job.jar.crc
    2014-12-01 14:33:36,917 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/jars/job.jar <- /tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/attempt_201411221244_9881_r_000000_0/work/job.jar
    2014-12-01 14:33:36,941 INFO org.apache.hadoop.mapred.Child: Starting task attempt_201411221244_9881_r_000000_0
    2014-12-01 14:33:36,942 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=SHUFFLE, sessionId=
    2014-12-01 14:33:37,075 INFO org.apache.hadoop.mapreduce.util.ProcessTree: setsid exited with exit code 0
    2014-12-01 14:33:37,086 WARN org.apache.hadoop.mapreduce.util.ProcfsBasedProcessTree: /proc/<pid>/status does not have information about swap space used(VmSwap). Can not track swap usage of a task.
    2014-12-01 14:33:37,086 INFO org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.mapreduce.util.LinuxResourceCalculatorPlugin@5ddc0e7a
    2014-12-01 14:38:48,265 WARN org.apache.hadoop.mapreduce.util.ProcfsBasedProcessTree: The process 2509 may have finished in the interim.
    2014-12-01 14:38:48,265 WARN org.apache.hadoop.mapreduce.util.ProcfsBasedProcessTree: The process 2515 may have finished in the interim.
    2014-12-01 14:39:00,393 INFO org.apache.hadoop.mapred.Merger: Merging 294 sorted segments
    2014-12-01 14:39:00,402 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 294 segments left of total size: 485117717 bytes
    2014-12-01 14:42:10,754 INFO org.apache.hadoop.mapred.Merger: Merging 380 sorted segments
    2014-12-01 14:42:10,755 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 380 segments left of total size: 485447829 bytes
    2014-12-01 14:43:20,534 INFO org.apache.hadoop.mapred.Merger: Merging 309 sorted segments
    2014-12-01 14:43:20,535 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 309 segments left of total size: 322186350 bytes
    2014-12-01 14:43:24,165 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted segments
    2014-12-01 14:43:24,196 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 1292749936 bytes
    2014-12-01 14:43:24,219 INFO ExecReducer: maximum memory = 1468989440
    2014-12-01 14:43:24,219 INFO ExecReducer: conf classpath = [file:/tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/jars/classes, file:/tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/jars/, file:/tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/attempt_201411221244_9881_r_000000_0/]
    2014-12-01 14:43:24,219 INFO ExecReducer: thread classpath = [file:/opt/mapr/hadoop/hadoop-0.20.2/conf/, file:/usr/lib/tools.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/, file:/opt/mapr/hadoop/hadoop-0.20.2/hadoop*core*.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/amazon-s3.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/asm-3.2.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/aspectjrt-1.6.5.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/aspectjtools-1.6.5.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/aws-java-sdk-1.3.26.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/baseutils-0.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-cli-1.2.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-codec-1.5.jar, file:/opt/mapr/lib/commons-collections-3.2.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-configuration-1.8.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-daemon-1.0.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-el-1.0.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-httpclient-3.0.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-httpclient-3.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-lang-2.6.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1.0.4.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1.1.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-math-2.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-net-1.4.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-net-3.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/core-3.1.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/emr-metrics-1.0.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/eval-0.5.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/gson-1.4.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/guava-13.0.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-capacity-scheduler.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-fairscheduler.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/hsqldb-1.8.0.10.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/httpclient-4.1.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/httpcore-4.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jackson-core-asl-1.5.2.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jackson-mapper-asl-1.5.2.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jasper-compiler-5.5.12.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jasper-runtime-5.5.12.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jersey-core-1.8.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jersey-json-1.8.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jersey-server-1.8.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jets3t-0.6.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jetty-6.1.26.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jetty-servlet-tester-6.1.26.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jetty-util-6.1.26.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/junit-4.5.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/kfs-0.2.2.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/libprotodefs.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/log4j-1.2.15.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/logging-0.1.jar, file:/opt/mapr/lib/maprfs-1.0.3-mapr-3.0.1.jar, file:/opt/mapr/lib/maprfs-diagnostic-tools-1.0.3-mapr-3.0.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/mapr-hbase-1.0.3-mapr-3.0.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/mapr-hbase-1.0.3-mapr-3.0.1-tests.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprsecurity-0.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/mockito-all-1.8.2.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/mockito-all-1.8.5.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/mysql-connector-java-5.0.8-bin.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/oro-2.0.8.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/protobuf-java-2.4.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/servlet-api-2.5-20081211.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/servlet-api-2.5-6.1.14.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/slf4j-api-1.4.3.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.4.3.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/xmlenc-0.52.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.6.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jsp-2.1/jsp-2.1.jar, file:/opt/mapr/hadoop/hadoop-0.20.2/lib/jsp-2.1/jsp-api-2.1.jar, file:/tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/jars/classes, file:/tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/jars/job.jar, file:/tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/distcache/256527047944634094_-597973821_97417236/maprfs/var/mapr/cluster/mapred/jobTracker/staging/hadoop/.staging/job_201411221244_9881/libjars/brickhouse-0.7.0-SNAPSHOT.jar, file:/tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/distcache/3588296526634329107_1574491892_97417258/maprfs/var/mapr/cluster/mapred/jobTracker/staging/hadoop/.staging/job_201411221244_9881/libjars/commons-lang-2.5.jar, file:/tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/distcache/-1562862905390586139_-1907470893_97417287/maprfs/var/mapr/cluster/mapred/jobTracker/staging/hadoop/.staging/job_201411221244_9881/libjars/pursway_hive_udf.jar, file:/tmp/mapr-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201411221244_9881/attempt_201411221244_9881_r_000000_0/work/]
    2014-12-01 14:43:24,241 INFO org.apache.hadoop.hive.ql.exec.Utilities: <PERFLOG method=deserializePlan>
    2014-12-01 14:43:24,492 INFO org.apache.hadoop.hive.ql.exec.Utilities: </PERFLOG method=deserializePlan start=1417437804241 end=1417437804492 duration=251>
    2014-12-01 14:43:24,511 INFO ExecReducer:
    <EX>Id =0
      <Children>
        <SEL>Id =1
          <Children>
            <FS>Id =2
              <Parent>Id = 1 null<\Parent>
            <\FS>
          <\Children>
          <Parent>Id = 0 null<\Parent>
        <\SEL>
      <\Children>
    <\EX>
    2014-12-01 14:43:24,511 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: Initializing Self 0 EX
    2014-12-01 14:43:24,515 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: Operator 0 EX initialized
    2014-12-01 14:43:24,515 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: Initializing children of 0 EX
    2014-12-01 14:43:24,516 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: Initializing child 1 SEL
    2014-12-01 14:43:24,516 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: Initializing Self 1 SEL
    2014-12-01 14:43:24,752 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT struct<_col0:string,_col1:string,_col2:string,_col3:string,_col4:string,_col5:string,_col6:string,_col7:string,_col8:string,_col9:string,_col10:string,_col11:string,_col12:string,_col13:string,_col14:string,_col15:string,_col16:string,_col17:string,_col18:string,_col19:string,_col20:string,_col21:string,_col22:map<string,string>,_col23:int,_col24:int,_col25:double,_col26:double,_col27:double,_col28:string,_col29:string,_col30:string,_col31:string,_col32:string,_col33:double,_col34:double,_col35:double,_col36:double,_col37:double,_col38:double,_col39:double>
    2014-12-01 14:43:24,759 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: Operator 1 SEL initialized
    2014-12-01 14:43:24,759 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: Initializing children of 1 SEL
    2014-12-01 14:43:24,759 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initializing child 2 FS
    2014-12-01 14:43:24,759 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initializing Self 2 FS
    2014-12-01 14:43:24,792 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Operator 2 FS initialized
    2014-12-01 14:43:24,792 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initialization Done 2 FS
    2014-12-01 14:43:24,792 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: Initialization Done 1 SEL
    2014-12-01 14:43:24,792 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: Initialization Done 0 EX
    2014-12-01 14:43:24,797 INFO ExecReducer: ExecReducer: processing 1 rows: used memory = 426940784
    2014-12-01 14:43:24,797 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 forwarding 1 rows
    2014-12-01 14:43:24,798 INFO org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryStruct: Missing fields! Expected 40 fields but only got 40! Ignoring similar problems.
    2014-12-01 14:43:24,799 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 1 rows
    2014-12-01 14:43:24,804 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Final Path: FS maprfs:/user/hadoop/tmp/hive/hive_2014-12-01_14-01-41_536_5637790490232049158-1697/_tmp.-ext-10000/kind=pairs_with_prob/000000_0
    2014-12-01 14:43:24,804 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Writing to temp file: FS maprfs:/user/hadoop/tmp/hive/hive_2014-12-01_14-01-41_536_5637790490232049158-1697/_task_tmp.-ext-10000/kind=pairs_with_prob/_tmp.000000_0
    2014-12-01 14:43:24,810 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/hadoop/tmp/hive/hive_2014-12-01_14-01-41_536_5637790490232049158-1697/_tmp.-ext-10000/kind=pairs_with_prob/000000_0
    2014-12-01 14:43:24,986 INFO ExecReducer: ExecReducer: processing 10 rows: used memory = 427584136
    2014-12-01 14:43:24,986 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 forwarding 10 rows
    2014-12-01 14:43:24,986 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 10 rows
    2014-12-01 14:43:25,041 INFO ExecReducer: ExecReducer: processing 100 rows: used memory = 428231616
    2014-12-01 14:43:25,041 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 forwarding 100 rows
    2014-12-01 14:43:25,041 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 100 rows
    2014-12-01 14:43:26,804 INFO ExecReducer: ExecReducer: processing 1000 rows: used memory = 447458568
    2014-12-01 14:43:26,805 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 forwarding 1000 rows
    2014-12-01 14:43:26,805 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 1000 rows
    2014-12-01 14:43:27,619 INFO ExecReducer: ExecReducer: processing 10000 rows: used memory = 280745720
    2014-12-01 14:43:27,619 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 forwarding 10000 rows
    2014-12-01 14:43:27,619 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 10000 rows
    2014-12-01 14:43:45,326 INFO ExecReducer: ExecReducer: processing 100000 rows: used memory = 387215224
    2014-12-01 14:43:45,327 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 forwarding 100000 rows
    2014-12-01 14:43:45,327 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 100000 rows
    2014-12-01 14:48:31,145 INFO ExecReducer: ExecReducer: processing 1000000 rows: used memory = 278681840
    2014-12-01 14:48:31,145 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 forwarding 1000000 rows
    2014-12-01 14:48:31,145 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 1000000 rows
    2014-12-01 15:19:47,693 WARN org.apache.hadoop.mapreduce.util.ProcfsBasedProcessTree: The process 29397 may have finished in the interim.
    2014-12-01 15:19:47,694 WARN org.apache.hadoop.mapreduce.util.ProcfsBasedProcessTree: The process 29398 may have finished in the interim.
    2014-12-01 15:19:47,694 WARN org.apache.hadoop.mapreduce.util.ProcfsBasedProcessTree: The process 29399 may have finished in the interim.
    2014-12-01 15:26:30,101 INFO ExecReducer: ExecReducer: processing 2000000 rows: used memory = 272723552
    2014-12-01 15:26:30,101 INFO org.apache.hadoop.hive.ql.exec.ExtractOperator: 0 forwarding 2000000 rows
    2014-12-01 15:26:30,101 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 2000000 rows

Other reducers on other server behave normally. 

This is a production cluster that behaves normally usually. This sort of things start to happen under heavy load.   
Memory seems to be sufficient and both memory and LA in the servers is fine. 
The systems is ~90% storage full so rebalancing takes place. 

Any clue on where might be the problem?



Outcomes