AnsweredAssumed Answered

Interrupted Exception while waiting for ZooKeeper service to send jobtracker address

Question asked by dimamah on Feb 12, 2014
Latest reply on Feb 26, 2014 by yufeldman
Hi, 
We have passed to using HiveServer2 recently (M3, V3.0.1 Hive 0.10) 
Since, when many users are using the same HiveServer (as expected in HiveServer2) we are getting the following error when running queries :

    Job Submission failed with exception 'java.io.IOException(Failed to get jobtracker location from Zookeeper at 106.103.1.81:5181,106.103.1.82:5181,106.103.1.83:5181 for cluster my.cluster.com)'
        java.io.IOException: Failed to get jobtracker location from Zookeeper at 106.103.1.81:5181,106.103.1.82:5181,106.103.1.83:5181 for cluster my.cluster.com

Full log is listed in the end of the post. 
This doesn't happen to all the queries all the time, but it is reproducible on site, we start a process running ~7 queries simultaneously and at least one of them fails.

In the log we can see that it says : `INFO zookeeper.ZooKeeper: Session: 0x1438b4623725224 closed` 
Looking at the zookeepers (3.3.6) log for this session we see this : 

    2014-02-13 13:39:00,950 - INFO  [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x1438b4623725224
    2014-02-13 13:39:00,953 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@1435] - Closed socket connection for client /106.103.1.82:49201 which had sessionid 0x1438b4623725224

The problem is that there is no initialization of the session. 
Looking further more back in the log for this sessionid we see : 

    2014-02-13 11:29:38,825 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn$Factory@251] - Accepted socket connection from /106.103.1.82:49201
    2014-02-13 11:29:38,830 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:5181:NIOServerCnxn@777] - Client attempting to establish new session at /106.103.1.82:49201
    2014-02-13 11:29:38,832 - INFO  [CommitProcessor:1:NIOServerCnxn@1580] - Established session 0x1438b4623725224 with negotiated timeout 30000 for client /106.103.1.82:49201

So it appears that this session was opened 2 hours ago from a port 49201. 
Looking back to the HiveServer2 log we see in that time : 

    14/02/13 11:29:38 [Thread-62-SendThread(joya-02:5181)] INFO zookeeper.ClientCnxn: Session establishment complete on server joya-02/106.103.1.82:5181, sessionid = 0x1438b4623725224, negotiated timeout = 30000

Looking for all the occurrences of "Thread-62" in the same log we see that it was a job (part of an sql with lots of union alls, executed in parallel) but nothing special about it : 

    14/02/13 11:29:35 [Thread-62] INFO exec.Task: Number of reduce tasks determined at compile time: 1
    14/02/13 11:29:35 [Thread-62] INFO exec.Task: In order to change the average load for a reducer (in bytes):
    14/02/13 11:29:35 [Thread-62] INFO exec.Task:   set hive.exec.reducers.bytes.per.reducer=<number>
    14/02/13 11:29:35 [Thread-62] INFO exec.Task: In order to limit the maximum number of reducers:
    14/02/13 11:29:35 [Thread-62] INFO exec.Task:   set hive.exec.reducers.max=<number>
    14/02/13 11:29:35 [Thread-62] INFO exec.Task: In order to set a constant number of reducers:
    14/02/13 11:29:35 [Thread-62] INFO exec.Task:   set mapred.reduce.tasks=<number>
    14/02/13 11:29:36 [Thread-62] INFO exec.ExecDriver: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
    14/02/13 11:29:36 [Thread-62] INFO exec.ExecDriver: adding libjars: file:///home/hadoop/general/pursway_hive_udf.jar,file:///home/hadoop/general/commons-lang-2.5.jar,file:///opt/mapr/hive/hive-0.10/lib/hive-builtins-0.10-mapr.jar
    14/02/13 11:29:36 [Thread-62] INFO exec.ExecDriver: Processing alias null-subquery1-subquery1-subquery1-subquery1-subquery1-subquery1-subquery2:b-subquery1-subquery1-subquery1-subquery1-subquery1-subquery1-subquery2:pairs
    14/02/13 11:29:36 [Thread-62] INFO exec.ExecDriver: Adding input file maprfs:/user/hive/warehouse/db.db/table
    14/02/13 11:29:36 [Thread-62] INFO exec.Utilities: Content Summary not cached for maprfs:/user/hive/warehouse/db.db/table
    14/02/13 11:29:38 [Thread-62] INFO exec.ExecDriver: Making Temp Directory: maprfs:/tmp/hive-hadoop/hive_2014-02-13_11-28-59_133_1065374934094895699/-mr-10007
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.6--1, built on 09/07/2012 18:16 GMT
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:host.name=joya-02
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_26
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc.
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.6.0_26/jre
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:java.class.path={long class path removed...}
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:java.library.path=/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/native/Linux-amd64-64:
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-131.0.15.el6.x86_64
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:user.name=hadoop
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:user.home=/home/hadoop
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/hadoop
    14/02/13 11:29:38 [Thread-62] INFO zookeeper.ZooKeeper: Initiating client connection, connectString=106.103.1.81:5181,106.103.1.82:5181,106.103.1.83:5181 sessionTimeout=30000 watcher=com.mapr.fs.JobTrackerWatcher@e04819e
    14/02/13 11:29:38 [Thread-62-SendThread()] INFO zookeeper.ClientCnxn: Opening socket connection to server /106.103.1.82:5181
    14/02/13 11:29:38 [Thread-62-SendThread(joya-02:5181)] INFO zookeeper.ClientCnxn: Socket connection established to joya-02/106.103.1.82:5181, initiating session
    14/02/13 11:29:38 [Thread-62-SendThread(joya-02:5181)] INFO zookeeper.ClientCnxn: Session establishment complete on server joya-02/106.103.1.82:5181, sessionid = 0x1438b4623725224, negotiated timeout = 30000
    14/02/13 11:29:38 [Thread-62] INFO fs.JobTrackerWatcher: Current running JobTracker is: joya-02/106.103.1.82:9001
    14/02/13 11:29:38 [Thread-62] WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    14/02/13 11:29:39 [Thread-62] INFO io.CombineHiveInputFormat: CombineHiveInputSplit creating pool for maprfs:/user/hive/warehouse/db.db/table; using filter path maprfs:/user/hive/warehouse/db.db/table
    14/02/13 11:29:39 [Thread-62] INFO mapred.FileInputFormat: Total input paths to process : 7
    14/02/13 11:29:39 [Thread-62] INFO io.CombineHiveInputFormat: number of splits 6
    14/02/13 11:29:39 [Thread-62] INFO exec.Task: Starting Job = job_201401161632_20470, Tracking URL = http://joya-02:50030/jobdetails.jsp?jobid=job_201401161632_20470
    14/02/13 11:29:39 [Thread-62] INFO exec.Task: Kill Command = /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job  -kill job_201401161632_20470
    14/02/13 11:29:40 [Thread-62] INFO exec.Task: Hadoop job information for Stage-7: number of mappers: 6; number of reducers: 1
    14/02/13 11:29:53 [Thread-62] INFO exec.Task: 2014-02-13 11:29:40,418 Stage-7 map = 0%,  reduce = 0%
    14/02/13 11:30:07 [Thread-62] INFO exec.Task: 2014-02-13 11:29:54,070 Stage-7 map = 67%,  reduce = 0%, Cumulative CPU 36.73 sec
    14/02/13 11:30:07 [Thread-62] INFO exec.Task: MapReduce Total cumulative CPU time: 36 seconds 730 msec
    14/02/13 11:30:21 [Thread-62] INFO exec.Task: Ended Job = job_201401161632_20470
    14/02/13 11:30:21 [Thread-62] INFO exec.FileSinkOperator: Moving tmp dir: maprfs:/tmp/hive-hadoop/hive_2014-02-13_11-28-59_133_1065374934094895699/_tmp.-mr-10007 to: maprfs:/tmp/hive-hadoop/hive_2014-02-13_11-28-59_133_1065374934094895699/_tmp.-mr-10007.intermediate
    14/02/13 11:30:21 [Thread-62] INFO exec.FileSinkOperator: Moving tmp dir: maprfs:/tmp/hive-hadoop/hive_2014-02-13_11-28-59_133_1065374934094895699/_tmp.-mr-10007.intermediate to: maprfs:/tmp/hive-hadoop/hive_2014-02-13_11-28-59_133_1065374934094895699/-mr-10007

**Full log from HiveServer2 :** 

    14/02/13 13:38:59 [Thread-689] INFO exec.Utilities: Cache Content Summary for maprfs:/user/hive/warehouse/db.db/table/type=standard length: 6436711528 file count: 23 directory count: 1
    14/02/13 13:38:59 [Thread-689] INFO exec.ExecDriver: BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6436711528
    Number of reduce tasks not specified. Estimated from input data size: 7
    14/02/13 13:38:59 [Thread-689] INFO exec.Task: Number of reduce tasks not specified. Estimated from input data size: 7
    In order to change the average load for a reducer (in bytes):
    14/02/13 13:38:59 [Thread-689] INFO exec.Task: In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    14/02/13 13:38:59 [Thread-689] INFO exec.Task:   set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    14/02/13 13:38:59 [Thread-689] INFO exec.Task: In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    14/02/13 13:38:59 [Thread-689] INFO exec.Task:   set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    14/02/13 13:38:59 [Thread-689] INFO exec.Task: In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
    14/02/13 13:38:59 [Thread-689] INFO exec.Task:   set mapred.reduce.tasks=<number>
    14/02/13 13:38:59 [Thread-689] INFO exec.ExecDriver: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
    14/02/13 13:38:59 [Thread-689] INFO exec.ExecDriver: adding libjars: file:///home/hadoop/general/pursway_hive_udf.jar,file:///home/hadoop/general/commons-lang-2.5.jar,file:///opt/mapr/hive/hive-0.10/lib/hive-builtins-0.10-mapr.jar
    14/02/13 13:38:59 [Thread-689] INFO exec.ExecDriver: Processing alias table
    14/02/13 13:38:59 [Thread-689] INFO exec.ExecDriver: Adding input file maprfs:/user/hive/warehouse/db.db/table/type=standard
    14/02/13 13:38:59 [Thread-689] INFO exec.Utilities: Content Summary maprfs:/user/hive/warehouse/db.db/table/type=standardlength: 6436711528 num files: 23 num directories: 1
    14/02/13 13:39:00 [Thread-689] INFO exec.ExecDriver: Making Temp Directory: maprfs:/tmp/hive-hadoop/hive_2014-02-13_13-38-59_375_6060653035593302528/-ext-10000
    14/02/13 13:39:00 [Thread-689] ERROR fs.JobTrackerWatcher: Interrupted Exception while waiting for ZooKeeper service to send jobtracker address
    java.lang.InterruptedException
            at java.lang.Object.wait(Native Method)
            at java.lang.Object.wait(Object.java:485)
            at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1345)
            at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:810)
            at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:843)
            at com.mapr.fs.JobTrackerWatcher.findJobTrackerAddr(JobTrackerWatcher.java:117)
            at com.mapr.fs.MapRClientImpl.getJobTrackerAddrs(MapRClientImpl.java:470)
            at com.mapr.fs.MapRFileSystem.getJobTrackerAddrs(MapRFileSystem.java:1467)
            at org.apache.hadoop.ipc.RPC$FailoverInvoker.invoke(RPC.java:526)
            at org.apache.hadoop.mapred.$Proxy11.getStagingAreaDir(Unknown Source)
            at org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1287)
            at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:891)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:885)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
            at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:885)
            at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:859)
            at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)
            at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
            at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
            at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
            at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:47)
    14/02/13 13:39:00 [Thread-689] INFO zookeeper.ZooKeeper: Session: 0x1438b4623725224 closed
    14/02/13 13:39:00 [Thread-689] ERROR ipc.RPC: FailoverProxy: Failing this Call: getStagingAreaDir. Error while fetching JobTracker location java.io.IOException: Failed to get jobtracker location from Zookeeper at 106.103.1.81:5181,106.103.1.82:5181,106.103.1.83:5181 for cluster my.cluster.com
    java.io.IOException: Failed to get jobtracker location from Zookeeper at 106.103.1.81:5181,106.103.1.82:5181,106.103.1.83:5181 for cluster my.cluster.com
            at com.mapr.fs.MapRClientImpl.getJobTrackerAddrs(MapRClientImpl.java:504)
            at com.mapr.fs.MapRFileSystem.getJobTrackerAddrs(MapRFileSystem.java:1467)
            at org.apache.hadoop.ipc.RPC$FailoverInvoker.invoke(RPC.java:526)
            at org.apache.hadoop.mapred.$Proxy11.getStagingAreaDir(Unknown Source)
            at org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1287)
            at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:891)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:885)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
            at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:885)
            at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:859)
            at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)
            at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
            at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
            at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
            at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:47)
    Job Submission failed with exception 'java.io.IOException(Failed to get jobtracker location from Zookeeper at 106.103.1.81:5181,106.103.1.82:5181,106.103.1.83:5181 for cluster my.cluster.com)'
    14/02/13 13:39:00 [Thread-689] ERROR exec.Task: Job Submission failed with exception 'java.io.IOException(Failed to get jobtracker location from Zookeeper at 106.103.1.81:5181,106.103.1.82:5181,106.103.1.83:5181 for cluster my.cluster.com)'
    java.io.IOException: Failed to get jobtracker location from Zookeeper at 106.103.1.81:5181,106.103.1.82:5181,106.103.1.83:5181 for cluster my.cluster.com
            at com.mapr.fs.MapRClientImpl.getJobTrackerAddrs(MapRClientImpl.java:504)
            at com.mapr.fs.MapRFileSystem.getJobTrackerAddrs(MapRFileSystem.java:1467)
            at org.apache.hadoop.ipc.RPC$FailoverInvoker.invoke(RPC.java:526)
            at org.apache.hadoop.mapred.$Proxy11.getStagingAreaDir(Unknown Source)
            at org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1287)
            at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:891)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:885)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
            at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:885)
            at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:859)
            at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)
            at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
            at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
            at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
            at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:47)

Outcomes