AnsweredAssumed Answered

Netty issues while deploying Flink with Yarn on MapR

Question asked by ani.desh1512 on Feb 2, 2017
Latest reply on Oct 23, 2017 by cathy
Branched to a new discussion

I am trying to run Flink using Yarn on MapR. My previous issue got resolved and I have updated the original post accordingly so.

 

Accordingly, I modified pom.xml to change the zookeeper version to mapr zookeeper jar version which in my case was: 3.4.5-mapr-1604
I then built flink (flink-1.3-SNAPSHOT) as follows:

 

mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=2.7.0-mapr-1607

 

The build is successfull. Then I try to run ./bin/yarn-session.sh -n 3 and get the following error:

 

 

 

2017-02-02 16:11:10,717 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Using values:

2017-02-02 16:11:10,718 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   -      TaskManager count = 3
2017-02-02 16:11:10,718 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   -      JobManager memory = 1024
2017-02-02 16:11:10,718 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   -      TaskManager memory = 1024
2017-02-02 16:11:10,928 INFO  com.mapr.util.zookeeper.ZKDataRetrieval                       - Process path: null. Event state: SyncConnected. Event type: None
2017-02-02 16:11:10,928 INFO  com.mapr.util.zookeeper.ZKDataRetrieval                       - Connected to ZK: ip-10-101-2-111.ec2.internal:5181,ip-10-101-2-112.ec2.internal:5181,ip-10-101-2-113.ec2.internal:5181
2017-02-02 16:11:10,929 INFO  com.mapr.util.zookeeper.ZKDataRetrieval                       - Getting serviceData for master node of resourcemanager
2017-02-02 16:11:10,935 INFO  com.mapr.util.zookeeper.ZKDataRetrieval                       - Process path: null. Event state: SaslAuthenticated. Event type: None
2017-02-02 16:11:10,948 INFO  org.apache.hadoop.yarn.client.MapRZKBasedRMFailoverProxyProvider  - Updated RM address to ip-10-101-2-111.ec2.internal/10.101.2.111:8032
2017-02-02 16:11:11,216 WARN  org.apache.flink.yarn.YarnClusterDescriptor                   - The configuration directory ('/home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
2017-02-02 16:11:11,225 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:///home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/log4j.properties to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/log4j.properties
2017-02-02 16:11:11,249 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:///home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/lib to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/lib
2017-02-02 16:11:11,680 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:///home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/logback.xml to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/logback.xml
2017-02-02 16:11:11,685 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:///home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/lib/flink-dist_2.10-1.3-SNAPSHOT.jar to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/flink-dist_2.10-1.3-SNAPSHOT.jar
2017-02-02 16:11:12,932 INFO  org.apache.flink.yarn.Utils                                   - Copying from /home/ubuntu/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/flink-conf.yaml to maprfs:/user/ubuntu/.flink/application_1485984594262_0007/flink-conf.yaml
2017-02-02 16:11:12,949 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Submitting application master application_1485984594262_0007
2017-02-02 16:11:12,977 INFO  org.apache.hadoop.yarn.security.ExternalTokenManagerFactory   - Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
2017-02-02 16:11:13,195 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl         - Submitted application application_1485984594262_0007
2017-02-02 16:11:13,195 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Waiting for the cluster to be allocated
2017-02-02 16:11:13,196 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deploying cluster, current state ACCEPTED
Error while deploying YARN cluster: Couldn't deploy Yarn cluster
java.lang.RuntimeException: Couldn't deploy Yarn cluster
     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:425)
     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:620)
     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:476)
     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:473)
     at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:415)
     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
     at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:473)
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1485984594262_0007 failed 1 times due to AM Container for appattempt_1485984594262_0007_000001 exited with  exitCode: 255
For more detailed output, check application tracking page:http://ip-10-101-2-111.ec2.internal:8088/cluster/app/application_1485984594262_0007Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e05_1485984594262_0007_01_000001
Exit code: 255
Stack trace: ExitCodeException exitCode=255:
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
     at org.apache.hadoop.util.Shell.run(Shell.java:456)
     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
     at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:304)
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:354)
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:87)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1
main : user is ubuntu
main : requested yarn user is ubuntu


Container exited with a non-zero exit code 255
Failing this attempt. Failing the application.
If log aggregation is enabled on your cluster, use this command to further investigate the issue:
yarn logs -applicationId application_1485984594262_0007
     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:888)
     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:557)
     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:423)
     ... 9 more
2017-02-02 16:11:18,231 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Cancelling deployment from Deployment Failure Hook
2017-02-02 16:11:18,231 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Killing YARN application
2017-02-02 16:11:18,235 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl         - Killed application application_1485984594262_0007
2017-02-02 16:11:18,336 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deleting files in maprfs:/user/ubuntu/.flink/application_1485984594262_0007

 

So i went ahead and checked the yarn container logs and they have the following error:

 


Uncaught error from thread [flink-akka.remote.default-remote-dispatcher-5] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[flink]

java.lang.VerifyError: (class: org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker signature: (Ljava/util/concurrent/Executor;)Lorg/jboss/netty/channel/socket/nio/AbstractNioWorker;) Wrong return type in function
        at akka.remote.transport.netty.NettyTransport.<init>(NettyTransport.scala:295)
        at akka.remote.transport.netty.NettyTransport.<init>(NettyTransport.scala:251)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
        at scala.util.Try$.apply(Try.scala:161)
        at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
        at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
        at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
        at scala.util.Success.flatMap(Try.scala:200)
        at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
        at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:749)
        at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:741)
        at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
        at akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:741)
        at akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:500)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at akka.remote.EndpointManager.aroundReceive(Remoting.scala:404)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

 

So, I figured this might be because of clashing netty versions between flink and MapR’s zookeeper jar.

 

And indeed, MapR’s zookeeper jar has following version of netty

 

<groupId>org.jboss.netty</groupId>
<artifactId>netty</artifactId>
<version>3.2.2.Final</version>

 

And Flink has following:
<groupId>io.netty</groupId>                               

<artifactId>netty-all</artifactId>

 <version>4.0.27.Final</version>

 

 

 

So, I changed Flink’s pom.xml to exclude netty from zookeeper dependency.

 

<exclusion>

<groupId>org.jboss.netty</groupId>

<artifactId>netty</artifactId>
 </exclusion>

 

 

 

Then I again ran ./bin/yarn-session.sh -n 3 and got the following error:

 

 

 

2017-02-02 15:44:03,540 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Using values:

2017-02-02 15:44:03,541 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   -      TaskManager count = 3
2017-02-02 15:44:03,541 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   -      JobManager memory = 1024
2017-02-02 15:44:03,541 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   -      TaskManager memory = 1024
2017-02-02 15:44:03,728 INFO  com.mapr.util.zookeeper.ZKDataRetrieval                       - Process path: null. Event state: SyncConnected. Event type: None
2017-02-02 15:44:03,728 INFO  com.mapr.util.zookeeper.ZKDataRetrieval                       - Connected to ZK: ip-10-101-2-111.ec2.internal:5181,ip-10-101-2-112.ec2.internal:5181,ip-10-101-2-113.ec2.internal:5181
2017-02-02 15:44:03,729 INFO  com.mapr.util.zookeeper.ZKDataRetrieval                       - Getting serviceData for master node of resourcemanager
2017-02-02 15:44:03,733 INFO  com.mapr.util.zookeeper.ZKDataRetrieval                       - Process path: null. Event state: SaslAuthenticated. Event type: None
2017-02-02 15:44:03,745 INFO  org.apache.hadoop.yarn.client.MapRZKBasedRMFailoverProxyProvider  - Updated RM address to ip-10-101-2-111.ec2.internal/10.101.2.111:8032
2017-02-02 15:44:04,016 WARN  org.apache.flink.yarn.YarnClusterDescriptor                   - The configuration directory ('/home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
2017-02-02 15:44:04,025 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:///home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/lib to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/lib
2017-02-02 15:44:04,446 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:///home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/logback.xml to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/logback.xml
2017-02-02 15:44:04,452 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:///home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/log4j.properties to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/log4j.properties
2017-02-02 15:44:04,457 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:///home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/lib/flink-dist_2.10-1.3-SNAPSHOT.jar to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/flink-dist_2.10-1.3-SNAPSHOT.jar
2017-02-02 15:44:05,826 INFO  org.apache.flink.yarn.Utils                                   - Copying from /home/ubuntu/yarn_flink/flink/flink-dist/target/flink-1.3-SNAPSHOT-bin/flink-1.3-SNAPSHOT/conf/flink-conf.yaml to maprfs:/user/ubuntu/.flink/application_1485984594262_0006/flink-conf.yaml
2017-02-02 15:44:05,842 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Submitting application master application_1485984594262_0006
2017-02-02 15:44:05,870 INFO  org.apache.hadoop.yarn.security.ExternalTokenManagerFactory   - Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
2017-02-02 15:44:06,088 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl         - Submitted application application_1485984594262_0006
2017-02-02 15:44:06,089 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Waiting for the cluster to be allocated
2017-02-02 15:44:06,090 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deploying cluster, current state ACCEPTED
Error while deploying YARN cluster: Couldn't deploy Yarn cluster
java.lang.RuntimeException: Couldn't deploy Yarn cluster
     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:428)
     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:620)
     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:476)
     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:473)
     at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:415)
     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
     at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:473)
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1485984594262_0006 failed 1 times due to AM Container for appattempt_1485984594262_0006_000001 exited with  exitCode: 31
For more detailed output, check application tracking page:http://ip-10-101-2-111.ec2.internal:8088/cluster/app/application_1485984594262_0006Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e05_1485984594262_0006_01_000001
Exit code: 31
Stack trace: ExitCodeException exitCode=31:
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
     at org.apache.hadoop.util.Shell.run(Shell.java:456)
     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
     at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:304)
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:354)
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:87)
     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1
main : user is ubuntu
main : requested yarn user is ubuntu


Container exited with a non-zero exit code 31
Failing this attempt. Failing the application.
If log aggregation is enabled on your cluster, use this command to further investigate the issue:
yarn logs -applicationId application_1485984594262_0006
     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:891)
     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:560)
     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:426)
     ... 9 more
2017-02-02 15:44:12,635 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Cancelling deployment from Deployment Failure Hook
2017-02-02 15:44:12,635 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Killing YARN application
2017-02-02 15:44:12,641 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl         - Killed application application_1485984594262_0006
2017-02-02 15:44:12,742 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deleting files in maprfs:/user/ubuntu/.flink/application_1485984594262_0006

 

 

 

So i went ahead and checked the yarn container logs again and they have the following error:

 

2017-02-02 15:44:11,3521 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:580 Thread: 19306 Client initialization failed due to mismatch of libraries. Please make sure that the java library version matches the native build version 5.2.0.39122.GA and native patch version $Id: mapr-version: 5.2.0.39122.GA 40967:64c8e3c8ee67 $

 

 

 

Am I right in assuming here that this error is being caused by conflicting versions of netty? What steps can I take to resolve this error? I have also asked this question on Flink mailing list btw.

 

Thanks in advance

Outcomes