flink on yarn模式出现The main method caused an error: Could not deploy Yarn job cluster问题排查+解决

报错复现:

flink run -m yarn-cluster -p 2 -yjm 700m -ytm 1024m -c WordCount target/bbb-1.0-SNAPSHOT.jar

完整报错如下:

 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Could not deploy Yarn job cluster.
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
	at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:662)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:893)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Could not deploy Yarn job cluster.
	at org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:398)
	at org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:70)
	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1733)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:94)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:63)
	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1620)
	at WordCount.main(WordCount.java:47)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
	... 11 more
Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. 
Diagnostics from YARN: Application application_1591614969089_0002 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1591614969089_0002_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: [2020-06-08 19:18:12.457]Exception from container-launch.
Container id: container_1591614969089_0002_01_000001
Exit code: 1

[2020-06-08 19:18:12.466]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

[2020-06-08 19:18:12.467]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

For more detailed output, check the application tracking page: http://Desktop:8188/applicationhistory/app/application_1591614969089_0002 Then click on links to logs of each attempt.
. Failing the application.
If log aggregation is enabled on your cluster, use this command to further investigate the issue:
yarn logs -applicationId application_1591614969089_0002
	at org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:999)
	at org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:488)
	at org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:391)
	... 22 more
2020-06-08 19:18:12,659 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Cancelling deployment from Deployment Failure Hook
2020-06-08 19:18:12,660 INFO  org.apache.hadoop.yarn.client.RMProxy                         - Connecting to ResourceManager at Desktop/192.168.0.103:8032
2020-06-08 19:18:12,661 INFO  org.apache.hadoop.yarn.client.AHSProxy                        - Connecting to Application History server at Desktop/192.168.0.103:10201
2020-06-08 19:18:12,661 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Killing YARN application
2020-06-08 19:18:12,668 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl         - Killed application application_1591614969089_0002
2020-06-08 19:18:12,769 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deleting files in hdfs://Desktop:9000/user/appleyuchi/.flink/application_1591614969089_0002.

比较难排查的一个报错,注意确保HADOOP的日志服务器打开,即确保jps中有:

JobHistoryServer,启动命令为:

"$HADOOP_HOME/bin/mapred --daemon start historyserver"

打开时间线服务器

yarn timelineserver

进行完上述操作后,yarn界面的各个端口应该都能打开了。
#######################################################################################

然后在yarn界面的log中看到如下报错:

2020-06-08 19:21:02,071 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Shutting YarnJobClusterEntrypoint down with application status FAILED. Diagnostics org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
	at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:261)
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:215)
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:518)
	at org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint.main(YarnJobClusterEntrypoint.java:119)
Caused by: java.net.BindException: Could not start rest endpoint on any port in port range 8082
	at org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:228)
	at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:165)
	... 9 more
.
2020-06-08 19:21:02,076 INFO  org.apache.flink.runtime.blob.BlobServer                      - Stopped BLOB server at 0.0.0.0:37633
2020-06-08 19:21:02,077 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping Akka RPC service.
2020-06-08 19:21:02,082 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping Akka RPC service.
2020-06-08 19:21:02,087 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2020-06-08 19:21:02,088 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2020-06-08 19:21:02,095 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2020-06-08 19:21:02,095 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2020-06-08 19:21:02,110 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2020-06-08 19:21:02,110 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2020-06-08 19:21:02,130 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped Akka RPC service.
2020-06-08 19:21:02,131 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped Akka RPC service.
2020-06-08 19:21:02,132 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Could not start cluster entrypoint YarnJobClusterEntrypoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint YarnJobClusterEntrypoint.
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:187)
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:518)
	at org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint.main(YarnJobClusterEntrypoint.java:119)
Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
	at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:261)
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:215)
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
	... 2 more
Caused by: java.net.BindException: Could not start rest endpoint on any port in port range 8082
	at org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:228)
	at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:165)
	... 9 more

##############################################################

端口问题,但是这个端口并没有占用啊,所以我也懵逼了一会儿。

犯错原因:

这两个文件中的端口要保持统一,我忘记修改masters文件了,从而导致了上述复杂的报错。

这里之所以默认的8081要改成8082是因为8081被spark给占用了,所以我当时修改完flink-conf.yaml就忘乎所以了。

 

最终解决方案:

flink-conf.yaml:rest.port: 8082

masters:Desktop:8082

然后别忘记这两个文件同步更新到集群中的其他节点。

关闭眼前的所有终端,重新开一个终端,因为配置文件只有在你开启新终端的情况下才会生效。


 

 

 

 

 

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值