Flink on Yarn 问题排查(rest.port与rest.bind-port 端口问题)

本文描述了在使用Flink 1.12.0进行行车数据分析时,遇到的Yarn模式下Job部署失败的端口冲突问题,涉及flink-conf.yaml配置、异常日志解析,以及通过调整rest.bind-port解决方法。
摘要由CSDN通过智能技术生成

一、问题背景

最近在做行车数据实时分析,为了后续批流一体化的开发,前期先做技术铺垫。目前使用Flink作为批流一体切入方案。以下是基于yarn模式提交 flinksql job时出现端口冲突的问题

 

二、问题复述

1、我目前使用的是flink-1.12.0版本。配置文件如下  

flink-conf.yaml


 

master和worker配置

vi master 

bj-pan.com-04:11057

vi  worker
bj-pan.com-02
bj-pan.com-03
bj-pan.com-04

2、启动flink集群

3.基于yarn模式提交flinksql job任务

4、异常日志详情

这异常日志 flink-1.12.0/log/flink-root-client-bj-pan.com-04.log 下,直接体现的日志是:表面看是内存问题。

2021-04-01 09:37:46,081 ERROR com.flink.streaming.core.JobApplication                      [] - 任务执行失败:
org.apache.flink.table.api.TableException: Failed to execute sql
	at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:696) ~[flink-table_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.table.api.internal.StatementSetImpl.execute(StatementSetImpl.java:97) ~[flink-table_2.11-1.12.0.jar:1.12.0]
	at com.flink.streaming.core.JobApplication.main(JobApplication.java:66) ~[flink-streaming-core-1.2.0.RELEASE.jar:1.2.0.RELEASE]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_141]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_141]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_141]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_141]
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:316) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:743) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:242) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:971) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1047) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_141]
	at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_141]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) [hadoop-common-3.0.0-cdh6.2.0.jar:?]
	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) [flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1047) [flink-dist_2.11-1.12.0.jar:1.12.0]
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Could not deploy Yarn job cluster.
	at org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:460) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:70) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1940) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.table.planner.delegation.ExecutorBase.executeAsync(ExecutorBase.java:57) ~[flink-table-blink_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:680) ~[flink-table_2.11-1.12.0.jar:1.12.0]
	... 18 more
Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. 
Diagnostics from YARN: Application application_1616133445012_0210 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1616133445012_0210_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: [2021-04-01 09:37:45.847]Exception from container-launch.
Container id: container_1616133445012_0210_01_000001
Exit code: 1

[2021-04-01 09:37:45.848]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

[2021-04-01 09:37:45.849]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

For more detailed output, check the application tracking page: http://bj-pan.com-02:8088/cluster/app/application_1616133445012_0210 Then click on links to logs of each attempt.
. Failing the application.
If log aggregation is enabled on your cluster, use this command to further investigate the issue:
yarn logs -applicationId application_1616133445012_0210
	at org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1078) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:558) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:453) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:70) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1940) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.table.planner.delegation.ExecutorBase.executeAsync(ExecutorBase.java:57) ~[flink-table-blink_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:680) ~[flink-table_2.11-1.12.0.jar:1.12.0]
	... 18 more
2021-04-01 09:37:46,095 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Cancelling deployment from Deployment Failure Hook
2021-04-01 09:37:46,096 INFO  org.apache.hadoop.yarn.client.RMProxy                        [] - Connecting to ResourceManager at bj-pan.com-02/172.17.112.108:8032
2021-04-01 09:37:46,101 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Killing YARN application
2021-04-01 09:37:46,117 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl        [] - Killed application application_1616133445012_0210
2021-04-01 09:37:46,218 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Deleting files in hdfs://bj-pan.com-01:8020/user/root/.flink/application_1616133445012_0210.

 使用命令查看yarn logs -applicationId application_1616133445012_0210

2021-04-01 09:37:45,490 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Could not start cluster entrypoint YarnJobClusterEntrypoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint YarnJobClusterEntrypoint.
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:191) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:529) [flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint.main(YarnJobClusterEntrypoint.java:95) [flink-dist_2.11-1.12.0.jar:1.12.0]
Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
	at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:257) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:220) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:173) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_141]
	at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_141]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) ~[hadoop-common-3.0.0-cdh6.2.0.jar:?]
	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:172) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	... 2 more
Caused by: java.net.BindException: Could not start rest endpoint on any port in port range 11057
	at org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:222) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:162) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:220) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:173) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_141]
	at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_141]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) ~[hadoop-common-3.0.0-cdh6.2.0.jar:?]
	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:172) ~[flink-dist_2.11-1.12.0.jar:1.12.0]
	... 2 more

End of LogType:jobmanager.log

问题描述:

提示端口被占用。因为已经启动standalone模式。端口rest.port端口以及使用。如若在基于yarn模式提交job。位置文件明确说明。 The port to which the REST client connects to. If rest.bind-port has
# not been specified, then the server will bind to this port as well.因为我没有使用rest.bind-port (第一张图的第二个红色框)。无法基于使用端口创建yarn client()

 

三、问题解决方案

配置rest.bind-port 范围端口(最好配置大点的范围)

 

再打提交job任务,可以可以看到当前节点YarnJobClusterEntrypoint

11952 YarnJobClusterEntrypoint
8449 StandaloneSessionClusterEntrypoint
12889 YarnJobClusterEntrypoint
8873 TaskManagerRunner
18126 NodeManager
29151 Jps

 

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

潘永青

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值