org.apache.flink.util.FlinkException: JobMaster for job failed,Flink集群无法启动

org.apache.flink.util.FlinkException: JobMaster for job failed,Flink集群无法启动

某天集群Zookeeper因为OOM挂掉,引起基于此ZK作HA的Flink集群出现故障。
在Zookeeper恢复后,重启Flink集群(版本1.7.2)时出现如下报错,导致集群无法正常启动。

ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
org.apache.flink.util.FlinkException: JobMaster for job 86f3b54aba572a495740abbcff328c81 failed.
    at org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:759)
    at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$startJobManagerRunner$6(Dispatcher.java:339)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
    at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.util.FlinkException: Could not start the job manager.
    at org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$verifyJobSchedulingStatusAndStartJobManager$2(JobManagerRunner.java:340)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_3#-582787431]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.UnfencedMessage".
    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
    ... 1 more

可以看到是Flink集群重启时,同时重启任务,但是这个任务启动对应的JobMaster时失败了。怀疑ZK挂掉后Flink停机,但是没法正常处理ZK上的元数据。Flink集群重启后,任务状态依然维持在停机之前,但是JobMaster的信息发生变化,任务无法启动引起整个集群故障。

最后手动修改问题任务的在ZK上的状态,将RUNNING状态改为DONE,让有问题的任务不重启。

[zk: zk01:2181(CONNECTED) 14] ls /flink/ha
[jobgraphs, leader, checkpoints, leaderlatch, checkpoint-counter, running_job_registry]
 
 
# 修改有问题的任务状态
[zk: zk01:2181(CONNECTED) 15] set /flink/ha/running_job_registry/86f3b54aba572a495740abbcff328c81 DONE

再次重启,如果还有其他任务状态异常也需要进行相同操作。

zk上jobgraphs是任务的解析的jobgraph,leader下面是任务的JobMaster信息,leaderlatch是选举主JobManager的,checkpoint-counter是ck信息,running_job_registry是在运行任务的注册信息。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
引用: Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed. Caused by: java.lang.Exception: java.net.SocketException: Connection reset Caused by: java.net.SocketException: Connection reset。 引用: 3.当设置的分区数多于机器的CPU数会发生数据混乱的错误,导致计算不正确。本身机器的CPU为4核。 org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint YarnSessionClusterEntrypoint. at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501) at org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint.main(YarnSessionClusterEntrypoint.java:93) Caused by: java.net.ConnectException: Call From node2/192.168.40.62 to node1:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)。 引用: 原因:socket连接重置,可能是使用不同的方式或者是重复提交flink任务,导致socket端口占用导致。2.No new data sinks have been defined since。原因:未被定义的数据输出 flink的批处理不需要行动算子来触发,因此删除最后一行的 //启动流式处理,如果没有该行代码上面的程序不会运行 streamEnv.execute("wordcount")。 "Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed." 这个错误是由于flink作业执行失败所引发的异常。可能的原因是网络连接重置或未定义的数据输出。对于网络连接重置,可以检查是否使用了不同的连接方式或是否重复提交了flink任务。对于未定义的数据输出,可以检查是否没有定义数据的输出操作。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值