Flink集群启动报错Could not recover job with job id 47b781a95dcbf8fbd2e3130c41bf5859

Flink集群启动报错standlone日志

Could not recover job with job id 47b781a95dcbf8fbd2e3130c41bf5859.
File does not exist: /flink/recovery/default/submittedJobGraph3

找不到执行计划图
在这里插入图片描述
flink-root-taskexecutor-0-Bigdata01.log日志详细信息

org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1440) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$17(TaskExecutor.java:1425) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.Actor.aroundReceive(Actor.scala:517) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.Actor.aroundReceive$(Actor.scala:515) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.12-1.13.6.jar:1.13.6]
2022-08-19 17:20:14,537 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1440) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$17(TaskExecutor.java:1425) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.Actor.aroundReceive(Actor.scala:517) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.Actor.aroundReceive$(Actor.scala:515) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.12-1.13.6.jar:1.13.6]
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.12-1.13.6.jar:1.13.6]
2022-08-19 17:20:14,541 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Stopping TaskExecutor akka.tcp://flink@172.0.0.1:44048/user/rpc/taskmanager_0.
2022-08-19 17:20:14,544 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Stop job leader service.
2022-08-19 17:20:14,544 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
2022-08-19 17:20:14,544 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/resource_manager_lock'}.
2022-08-19 17:20:14,544 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
2022-08-19 17:20:14,548 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /tmp/flink-io-924c66f9-2fd1-4d7d-a61f-372cbb6bbdd5
2022-08-19 17:20:14,548 INFO  org.apache.flink.runtime.io.network.NettyShuffleEnvironment  [] - Shutting down the network environment and its components.
2022-08-19 17:20:14,551 INFO  org.apache.flink.runtime.io.network.netty.NettyClient        [] - Successful shutdown (took 2 ms).
2022-08-19 17:20:14,554 INFO  org.apache.flink.runtime.io.network.netty.NettyServer        [] - Successful shutdown (took 2 ms).
2022-08-19 17:20:14,556 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /tmp/flink-netty-shuffle-3a40d0d4-9d9e-4dc2-9b6c-9a9dc9ec07cb
2022-08-19 17:20:14,556 INFO  org.apache.flink.runtime.taskexecutor.KvStateService         [] - Shutting down the kvState service and its components.
2022-08-19 17:20:14,556 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Stop job leader service.
2022-08-19 17:20:14,557 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /tmp/flink-dist-cache-ba2f5c0d-3f45-4acb-b1db-e51b12473faf
2022-08-19 17:20:14,559 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Stopped TaskExecutor akka.tcp://flink@172.0.0.1:44048/user/rpc/taskmanager_0.
2022-08-19 17:20:14,563 INFO  org.apache.flink.runtime.blob.PermanentBlobCache             [] - Shutting down BLOB cache
2022-08-19 17:20:14,563 INFO  org.apache.flink.runtime.blob.TransientBlobCache             [] - Shutting down BLOB cache
2022-08-19 17:20:14,563 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Stopping Akka RPC service.
2022-08-19 17:20:14,566 INFO  org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - backgroundOperationsLoop exiting
2022-08-19 17:20:14,575 INFO  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper [] - Session: 0x182b539e00e0005 closed
2022-08-19 17:20:14,575 INFO  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - EventThread shut down for session: 0x182b539e00e0005
2022-08-19 17:20:14,575 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Stopping Akka RPC service.
2022-08-19 17:20:14,577 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-08-19 17:20:14,577 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-08-19 17:20:14,579 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-08-19 17:20:14,579 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-08-19 17:20:14,598 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting shut down.
2022-08-19 17:20:14,598 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting shut down.
2022-08-19 17:20:14,609 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Stopped Akka RPC service.
2022-08-19 17:20:14,612 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Stopped Akka RPC service.
2022-08-19 17:20:14,612 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Terminating TaskManagerRunner with exit code 1.

问题原因
Flink集群启动时正在尝试把上一次停掉的job再拉起来继续跑,这是正常的因为Flink有历史服务器和checkpoint,以及它的failover功能

解决思路一:找不到文件路径 在hdfs创建相同文件夹 重启flink集群 启动仍然报错

解决思路二:把所有hdfs以及机器中/tmp和/var/tmp下和flink相关的文件删掉,让集群不要再恢复历史停掉的job,干干净净的重启

再启动一次,还报错找不到Jobgraph,这说明它仍然找到了一个历史存留的graph并且去执行了

这说明了一个问题,Flink到底把临时文件都保存到哪里了呢 , 查看同事的配置文件看到明明只配置了hdfs和机器目录的啊 ??

简单分析,如果我删了hdfs ,删了它的tmp目录,它仍然能找到临时文件,那么它一定在一个地方

就是zookeeper

于是,查看zk主题下的path,发现的确如此,Flink不单单会把历史服务记录和checkpoint信息记录到hdfs上,还会向zk提交数据记录,这也透视了checkpoint和failover机制存在于zk的通信,而非单纯的通过并行线程记录到hdfs上,这个设计很赞,可以防止依赖模块出现未知故障而丢失数据。

所以这个报错解决办法即是,清理zk中flink path下关于jobgraph和checkpoint相关和runningjob的数据即可

那么在清理zk数据后,我们再来启动集群试一次!
关于zk状态清理

1. zkCli.sh -server IP:2181 --进入zkCli客户端操作
2. ls / --查看zk跟目录信息
3. delete /flink目录下的jobgraph信息
4. deleteall /flink递归删除 要看版本 不行就rmr

清除/tmp下的flink pid文件 临时存放的flink进程id如果集群没起来可以清除 如果集群启动不可清除 否则可能导致集群无法正常关闭和重启
清除hdfs上的flink文件 里面保存的checkpoint数据 (谨慎清理)

清理完毕 重启flink查看日志正常启动

如需转载 请附上链接

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值