记录:Flink checkpoint 过期导致失败(线上问题)

这篇博客主要分析了Flink作业在运行过程中遇到的checkpoint超时问题,原因是数据倾斜导致的barrier对齐时间过长。解决方案是针对数据倾斜进行优化,确保下游消费能力与上游数据量匹配。排查过程包括检查UI上的失败checkpoint详情,关注任务延迟信息,并定位到出现问题的TaskManager。
摘要由CSDN通过智能技术生成

报错信息:

2021-08-18 18:28:40,502 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 1 (type=CHECKPOINT) @ 1629282520408 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:38:40,502 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 1 of job 1cba827ad8a8d68521605157b77a6191 expired before completing.
2021-08-18 18:38:40,527 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 2 (type=CHECKPOINT) @ 1629283120507 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:48:40,528 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 2 of job 1cba827ad8a8d68521605157b77a6191 expired before completing.
2021-08-18 18:48:40,581 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 3 (type=CHECKPOINT) @ 1629283720528 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:49:35,383 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 8e8ad4a5e66753df9bacb838c4a8c195 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000016 @ (dataPort=26682).
2021-08-18 18:49:41,220 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 6c425a7bf2cdb6f9eaff230e657e7b71 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000028 @(dataPort=26329).
2021-08-18 18:51:23,142 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 4aca7ff37807d7c53dab4af4d7317a92 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000015 @ (dataPort=28244).
2021-08-18 18:51:43,449 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task f410376b8a4c4482529b47d615d0ef7d of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000027 @ (dataPort=35241).
2021-08-18 18:52:14,054 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task b0b4e5bc960fd38f09ed136eabc92207 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000011 @  (dataPort=17858).
2021-08-18 18:53:15,172 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 254800e15801981e9a5fcaf8f8374f80 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000019 @ (dataPort=28753).
2021-08-18 18:54:23,649 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task bda7204e335e5818502024d14b9cf515 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000032 @ (dataPort=37476).
2021-08-18 18:54:53,507 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 9839eef45f9224ddbbd3383cf962ea75 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000026 @ (dataPort=31847).
2021-08-18 18:57:07,707 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 6b91c567ac5eb48d6d8c4c692327c95d of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000028 @  (dataPort=26329).
2021-08-18 18:58:40,581 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 3 of job 1cba827ad8a8d68521605157b77a6191 expired before completing.
2021-08-18 18:58:40,604 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 4 (type=CHECKPOINT) @ 1629284320583 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:58:40,590 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.

jobmanager定时trigger checkpoint,给source处发送trigger信号,同时会启动一个异步线程,在checkpoint timeout时长之后停止本轮 checkpoint,cancel动作执行之后本轮的checkpoint就为超时,如果在超时之前收到了最后一个sink算子的ack信号,那么checkpoint就是成功的。

超时原因一般有两种

        1. barrier对齐超时

        2. 异步状态遍历和写HDFS超时(比如State太大)

解决:     

        这个job 的sink是写入ES中,由于ES集群扩容了一些机器,因为跨机房导致ES端有问题,从而导致消费端有问题从而出现反压,从而导致barrier对齐时间过长 而导致checkpoint失败。

一般的排查方式(网上找的方法):

查看ui上的失败checkpoint的detail,可以看到失败的是Pending -> EventEmit xxx这个算子的10个subtask。

checkpoint

接着去jobmanager上查看这个checkpoint的一些延迟信息:

(最上面报错信息内容)

可以根据这些失败的task的id去查询这些任务落在哪一个taskmanager上,经过排查发现,是同一台机器,通过ui看到该机器流入的数据明显比别的流入量大 因此是因为数据倾斜导致了这个问题,追根溯源还是下游消费能力不足的问题

根据日志可以去源码中查找相对应的信息,例如:

2021-08-18 18:49:35,383 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 8e8ad4a5e66753df9bacb838c4a8c195 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000016 @ xxxxxxxxx (dataPort=26682).

CheckpointCoordinator类中查找:

if (recentPendingCheckpoints.contains(checkpointId)) {
                    wasPendingCheckpoint = true;
                    LOG.warn(
                            "Received late message for now expired checkpoint attempt {} from task "
                                    + "{} of job {} at {}.",
                            checkpointId,
                            message.getTaskExecutionId(),
                            message.getJob(),
                            taskManagerLocationInfo);

代码中就可以知道 每个大括号{} 中代表的是什么。

(问题排查:主要是通过 JM的日志和TM日志,以及Flink WebUI 的 每个算子反压状况以及checkpoint的状况)

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值