Flink checkpoint使用Stop命令停止任务后会自动删除checkpoint 目录?
Flink Checkpoint 目录的清除策略
应用代码
CheckpointConfig checkPointConfig = streamEnv.getCheckpointConfig();
checkPointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
checkPointConfig.setCheckpointTimeout(1 * 60 * 1000);
checkPointConfig.setMinPauseBetweenCheckpoints((1 * 30 * 1000));
checkPointConfig.setMaxConcurrentCheckpoints(1);
checkPointConfig.setTolerableCheckpointFailureNumber(3);
checkPointConfig
.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
Flink源码
org.apache.flink.streaming.api.environment.CheckpointConfig.ExternalizedCheckpointCleanup
有两种策略:
/**
* Delete externalized checkpoints on job cancellation.
*
* <p>All checkpoint state will be deleted when you cancel the owning
* job, both the meta data and actual program state. Therefore, you
* cannot resume from externalized checkpoints after the job has been
* cancelled.
*
* <p>Note that checkpoint state is always kept if the job terminates
* with state {@link JobStatus#FAILED}.
*/
DELETE_ON_CANCELLATION(true),
/**
* Retain externalized checkpoints on job cancellation.
*
* <p>All checkpoint state is kept when you cancel the owning job. You
* have to manually delete both the checkpoint meta data and actual
* program state after cancelling the job.
*
* <p>Note that checkpoint state is always kept if the job terminates
* with state {@link JobStatus#FAILED}.
*/
RETAIN_ON_CANCELLATION(false);
DELETE_ON_CANCELLATION
:仅当作业失败时,作业的 Checkpoint 才会被保留用于任务恢复。当作业取消时,Checkpoint 状态信息会被删除,因此取消任务后,不能从 Checkpoint 位置进行恢复任务。RETAIN_ON_CANCELLATION
:当作业手动取消时,将会保留作业的 Checkpoint 状态信息。注意,这种情况下,需要手动清除该作业保留的 Checkpoint 状态信息,否则这些状态信息将永远保留在外部的持久化存储中。
需要注意
即使使用了RETAIN_ON_CANCELLATION
命令,当使用flink stop
命令来停止任务时也会删除Checkpoint 目录,这是因为这个机制是适用于使用cancel
命令取消的任务的。
以下是网友做的测试:
启动后等待若干检查点之后做如下操作文件系统上的检查点是否保留说明
- WEB UI 点击 Cancel 方式取消任务 保留 合理,因为设置了 RETAIN_ON_CANCELLATION。
- 通过命令生成保存点:
flink savepoint ${jobId} ${savepointDir}
保留 OK - 通过命令取消任务:
flink cancel ${jobId}
保留 OK - 通过命令取消任务并生成保存点:
flink cancel -s ${savepointDir} ${jobId}
保留 OK - 通过命令停止任务(基于默认保存点目录):
flink stop ${jobId}
不保留 注意别被特点坑 - 通过命令停止任务并生成保存点:
flink stop -p ${savepointDir} ${jobId}
不保留 注意别被特点坑
这是因为在Flink-1.7.0 之后,savepoint也被当做是retained checkpoint了 [1],当你stop with savepoint 成功时,新的savepoint创建之后,旧的checkpoint因为默认retain checkpoint的数量为1而被subsume掉了,也就是被删掉了。
如果你还想保留之前的一个旧的checkpoint,可以将默认retain的checkpoint数目设置为2 [2]。
另外说一句,即使是已经deprecated的cancel with savepoint的用法,当新的savepoint创建成功后,旧的checkpoint在默认情况下也应该会被删除,除非增大retain的checkpoint数量。
[1] https://issues.apache.org/jira/browse/FLINK-10354
[2] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/checkpointing.html#state-checkpoints-num-retained