-
checkpoint
1.1 Flink Checkpoint和Savepoint对比:
概念:Checkpoint 是 自动容错机制 ,Savepoint 程序全局状态镜像 。
目的: Checkpoint 是程序自动容错,快速恢复 。Savepoint是 程序修改后继续从状态恢复,程序升级等。
用户交互:Checkpoint 是 Flink 系统行为 。Savepoint是用户触发。
状态文件保留策略:Checkpoint默认程序删除,可以设置CheckpointConfig中的参数进行保留 。Savepoint会一直保存,除
非用户删除 。
方式:checkpoint一般全量,rockdb方式可以增量,savepoint全量
1.2 checkpoint设置
1.2.1 Checkpoint保存数:默认1,
conf/flink-conf.yaml中,添加如下配置,指定最多需要保存Checkpoint的个数:
state.checkpoints.num-retained: 20
1.2.2 查询checkpoint
dfs dfs -
ls
/flink-1
.5.3
/flink-checkpoints/582e17d2cc343e6c56255d111bae0191/
tips: 582e17d2cc343e6c56255d111bae0191:jobid
1.2.3 checkpoint恢复
flink自动恢复的
1.2.4 触发方式:flink程序异常
1.2.5 代码设置
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置每隔5000ms启动一个checkpoint
env.enableCheckpointing(1000);
// 设置statebackend,即checkpoint存储位置
env.setStateBackend(new FsStateBackend("hdfs://localhost:9000/jerome/flink/checkpoint"));
// 获取checkpoint配置参数
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
// 确保checkpoint之间至少有500ms的时间间隔,即checkpoint的最小间隔
checkpointConfig.setMinPauseBetweenCheckpoints(500);
// 设置checkpoint超时时间,超过时间则会被丢弃
checkpointConfig.setCheckpointTimeout(60000);
// 同一时间只允许进行一个checkpoint
checkpointConfig.setMaxConcurrentCheckpoints(1);
// 一旦Flink程序被cancel后,会保留checkpoint数据,
// ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION:表示一旦Flink处理程序被cancel后,会保留Checkpoint数据,以便根据实际需要恢复到指定的Checkpoint
// ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION:表示一旦Flink处理程序被cancel后,会删除Checkpoint数据,只有job执行失败的时候才会保存checkpoint
checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
2. savepoint
2.1 conf/flink-conf.yaml中,添加如下配置,设置Savepoint存储目录,例如如下所示:
state.savepoints.dir: hdfs://namenode01.td.com/flink-1.5.3/flink-savepoints
2.2 执行savepoint:
bin/flink savepoint <jobId> [savepointDirectory] -yid <yarnAppId>
2.3 savepoint恢复:
bin/flink run-application -t yarn-application -p 4 -Djobmanager.memory.process.size=1024m -Dtaskmanager.memory.process.size=2048m -s hdfs://xxx/xxx-savepoints/savepoint-xxx-xxxx /xxxx/xxxxx/xxxx/xxxx-xxxx-1.0-SNAPSHOT.jar --hostname xxxxx --port xxxxxx
2.4 查询savepiont
hdfs dfs -ls /xxxxx/xxxx-savepoints/savepoint-xxxx-xxxxx