Flink学习 - 9. Checkpoint使用方式
checkpoint 开启
默认的checkpoint是关闭的,需要使用的使用要优先开启
开启方式:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置每隔5000ms启动一个checkpoint
env.enableCheckpointing(1000);
checkpoint 模式
默认的checkPointMode是 Exactly-once,可以设置成 AT_LEAST_ONCE;
主要是以上两种模式。
Exactly-once对于大多数应用来说是最合适的。At-least-once用在某些延迟超低的应用程序,对数据准确性要求不高的应用。
checkpointConfig.setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);
flink给出的模式
/**
* The checkpointing mode defines what consistency guarantees the system gives in the presence of
* failures.
*
* <p>When checkpointing is activated, the data streams are replayed such that lost parts of the
* processing are repeated. For stateful operations and functions, the checkpointing mode defines
* whether the system draws checkpoints such that a recovery behaves as if the operators/functions
* see each record "exactly once" ({@link #EXACTLY_ONCE}), or whether the checkpoints are drawn
* in a simpler fashion that typically encounters some duplicates upon recovery
* ({@link #AT_LEAST_ONCE})</p>
*/
@Public
public enum CheckpointingMode {
/**
* Sets the checkpointing mode to "exactly once". This mode means that the system will
* checkpoint the operator and user function state in such a way that, upon recovery,
* every record will be reflected exactly once in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will consistently be equal to the number of actual elements in the stream,
* regardless of failures and recovery.</p>
*
* <p>Note that this does not mean that each record flows through the streaming data flow
* only once. It means that upon recovery, the state of operators/functions is restored such
* that the resumed data streams pick up exactly at after the last modification to the state.</p>
*
* <p>Note that this mode does not guarantee exactly-once behavior in the interaction with
* external systems (only state in Flink's operators and user functions). The reason for that
* is that a certain level of "collaboration" is required between two systems to achieve
* exactly-once guarantees. However, for certain systems, connectors can be written that facilitate
* this collaboration.</p>
*
* <p>This mode sustains high throughput. Depending on the data flow graph and operations,
* this mode may increase the record latency, because operators need to align their input
* streams, in order to create a consistent snapshot point. The latency increase for simple
* dataflows (no repartitioning) is negligible. For simple dataflows with repartitioning, the average
* latency remains small, but the slowest records typically have an increased latency.</p>
*/
EXACTLY_ONCE,
/**
* Sets the checkpointing mode to "at least once". This mode means that the system will
* checkpoint the operator and user function state in a simpler way. Upon failure and recovery,
* some records may be reflected multiple times in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will equal to, or larger, than the actual number of elements in the stream,
* in the presence of failure and recovery.</p>
*
* <p>This mode has minimal impact on latency and may be preferable in very-low latency
* scenarios, where a sustained very-low latency (such as few milliseconds) is needed,
* and where occasional duplicate messages (on recovery) do not matter.</p>
*/
AT_LEAST_ONCE
}
checkpoint存储位置
- FsStateBackend
- MemoryStateBackend
- RocksDBStateBackend
单任务设置:
// 设置statebackend,即checkpoint存储位置
env.setStateBackend(new FsStateBackend("hdfs://localhost:9000/jerome/flink/checkpoint"));
全局设置:
修改flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: hdfs://namenode:9000/flink/checkpoints
注意:state.backend的值可以是下面几种:jobmanager(MemoryStateBackend), filesystem(FsStateBackend), rocksdb(RocksDBStateBackend)
实例代码
package com.jerome.flink.checkpoint;
import org.apache.flink.api.common.functions.RichFlatMapFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.runtime.state.filesystem.FsStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* This is Description
*
* @author Jerome丶子木
* @date 2020/01/10
*/
public class CheckpointTest {
public static void main(String[] args) throws Exception {
String hostname;
int port;
try{
ParameterTool parameterTool = ParameterTool.fromArgs(args);
hostname = parameterTool.get("hostname");
port = parameterTool.getInt("port");
}catch (Exception e){
e.printStackTrace();
hostname = "localhost";
port = 9010;
}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置每隔5000ms启动一个checkpoint
env.enableCheckpointing(1000);
// 设置statebackend,即checkpoint存储位置
env.setStateBackend(new FsStateBackend("hdfs://localhost:9000/jerome/flink/checkpoint"));
// 获取checkpoint配置参数
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
// 确保checkpoint之间至少有500ms的时间间隔,即checkpoint的最小间隔
checkpointConfig.setMinPauseBetweenCheckpoints(500);
// 设置checkpoint超时时间,超过时间则会被丢弃
checkpointConfig.setCheckpointTimeout(60000);
// 同一时间只允许进行一个checkpoint
checkpointConfig.setMaxConcurrentCheckpoints(1);
// 一旦Flink程序被cancel后,会保留checkpoint数据,
// ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION:表示一旦Flink处理程序被cancel后,会保留Checkpoint数据,以便根据实际需要恢复到指定的Checkpoint
// ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION:表示一旦Flink处理程序被cancel后,会删除Checkpoint数据,只有job执行失败的时候才会保存checkpoint
checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
String delimiter = "\n";
DataStreamSource<String> stringDataStreamSource = env.socketTextStream(hostname, port, delimiter);
SingleOutputStreamOperator<WordWithCount> windowCount = stringDataStreamSource.flatMap(new RichFlatMapFunction<String, WordWithCount>() {
@Override
public void flatMap(String value, Collector<WordWithCount> out) throws Exception {
String[] splits = value.split("\\s");
for (String word : splits) {
out.collect(new WordWithCount(word, 1));
}
}
}).keyBy("word")
//指定时间窗口时间是2秒,指定时间间隔是1秒
.timeWindow(Time.seconds(2), Time.seconds(1))
.sum("count");
/*.reduce(new ReduceFunction<WordWithCount>() {
@Override
public WordWithCount reduce(WordWithCount value1, WordWithCount value2) throws Exception {
return new WordWithCount(value1.word,value1.count + value2.count);
}
})*/
// 把数据打印到控制台并且设置并行度
windowCount.print().setParallelism(1);
env.execute("Socket window count");
}
public static class WordWithCount{
public String word;
public long count;
public WordWithCount(){}
public WordWithCount(String word, long count){
this.word = word;
this.count = count;
}
@Override
public String toString(){
StringBuilder res = new StringBuilder();
res.append("WordWithCount{ word = '");
res.append(word);
res.append("' , count= ");
res.append(count);
res.append("} ;");
return res.toString();
}
}
}
启动程序碰到的问题
从idea本地调试过程中,直接执行碰到错误:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Could not retrieve JobResult.
at org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:622)
at org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:117)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507)
at com.jerome.flink.checkpoint.CheckpointTest.main(CheckpointTest.java:84)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$internalSubmitJob$2(Dispatcher.java:333)
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
... 6 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:152)
at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:83)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:375)
at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
... 7 more
Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to create checkpoint storage at checkpoint coordinator side.
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.<init>(CheckpointCoordinator.java:255)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.enableCheckpointing(ExecutionGraph.java:594)
at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:340)
at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:106)
at org.apache.flink.runtime.scheduler.LegacyScheduler.createExecutionGraph(LegacyScheduler.java:207)
at org.apache.flink.runtime.scheduler.LegacyScheduler.createAndRestoreExecutionGraph(LegacyScheduler.java:184)
at org.apache.flink.runtime.scheduler.LegacyScheduler.<init>(LegacyScheduler.java:176)
at org.apache.flink.runtime.scheduler.LegacySchedulerFactory.createInstance(LegacySchedulerFactory.java:70)
at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:275)
at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:265)
at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:146)
... 10 more
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'hdfs'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:447)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:359)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
at org.apache.flink.runtime.state.filesystem.FsCheckpointStorage.<init>(FsCheckpointStorage.java:61)
at org.apache.flink.runtime.state.filesystem.FsStateBackend.createCheckpointStorage(FsStateBackend.java:490)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.<init>(CheckpointCoordinator.java:253)
... 22 more
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Hadoop is not in the classpath/dependencies.
at org.apache.flink.core.fs.UnsupportedSchemeFactory.create(UnsupportedSchemeFactory.java:58)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:443)
... 27 more
分析得知是缺少连接hadoop的依赖,需要在pom中添加依赖:
<!-- flink连接hadoop相关依赖 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-shaded-hadoop2</artifactId>
<version>1.2.0</version>
</dependency>
以上是本地测试,如果打包在集群中运行时报错,则可以在flink官网下载对应的jar:
然后放到flink对应的依赖目录:
再次执行即可。