flink 之 Checkpoint 出现的错误

最新推荐文章于 2024-05-11 23:31:27 发布

Seniscz

最新推荐文章于 2024-05-11 23:31:27 发布

阅读量7k

点赞数 1

分类专栏： flink

本文链接：https://blog.csdn.net/CZ_yjsy_data/article/details/87821050

版权

flink 专栏收录该内容

3 篇文章 1 订阅

订阅专栏

文章目录

flink 任务运行中出现的错误

flink 任务运行中出现的错误

一、flink 维护状态变量出现走过的弯路：

1、flink 维护状态变量有三种方式：

1>、MemoryStateBackend
2>、FsStateBackend
3>、RocksDBStateBackend

<1>、MemoryStateBackend

是将状态维护到内存中，忽略不讲

<2>、FsStateBackend

后端在TaskManager的内存中保存运行中的数据，执行checkpoint的时候，会把state的快照数据保存到配置的文件系统中可以使用hdfs等分布式文件系统。默认情况下，fsstateback使用异步快照

  val checkPointPath = new Path("hdfs:///flink/checkpoints")
  val fsStateBackend: StateBackend= new FsStateBackend(checkPointPath)
  env.setStateBackend(fsStateBackend)

<3>、RocksDBStateBackend

rocksdbstate后端将运行中的数据保存在RocksDB数据库中，该数据库(默认情况下)存储在TaskManager数据目录中。同时它需要配置一个远端的filesystem uri（一般是HDFS），在做checkpoint的时候，会把本地的数据直接复制到filesystem中。failover的时候从filesystem中恢复到本地，最小元数据存储在JobManager的内存中(或者在高可用性模式下，存储在元数据检查点中)。rocksdbstate后端总是执行异步快照

思考：RocksDB 是基于磁盘，Redis 基于内存，为何选用 RocksDB 待解决

 val rocksdbBackend = new RocksDBStateBackend("hdfs:///flink/checkpoints",true)
 //对于状态数据不需要压缩因为压缩选项对增量快照没有影响，因为它们使用的是RocksDB的内部格式
 rocksdbBackend.setOptions(new MyOptions())
 env.setStateBackend(rocksdbBackend)

2、出现的错误：

<1>:当使用 FsStateBackend 的时候出现的错误如下图：

1547198994759

图中Overview 的含义见：

https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/checkpoint_monitoring.html

关于 checkpoint 的参数设置如下：

// start a checkpoint every  单位毫秒
env.enableCheckpointing(10000 * 2)
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(1000*2)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(60000 * 5)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(4)

stat size 一直增加的错误原因：这是因为state数据保存在taskmanager的内存中，其一直增加最终导致 OOM

错误如下：

2019-01-11 02:07:30,781 ERROR com.miaoke.flink.classnet.MySQLSink$                          - 数据插入MySQL失败 :    java.lang.OutOfMemoryError: GC overhead limit exceeded userId 的值为：796346time 的值为: 2019-01-11 01:28:23

End to End Duration 的时间一直在增加的原因：目前未解决

其会导致： Cause Checkpoint expread before compieting

<2>:当使用 RocksDBStateBackend 的时候出现的错误：

stat size 的值不在增加，前面已讲解

End to End Duration 的时间一直在增加和使用 FsStateBackend 的时候出现的错误一致，如下图：

1547436814226

目前还没有解决:

怀疑是自己的配置有问题，查阅官网 Tuning RocksDB 后配置如下：

val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setRestartStrategy(RestartStrategies.noRestart())
// start a checkpoint every  单位毫秒
env.enableCheckpointing(10000 * 2)
// make sure 500 ms of progress happen between checkpoints 默认为 0
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(1000*2)
// checkpoints have to complete within one minute, or are discarded 默认 10 分钟
env.getCheckpointConfig.setCheckpointTimeout(60000 * 2)
// allow only one checkpoint to be in progress at the same time 
env.getCheckpointConfig.setMaxConcurrentCheckpoints(4)
env.getCheckpointConfig.setFailOnCheckpointingErrors(true)  // The default is true.
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
val rocksdbBackend = new RocksDBStateBackend("hdfs:///flink/checkpoints",true)
//对于状态数据不需要压缩因为压缩选项对增量快照没有影响，因为它们使用的是RocksDB的内部格式
rocksdbBackend.setOptions(new MyOptions()) rocksdbBackend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM)
env.setStateBackend(rocksdbBackend)
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)//退出不删除checkpoint

当使用该配置时在提交作业时出现的错误：

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 9ebe0758cd287126758d57b15fb5e5a3)
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:260)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at com.miaoke.flink.classnet.KafkaDataInsertMySQL$.main(KafkaDataInsertMySQL.scala:20)
	at com.miaoke.flink.classnet.KafkaDataInsertMySQL.main(KafkaDataInsertMySQL.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:426)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 12 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 10 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:953)
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
	... 4 more
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
	at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:310)
	at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:294)
	at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
	... 5 more

将改配置取消掉再次提交作业：

rocksdbBackend.setOptions(new MyOptions())

此时作业提交成功

是否需要 new MyOptions() 见：

https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/state/large_state_tuning.html

为何出现该错误：未解决

将该任务分别提交到线下和线上集群：

线下集群：

1547454238486

线上集群：

1547455415258

当时间接近配置的时间是，checkpoint 出现错误,如图：

1547455525859

出现之前的异常，作业停止

思考：为何线上和线下出现不同情况（原因是线下数据量小）

怀疑是网络缓冲区的问题:

Tuning Network Buffers

添加如下配置：

#To support, for example, a cluster of 20 8-slot machines, you should use roughly 5000 network buffers for optimal throughput.
taskmanager.network.numberOfBuffers: 2500

由于修改此配置会影响到其它的作业，放弃

待解决

将配置再次修改：

去掉该配置

//触发下一个检查点之前的最小暂停
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(1000*2)

再次提交作业：

1547464186283

根据 FLINK-10615 将配置再次修改：

https://issues.apache.org/jira/browse/FLINK-10615

https://issues.apache.org/jira/browse/FLINK-10930 此观点分离目录不知如何做，待解决？

https://issues.apache.org/jira/browse/FLINK-10855

env.getCheckpointConfig.setFailOnCheckpointingErrors(false)  // The default is true.

当 : End to End Duration:1m 59s 时出现如下错误：


2019-01-16 15:21:44,438 INFO  org.apache.flink.runtime.rest.handler.legacy.backpressure.StackTraceSampleCoordinator  - Cancelling sample 0
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@bigdata05:35756/user/taskmanager_0#528141446]] after [15000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messa
ges.RemoteRpcInvocation".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2019-01-16 15:21:45,111 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Received late message for now expired checkpoint attempt 166 from 3739f018527af86e160aa509e421ad39 of job 7e325f322
7fa7b6ce9679acc633eede6.
2019-01-16 15:21:52,867 INFO  org.apache.flink.runtime.rest.handler.legacy.backpressure.StackTraceSampleCoordinator  - Cancelling sample 1
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@bigdata05:35756/user/taskmanager_0#528141446]] after [15000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messa
ges.RemoteRpcInvocation".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2019-01-16 15:22:03,739 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Received late message for now expired checkpoint attempt 167 from 3739f018527af86e160aa509e421ad39 of job 7e325f322
7fa7b6ce9679acc633eede6.

但是作业正常运行，并没有退出，MySQL 数据正常插入，但是 Checkpoint 提交失败，

思考如下两个问题：

1、当作业重启时，是否能保证 EXACTLY_ONCE 语义、状态应当如何去恢复待解决

2、此时的状态是否起作用

最终出现如下异常：

2019-01-16 18:13:24,234 ERROR com.miaoke.flink.classnet.MySQLSink$                          - 数据插入MySQL失败:  java.lang.OutOfMemoryError: GC overhead limit exceeded

为何出现 OOM 未解

思考：如何配置当 OOM 时，进行dump文件分析 未尝试

配置： -Xmx10M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=d://

日志：yarn logs -applicationId application_1545305009142_0290 | less

修改代码

怀疑是消费的数据太快，而插入MySQL 的时候比较慢，出现累计导致 End toEnd Duration 的时间一直递增

导致flink 背压运行 Back Pressure 如图

1547705967826

加大sink 的并行度再次提交作业问题得到解决，如下图所示：

1547708152002

上文的 End to End Duration 的值逐渐递增的问题解决，如下图所示：

1547709263777

最终的配置入下：

val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,Time.of(10,TimeUnit.SECONDS)))   //思考：当作业重启后是否能够获取到原来的状态值
env.setRestartStrategy(RestartStrategies.noRestart())  //作业失败后不重启
// start a checkpoint every  单位毫秒
env.enableCheckpointing(10000 * 6)
// make sure 500 ms of progress happen between checkpoints   触发下一个检查点之前的最小暂停。 默认为 0
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(10000*3)
// checkpoints have to complete within one minute, or are discarded    如果在此之前未完成，则中止正在执行的检查点的时间  默认 10 分钟
env.getCheckpointConfig.setCheckpointTimeout(60000 * 10)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
env.getCheckpointConfig.setFailOnCheckpointingErrors(true)  // The default is true.  如果设置为true，任务将在检查点错误时失败
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
val rocksdbBackend = new RocksDBStateBackend("hdfs:///flink/checkpoints",true)
//对于状态数据不需要压缩因为压缩选项对增量快照没有影响，因为它们使用的是RocksDB的内部格式
//rocksdbBackend.setOptions(new MyOptions())
rocksdbBackend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM)
env.setStateBackend(rocksdbBackend)
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)//退出不删除checkpoint

如此简单的问题，却饶了一大圈，无知、无知、无知！！！

Seniscz

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
flink 之 Checkpoint 出现的错误

文章目录flink 任务运行中出现的错误一、flink 维护状态变量出现走过的弯路：1、flink 维护状态变量有三种方式：&lt;1&gt;、MemoryStateBackend&lt;2&gt;、FsStateBackend&lt;3&gt;、RocksDBStateBackend2、出现的错误：&lt;1&gt;:当使用 FsStateBackend 的时候出现的错误如下图：错误如下：&l...
复制链接

扫一扫