flink 大状态下做savepoint失败问题的排查

39 篇文章 1 订阅

1、线上有一个任务状态比较大,做checkpoint的时候大约有100G左右,任务在做到10G左右的时候会报错

下面是做成功的状态

2、报错日志如下:


第二个错误日志:
2022-04-16 00:05:23
org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException: Connection for partition 55780195e63c343e4a320329203bbb8a#13@83443caed0bbd8a73684a77ae7f82b55 not reachable.
  at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:183)
  at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:322)
  at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:291)
  at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:94)
  at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
  at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
  at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsNonBlocking(MailboxProcessor.java:359)
  at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:323)
  at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
  at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:681)
  at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:636)
  at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
  at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:620)
  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
  at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager '/10.151.213.20:12115' has failed. This might indicate that the remote task manager has been lost.
  at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:145)
  at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:114)
  at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:81)
  at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:70)
  at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:179)
  ... 15 more
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.151.213.20:12115
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
  at org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
  at org.apache.flink.shaded.netty4.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)
  at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)
  at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)
  at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)
  at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
  at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
  at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
  at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  at java.lang.Thread.run(Thread.java:748)


第一个错误日志:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '10.151.213.20/10.151.213.20:12115'. This might indicate that the remote task manager was lost.
  at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:160)
  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
  at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
  at org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
  at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
  at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
  at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
  at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
  at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
  at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
  at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  at java.lang.Thread.run(Thread.java:748)

3、任务的配置信息如下:

table.dynamic-table-options.enabled=true;
state.backend.rocksdb.compaction.style=level;
table.exec.mini-batch.enabled=true;
state.backend.rocksdb.thread.num=8;
state.backend.rocksdb.checkpoint.transfer.thread.num=8;
table.exec.mini-batch.size=35000;
table.optimizer.distinct-agg.split.enabled=true;
state.backend.rocksdb.block.blocksize=32 kb;
state.backend.rocksdb.writebuffer.number-to-merge=2;
table.exec.mini-batch.allow-latency=15 s;

通过报错日志发现也有人遇到这个情况,判断是否是因为容器的磁盘或者内存太小小,导致出现问题,通过相关日志信息去tm上查看错误信息,错误代码如下

flink任务重启原因分析_L13763338360的博客-CSDN博客_flink运行一段时间就重启

4、

2022-04-16 00:05:23,933 WARN  org.apache.flink.streaming.api.operators.BackendRestorerProcedure [] - Exception while restoring keyed state backend for KeyedMapBundleOperator_7cf1b594451647e7fb445b612e152cd7_(1/40) from alternative (1/1), will retry while more alternatives are available.
org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:394) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:465) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:90) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:441) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:582) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:562) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:759) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
Caused by: java.lang.InterruptedException
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:347) ~[?:1.8.0_262]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) ~[?:1.8.0_262]
	at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForAllStateHandles(RocksDBStateDownloader.java:84) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.transferAllStateDataToDirectory(RocksDBStateDownloader.java:63) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.transferRemoteStateToLocalDirectory(RocksDBIncrementalRestoreOperation.java:253) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:221) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:187) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:167) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:314) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
	... 17 more
2022-04-16 00:05:23,935 INFO  org.apache.hadoop.hdfs.client.impl.BlockReaderRemote         [] - Reader block: 10085135602 

发现内存不是太太,调整一下内存资源。

5、配置信息如下:

调整管理内存大小,同步cache大小也调整
state.backend.incremental=true;
taskmanager.memory.managed.fraction =0.3;
state.backend.rocksdb.block.blocksize=64 kb;
state.backend.rocksdb.block.cache-size=128 mb;
state.backend.rocksdb.files.open = -1;
state.backend.rocksdb.writebuffer.size =128 mb;
state.backend.rocksdb.writebuffer.count=4;
state.backend.rocksdb.writebuffer.number-to-merge=2;

state.backend.rocksdb.compaction.style=level;
state.backend.rocksdb.thread.num=4;
state.backend.rocksdb.metrics.block-cache-usage=true;
state.backend.rocksdb.checkpoint.transfer.thread.num=8;


table.dynamic-table-options.enabled=true;
table.exec.mini-batch.enabled=true;
table.exec.mini-batch.size=35000;
table.optimizer.distinct-agg.split.enabled=true;
table.exec.mini-batch.allow-latency=15 s;

参考文档:
c​​​​​​​​​​​​​​​​​​​​​​​​​​​​Flink on RocksDB 参数调优指南 - 云+社区 - 腾讯云

nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/

Configuration | Apache Flink

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值