Flink Kafka Producer报错:The producer has been rejected from the broker because it tried to use an old

背景

在开发一个APM项目的过程中,需要使用flink从阿里云的sls消费数据并写入kafka,这里使用的Sink是flink官方支持库提供的 FlinkKafkaProducer ,对接后在运行过程中较频繁的出现以下异常

2021-07-07 11:25:56.080 [ERROR] [APM_ANRProcess -> Sink: APM_ANRSink (1/1)] [org.apache.flink.streaming.runtime.tasks.StreamTask][732] - Error during disposal of stream operator.
org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: The producer has been rejected from the broker because it tried to use an old epoch with the transactionalId
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1282)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:920)
	at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:729)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:645)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:549)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: The producer has been rejected from the broker because it tried to use an old epoch with the transactionalId
2021-07-07 11:25:56.082 [WARN ] [APM_ANRProcess -> Sink: APM_ANRSink (1/1)] [org.apache.flink.runtime.taskmanager.Task][970] - APM_ANRProcess -> Sink: APM_ANRSink (1/1) (2b3de5183e98607918b2976d46795740) switched from RUNNING to FAILED.
org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1282)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:816)
	at com.shizhuang.apm.flinkwork.sink.ApmKafkaIngestEventsProducer.invoke(ApmKafkaIngestEventsProducer.java:58)
	at com.shizhuang.apm.flinkwork.sink.ApmKafkaIngestEventsProducer.invoke(ApmKafkaIngestEventsProducer.java:18)
	at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:235)
	at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:717)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:692)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:672)
	at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:52)
	at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:30)
	at org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:53)

问题分析

从日志上看

The producer has been rejected from the broker because it tried to use an old epoch with the transactionalId

从自己面上的意思是说 producer 使用了 过时的 epoch 信息。
kafka会给每个producer 生成、分配一个transaction.id,这个唯一标识符可以用来解决僵尸示例的问题,epoch是 transaction.id 信息里的一个元信息,关于 epoch 的过期,从网上搜索的资料有这样一段解释

Once the epoch is bumped, any producers with same transactional.id and an older epoch are considered zombies and are fenced off, ie. future transactional writes from those producers are rejected. [emphasis added]

因此首先猜测是 producer 事务的提交超时了,因此broker认为 producer已经断开了,当producer 尝试使用过期的transaction.id提交信息时被拒绝了

问题解决

通过code review,我们的flink是设置了 checkpoint的,在flink 的官方文档上有说明,当 flink 设置了 checkpoint, kafka的事务提交时间将使用flink的checkpoint时间间隔。
我们项目当前 checkpointing的事务提交时间为5分钟,尝试需改 checkpoint时间为 1分钟后,发现问题解决了。
当然如果不想修改 checkpoint的间隔,也可以适当提高 transaction.timeout.ms 的时间,比如设置成 和broker一样的默认 15分钟

结论

  • 通过控制 flink的 checkpoint时间间隔,解决了 kafka producer 事务超时的问题
  • 如果不想修改checkpoint ,也可以适当提高 transaction.timeout.ms 的时间
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

卓修武

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值