kafka consumer异常rebalance之后停止消费

最新推荐文章于 2024-08-24 16:00:52 发布

cumtmonster

最新推荐文章于 2024-08-24 16:00:52 发布

阅读量856

点赞数 10

文章标签： kafka 分布式

本文链接：https://blog.csdn.net/cumtmonster/article/details/133877934

版权

FullGc导致consumer心跳超时leave group

kafka监控报错信息
尝试重启consumer但是发现了新问题

昨天，发现线上基于某个事件的某些动作有明显的下滑趋势。经过排查确认消息是成功发送到kafka的，但是消费的日志却一直没打印。怀疑某些原因导致了消费卡住了

kafka监控报错信息

首先查看了kafka相关监控，公司用的阿里云kafka，监控用的是阿里的一个监控后台，非Eagle，不过监控的信息都差不多。

在这里插入图片描述

通过兼容发现，在上图中的这个时间点，集群发生了rebalance，赶紧根据这个时间点去搜索日志，最后发现了线面的3段错误日志

1
logger: org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer
message: Stopping container due to an Error
stack_trace: java.lang.OutOfMemoryError: Java heap space
2
Attempt to heartbeat failed since group is rebalancing
3
Member consumer-1-68a0fdb7-cfaf-4eaa-9dfe-4f0240a95dfe sending LeaveGroup request to coordinator ****。

当时的堆内存监控，看到Tenured Gen满了，且触发了fullGc，这里我们只关注由于FullGc，STW导致了consumer消费者的heart beat检测失败，然后自动leave group.
在这里插入图片描述

也就是说由于consusmer内部程序发生内存异常，consumer 自己离线了。

尝试重启consumer但是发现了新问题

stack_trace: org.apache.kafka.clients.consumer.CommitFailedException: 
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. 
This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, 
which typically implies that the poll loop is spending too much time message processing. 
You can address this either by increasing max.poll.interval.ms or by 
reducing the maximum size of batches returned in poll() with max.poll.records.

这里就是我们队列堆积之后重启消费者，消费者默认会拉取500条数据，在5min中内消费完，如果不能就又会触发rebalance。这个是因为我们的消费者逻辑比较复杂。
这里我们临时调整了每次拉取的数据条数之后恢复正常。