kafka消费积压问题排查小结

最新推荐文章于 2025-04-21 10:00:00 发布

owisho_java

最新推荐文章于 2025-04-21 10:00:00 发布

阅读量1.1k

点赞数

文章标签： kafka java

本文链接：https://blog.csdn.net/owisho_java/article/details/129443005

版权

文章分析了Kafka消费者在处理消息时因耗时过长导致心跳超时，从而引发消费滞后和CommitFailedException的问题。原因是KafkaConsumer的heartbeat不是异步发送的，而是在poll()时执行。解决方案包括关闭自动提交，捕获并重试CommitFailedException，确保业务处理时间不超过session.timeout.ms。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

注意：

此文章针对的kafka client版本在下面的maven dependence中有注明，未必试用于所有kafka client版本

现象及问题：

kafka manager 中发现topic的消费出现了大lag，排查consumer log发现如下错误信息：

Auto offset commit failed for group test1: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.

问题产生的原因分析：

1.kafka client 在 subscribe topic 后会通过sendJoinGroupRequest 将自己的信息注册到kafka server中

2.kafka server 会定期检查client的最近一次heartbeat 时间，默认如果超过6s没有新的心跳请求，那么会认为client可能死亡，将client 从kafka server中移除

3.如下版本kafka client的KafkaConsumer

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>0.10.0.1</version>
</dependency>

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka_2.11</artifactId>
    <version>0.10.0.1</version>
    <exclusions>
        <exclusion>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
    </exclusions>
</dependency>

heartbeat请求，并非真正使用单独线程异步发送（开启自动提交和关闭自动提交都是如此）。而是被设计成了DelayedTask在执行consumer.poll()的时候被调用，代码如下：

private Map<TopicPartition, List<ConsumerRecord<K, V>>> pollOnce(long timeout) {
        // TODO: Sub-requests should take into account the poll timeout (KAFKA-1894)
        coordinator.ensureCoordinatorReady();

        // ensure we have partitions assigned if we expect to
        if (subscriptions.partitionsAutoAssigned())
            coordinator.ensurePartitionAssignment();

        // fetch positions if we have partitions we're subscribed to that we
        // don't know the offset for
        if (!subscriptions.hasAllFetchPositions())
            updateFetchPositions(this.subscriptions.missingFetchPositions());

        long now = time.milliseconds();

        // execute delayed tasks (e.g. autocommits and heartbeats) prior to fetching records
        client.executeDelayedTasks(now);

        // init any new fetches (won't resend pending fetches)
        Map<TopicPartition, List<ConsumerRecord<K, V>>> records = fetcher.fetchedRecords();

        // if data is available already, e.g. from a previous network client poll() call to commit,
        // then just return it immediately
        if (!records.isEmpty())
            return records;

        fetcher.sendFetches();
        client.poll(timeout, now);
        return fetcher.fetchedRecords();
}

4.因此，如果kafkaconsumer.poll()之后业务逻辑处理时间过长，就会导致client无法向server发送heartbeat请求，导致server认为client死亡。

5.server认为client死亡，将client从自己维护的GroupMetaData移除后；如果client在未重新发送joinRequest的情况下，直接发送OffsetCommitRequest，就会导致上述异常发生；offset将不会被server记录

可能出现的现象：

1.如果开启offset自动提交（默认情况下是开启），那么会出现，kafkaconsumer一直重复消费相同内容；

2.如果关闭自动提交，手动执行offsetCommit的时候，会出现CommitFailedException

解决方案：

1.首先，我关闭了自动提交（防止重复消费相同数据的发生）

2.其次，捕获手动commitOffset出现的异常并重试poll，做这个操作的原因是：在commit失败的那次poll中，由于发送rejoin请求逻辑在heartbeat请求逻辑之前，所以当时rejoin状态是false（还未被heartbeat请求结果修改），因此没有触发consumer的rejoin group 操作（处理逻辑在如下代码片段中执行）

// ensure we have partitions assigned if we expect to
if (subscriptions.partitionsAutoAssigned())
    coordinator.ensurePartitionAssignment();

随后，通过delayedTask触发了heartbeat请求；此时，通过heartbeat请求发现consumer已经失效，需要rejoin group操作，所以将rejoin状态更新为true（处理逻辑在如下代码片段中执行）

// execute delayed tasks (e.g. autocommits and heartbeats) prior to fetching records
client.executeDelayedTasks(now);

因此，捕获CommitFailedException后重新执行poll操作，可以使consumer重新加入消费组，并完成数据提交

示例代码：

注意：如果使用下面的消费逻辑，一定要保证，offset commit后的业务逻辑不发生异常，否则会导致处理数据丢失（不增加其他offset处理逻辑的前提下）

import org.apache.kafka.clients.consumer.*;

import java.io.IOException;
import java.util.Arrays;
import java.util.Iterator;
import java.util.Properties;

public class TestKafkaClient {

    public static void main(String[] args) {
        for (int i = 0; i < 1; i++) {
            Thread t = new Thread(new Client());
            t.setName("kafka-client-test" + i);
            t.start();
        }
    }

}

class Client implements Runnable {

    @Override
    public void run() {
        Properties properties = new Properties();
        try {
            properties.load(Client.class.getClassLoader().getResourceAsStream("consumer.props"));
        } catch (IOException e) {
            e.printStackTrace();
        }
        KafkaConsumer consumer = new KafkaConsumer(properties);
        consumer.subscribe(Arrays.asList("test1"));
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(10000);
            try{
                consumer.commitSync();
            }catch (CommitFailedException e){
                e.printStackTrace();
                continue;
            }
            Iterator<ConsumerRecord<String, String>> itr = records.iterator();
            while (itr.hasNext()) {
                ConsumerRecord<String, String> next = itr.next();
                System.out.println(Thread.currentThread().getName() + "key:" + next.key() + "value:" + next.value());
            }
            try {
                Thread.sleep(10000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }
}