文章目录
前言
消费者
是 Kafka 的客户端(源码为 Java 实现
)之一,其实 Kafka 高性能的重要原因之一就是把负载均衡放在了客户端实现。 通常一个 Kafka 集群及其消费者的结构关系如下图所示,其中比较关键的概念如下:
Broker
也就是 Kafka 服务端 broker 角色节点,实际负责处理消息存储、消息推送等功能,不了解的读者可参考Kafka 3.0 源码笔记(1)-Kafka 服务端的网络通信架构Topic
发布订阅的消息主题,用作存储消息的第一层结构,是 Kafka 中的一个逻辑结构。每个主题包含多个分区,这些分区通常会分布在不同的 Broker 节点上,共同构成一个主题的物理基础Partition
主题下的分区,每个分区是一组有序的消息日志。分区是 Kafka 中用来存储消息的第二层结构,也是存储消息的实际物理结构Consumer
消费者实例,Kafka 的客户端之一,实际负责从分区中拉消息处理Consumer Group
消费者组,多个消费者实例共同组成一个消费者组来消费主题,主题中的每个分区都只会被组内的一个消费者实例消费,组内其他消费者实例不能消费它
1. 消费者负载均衡的实现
一个消费者组通常包含多个消费者实例,如果要达成上图所示的消费关系,必然需要一套机制来保证每个消费者知道自己应该从哪个分区拉消息,而这套机制的运作流程如下图所示:
KafkaConsumer
中包含两个关键的组件,一个是负责协调消费者组内消费者分区消费的ConsumerCoordinator
,另一个是负责拉取消息的Fetcher
ConsumerCoordinator
会与 Kafka 的服务端交互,首先确定一个负责当前消费者组的消费者协调器,然后与这个协调器交互,加入消费者组并完成整个消费者组的消费分区的分配Fetcher
会在消费分区分配完成后向当前消费者负责的分区所在的 Broker 节点发起拉消息的请求,实际完成了 Kafka 消费者对服务端请求的分流,也就是负载均衡
整个消费者启动并拉取消息的核心流程可分为以下几步:
- 消费者确定自己所在的消费者组协调器的地址,并与其建立连接。这个过程中,如果消费者需要更新 Kafka 集群元数据,则也进行处理
- 消费者与协调器所在的服务端交互,请求加入消费者组。在这个过程中,协调器指定的 leader 将结合集群元数据与整个消费者组的消费者信息进行分区分配,完成后将分配方案发送给协调器
- 协调器将分区分配方案返回给各个消费者,消费者向自己负责的分区所在的 Broker 发起拉消息请求,完成消息消费
- 消费者会启动一个心跳线程与协调器保持连接,如果协调器返回消费者组状态变化,则进行重新加入消费者组的重平衡动作
2. 源码分析
2.1 KafkaConsumer 的初始化
-
KafkaConsumer
的构造方法中会初始化许多组件,比较重要的如下metadata
:ConsumerMetadata
对象,负责存储 Kafka 集群的元数据信息,后续this.metadata.bootstrap()
调用将根据配置bootstrap.servers
初始化集群节点client
:ConsumerNetworkClient
对象,上层的网络客户端,内部封装了一个NetworkClient
对象,这个NetworkClient
对象实际负责底层网络数据的读写assignors
: 消费者分区分配器列表,当前消费者如被选为消费者组的 leader,将使用分配器进行分区分配coordinator
:ConsumerCoordinator
对象,消费者的协调器组件,负责和服务端消费者组协调器交互fetcher
:Fetcher
对象,实际拉取服务端消息的组件
KafkaConsumer(ConsumerConfig config, Deserializer<K> keyDeserializer, Deserializer<V> valueDeserializer) { try { GroupRebalanceConfig groupRebalanceConfig = new GroupRebalanceConfig(config, GroupRebalanceConfig.ProtocolType.CONSUMER); this.groupId = Optional.ofNullable(groupRebalanceConfig.groupId); this.clientId = config.getString(CommonClientConfigs.CLIENT_ID_CONFIG); LogContext logContext; // If group.instance.id is set, we will append it to the log context. if (groupRebalanceConfig.groupInstanceId.isPresent()) { logContext = new LogContext("[Consumer instanceId=" + groupRebalanceConfig.groupInstanceId.get() + ", clientId=" + clientId + ", groupId=" + groupId.orElse("null") + "] "); } else { logContext = new LogContext("[Consumer clientId=" + clientId + ", groupId=" + groupId.orElse("null") + "] "); } this.log = logContext.logger(getClass()); boolean enableAutoCommit = config.maybeOverrideEnableAutoCommit(); groupId.ifPresent(groupIdStr -> { if (groupIdStr.isEmpty()) { log.warn("Support for using the empty group id by consumers is deprecated and will be removed in the next major release."); } }); log.debug("Initializing the Kafka consumer"); this.requestTimeoutMs = config.getInt(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG); this.defaultApiTimeoutMs = config.getInt(ConsumerConfig.DEFAULT_API_TIMEOUT_MS_CONFIG); this.time = Time.SYSTEM; this.metrics = buildMetrics(config, time, clientId); this.retryBackoffMs = config.getLong(ConsumerConfig.RETRY_BACKOFF_MS_CONFIG); List<ConsumerInterceptor<K, V>> interceptorList = (List) config.getConfiguredInstances( ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG, ConsumerInterceptor.class, Collections.singletonMap(ConsumerConfig.CLIENT_ID_CONFIG, clientId)); this.interceptors = new ConsumerInterceptors<>(interceptorList); if (keyDeserializer == null) { this.keyDeserializer = config.getConfiguredInstance(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, Deserializer.class); this.keyDeserializer.configure(config.originals(Collections.singletonMap(ConsumerConfig.CLIENT_ID_CONFIG, clientId)), true); } else { config.ignore(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG); this.keyDeserializer = keyDeserializer; } if (valueDeserializer == null) { this.valueDeserializer = config.getConfiguredInstance(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, Deserializer.class); this.valueDeserializer.configure(config.originals(Collections.singletonMap(ConsumerConfig.CLIENT_ID_CONFIG, clientId)), false); } else { config.ignore(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG); this.valueDeserializer = valueDeserializer; } OffsetResetStrategy offsetResetStrategy = OffsetResetStrategy.valueOf(config.getString(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG).toUpperCase(Locale.ROOT)); this.subscriptions = new SubscriptionState(logContext, offsetResetStrategy); ClusterResourceListeners clusterResourceListeners = configureClusterResourceListeners(keyDeserializer, valueDeserializer, metrics.reporters(), interceptorList); this.metadata = new ConsumerMetadata(retryBackoffMs, config.getLong(ConsumerConfig.METADATA_MAX_AGE_CONFIG), !config.getBoolean(ConsumerConfig.EXCLUDE_INTERNAL_TOPICS_CONFIG), config.getBoolean(ConsumerConfig.ALLOW_AUTO_CREATE_TOPICS_CONFIG), subscriptions, logContext, clusterResourceListeners); List<InetSocketAddress> addresses = ClientUtils.parseAndValidateAddresses( config.getList(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG), config.getString(ConsumerConfig.CLIENT_DNS_LOOKUP_CONFIG)); this.metadata.bootstrap(addresses); String metricGrpPrefix = "consumer"; FetcherMetricsRegistry metricsRegistry = new FetcherMetricsRegistry(Collections.singleton(CLIENT_ID_METRIC_TAG), metricGrpPrefix); ChannelBuilder channelBuilder = ClientUtils.createChannelBuilder(config, time, logContext); this.isolationLevel = IsolationLevel.valueOf( config.getString(ConsumerConfig.ISOLATION_LEVEL_CONFIG).toUpperCase(Locale.ROOT)); Sensor throttleTimeSensor = Fetcher.throttleTimeSensor(metrics, metricsRegistry); int heartbeatIntervalMs = config.getInt(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG); ApiVersions apiVersions = new ApiVersions(); NetworkClient netClient = new NetworkClient( new Selector(config.getLong(ConsumerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG), metrics, time, metricGrpPrefix, channelBuilder, logContext), this.metadata, clientId, 100, // a fixed large enough value will suffice for max in-flight requests config.getLong(ConsumerConfig.RECONNECT_BACKOFF_MS_CONFIG), config.getLong(ConsumerConfig.RECONNECT_BACKOFF_MAX_MS_CONFIG), config.getInt(ConsumerConfig.SEND_BUFFER_CONFIG), config.getInt(ConsumerConfig.RECEIVE_BUFFER_CONFIG), config.getInt(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG), config.getLong(ConsumerConfig.SOCKET_CONNECTION_SETUP_TIMEOUT_MS_CONFIG), config.getLong(ConsumerConfig.SOCKET_CONNECTION_SETUP_TIMEOUT_MAX_MS_CONFIG), time, true, apiVersions, throttleTimeSensor, logContext); this.client = new ConsumerNetworkClient( logContext, netClient, metadata, time, retryBackoffMs, config.getInt(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG), heartbeatIntervalMs); //Will avoid blocking an extended period of time to prevent heartbeat thread starvation this.assignors = ConsumerPartitionAssignor.getAssignorInstances( config.getList(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG), config.originals(Collections.singletonMap(ConsumerConfig.CLIENT_ID_CONFIG, clientId)) ); // no coordinator will be constructed for the default (null) group id this.coordinator = !groupId.isPresent() ? null : new ConsumerCoordinator(groupRebalanceConfig, logContext, this.client, assignors, this.metadata, this.subscriptions, metrics, metricGrpPrefix, this.time, enableAutoCommit, config.getInt(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG), this.interceptors, config.getBoolean(ConsumerConfig.THROW_ON_FETCH_STABLE_OFFSET_UNSUPPORTED)); this.fetcher = new Fetcher<>( logContext, this.client, config.getInt(ConsumerConfig.FETCH_MIN_BYTES_CONFIG), config.getInt(ConsumerConfig.FETCH_MAX_BYTES_CONFIG), config.getInt(ConsumerConfig.FETCH_MAX_WAIT_MS_CONFIG), config.getInt(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG), config.getInt(ConsumerConfig.MAX_POLL_RECORDS_CONFIG), config.getBoolean(ConsumerConfig.CHECK_CRCS_CONFIG), config.getString(ConsumerConfig.CLIENT_RACK_CONFIG), this.keyDeserializer, this.valueDeserializer, this.metadata, this.subscriptions, metrics, metricsRegistry, this.time, this.retryBackoffMs, this.requestTimeoutMs, isolationLevel, apiVersions); this.kafkaConsumerMetrics = new KafkaConsumerMetrics(metrics, metricGrpPrefix); config.logUnused(); AppInfoParser.registerAppInfo(JMX_PREFIX, clientId, metrics, time.milliseconds()); log.debug("Kafka consumer initialized"); } catch (Throwable t) { // call close methods if internal objects are already constructed; this is to prevent resource leak. see KAFKA-2121 // we do not need to call `close` at all when `log` is null, which means no internal objects were initialized. if (this.log != null) { close(0, true); } // now propagate the exception throw new KafkaException("Failed to construct kafka consumer", t); } }
-
ConsumerMetadata#bootstrap()
方法会通过MetadataCache#bootstrap()
生成一个MetadataCache
对象,MetadataCache
中实际保存着 Kafka 集群的元数据public synchronized void bootstrap(List<InetSocketAddress> addresses) { this.needFullUpdate = true; this.updateVersion += 1; this.cache = MetadataCache.bootstrap(addresses); }
-
MetadataCache
的属性如下,其命名一目了然,不再赘述public class MetadataCache { private final String clusterId; private final Map<Integer, Node> nodes; private final Set<String> unauthorizedTopics; private final Set<String> invalidTopics; private final Set<String> internalTopics; private final Node controller; private final Map<TopicPartition, PartitionMetadata> metadataByPartition; private final Map<String, Uuid> topicIds; private Cluster clusterInstance; ....... }
2.2 KafkaConsumer 的消息拉取
2.2.1 消息拉取的准备及入口
-
拉取消息前,消费者首先要声明自己订阅的主题,则
KafkaConsumer#subscribe()
将被调用。这个方法中其实主要涉及一些属性的设置,大致为以下几步:- 考虑到多次订阅主题不一致的情况,调用
Fetcher#clearBufferedDataForUnassignedTopics()
将已经接收到的不在本次订阅的 topic 列表中的数据清除 - 调用
SubscriptionState#subscribe()
方法重置订阅的 topic 列表 - 调用
Metadata#requestUpdateForNewTopics()
方法设置更新元数据的标识位needPartialUpdate
为 true,则后续消费者将发送更新元数据请求
@Override public void subscribe(Collection<String> topics) { subscribe(topics, new NoOpConsumerRebalanceListener()); } public void subscribe(Collection<String> topics, ConsumerRebalanceListener listener) { acquireAndEnsureOpen(); try { maybeThrowInvalidGroupIdException(); if (topics == null) throw new IllegalArgumentException("Topic collection to subscribe to cannot be null"); if (topics.isEmpty()) { // treat subscribing to empty topic list as the same as unsubscribing this.unsubscribe(); } else { for (String topic : topics) { if (Utils.isBlank(topic)) throw new IllegalArgumentException("Topic collection to subscribe to cannot contain null or empty topic"); } throwIfNoAssignorsConfigured(); fetcher.clearBufferedDataForUnassignedTopics(topics); log.info("Subscribed to topic(s): {}", Utils.join(topics, ", ")); if (this.subscriptions.subscribe(new HashSet<>(topics), listener)) metadata.requestUpdateForNewTopics(); } } finally { release(); } }
- 考虑到多次订阅主题不一致的情况,调用
-
设置订阅的主题后,则可调用
KafkaConsumer#poll()
方法进入拉消息的流程,此处是消息消费的入口,关键步骤如下:- 调用
KafkaConsumer#updateAssignmentMetadataIfNeeded()
方法进入消费者分区分配的流程 - 以上步骤完成后,调用
KafkaConsumer#pollForFetches()
方法拉取消息
@Override public ConsumerRecords<K, V> poll(final Duration timeout) { return poll(time.timer(timeout), true); } /** * @throws KafkaException if the rebalance callback throws exception */ private ConsumerRecords<K, V> poll(final Timer timer, final boolean includeMetadataInTimeout) { acquireAndEnsureOpen(); try { this.kafkaConsumerMetrics.recordPollStart(timer.currentTimeMs()); if (this.subscriptions.hasNoSubscriptionOrUserAssignment()) { throw new IllegalStateException("Consumer is not subscribed to any topics or assigned any partitions"); } do { client.maybeTriggerWakeup(); if (includeMetadataInTimeout) { // try to update assignment metadata BUT do not need to block on the timer for join group updateAssignmentMetadataIfNeeded(timer, false); } else { while (!updateAssignmentMetadataIfNeeded(time.timer(Long.MAX_VALUE), true)) { log.warn("Still waiting for metadata"); } } final Map<TopicPartition, List<ConsumerRecord<K, V>>> records = pollForFetches(timer); if (!records.isEmpty()) { // before returning the fetched records, we can send off the next round of fetches // and avoid block waiting for their responses to enable pipelining while the user // is handling the fetched records. // // NOTE: since the consumed position has already been updated, we must not allow // wakeups or any other errors to be triggered prior to returning the fetched records. if (fetcher.sendFetches() > 0 || client.hasPendingRequests()) { client.transmitSends(); } return this.interceptors.onConsume(new ConsumerRecords<>(records)); } } while (timer.notExpired()); return ConsumerRecords.empty(); } finally { release(); this.kafkaConsumerMetrics.recordPollEnd(timer.currentTimeMs()); } }
- 调用
-
KafkaConsumer#updateAssignmentMetadataIfNeeded()
方法的核心是调用ConsumerCoordinator#poll()
方法与服务端协调器交互确定当前消费者负责的分区,这部分比较复杂,下文2.2.2 节详细分析boolean updateAssignmentMetadataIfNeeded(final Timer timer, final boolean waitForJoinGroup) { if (coordinator != null && !coordinator.poll(timer, waitForJoinGroup)) { return false; } return updateFetchPositions(timer); }
-
KafkaConsumer#pollForFetches()
方法的处理比较直观,核心就是调用Fetcher#sendFetches()
方法向服务端发起请求获取消息,下文2.2.3 节将详细分析private Map<TopicPartition, List<ConsumerRecord<K, V>>> pollForFetches(Timer timer) { long pollTimeout = coordinator == null ? timer.remainingMs() : Math.min(coordinator.timeToNextPoll(timer.currentTimeMs()), timer.remainingMs()); // if data is available already, return it immediately final Map<TopicPartition, List<ConsumerRecord<K, V>>> records = fetcher.fetchedRecords(); if (!records.isEmpty()) { return records; } // send any new fetches (won't resend pending fetches) fetcher.sendFetches(); // We do not want to be stuck blocking in poll if we are missing some positions // since the offset lookup may be backing off after a failure // NOTE: the use of cachedSubscriptionHashAllFetchPositions means we MUST call // updateAssignmentMetadataIfNeeded before this method. if (!cachedSubscriptionHashAllFetchPositions && pollTimeout > retryBackoffMs) { pollTimeout = retryBackoffMs; } log.trace("Polling for fetches with timeout {}", pollTimeout); Timer pollTimer = time.timer(pollTimeout); client.poll(pollTimer, () -> { // since a fetch might be completed by the background thread, we need this poll condition // to ensure that we do not block unnecessarily in poll() return !fetcher.hasAvailableFetches(); }); timer.update(pollTimer.currentTimeMs()); return fetcher.fetchedRecords(); }
2.2.2 消费者与协调器的交互
2.2.2.1 协调器定位
-
ConsumerCoordinator#poll()
方法的源码如下,可以看到关键的处理步骤如下:- 如果当前还没有找到消费者所在的消费者组协调器,则需调用父类
AbstractCoordinator#ensureCoordinatorReady()
方法确保和协调器建立连接 - 如果消费者还没有加入消费者组或者通过心跳监听到消费者组状态有变化,则需要调用父类
AbstractCoordinator#ensureActiveGroup()
方法等待消费者组分区分配完成 - 最后无论以上处理是否执行,都调用
ConsumerCoordinator#maybeAutoCommitOffsetsAsync()
方法检查自动提交是否开启,如果开启则需要提交上一次拉取消费的各个分区的消息
public boolean poll(Timer timer, boolean waitForJoinGroup) { maybeUpdateSubscriptionMetadata(); invokeCompletedOffsetCommitCallbacks(); if (subscriptions.hasAutoAssignedPartitions()) { if (protocol == null) { throw new IllegalStateException("User configured " + ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG + " to empty while trying to subscribe for group protocol to auto assign partitions"); } // Always update the heartbeat last poll time so that the heartbeat thread does not leave the // group proactively due to application inactivity even if (say) the coordinator cannot be found. pollHeartbeat(timer.currentTimeMs()); if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) { return false; } if (rejoinNeededOrPending()) { // due to a race condition between the initial metadata fetch and the initial rebalance, // we need to ensure that the metadata is fresh before joining initially. This ensures // that we have matched the pattern against the cluster's topics at least once before joining. if (subscriptions.hasPatternSubscription()) { // For consumer group that uses pattern-based subscription, after a topic is created, // any consumer that discovers the topic after metadata refresh can trigger rebalance // across the entire consumer group. Multiple rebalances can be triggered after one topic // creation if consumers refresh metadata at vastly different times. We can significantly // reduce the number of rebalances caused by single topic creation by asking consumer to // refresh metadata before re-joining the group as long as the refresh backoff time has // passed. if (this.metadata.timeToAllowUpdate(timer.currentTimeMs()) == 0) { this.metadata.requestUpdate(); } if (!client.ensureFreshMetadata(timer)) { return false; } maybeUpdateSubscriptionMetadata(); } // if not wait for join group, we would just use a timer of 0 if (!ensureActiveGroup(waitForJoinGroup ? timer : time.timer(0L))) { // since we may use a different timer in the callee, we'd still need // to update the original timer's current time after the call timer.update(time.milliseconds()); return false; } } } else { // For manually assigned partitions, if there are no ready nodes, await metadata. // If connections to all nodes fail, wakeups triggered while attempting to send fetch // requests result in polls returning immediately, causing a tight loop of polls. Without // the wakeup, poll() with no channels would block for the timeout, delaying re-connection. // awaitMetadataUpdate() initiates new connections with configured backoff and avoids the busy loop. // When group management is used, metadata wait is already performed for this scenario as // coordinator is unknown, hence this check is not required. if (metadata.updateRequested() && !client.hasReadyNodes(timer.currentTimeMs())) { client.awaitMetadataUpdate(timer); } } maybeAutoCommitOffsetsAsync(timer.currentTimeMs()); return true; }
- 如果当前还没有找到消费者所在的消费者组协调器,则需调用父类
-
AbstractCoordinator#ensureCoordinatorReady()
方法会在 while 循环中不断向 Kafka 服务端发起请求确定消费者组协调器,其核心处理如下:- 调用
AbstractCoordinator#lookupCoordinator()
方法生成异步请求,并将其存入到请求队列中 - 调用
ConsumerNetworkClient#poll()
实际发起网络请求,并监听服务端的响应,如异步请求完成则退出循环
protected synchronized boolean ensureCoordinatorReady(final Timer timer) { if (!coordinatorUnknown()) return true; do { if (fatalFindCoordinatorException != null) { final RuntimeException fatalException = fatalFindCoordinatorException; fatalFindCoordinatorException = null; throw fatalException; } final RequestFuture<Void> future = lookupCoordinator(); client.poll(future, timer); if (!future.isDone()) { // ran out of time break; } RuntimeException fatalException = null; if (future.failed()) { if (future.isRetriable()) { log.debug("Coordinator discovery failed, refreshing metadata", future.exception()); client.awaitMetadataUpdate(timer); } else { fatalException = future.exception(); log.info("FindCoordinator request hit fatal exception", fatalException); } } else if (coordinator != null && client.isUnavailable(coordinator)) { // we found the coordinator, but the connection has failed, so mark // it dead and backoff before retrying discovery markCoordinatorUnknown("coordinator unavailable"); timer.sleep(rebalanceConfig.retryBackoffMs); } clearFindCoordinatorFuture(); if (fatalException != null) throw fatalException; } while (coordinatorUnknown() && timer.notExpired()); return !coordinatorUnknown(); }
- 调用
-
AbstractCoordinator#lookupCoordinator()
方法的核心处理如下:- 首先调用
ConsumerNetworkClient#leastLoadedNode()
方法从集群元数据中取得一个最新的节点 - 调用
AbstractCoordinator#sendFindCoordinatorRequest()
方法生成发送给目标节点的异步请求
protected synchronized RequestFuture<Void> lookupCoordinator() { if (findCoordinatorFuture == null) { // find a node to ask about the coordinator Node node = this.client.leastLoadedNode(); if (node == null) { log.debug("No broker available to send FindCoordinator request"); return RequestFuture.noBrokersAvailable(); } else { findCoordinatorFuture = sendFindCoordinatorRequest(node); } } return findCoordinatorFuture; }
- 首先调用
-
AbstractCoordinator#sendFindCoordinatorRequest()
方法源码如下,可以看到核心为两步:- 调用
ConsumerNetworkClient#send()
方法生成一个 FindCoordinator 异步请求并入队 - 调用这个异步请求
RequestFuture#compose()
方法为其添加监听器,并将服务端响应的回调处理器FindCoordinatorResponseHandler
封装在监听器中
private RequestFuture<Void> sendFindCoordinatorRequest(Node node) { // initiate the group metadata request log.debug("Sending FindCoordinator request to broker {}", node); FindCoordinatorRequestData data = new FindCoordinatorRequestData() .setKeyType(CoordinatorType.GROUP.id()) .setKey(this.rebalanceConfig.groupId); FindCoordinatorRequest.Builder requestBuilder = new FindCoordinatorRequest.Builder(data); return client.send(node, requestBuilder) .compose(new FindCoordinatorResponseHandler()); }
- 调用
-
ConsumerNetworkClient#send()
方法的核心逻辑如下:- 新建一个
RequestFutureCompletionHandler
对象作为请求完成的处理器 - 调用
NetworkClient#newClientRequest()
方法生成一个网络ClientRequest
对象,并将其存入到缓存队列unsent
中
public RequestFuture<ClientResponse> send(Node node, AbstractRequest.Builder<?> requestBuilder) { return send(node, requestBuilder, requestTimeoutMs); } public RequestFuture<ClientResponse> send(Node node, AbstractRequest.Builder<?> requestBuilder, int requestTimeoutMs) { long now = time.milliseconds(); RequestFutureCompletionHandler completionHandler = new RequestFutureCompletionHandler(); ClientRequest clientRequest = client.newClientRequest(node.idString(), requestBuilder, now, true, requestTimeoutMs, completionHandler); unsent.put(node, clientRequest); // wakeup the client in case it is blocking in poll so that we can send the queued request client.wakeup(); return completionHandler.future; }
- 新建一个
-
此时回到本节步骤2
ConsumerNetworkClient#poll()
方法调用,可以看到其入口在一个 while 循环中,最终的核心实现关键点如下:- 调用
ConsumerNetworkClient#trySend()
方法将unsent
中的请求取出,并与目标节点建立连接 - 调用
NetworkClient#poll()
方法监听底层网络连接,并处理网络数据读写 - 调用
ConsumerNetworkClient#firePendingCompletedRequests()
方法回调上层请求的回调处理器
public boolean poll(RequestFuture<?> future, Timer timer) { do { poll(timer, future); } while (!future.isDone() && timer.notExpired()); return future.isDone(); } public void poll(Timer timer, PollCondition pollCondition) { poll(timer, pollCondition, false); } public void poll(Timer timer, PollCondition pollCondition, boolean disableWakeup) { // there may be handlers which need to be invoked if we woke up the previous call to poll firePendingCompletedRequests(); lock.lock(); try { // Handle async disconnects prior to attempting any sends handlePendingDisconnects(); // send all the requests we can send now long pollDelayMs = trySend(timer.currentTimeMs()); // check whether the poll is still needed by the caller. Note that if the expected completion // condition becomes satisfied after the call to shouldBlock() (because of a fired completion // handler), the client will be woken up. if (pendingCompletion.isEmpty() && (pollCondition == null || pollCondition.shouldBlock())) { // if there are no requests in flight, do not block longer than the retry backoff long pollTimeout = Math.min(timer.remainingMs(), pollDelayMs); if (client.inFlightRequestCount() == 0) pollTimeout = Math.min(pollTimeout, retryBackoffMs); client.poll(pollTimeout, timer.currentTimeMs()); } else { client.poll(0, timer.currentTimeMs()); } timer.update(); // handle any disconnects by failing the active requests. note that disconnects must // be checked immediately following poll since any subsequent call to client.ready() // will reset the disconnect status checkDisconnects(timer.currentTimeMs()); if (!disableWakeup) { // trigger wakeups after checking for disconnects so that the callbacks will be ready // to be fired on the next call to poll() maybeTriggerWakeup(); } // throw InterruptException if this thread is interrupted maybeThrowInterruptException(); // try again to send requests since buffer space may have been // cleared or a connect finished in the poll trySend(timer.currentTimeMs()); // fail requests that couldn't be sent if they have expired failExpiredRequests(timer.currentTimeMs()); // clean unsent requests collection to keep the map from growing indefinitely unsent.clean(); } finally { lock.unlock(); } // called without the lock to avoid deadlock potential if handlers need to acquire locks firePendingCompletedRequests(); metadata.maybeThrowAnyException(); }
- 调用
-
ConsumerNetworkClient#trySend()
方法的核心是将unsent
中的请求发往对应的节点,其关键为以下两步:- 首先调用
NetworkClient#ready()
方法确保与目标节点建立了连接 - 其次调用
NetworkClient#send()
方法将请求存入连接的缓冲区,等待连接可写时发送
// Visible for testing long trySend(long now) { long pollDelayMs = maxPollTimeoutMs; // send any requests that can be sent now for (Node node : unsent.nodes()) { Iterator<ClientRequest> iterator = unsent.requestIterator(node); if (iterator.hasNext()) pollDelayMs = Math.min(pollDelayMs, client.pollDelayMs(node, now)); while (iterator.hasNext()) { ClientRequest request = iterator.next(); if (client.ready(node, now)) { client.send(request, now); iterator.remove(); } else { // try next node when current node is not ready break; } } } return pollDelayMs; }
- 首先调用
-
NetworkClient#ready()
方法的核心是调用NetworkClient#initiateConnect()
方法建立连接,而建立连接的核心又是Selector#connect()
方法。Selector#connect()
方法继续深入其实是 Java NIO 的基本操作,篇幅所限本文不再深入,读者对照源码流程图理解即可@Override public boolean ready(Node node, long now) { if (node.isEmpty()) throw new IllegalArgumentException("Cannot connect to empty node " + node); if (isReady(node, now)) return true; if (connectionStates.canConnect(node.idString(), now)) // if we are interested in sending to a node and we don't have a connection to it, initiate one initiateConnect(node, now); return false; } private void initiateConnect(Node node, long now) { String nodeConnectionId = node.idString(); try { connectionStates.connecting(nodeConnectionId, now, node.host()); InetAddress address = connectionStates.currentAddress(nodeConnectionId); log.debug("Initiating connection to node {} using address {}", node, address); selector.connect(nodeConnectionId, new InetSocketAddress(address, node.port()), this.socketSendBuffer, this.socketReceiveBuffer); } catch (IOException e) { log.warn("Error connecting to node {}", node, e); // Attempt failed, we'll try again after the backoff connectionStates.disconnected(nodeConnectionId, now); // Notify metadata updater of the connection failure metadataUpdater.handleServerDisconnect(now, nodeConnectionId, Optional.empty()); } }
-
此时回到本节步骤7 第2点
NetworkClient#send()
方法调用,根据如下源码可知其最终调用的关键如下:- 生成包装了 ClientRequest 的回调对象
RequestFutureCompletionHandler
的底层InFlightRequest
对象,并将其入队 - 调用
Selector#send()
方法将请求存入连接的发送缓冲区,此处不再深入
@Override public void send(ClientRequest request, long now) { doSend(request, false, now); } private void doSend(ClientRequest clientRequest, boolean isInternalRequest, long now) { ensureActive(); String nodeId = clientRequest.destination(); if (!isInternalRequest) { // If this request came from outside the NetworkClient, validate // that we can send data. If the request is internal, we trust // that internal code has done this validation. Validation // will be slightly different for some internal requests (for // example, ApiVersionsRequests can be sent prior to being in // READY state.) if (!canSendRequest(nodeId, now)) throw new IllegalStateException("Attempt to send a request to node " + nodeId + " which is not ready."); } AbstractRequest.Builder<?> builder = clientRequest.requestBuilder(); try { NodeApiVersions versionInfo = apiVersions.get(nodeId); short version; // Note: if versionInfo is null, we have no server version information. This would be // the case when sending the initial ApiVersionRequest which fetches the version // information itself. It is also the case when discoverBrokerVersions is set to false. if (versionInfo == null) { version = builder.latestAllowedVersion(); if (discoverBrokerVersions && log.isTraceEnabled()) log.trace("No version information found when sending {} with correlation id {} to node {}. " + "Assuming version {}.", clientRequest.apiKey(), clientRequest.correlationId(), nodeId, version); } else { version = versionInfo.latestUsableVersion(clientRequest.apiKey(), builder.oldestAllowedVersion(), builder.latestAllowedVersion()); } // The call to build may also throw UnsupportedVersionException, if there are essential // fields that cannot be represented in the chosen version. doSend(clientRequest, isInternalRequest, now, builder.build(version)); } catch (UnsupportedVersionException unsupportedVersionException) { // If the version is not supported, skip sending the request over the wire. // Instead, simply add it to the local queue of aborted requests. log.debug("Version mismatch when attempting to send {} with correlation id {} to {}", builder, clientRequest.correlationId(), clientRequest.destination(), unsupportedVersionException); ClientResponse clientResponse = new ClientResponse(clientRequest.makeHeader(builder.latestAllowedVersion()), clientRequest.callback(), clientRequest.destination(), now, now, false, unsupportedVersionException, null, null); if (!isInternalRequest) abortedSends.add(clientResponse); else if (clientRequest.apiKey() == ApiKeys.METADATA) metadataUpdater.handleFailedRequest(now, Optional.of(unsupportedVersionException)); } } private void doSend(ClientRequest clientRequest, boolean isInternalRequest, long now, AbstractRequest request) { String destination = clientRequest.destination(); RequestHeader header = clientRequest.makeHeader(request.version()); if (log.isDebugEnabled()) { log.debug("Sending {} request with header {} and timeout {} to node {}: {}", clientRequest.apiKey(), header, clientRequest.requestTimeoutMs(), destination, request); } Send send = request.toSend(header); InFlightRequest inFlightRequest = new InFlightRequest( clientRequest, header, isInternalRequest, request, send, now); this.inFlightRequests.add(inFlightRequest); selector.send(new NetworkSend(clientRequest.destination(), send)); }
- 生成包装了 ClientRequest 的回调对象
-
此时回到本节步骤6第2点
NetworkClient#poll()
方法,此处的处理重点如下:- 调用
MetadataUpdater#maybeUpdate()
方法检查是否需要更新消费者保存的集群元数据,如需要更新则发送 Metadata请求,本文暂不深入 - 调用
Selector#poll
处理底层连接的网络数据读写,这个点很繁琐,读者如有兴趣可自行深入研究 - 调用
NetworkClient#handleCompletedReceives()
方法处理底层连接接收到的网络数据转化为上层ClientResponse
对象 - 调用
NetworkClient#completeResponses()
方法将接收到的网络数据回调到上层处理
@Override public List<ClientResponse> poll(long timeout, long now) { ensureActive(); if (!abortedSends.isEmpty()) { // If there are aborted sends because of unsupported version exceptions or disconnects, // handle them immediately without waiting for Selector#poll. List<ClientResponse> responses = new ArrayList<>(); handleAbortedSends(responses); completeResponses(responses); return responses; } long metadataTimeout = metadataUpdater.maybeUpdate(now); try { this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs)); } catch (IOException e) { log.error("Unexpected error during I/O", e); } // process completed actions long updatedNow = this.time.milliseconds(); List<ClientResponse> responses = new ArrayList<>(); handleCompletedSends(responses, updatedNow); handleCompletedReceives(responses, updatedNow); handleDisconnections(responses, updatedNow); handleConnections(); handleInitiateApiVersionRequests(updatedNow); handleTimedOutConnections(responses, updatedNow); handleTimedOutRequests(responses, updatedNow); completeResponses(responses); return responses; }
- 调用
-
NetworkClient#handleCompletedReceives()
方法如下,可以看到核心逻辑很清晰:- 调用
Selector#completedReceives()
方法取得网络接收的响应数据,遍历处理 - 取出当前网络响应对应的
InFlightRequest
对象,并调用NetworkClient#parseResponse()
解析网络回包,生成上层的数据响应对象 - 根据响应的数据类型确定处理方式,如果是
MetadataResponse
类型则调用MetadataUpdater#handleSuccessfulResponse()
方法进行元数据更新,其余的则调用InFlightRequest#completed()
方法转化生成ClientResponse
对象,并将上层的回调处理器设置到其内部
private void handleCompletedReceives(List<ClientResponse> responses, long now) { for (NetworkReceive receive : this.selector.completedReceives()) { String source = receive.source(); InFlightRequest req = inFlightRequests.completeNext(source); AbstractResponse response = parseResponse(receive.payload(), req.header); if (throttleTimeSensor != null) throttleTimeSensor.record(response.throttleTimeMs(), now); if (log.isDebugEnabled()) { log.debug("Received {} response from node {} for request with header {}: {}", req.header.apiKey(), req.destination, req.header, response); } // If the received response includes a throttle delay, throttle the connection. maybeThrottle(response, req.header.apiVersion(), req.destination, now); if (req.isInternalRequest && response instanceof MetadataResponse) metadataUpdater.handleSuccessfulResponse(req.header, now, (MetadataResponse) response); else if (req.isInternalRequest && response instanceof ApiVersionsResponse) handleApiVersionsResponse(responses, req, now, (ApiVersionsResponse) response); else responses.add(req.completed(response, now)); } }
- 调用
-
回到本节步骤10第4点
NetworkClient#completeResponses()
方法将回调ClientResponse
对象的回调对象方法,据此将网络数据扔到上层的缓存队列中private void completeResponses(List<ClientResponse> responses) { for (ClientResponse response : responses) { try { response.onComplete(); } catch (Exception e) { log.error("Uncaught error in request completion:", e); } } }
-
此时回到本节步骤6第3步,
ConsumerNetworkClient#firePendingCompletedRequests()
方法将依次回调响应的回调函数,最终底层数据回调的链路比较崎岖,笔者大致梳理如下,则可知最终将回调到FindCoordinatorResponseHandler#onSuccess()
方法private void firePendingCompletedRequests() { boolean completedRequestsFired = false; for (;;) { RequestFutureCompletionHandler completionHandler = pendingCompletion.poll(); if (completionHandler == null) break; completionHandler.fireCompletion(); completedRequestsFired = true; } // wakeup the client in case it is blocking in poll for this future's completion if (completedRequestsFired) client.wakeup(); }
-
FindCoordinatorResponseHandler#onSuccess()
方法的处理简单明了,至此协调器的定位结束- 首先根据服务端响应数据构建协调器节点
- 调用
ConsumerNetworkClient#tryConnect()
方法与协调器节点建立连接
private class FindCoordinatorResponseHandler extends RequestFutureAdapter<ClientResponse, Void> { @Override public void onSuccess(ClientResponse resp, RequestFuture<Void> future) { log.debug("Received FindCoordinator response {}", resp); List<Coordinator> coordinators = ((FindCoordinatorResponse) resp.responseBody()).coordinators(); if (coordinators.size() != 1) { log.error("Group coordinator lookup failed: Invalid response containing more than a single coordinator"); future.raise(new IllegalStateException("Group coordinator lookup failed: Invalid response containing more than a single coordinator")); } Coordinator coordinatorData = coordinators.get(0); Errors error = Errors.forCode(coordinatorData.errorCode()); if (error == Errors.NONE) { synchronized (AbstractCoordinator.this) { // use MAX_VALUE - node.id as the coordinator id to allow separate connections // for the coordinator in the underlying network client layer int coordinatorConnectionId = Integer.MAX_VALUE - coordinatorData.nodeId(); AbstractCoordinator.this.coordinator = new Node( coordinatorConnectionId, coordinatorData.host(), coordinatorData.port()); log.info("Discovered group coordinator {}", coordinator); client.tryConnect(coordinator); heartbeat.resetSessionTimeout(); } future.complete(null); } else if (error == Errors.GROUP_AUTHORIZATION_FAILED) { future.raise(GroupAuthorizationException.forGroupId(rebalanceConfig.groupId)); } else { log.debug("Group coordinator lookup failed: {}", coordinatorData.errorMessage()); future.raise(error); } } @Override public void onFailure(RuntimeException e, RequestFuture<Void> future) { log.debug("FindCoordinator request failed due to {}", e.toString()); if (!(e instanceof RetriableException)) { // Remember the exception if fatal so we can ensure it gets thrown by the main thread fatalFindCoordinatorException = e; } super.onFailure(e, future); } }
2.2.2.2 消费者加入消费者组的流程
-
经过以上步骤,消费者已经定位到消费者组的协调器并与其建立了连接,接下来回到2.2.2.1节步骤1第2步,
AbstractCoordinator#ensureActiveGroup()
方法的调用,可以看到此处的关键操作如下:- 调用
AbstractCoordinator#startHeartbeatThreadIfNeeded()
方法尝试启动心跳线程,与消费者组协调器建立心跳连接,心跳线程的逻辑在下文2.2.2.3节将详细分析 - 调用
AbstractCoordinator#joinGroupIfNeeded()
方法尝试进入消费者加入消费者组的流程
boolean ensureActiveGroup(final Timer timer) { // always ensure that the coordinator is ready because we may have been disconnected // when sending heartbeats and does not necessarily require us to rejoin the group. if (!ensureCoordinatorReady(timer)) { return false; } startHeartbeatThreadIfNeeded(); return joinGroupIfNeeded(timer); } private synchronized void startHeartbeatThreadIfNeeded() { if (heartbeatThread == null) { heartbeatThread = new HeartbeatThread(); heartbeatThread.start(); } }
- 调用
-
AbstractCoordinator#joinGroupIfNeeded()
方法会在 while 循环中不断进行加入消费者组的尝试,直至成功。从其源码来看,关键点如下:- 调用
AbstractCoordinator#initiateJoinGroup()
方法生成加入消费者组的JoinGroup 异步请求 - 调用
ConsumerNetworkClinet#poll()
方法实际推动请求的收发处理,这部分在上一节中有分析,不再赘述 - 如果加入消费者组的异步请求成功完成,则协调器会将当前消费者负责的分区下发过来,消费者调用
ConsumerCoordinator#onJoinComplete()
保存即可
boolean joinGroupIfNeeded(final Timer timer) { while (rejoinNeededOrPending()) { if (!ensureCoordinatorReady(timer)) { return false; } // call onJoinPrepare if needed. We set a flag to make sure that we do not call it a second // time if the client is woken up before a pending rebalance completes. This must be called // on each iteration of the loop because an event requiring a rebalance (such as a metadata // refresh which changes the matched subscription set) can occur while another rebalance is // still in progress. if (needsJoinPrepare) { // need to set the flag before calling onJoinPrepare since the user callback may throw // exception, in which case upon retry we should not retry onJoinPrepare either. needsJoinPrepare = false; onJoinPrepare(generation.generationId, generation.memberId); } final RequestFuture<ByteBuffer> future = initiateJoinGroup(); client.poll(future, timer); if (!future.isDone()) { // we ran out of time return false; } if (future.succeeded()) { Generation generationSnapshot; MemberState stateSnapshot; // Generation data maybe concurrently cleared by Heartbeat thread. // Can't use synchronized for {@code onJoinComplete}, because it can be long enough // and shouldn't block heartbeat thread. // See {@link PlaintextConsumerTest#testMaxPollIntervalMsDelayInAssignment} synchronized (AbstractCoordinator.this) { generationSnapshot = this.generation; stateSnapshot = this.state; } if (!generationSnapshot.equals(Generation.NO_GENERATION) && stateSnapshot == MemberState.STABLE) { // Duplicate the buffer in case `onJoinComplete` does not complete and needs to be retried. ByteBuffer memberAssignment = future.value().duplicate(); onJoinComplete(generationSnapshot.generationId, generationSnapshot.memberId, generationSnapshot.protocolName, memberAssignment); // Generally speaking we should always resetJoinGroupFuture once the future is done, but here // we can only reset the join group future after the completion callback returns. This ensures // that if the callback is woken up, we will retry it on the next joinGroupIfNeeded. // And because of that we should explicitly trigger resetJoinGroupFuture in other conditions below. resetJoinGroupFuture(); needsJoinPrepare = true; } else { final String reason = String.format("rebalance failed since the generation/state was " + "modified by heartbeat thread to %s/%s before the rebalance callback triggered", generationSnapshot, stateSnapshot); resetStateAndRejoin(reason); resetJoinGroupFuture(); } } else { final RuntimeException exception = future.exception(); resetJoinGroupFuture(); rejoinNeeded = true; if (exception instanceof UnknownMemberIdException || exception instanceof IllegalGenerationException || exception instanceof RebalanceInProgressException || exception instanceof MemberIdRequiredException) continue; else if (!future.isRetriable()) throw exception; timer.sleep(rebalanceConfig.retryBackoffMs); } } return true; }
- 调用
-
AbstractCoordinator#initiateJoinGroup()
一个入口方法,核心是调用AbstractCoordinator#sendJoinGroupRequest()
构造异步请求private synchronized RequestFuture<ByteBuffer> initiateJoinGroup() { // we store the join future in case we are woken up by the user after beginning the // rebalance in the call to poll below. This ensures that we do not mistakenly attempt // to rejoin before the pending rebalance has completed. if (joinFuture == null) { state = MemberState.PREPARING_REBALANCE; // a rebalance can be triggered consecutively if the previous one failed, // in this case we would not update the start time. if (lastRebalanceStartMs == -1L) lastRebalanceStartMs = time.milliseconds(); joinFuture = sendJoinGroupRequest(); joinFuture.addListener(new RequestFutureListener<ByteBuffer>() { @Override public void onSuccess(ByteBuffer value) { // do nothing since all the handler logic are in SyncGroupResponseHandler already } @Override public void onFailure(RuntimeException e) { // we handle failures below after the request finishes. if the join completes // after having been woken up, the exception is ignored and we will rejoin; // this can be triggered when either join or sync request failed synchronized (AbstractCoordinator.this) { sensors.failedRebalanceSensor.record(); } } }); } return joinFuture; }
-
AbstractCoordinator#sendJoinGroupRequest()
方法实现如下,可以看到处理流程和2.2.2.1节分析的协调器定位基本一致,都是生成异步请求,然后添加回调处理器,当收到请求对应的响应时回调处理器进行业务逻辑处理。此处 JoinGroup 请求的回调处理器为JoinGroupResponseHandler
,则最终将调用JoinGroupResponseHandler#handle()
方法进行响应处理RequestFuture<ByteBuffer> sendJoinGroupRequest() { if (coordinatorUnknown()) return RequestFuture.coordinatorNotAvailable(); // send a join group request to the coordinator log.info("(Re-)joining group"); JoinGroupRequest.Builder requestBuilder = new JoinGroupRequest.Builder( new JoinGroupRequestData() .setGroupId(rebalanceConfig.groupId) .setSessionTimeoutMs(this.rebalanceConfig.sessionTimeoutMs) .setMemberId(this.generation.memberId) .setGroupInstanceId(this.rebalanceConfig.groupInstanceId.orElse(null)) .setProtocolType(protocolType()) .setProtocols(metadata()) .setRebalanceTimeoutMs(this.rebalanceConfig.rebalanceTimeoutMs) ); log.debug("Sending JoinGroup ({}) to coordinator {}", requestBuilder, this.coordinator); // Note that we override the request timeout using the rebalance timeout since that is the // maximum time that it may block on the coordinator. We add an extra 5 seconds for small delays. int joinGroupTimeoutMs = Math.max(client.defaultRequestTimeoutMs(), rebalanceConfig.rebalanceTimeoutMs + JOIN_GROUP_TIMEOUT_LAPSE); return client.send(coordinator, requestBuilder, joinGroupTimeoutMs) .compose(new JoinGroupResponseHandler(generation)); }
-
JoinGroupResponseHandler#handle()
方法会处理多种异常情况,核心则是根据协调器返回的消费者组 leader 的 id 判断自身是否是消费者组 leader,如是的话则调用AbstractCoordinator#onJoinLeader()
进行分区分配private class JoinGroupResponseHandler extends CoordinatorResponseHandler<JoinGroupResponse, ByteBuffer> { private JoinGroupResponseHandler(final Generation generation) { super(generation); } @Override public void handle(JoinGroupResponse joinResponse, RequestFuture<ByteBuffer> future) { Errors error = joinResponse.error(); if (error == Errors.NONE) { if (isProtocolTypeInconsistent(joinResponse.data().protocolType())) { log.error("JoinGroup failed: Inconsistent Protocol Type, received {} but expected {}", joinResponse.data().protocolType(), protocolType()); future.raise(Errors.INCONSISTENT_GROUP_PROTOCOL); } else { log.debug("Received successful JoinGroup response: {}", joinResponse); sensors.joinSensor.record(response.requestLatencyMs()); synchronized (AbstractCoordinator.this) { if (state != MemberState.PREPARING_REBALANCE) { // if the consumer was woken up before a rebalance completes, we may have already left // the group. In this case, we do not want to continue with the sync group. future.raise(new UnjoinedGroupException()); } else { state = MemberState.COMPLETING_REBALANCE; // we only need to enable heartbeat thread whenever we transit to // COMPLETING_REBALANCE state since we always transit from this state to STABLE if (heartbeatThread != null) heartbeatThread.enable(); AbstractCoordinator.this.generation = new Generation( joinResponse.data().generationId(), joinResponse.data().memberId(), joinResponse.data().protocolName()); log.info("Successfully joined group with generation {}", AbstractCoordinator.this.generation); if (joinResponse.isLeader()) { onJoinLeader(joinResponse).chain(future); } else { onJoinFollower().chain(future); } } } } } else if (error == Errors.COORDINATOR_LOAD_IN_PROGRESS) { log.info("JoinGroup failed: Coordinator {} is loading the group.", coordinator()); // backoff and retry future.raise(error); } else if (error == Errors.UNKNOWN_MEMBER_ID) { log.info("JoinGroup failed: {} Need to re-join the group. Sent generation was {}", error.message(), sentGeneration); // only need to reset the member id if generation has not been changed, // then retry immediately if (generationUnchanged()) resetGenerationOnResponseError(ApiKeys.JOIN_GROUP, error); future.raise(error); } else if (error == Errors.COORDINATOR_NOT_AVAILABLE || error == Errors.NOT_COORDINATOR) { // re-discover the coordinator and retry with backoff markCoordinatorUnknown(error); log.info("JoinGroup failed: {} Marking coordinator unknown. Sent generation was {}", error.message(), sentGeneration); future.raise(error); } else if (error == Errors.FENCED_INSTANCE_ID) { // for join-group request, even if the generation has changed we would not expect the instance id // gets fenced, and hence we always treat this as a fatal error log.error("JoinGroup failed: The group instance id {} has been fenced by another instance. " + "Sent generation was {}", rebalanceConfig.groupInstanceId, sentGeneration); future.raise(error); } else if (error == Errors.INCONSISTENT_GROUP_PROTOCOL || error == Errors.INVALID_SESSION_TIMEOUT || error == Errors.INVALID_GROUP_ID || error == Errors.GROUP_AUTHORIZATION_FAILED || error == Errors.GROUP_MAX_SIZE_REACHED) { // log the error and re-throw the exception log.error("JoinGroup failed due to fatal error: {}", error.message()); if (error == Errors.GROUP_MAX_SIZE_REACHED) { future.raise(new GroupMaxSizeReachedException("Consumer group " + rebalanceConfig.groupId + " already has the configured maximum number of members.")); } else if (error == Errors.GROUP_AUTHORIZATION_FAILED) { future.raise(GroupAuthorizationException.forGroupId(rebalanceConfig.groupId)); } else { future.raise(error); } } else if (error == Errors.UNSUPPORTED_VERSION) { log.error("JoinGroup failed due to unsupported version error. Please unset field group.instance.id " + "and retry to see if the problem resolves"); future.raise(error); } else if (error == Errors.MEMBER_ID_REQUIRED) { // Broker requires a concrete member id to be allowed to join the group. Update member id // and send another join group request in next cycle. String memberId = joinResponse.data().memberId(); log.debug("JoinGroup failed due to non-fatal error: {} Will set the member id as {} and then rejoin. " + "Sent generation was {}", error, memberId, sentGeneration); synchronized (AbstractCoordinator.this) { AbstractCoordinator.this.generation = new Generation(OffsetCommitRequest.DEFAULT_GENERATION_ID, memberId, null); } requestRejoin("need to re-join with the given member-id"); future.raise(error); } else if (error == Errors.REBALANCE_IN_PROGRESS) { log.info("JoinGroup failed due to non-fatal error: REBALANCE_IN_PROGRESS, " + "which could indicate a replication timeout on the broker. Will retry."); future.raise(error); } else { // unexpected error, throw the exception log.error("JoinGroup failed due to unexpected error: {}", error.message()); future.raise(new KafkaException("Unexpected error in join group response: " + error.message())); } } }
-
AbstractCoordinator#onJoinLeader()
方法的实现比较简洁,关键动作如下:- 调用
ConsumerCoordinator#performAssignment()
进行分区分配 - 生成SyncGroup 异步请求,调用
AbstractCoordinator#sendSyncGroupRequest()
方法将分配方案同步给协调器
private RequestFuture<ByteBuffer> onJoinLeader(JoinGroupResponse joinResponse) { try { // perform the leader synchronization and send back the assignment for the group Map<String, ByteBuffer> groupAssignment = performAssignment(joinResponse.data().leader(), joinResponse.data().protocolName(), joinResponse.data().members()); List<SyncGroupRequestData.SyncGroupRequestAssignment> groupAssignmentList = new ArrayList<>(); for (Map.Entry<String, ByteBuffer> assignment : groupAssignment.entrySet()) { groupAssignmentList.add(new SyncGroupRequestData.SyncGroupRequestAssignment() .setMemberId(assignment.getKey()) .setAssignment(Utils.toArray(assignment.getValue())) ); } SyncGroupRequest.Builder requestBuilder = new SyncGroupRequest.Builder( new SyncGroupRequestData() .setGroupId(rebalanceConfig.groupId) .setMemberId(generation.memberId) .setProtocolType(protocolType()) .setProtocolName(generation.protocolName) .setGroupInstanceId(this.rebalanceConfig.groupInstanceId.orElse(null)) .setGenerationId(generation.generationId) .setAssignments(groupAssignmentList) ); log.debug("Sending leader SyncGroup to coordinator {} at generation {}: {}", this.coordinator, this.generation, requestBuilder); return sendSyncGroupRequest(requestBuilder); } catch (RuntimeException e) { return RequestFuture.failure(e); } }
- 调用
-
ConsumerCoordinator#performAssignment()
分区分配的源码并不复杂,简单来说分为如下几步:- 首先调用
ConsumerCoordinator#lookupAssignor()
找到指定分配策略的分配器 - 调用分配器
ConsumerPartitionAssignor#assign()
方法结合 topic 的分区信息和消费者组的消费者信息进行分区分配,这部分读者如有兴趣可自行研究
@Override protected Map<String, ByteBuffer> performAssignment(String leaderId, String assignmentStrategy, List<JoinGroupResponseData.JoinGroupResponseMember> allSubscriptions) { ConsumerPartitionAssignor assignor = lookupAssignor(assignmentStrategy); if (assignor == null) throw new IllegalStateException("Coordinator selected invalid assignment protocol: " + assignmentStrategy); String assignorName = assignor.name(); Set<String> allSubscribedTopics = new HashSet<>(); Map<String, Subscription> subscriptions = new HashMap<>(); // collect all the owned partitions Map<String, List<TopicPartition>> ownedPartitions = new HashMap<>(); for (JoinGroupResponseData.JoinGroupResponseMember memberSubscription : allSubscriptions) { Subscription subscription = ConsumerProtocol.deserializeSubscription(ByteBuffer.wrap(memberSubscription.metadata())); subscription.setGroupInstanceId(Optional.ofNullable(memberSubscription.groupInstanceId())); subscriptions.put(memberSubscription.memberId(), subscription); allSubscribedTopics.addAll(subscription.topics()); ownedPartitions.put(memberSubscription.memberId(), subscription.ownedPartitions()); } // the leader will begin watching for changes to any of the topics the group is interested in, // which ensures that all metadata changes will eventually be seen updateGroupSubscription(allSubscribedTopics); isLeader = true; log.debug("Performing assignment using strategy {} with subscriptions {}", assignorName, subscriptions); Map<String, Assignment> assignments = assignor.assign(metadata.fetch(), new GroupSubscription(subscriptions)).groupAssignment(); // skip the validation for built-in cooperative sticky assignor since we've considered // the "generation" of ownedPartition inside the assignor if (protocol == RebalanceProtocol.COOPERATIVE && !assignorName.equals(COOPERATIVE_STICKY_ASSIGNOR_NAME)) { validateCooperativeAssignment(ownedPartitions, assignments); } maybeUpdateGroupSubscription(assignorName, assignments, allSubscribedTopics); assignmentSnapshot = metadataSnapshot; log.info("Finished assignment for group at generation {}: {}", generation().generationId, assignments); Map<String, ByteBuffer> groupAssignment = new HashMap<>(); for (Map.Entry<String, Assignment> assignmentEntry : assignments.entrySet()) { ByteBuffer buffer = ConsumerProtocol.serializeAssignment(assignmentEntry.getValue()); groupAssignment.put(assignmentEntry.getKey(), buffer); } return groupAssignment; }
- 首先调用
-
AbstractCoordinator#sendSyncGroupRequest()
方法会将异步请求入队,核心处理在收到协调器响应后的回调处理器SyncGroupResponseHandler
中,如前所知收到响应后SyncGroupResponseHandler#handle()
方法将被触发private RequestFuture<ByteBuffer> sendSyncGroupRequest(SyncGroupRequest.Builder requestBuilder) { if (coordinatorUnknown()) return RequestFuture.coordinatorNotAvailable(); return client.send(coordinator, requestBuilder) .compose(new SyncGroupResponseHandler(generation)); }
-
SyncGroupResponseHandler#handle()
方法的核心是将协调器响应中的分区分配方案通过RequestFuture#complete()
方法回调传递给上层,也就是本节步骤2的第3步private class SyncGroupResponseHandler extends CoordinatorResponseHandler<SyncGroupResponse, ByteBuffer> { private SyncGroupResponseHandler(final Generation generation) { super(generation); } @Override public void handle(SyncGroupResponse syncResponse, RequestFuture<ByteBuffer> future) { Errors error = syncResponse.error(); if (error == Errors.NONE) { if (isProtocolTypeInconsistent(syncResponse.data().protocolType())) { log.error("SyncGroup failed due to inconsistent Protocol Type, received {} but expected {}", syncResponse.data().protocolType(), protocolType()); future.raise(Errors.INCONSISTENT_GROUP_PROTOCOL); } else { log.debug("Received successful SyncGroup response: {}", syncResponse); sensors.syncSensor.record(response.requestLatencyMs()); synchronized (AbstractCoordinator.this) { if (!generation.equals(Generation.NO_GENERATION) && state == MemberState.COMPLETING_REBALANCE) { // check protocol name only if the generation is not reset final String protocolName = syncResponse.data().protocolName(); final boolean protocolNameInconsistent = protocolName != null && !protocolName.equals(generation.protocolName); if (protocolNameInconsistent) { log.error("SyncGroup failed due to inconsistent Protocol Name, received {} but expected {}", protocolName, generation.protocolName); future.raise(Errors.INCONSISTENT_GROUP_PROTOCOL); } else { log.info("Successfully synced group in generation {}", generation); state = MemberState.STABLE; rejoinNeeded = false; // record rebalance latency lastRebalanceEndMs = time.milliseconds(); sensors.successfulRebalanceSensor.record(lastRebalanceEndMs - lastRebalanceStartMs); lastRebalanceStartMs = -1L; future.complete(ByteBuffer.wrap(syncResponse.data().assignment())); } } else { log.info("Generation data was cleared by heartbeat thread to {} and state is now {} before " + "receiving SyncGroup response, marking this rebalance as failed and retry", generation, state); // use ILLEGAL_GENERATION error code to let it retry immediately future.raise(Errors.ILLEGAL_GENERATION); } } } } else { if (error == Errors.GROUP_AUTHORIZATION_FAILED) { future.raise(GroupAuthorizationException.forGroupId(rebalanceConfig.groupId)); } else if (error == Errors.REBALANCE_IN_PROGRESS) { log.info("SyncGroup failed: The group began another rebalance. Need to re-join the group. " + "Sent generation was {}", sentGeneration); future.raise(error); } else if (error == Errors.FENCED_INSTANCE_ID) { // for sync-group request, even if the generation has changed we would not expect the instance id // gets fenced, and hence we always treat this as a fatal error log.error("SyncGroup failed: The group instance id {} has been fenced by another instance. " + "Sent generation was {}", rebalanceConfig.groupInstanceId, sentGeneration); future.raise(error); } else if (error == Errors.UNKNOWN_MEMBER_ID || error == Errors.ILLEGAL_GENERATION) { log.info("SyncGroup failed: {} Need to re-join the group. Sent generation was {}", error.message(), sentGeneration); if (generationUnchanged()) resetGenerationOnResponseError(ApiKeys.SYNC_GROUP, error); future.raise(error); } else if (error == Errors.COORDINATOR_NOT_AVAILABLE || error == Errors.NOT_COORDINATOR) { log.info("SyncGroup failed: {} Marking coordinator unknown. Sent generation was {}", error.message(), sentGeneration); markCoordinatorUnknown(error); future.raise(error); } else { future.raise(new KafkaException("Unexpected error from SyncGroup: " + error.message())); } } } }
-
请求处理完毕,回到本节步骤2的第3步,
ConsumerCoordinator#onJoinComplete()
方法源码如下,可以看到主要操作就是使用协调器下发的分区更新本地订阅数据,以便在拉取数据时直接请求分区所在的服务端节点,至此消费者加入消费者组的流程结束protected void onJoinComplete(int generation, String memberId, String assignmentStrategy, ByteBuffer assignmentBuffer) { log.debug("Executing onJoinComplete with generation {} and memberId {}", generation, memberId); // Only the leader is responsible for monitoring for metadata changes (i.e. partition changes) if (!isLeader) assignmentSnapshot = null; ConsumerPartitionAssignor assignor = lookupAssignor(assignmentStrategy); if (assignor == null) throw new IllegalStateException("Coordinator selected invalid assignment protocol: " + assignmentStrategy); // Give the assignor a chance to update internal state based on the received assignment groupMetadata = new ConsumerGroupMetadata(rebalanceConfig.groupId, generation, memberId, rebalanceConfig.groupInstanceId); Set<TopicPartition> ownedPartitions = new HashSet<>(subscriptions.assignedPartitions()); // should at least encode the short version if (assignmentBuffer.remaining() < 2) throw new IllegalStateException("There are insufficient bytes available to read assignment from the sync-group response (" + "actual byte size " + assignmentBuffer.remaining() + ") , this is not expected; " + "it is possible that the leader's assign function is buggy and did not return any assignment for this member, " + "or because static member is configured and the protocol is buggy hence did not get the assignment for this member"); Assignment assignment = ConsumerProtocol.deserializeAssignment(assignmentBuffer); Set<TopicPartition> assignedPartitions = new HashSet<>(assignment.partitions()); if (!subscriptions.checkAssignmentMatchedSubscription(assignedPartitions)) { final String reason = String.format("received assignment %s does not match the current subscription %s; " + "it is likely that the subscription has changed since we joined the group, will re-join with current subscription", assignment.partitions(), subscriptions.prettyString()); requestRejoin(reason); return; } final AtomicReference<Exception> firstException = new AtomicReference<>(null); Set<TopicPartition> addedPartitions = new HashSet<>(assignedPartitions); addedPartitions.removeAll(ownedPartitions); if (protocol == RebalanceProtocol.COOPERATIVE) { Set<TopicPartition> revokedPartitions = new HashSet<>(ownedPartitions); revokedPartitions.removeAll(assignedPartitions); log.info("Updating assignment with\n" + "\tAssigned partitions: {}\n" + "\tCurrent owned partitions: {}\n" + "\tAdded partitions (assigned - owned): {}\n" + "\tRevoked partitions (owned - assigned): {}\n", assignedPartitions, ownedPartitions, addedPartitions, revokedPartitions ); if (!revokedPartitions.isEmpty()) { // Revoke partitions that were previously owned but no longer assigned; // note that we should only change the assignment (or update the assignor's state) // AFTER we've triggered the revoke callback firstException.compareAndSet(null, invokePartitionsRevoked(revokedPartitions)); // If revoked any partitions, need to re-join the group afterwards final String reason = String.format("need to revoke partitions %s as indicated " + "by the current assignment and re-join", revokedPartitions); requestRejoin(reason); } } // The leader may have assigned partitions which match our subscription pattern, but which // were not explicitly requested, so we update the joined subscription here. maybeUpdateJoinedSubscription(assignedPartitions); // Catch any exception here to make sure we could complete the user callback. firstException.compareAndSet(null, invokeOnAssignment(assignor, assignment)); // Reschedule the auto commit starting from now if (autoCommitEnabled) this.nextAutoCommitTimer.updateAndReset(autoCommitIntervalMs); subscriptions.assignFromSubscribed(assignedPartitions); // Add partitions that were not previously owned but are now assigned firstException.compareAndSet(null, invokePartitionsAssigned(addedPartitions)); if (firstException.get() != null) { if (firstException.get() instanceof KafkaException) { throw (KafkaException) firstException.get(); } else { throw new KafkaException("User rebalance callback throws an error", firstException.get()); } } }
2.2.2.3 消费者心跳的处理
-
2.2.2.2节步骤1提到了消费者心跳的启动,则
HeartbeatThread#run()
方法将被触发,从以下源码可以看到消费者通过心跳可以做不少的事情:- 如果消费者协调器失连,则调用
AbstractCoordinator#lookupCoordinator()
尝试重新连接 - 如果心跳轮询超时,则调用
AbstractCoordinator#maybeLeaveGroup()
方法向协调器发送 LeaveGroup 请求,离开消费者组 - 正常发送心跳请求通过
AbstractCoordinator#sendHeartbeatRequest()
方法触发,并在其回调处理器中做相应处理
public void run() { try { log.debug("Heartbeat thread started"); while (true) { synchronized (AbstractCoordinator.this) { if (closed) return; if (!enabled) { AbstractCoordinator.this.wait(); continue; } // we do not need to heartbeat we are not part of a group yet; // also if we already have fatal error, the client will be // crashed soon, hence we do not need to continue heartbeating either if (state.hasNotJoinedGroup() || hasFailed()) { disable(); continue; } client.pollNoWakeup(); long now = time.milliseconds(); if (coordinatorUnknown()) { if (findCoordinatorFuture != null) { // clear the future so that after the backoff, if the hb still sees coordinator unknown in // the next iteration it will try to re-discover the coordinator in case the main thread cannot clearFindCoordinatorFuture(); // backoff properly AbstractCoordinator.this.wait(rebalanceConfig.retryBackoffMs); } else { lookupCoordinator(); } } else if (heartbeat.sessionTimeoutExpired(now)) { // the session timeout has expired without seeing a successful heartbeat, so we should // probably make sure the coordinator is still healthy. markCoordinatorUnknown("session timed out without receiving a " + "heartbeat response"); } else if (heartbeat.pollTimeoutExpired(now)) { // the poll timeout has expired, which means that the foreground thread has stalled // in between calls to poll(). log.warn("consumer poll timeout has expired. This means the time between subsequent calls to poll() " + "was longer than the configured max.poll.interval.ms, which typically implies that " + "the poll loop is spending too much time processing messages. You can address this " + "either by increasing max.poll.interval.ms or by reducing the maximum size of batches " + "returned in poll() with max.poll.records."); maybeLeaveGroup("consumer poll timeout has expired."); } else if (!heartbeat.shouldHeartbeat(now)) { // poll again after waiting for the retry backoff in case the heartbeat failed or the // coordinator disconnected AbstractCoordinator.this.wait(rebalanceConfig.retryBackoffMs); } else { heartbeat.sentHeartbeat(now); final RequestFuture<Void> heartbeatFuture = sendHeartbeatRequest(); heartbeatFuture.addListener(new RequestFutureListener<Void>() { @Override public void onSuccess(Void value) { synchronized (AbstractCoordinator.this) { heartbeat.receiveHeartbeat(); } } @Override public void onFailure(RuntimeException e) { synchronized (AbstractCoordinator.this) { if (e instanceof RebalanceInProgressException) { // it is valid to continue heartbeating while the group is rebalancing. This // ensures that the coordinator keeps the member in the group for as long // as the duration of the rebalance timeout. If we stop sending heartbeats, // however, then the session timeout may expire before we can rejoin. heartbeat.receiveHeartbeat(); } else if (e instanceof FencedInstanceIdException) { log.error("Caught fenced group.instance.id {} error in heartbeat thread", rebalanceConfig.groupInstanceId); heartbeatThread.failed.set(e); } else { heartbeat.failHeartbeat(); // wake up the thread if it's sleeping to reschedule the heartbeat AbstractCoordinator.this.notify(); } } } }); } } } } catch (AuthenticationException e) { log.error("An authentication error occurred in the heartbeat thread", e); this.failed.set(e); } catch (GroupAuthorizationException e) { log.error("A group authorization error occurred in the heartbeat thread", e); this.failed.set(e); } catch (InterruptedException | InterruptException e) { Thread.interrupted(); log.error("Unexpected interrupt received in heartbeat thread", e); this.failed.set(new RuntimeException(e)); } catch (Throwable e) { log.error("Heartbeat thread failed due to unexpected error", e); if (e instanceof RuntimeException) this.failed.set((RuntimeException) e); else this.failed.set(new RuntimeException(e)); } finally { log.debug("Heartbeat thread has closed"); } }
- 如果消费者协调器失连,则调用
-
AbstractCoordinator#sendHeartbeatRequest()
方法会设置HeartbeatResponseHandler
为心跳响应处理器,则当协调器响应心跳请求时,HeartbeatResponseHandler#handle()
方法将被执行synchronized RequestFuture<Void> sendHeartbeatRequest() { log.debug("Sending Heartbeat request with generation {} and member id {} to coordinator {}", generation.generationId, generation.memberId, coordinator); HeartbeatRequest.Builder requestBuilder = new HeartbeatRequest.Builder(new HeartbeatRequestData() .setGroupId(rebalanceConfig.groupId) .setMemberId(this.generation.memberId) .setGroupInstanceId(this.rebalanceConfig.groupInstanceId.orElse(null)) .setGenerationId(this.generation.generationId)); return client.send(coordinator, requestBuilder) .compose(new HeartbeatResponseHandler(generation)); }
-
HeartbeatResponseHandler#handle()
方法会处理各种返回码,其中对Errors.REBALANCE_IN_PROGRESS
的处理是调用AbstractCoordinator#requestRejoin()
重置标识位rejoinNeeded
为 true,则消费者下次进行拉取消息的动作时会触发重新加入消费者组的流程,从而完成消费者组的重平衡public void handle(HeartbeatResponse heartbeatResponse, RequestFuture<Void> future) { sensors.heartbeatSensor.record(response.requestLatencyMs()); Errors error = heartbeatResponse.error(); if (error == Errors.NONE) { log.debug("Received successful Heartbeat response"); future.complete(null); } else if (error == Errors.COORDINATOR_NOT_AVAILABLE || error == Errors.NOT_COORDINATOR) { log.info("Attempt to heartbeat failed since coordinator {} is either not started or not valid", coordinator()); markCoordinatorUnknown(error); future.raise(error); } else if (error == Errors.REBALANCE_IN_PROGRESS) { // since we may be sending the request during rebalance, we should check // this case and ignore the REBALANCE_IN_PROGRESS error synchronized (AbstractCoordinator.this) { if (state == MemberState.STABLE) { requestRejoin("group is already rebalancing"); future.raise(error); } else { log.debug("Ignoring heartbeat response with error {} during {} state", error, state); future.complete(null); } } } else if (error == Errors.ILLEGAL_GENERATION || error == Errors.UNKNOWN_MEMBER_ID || error == Errors.FENCED_INSTANCE_ID) { if (generationUnchanged()) { log.info("Attempt to heartbeat with {} and group instance id {} failed due to {}, resetting generation", sentGeneration, rebalanceConfig.groupInstanceId, error); resetGenerationOnResponseError(ApiKeys.HEARTBEAT, error); future.raise(error); } else { // if the generation has changed, then ignore this error log.info("Attempt to heartbeat with stale {} and group instance id {} failed due to {}, ignoring the error", sentGeneration, rebalanceConfig.groupInstanceId, error); future.complete(null); } } else if (error == Errors.GROUP_AUTHORIZATION_FAILED) { future.raise(GroupAuthorizationException.forGroupId(rebalanceConfig.groupId)); } else { future.raise(new KafkaException("Unexpected error in heartbeat response: " + error.message())); } } }
2.2.3 消息的拉取消费
-
经过以上流程,消费者已经知道自己负责消费的 topic 分区,则回到2.2.1节步骤2第2步,调用
KafkaConsumer#pollForFetches()
拉消息,可以看到这个方法的核心如下:- 首先调用
Fetcher#fetchedRecords()
获取队列缓存的消息记录,如果不为空则直接返回 - 以上条件不成立,则调用
Fetcher#sendFetches()
生成一个新的 Fetch 异步请求 - 调用
ConsumerNetworkClient#poll()
发起请求,并将底层响应通过回调处理器传递到上层,这部分流程上文已经分析过,不再赘述 - 调用
Fetcher#fetchedRecords()
获取队列缓存的消息记录,返回消息记录
private Map<TopicPartition, List<ConsumerRecord<K, V>>> pollForFetches(Timer timer) { long pollTimeout = coordinator == null ? timer.remainingMs() : Math.min(coordinator.timeToNextPoll(timer.currentTimeMs()), timer.remainingMs()); // if data is available already, return it immediately final Map<TopicPartition, List<ConsumerRecord<K, V>>> records = fetcher.fetchedRecords(); if (!records.isEmpty()) { return records; } // send any new fetches (won't resend pending fetches) fetcher.sendFetches(); // We do not want to be stuck blocking in poll if we are missing some positions // since the offset lookup may be backing off after a failure // NOTE: the use of cachedSubscriptionHashAllFetchPositions means we MUST call // updateAssignmentMetadataIfNeeded before this method. if (!cachedSubscriptionHashAllFetchPositions && pollTimeout > retryBackoffMs) { pollTimeout = retryBackoffMs; } log.trace("Polling for fetches with timeout {}", pollTimeout); Timer pollTimer = time.timer(pollTimeout); client.poll(pollTimer, () -> { // since a fetch might be completed by the background thread, we need this poll condition // to ensure that we do not block unnecessarily in poll() return !fetcher.hasAvailableFetches(); }); timer.update(pollTimer.currentTimeMs()); return fetcher.fetchedRecords(); }
- 首先调用
-
Fetcher#sendFetches()
方法的核心处理分为以下几步:- 调用
Fetcher#prepareFetchRequests()
方法根据当前消费者负责的分区选定需要发送请求的各个 Kafka 节点 - 遍历列表,调用
ConsumerNetworkClient#send()
方法将发送给目标节点的 Fetch 请求入队,并设置响应的回调处理将服务端返回的数据通过completedFetches.add()
缓存到队列中
public synchronized int sendFetches() { // Update metrics in case there was an assignment change sensors.maybeUpdateAssignment(subscriptions); Map<Node, FetchSessionHandler.FetchRequestData> fetchRequestMap = prepareFetchRequests(); for (Map.Entry<Node, FetchSessionHandler.FetchRequestData> entry : fetchRequestMap.entrySet()) { final Node fetchTarget = entry.getKey(); final FetchSessionHandler.FetchRequestData data = entry.getValue(); final FetchRequest.Builder request = FetchRequest.Builder .forConsumer(this.maxWaitMs, this.minBytes, data.toSend()) .isolationLevel(isolationLevel) .setMaxBytes(this.maxBytes) .metadata(data.metadata()) .toForget(data.toForget()) .rackId(clientRackId); if (log.isDebugEnabled()) { log.debug("Sending {} {} to broker {}", isolationLevel, data.toString(), fetchTarget); } RequestFuture<ClientResponse> future = client.send(fetchTarget, request); // We add the node to the set of nodes with pending fetch requests before adding the // listener because the future may have been fulfilled on another thread (e.g. during a // disconnection being handled by the heartbeat thread) which will mean the listener // will be invoked synchronously. this.nodesWithPendingFetchRequests.add(entry.getKey().id()); future.addListener(new RequestFutureListener<ClientResponse>() { @Override public void onSuccess(ClientResponse resp) { synchronized (Fetcher.this) { try { FetchResponse response = (FetchResponse) resp.responseBody(); FetchSessionHandler handler = sessionHandler(fetchTarget.id()); if (handler == null) { log.error("Unable to find FetchSessionHandler for node {}. Ignoring fetch response.", fetchTarget.id()); return; } if (!handler.handleResponse(response)) { return; } Set<TopicPartition> partitions = new HashSet<>(response.responseData().keySet()); FetchResponseMetricAggregator metricAggregator = new FetchResponseMetricAggregator(sensors, partitions); for (Map.Entry<TopicPartition, FetchResponseData.PartitionData> entry : response.responseData().entrySet()) { TopicPartition partition = entry.getKey(); FetchRequest.PartitionData requestData = data.sessionPartitions().get(partition); if (requestData == null) { String message; if (data.metadata().isFull()) { message = MessageFormatter.arrayFormat( "Response for missing full request partition: partition={}; metadata={}", new Object[]{partition, data.metadata()}).getMessage(); } else { message = MessageFormatter.arrayFormat( "Response for missing session request partition: partition={}; metadata={}; toSend={}; toForget={}", new Object[]{partition, data.metadata(), data.toSend(), data.toForget()}).getMessage(); } // Received fetch response for missing session partition throw new IllegalStateException(message); } else { long fetchOffset = requestData.fetchOffset; FetchResponseData.PartitionData partitionData = entry.getValue(); log.debug("Fetch {} at offset {} for partition {} returned fetch data {}", isolationLevel, fetchOffset, partition, partitionData); Iterator<? extends RecordBatch> batches = FetchResponse.recordsOrFail(partitionData).batches().iterator(); short responseVersion = resp.requestHeader().apiVersion(); completedFetches.add(new CompletedFetch(partition, partitionData, metricAggregator, batches, fetchOffset, responseVersion)); } } sensors.fetchLatency.record(resp.requestLatencyMs()); } finally { nodesWithPendingFetchRequests.remove(fetchTarget.id()); } } } @Override public void onFailure(RuntimeException e) { synchronized (Fetcher.this) { try { FetchSessionHandler handler = sessionHandler(fetchTarget.id()); if (handler != null) { handler.handleError(e); } } finally { nodesWithPendingFetchRequests.remove(fetchTarget.id()); } } } }); } return fetchRequestMap.size(); }
- 调用
-
Fetcher#prepareFetchRequests()
方法使用 2.2.2节分配给当前消费者到分区信息结合集群元数据,确定消费者应该发送请求的目标节点private Map<Node, FetchSessionHandler.FetchRequestData> prepareFetchRequests() { Map<Node, FetchSessionHandler.Builder> fetchable = new LinkedHashMap<>(); validatePositionsOnMetadataChange(); long currentTimeMs = time.milliseconds(); for (TopicPartition partition : fetchablePartitions()) { FetchPosition position = this.subscriptions.position(partition); if (position == null) { throw new IllegalStateException("Missing position for fetchable partition " + partition); } Optional<Node> leaderOpt = position.currentLeader.leader; if (!leaderOpt.isPresent()) { log.debug("Requesting metadata update for partition {} since the position {} is missing the current leader node", partition, position); metadata.requestUpdate(); continue; } // Use the preferred read replica if set, otherwise the position's leader Node node = selectReadReplica(partition, leaderOpt.get(), currentTimeMs); if (client.isUnavailable(node)) { client.maybeThrowAuthFailure(node); // If we try to send during the reconnect backoff window, then the request is just // going to be failed anyway before being sent, so skip the send for now log.trace("Skipping fetch for partition {} because node {} is awaiting reconnect backoff", partition, node); } else if (this.nodesWithPendingFetchRequests.contains(node.id())) { log.trace("Skipping fetch for partition {} because previous request to {} has not been processed", partition, node); } else { // if there is a leader and no in-flight requests, issue a new fetch FetchSessionHandler.Builder builder = fetchable.get(node); if (builder == null) { int id = node.id(); FetchSessionHandler handler = sessionHandler(id); if (handler == null) { handler = new FetchSessionHandler(logContext, id); sessionHandlers.put(id, handler); } builder = handler.newBuilder(); fetchable.put(node, builder); } builder.add(partition, new FetchRequest.PartitionData(position.offset, FetchRequest.INVALID_LOG_START_OFFSET, this.fetchSize, position.currentLeader.epoch, Optional.empty())); log.debug("Added {} fetch request for partition {} at position {} to node {}", isolationLevel, partition, position, node); } } Map<Node, FetchSessionHandler.FetchRequestData> reqs = new LinkedHashMap<>(); for (Map.Entry<Node, FetchSessionHandler.Builder> entry : fetchable.entrySet()) { reqs.put(entry.getKey(), entry.getValue().build()); } return reqs; }
-
回到本节步骤1第4步,
Fetcher#fetchedRecords()
方法核心逻辑如下,至此 Kafka 消费者的核心流程基本结束- 首先调用
Fetcher#initializeCompletedFetch()
方法将服务端返回的数据初始化为一个CompletedFetch
对象,并更新本地缓存的元数据中topic 分区偏移量 offset 信息 - 调用
Fetcher#fetchRecords()
方法解析CompletedFetch
对象,从中解析出供上层消费者消费的ConsumerRecord
对象
public Map<TopicPartition, List<ConsumerRecord<K, V>>> fetchedRecords() { Map<TopicPartition, List<ConsumerRecord<K, V>>> fetched = new HashMap<>(); Queue<CompletedFetch> pausedCompletedFetches = new ArrayDeque<>(); int recordsRemaining = maxPollRecords; try { while (recordsRemaining > 0) { if (nextInLineFetch == null || nextInLineFetch.isConsumed) { CompletedFetch records = completedFetches.peek(); if (records == null) break; if (records.notInitialized()) { try { nextInLineFetch = initializeCompletedFetch(records); } catch (Exception e) { // Remove a completedFetch upon a parse with exception if (1) it contains no records, and // (2) there are no fetched records with actual content preceding this exception. // The first condition ensures that the completedFetches is not stuck with the same completedFetch // in cases such as the TopicAuthorizationException, and the second condition ensures that no // potential data loss due to an exception in a following record. FetchResponseData.PartitionData partition = records.partitionData; if (fetched.isEmpty() && FetchResponse.recordsOrFail(partition).sizeInBytes() == 0) { completedFetches.poll(); } throw e; } } else { nextInLineFetch = records; } completedFetches.poll(); } else if (subscriptions.isPaused(nextInLineFetch.partition)) { // when the partition is paused we add the records back to the completedFetches queue instead of draining // them so that they can be returned on a subsequent poll if the partition is resumed at that time log.debug("Skipping fetching records for assigned partition {} because it is paused", nextInLineFetch.partition); pausedCompletedFetches.add(nextInLineFetch); nextInLineFetch = null; } else { List<ConsumerRecord<K, V>> records = fetchRecords(nextInLineFetch, recordsRemaining); if (!records.isEmpty()) { TopicPartition partition = nextInLineFetch.partition; List<ConsumerRecord<K, V>> currentRecords = fetched.get(partition); if (currentRecords == null) { fetched.put(partition, records); } else { // this case shouldn't usually happen because we only send one fetch at a time per partition, // but it might conceivably happen in some rare cases (such as partition leader changes). // we have to copy to a new list because the old one may be immutable List<ConsumerRecord<K, V>> newRecords = new ArrayList<>(records.size() + currentRecords.size()); newRecords.addAll(currentRecords); newRecords.addAll(records); fetched.put(partition, newRecords); } recordsRemaining -= records.size(); } } } } catch (KafkaException e) { if (fetched.isEmpty()) throw e; } finally { // add any polled completed fetches for paused partitions back to the completed fetches queue to be // re-evaluated in the next poll completedFetches.addAll(pausedCompletedFetches); } return fetched; }
- 首先调用