4.Sender#run源码分析

4.Sender#run源码分析

上节咱们讲解了RecordAccumulator收集器再收集消息的过程,咱们整体回顾一下:

生产者发送的消息现在客户端缓存到RecordAccumulator.batches中,等到一定时机再由发送线程Sender批量的写入到Kafka集群中,生产者每生产一条消息,就向batches中追加一条消息,追加方法的返回值表示批记录是否满了;如果满了则开始发送这一批数据.如果批记录没有满,就会继续等待直到收集到足够信息.
在这里插入图片描述

追加消息时首先要获取分区所属的队列,然后取队列中最后一个批记录,如果队列中不存在批记录或者上一个批记录已经写满,应该创建新的批记录(ArrayDeque),并且加入队列的尾部,这里我们把每个批记录看作队列的一个元素,先创建的批记录最先被旧的消息填满,后创建的批记录表示最近的消息,追加消息时总是往最近的批记录中添加.

步骤:

  1. 如果队列中不存在批记录,进入步骤5. Deque<ProducerBatch> dq = getOrCreateDeque(tp);
  2. 如果存在旧的批记录,尝试追加当前一条数据,并判断能不能追加成功. FutureRecordMetadata future = last.tryAppend(timestamp, key, value, headers, callback, nowMs);
  3. 如果追加成功,说明已有的批记录可以容纳当前这条消息,返回结果.
  4. 如果追加不成功,说明虽然有旧的批记录,但是容纳不下当前这一条消息,进入步骤5.
  5. 创建一个新的批记录,并在其中添加当前消息,新的批记录一定能容纳当前这条消息.

在这里插入图片描述

本次回顾完成,这次咱们看sender如何取队列批数据,并发送到集群中

1.从记录RecordAccumulator获取数据

生产者发送的消息在客户端首先被保存在RecordAccumulator中,发送线程需要发送消息时,从中获取就可以了,不过RecordAccumulator并不仅仅将消息暂存起来.而且为了发送线程能够更好的工作,追加到RecordAccumulator的消息按照分区放好,需要发送时 唤醒sender进行发送,我们看一下Sender.run方法中:

public void run() {
        // main loop, runs until close is called
        while (running) {
            try {
                runOnce();
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }
        }
        // okay we stopped accepting requests but there may still be
        // requests in the transaction manager, accumulator or waiting for acknowledgment,
        // wait until these are completed.
        while (!forceClose && ((this.accumulator.hasUndrained() || this.client.inFlightRequestCount() > 0) || hasPendingTransactionalRequests())) {
            try {
                runOnce();
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }
        }

        // Abort the transaction if any commit or abort didn't go through the transaction manager's queue
        while (!forceClose && transactionManager != null && transactionManager.hasOngoingTransaction()) {
            if (!transactionManager.isCompleting()) {
                log.info("Aborting incomplete transaction due to shutdown");
                transactionManager.beginAbort();
            }
            try {
                runOnce();
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }
        }

        if (forceClose) {
            // We need to fail all the incomplete transactional requests and batches and wake up the threads waiting on
            // the futures.
            if (transactionManager != null) {
                transactionManager.close();
            }
            log.debug("Aborting incomplete batches due to forced shutdown");
            this.accumulator.abortIncompleteBatches();
        }
        try {
            this.client.close();
        } catch (Exception e) {
            log.error("Failed to close network client", e);
        }
        log.debug("Shutdown of Kafka producer I/O thread has completed.");
    }

1.1 runOnce方法

主要runOnce()方法进行发送,runOnce()中sendProducerData()进行发送.

 void runOnce() {
        if (transactionManager != null) {
            try {
                transactionManager.maybeResolveSequences();

                // do not continue sending if the transaction manager is in a failed state
                if (transactionManager.hasFatalError()) {
                    RuntimeException lastError = transactionManager.lastError();
                    if (lastError != null)
                        maybeAbortBatches(lastError);
                    // 获取元数据
                    client.poll(retryBackoffMs, time.milliseconds());
                    return;
                }

                // Check whether we need a new producerId. If so, we will enqueue an InitProducerId
                // request which will be sent below
                transactionManager.bumpIdempotentEpochAndResetIdIfNeeded();

                if (maybeSendAndPollTransactionalRequest()) {
                    return;
                }
            } catch (AuthenticationException e) {
                // This is already logged as error, but propagated here to perform any clean ups.
                log.trace("Authentication exception while processing transactional request", e);
                transactionManager.authenticationFailed(e);
            }
        }

        long currentTimeMs = time.milliseconds();
        long pollTimeout = sendProducerData(currentTimeMs);
        client.poll(pollTimeout, currentTimeMs);
    }

以前是采用场景驱动的方式获取的node节点等的元数据,新版本采用 runOnce()方法

private long sendProducerData(long now) {
    // 1.计算需要以及可以向哪些节点发送请求
    Cluster cluster = metadata.fetch();
    // get the list of partitions with data ready to send
    // 计算需要向哪些节点发送请求
    // 判断哪些分区已经达到发送条件  full || expired || exhausted || closed || flushInProgress()
    RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

    // if there are any partitions whose leaders are not known yet, force metadata update
    // 2.如果存在未知的leader副本对应的节点(对应的topic 分区正在执行leader选举,或者对应的topic 已经失效)
    // 标记需要更新缓存的集群元数据信息
    if (!result.unknownLeaderTopics.isEmpty()) {
        // The set of topics with unknown leader contains topics with leader election pending as well as
        // topics which may have expired. Add the topic again to metadata to ensure it is included
        // and request metadata update, since there are messages to send to the topic.
        // 在待发送的消息未找到分区信息,则需要broker服务器拉取对应的leader节点信息
        for (String topic : result.unknownLeaderTopics)
            this.metadata.add(topic, now);

        log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
            result.unknownLeaderTopics);
        this.metadata.requestUpdate();
    }

    // remove any nodes we aren't ready to send to
    // 3. 遍历处理待发送请求的目标节点,基于网络IO检查对应节点是否可用,对于不可用的节点则剔除。
    Iterator<Node> iter = result.readyNodes.iterator();
    long notReadyTimeout = Long.MAX_VALUE;
    while (iter.hasNext()) {
        Node node = iter.next();
        // 检查目标节点是否准备好接收请求,如果未准备好但目标节点允许创建连接,则创建到目标节点的连接
        if (!this.client.ready(node, now)) {
            // 对于未准备好的节点,则从ready集合中剔除
            iter.remove();
            notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
        }
    }

    // create produce requests
    // 4.获取每个节点待发送消息集合,其中key是目标leader 副本所在节点ID
    Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);
    addToInflightBatches(batches);
    if (guaranteeMessageOrder) {
        // 5.如果需要保证消息的顺序性,则缓存对应topic分区对象,防止同一时间往同一个topic 分区发送多条处于未完成状态的消息
        // Mute all the partitions drained
        // 将所有RecordBatch的topic 分区对象加入到muted集合中
        // 防止同一时往同一个topic中发送多条处于未完成状态的消息
        for (List<ProducerBatch> batchList : batches.values()) {
            for (ProducerBatch batch : batchList)
                this.accumulator.mutePartition(batch.topicPartition);
        }
    }

    //6.处理本地过期消息,返回TimeoutExceptionm,并释放空间
    accumulator.resetNextBatchExpiryTime();
    List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);
    List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);
    expiredBatches.addAll(expiredInflightBatches);

    // Reset the producer id if an expired batch has previously been sent to the broker. Also update the metrics
    // for expired batches. see the documentation of @TransactionState.resetIdempotentProducerId to understand why
    // we need to reset the producer id here.
    if (!expiredBatches.isEmpty())
        log.trace("Expired {} batches in accumulator", expiredBatches.size());
    for (ProducerBatch expiredBatch : expiredBatches) {
        String errorMessage = "Expiring " + expiredBatch.recordCount + " record(s) for " + expiredBatch.topicPartition
            + ":" + (now - expiredBatch.createdMs) + " ms has passed since batch creation";
        failBatch(expiredBatch, -1, NO_TIMESTAMP, new TimeoutException(errorMessage), false);
        if (transactionManager != null && expiredBatch.inRetry()) {
            // This ensures that no new batches are drained until the current in flight batches are fully resolved.
            transactionManager.markSequenceUnresolved(expiredBatch);
        }
    }
    sensors.updateProduceRequestMetrics(batches);

    // If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
    // loop and try sending more data. Otherwise, the timeout will be the smaller value between next batch expiry
    // time, and the delay time for checking data availability. Note that the nodes may have data that isn't yet
    // sendable due to lingering, backing off, etc. This specifically does not include nodes with sendable data
    // that aren't ready to send since they would cause busy looping.
    // 如果存在待发送的消息,则设置pollTimeout,等于0 这样可以立即发送请求,从而能够缩短剩余消息的缓存时间,避免堆积
    long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
    pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
    pollTimeout = Math.max(pollTimeout, 0);
    if (!result.readyNodes.isEmpty()) {
        log.trace("Nodes with data ready to send: {}", result.readyNodes);
        // if some partitions are already ready to be sent, the select time would be 0;
        // otherwise if some partition already has some data accumulated but not ready yet,
        // the select time will be the time difference between now and its linger expiry time;
        // otherwise the select time will be the time difference between now and the metadata expiry time;
        pollTimeout = 0;
    }
    // 7.发送请求到服务器,并处理服务端响应
    sendProduceRequests(batches, now);
    return pollTimeout;
}

1.2整体流程图

在这里插入图片描述

2.部分细节实现

2.1 判断哪些消息批次已经达到发送条件

Sender#sendProducerDatathis.accumulator.ready(cluster, now);

public ReadyCheckResult ready(Cluster cluster, long nowMs) {
    Set<Node> readyNodes = new HashSet<>();
    long nextReadyCheckDelayMs = Long.MAX_VALUE;
    Set<String> unknownLeaderTopics = new HashSet<>();

    boolean exhausted = this.free.queued() > 0;
    // 对生产者 batches 缓存中便利,从中挑选已准备好的消息批次 ProducerBatch
    for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {
        Deque<ProducerBatch> deque = entry.getValue();

        synchronized (deque) {
            // When producing to a large number of partitions, this path is hot and deques are often empty.
            // We check whether a batch exists first to avoid the more expensive checks whenever possible.
            ProducerBatch batch = deque.peekFirst();
            if (batch != null) {
                TopicPartition part = entry.getKey();
                // 从生产者缓存元数据中尝试查找分区的leader信息,如果不存在将topic添加到 unknownLeaderTopics,稍后会发送元数据请求去broker查找分区的信息
                Node leader = cluster.leaderFor(part);
                if (leader == null) {
                    // This is a partition for which leader is not known, but messages are available to send.
                    // Note that entries are currently not removed from batches when deque is empty.
                    unknownLeaderTopics.add(part.topic());

                }
                // 如果不再readyNodes中就需要判断是否满足条件(isMuted与顺序有关)
                else if (!readyNodes.contains(leader) && !isMuted(part)) {
                    // waitedTimeMs 该批次已等待的时长,等于当前时间戳与ProducerBatch.lastAttemptMs之差,在ProducerBatch创建时候或需要重试时就会将当前时间戳赋给lastAttemptMs
                    long waitedTimeMs = batch.waitedTimeMs(nowMs);
                    // 需要重试并且等待时间小于retryBackoffMs,则backingOff=true,意味着该批次还未准备好
                    boolean backingOff = batch.attempts() > 0 && waitedTimeMs < retryBackoffMs;
                    // 如果backingOff=true则表示在进行重试且等待时间小于 retryBackoffMs
                    long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                    // 表示该批次已经满了  deque.size()>1表示至少有一个deque中的ProducerBatch 批次已经满了,
                    // batch.isFull() 表示batch 批次已经写满
                    boolean full = deque.size() > 1 || batch.isFull();
                    // 表示已经等待的时间是否大于需要等待的时间,expired=true表示已经到达触发点,即需要执行
                    boolean expired = waitedTimeMs >= timeToWaitMs;
                    // exhausted=true 表示缓存空间内存已不够(创建新的ProducerBatch时阻塞在申请缓存空间的线程数大于0)此时应立即将缓存中的消息进行发送broker
                    boolean sendable = full || expired || exhausted || closed || flushInProgress();

                    // sendable 是否可以发送可以总结为以下几点:
                    // 1.该批次已写满
                    // 2.已等待规定的时长
                    // 3.发送者内部缓存空间不足,需要立即发送
                    // 4.发送者已经被关闭,需要将缓存的消息进行发送
                    // 5.发送着 flush方法被调用
                    // 这里判断是否准备好条件
                    if (sendable && !backingOff) {
                        readyNodes.add(leader);
                    } else {
                        long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                        // Note that this results in a conservative estimate since an un-sendable partition may have
                        // a leader that will later be found to have sendable data. However, this is good enough
                        // since we'll just wake up and then sleep again for the remaining time.
                        nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
                    }
                }
            }
        }
    }
    // 将需要发送的 readyNodes 收集起来
    return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
}

2.2节点未准备好的情况分析this.client.ready(node, now)

// 3. 遍历处理待发送请求的目标节点,基于网络IO检查对应节点是否可用,对于不可用的节点则剔除。
        Iterator<Node> iter = result.readyNodes.iterator();
        long notReadyTimeout = Long.MAX_VALUE;
        while (iter.hasNext()) {
            Node node = iter.next();
            // 检查目标节点是否准备好接收请求,如果未准备好但目标节点允许创建连接,则创建到目标节点的连接
            // 是否准备好可以总结为以下几点:
            // 1.host为"" 或 port<0 则node地址不可用
            // 2.有未更新的元数据信息,表示没准备好,需要更新元数据信息
            // 3.生产者与broker已经建立TCP链接(nodeStatus=READY && selector的channel已经准备好)
            // 4.生产者在收到服务器消息前,已发送的消息个数<max.in.flight.requests.per.connection(默认 5 )
            // 5.如果启用 SSL、ACL 等机制,相关状态都已就绪
            
            if (!this.client.ready(node, now)) {
                // 对于未准备好的节点,则从ready集合中剔除
                iter.remove();
                // this.client.pollDelayMs 预计分区在接下来 多久的时间都处于notReady状态(减少网络请求)
                notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
            }
        }
@Override
    public boolean ready(Node node, long now) {
        if (node.isEmpty())
            throw new IllegalArgumentException("Cannot connect to empty node " + node);

        if (isReady(node, now))
            return true;

        if (connectionStates.canConnect(node.idString(), now))
            // if we are interested in sending to a node and we don't have a connection to it, initiate one
            initiateConnect(node, now);

        return false;
    }
@Override
    public boolean isReady(Node node, long now) {
        // if we need to update our metadata now declare all requests unready to make metadata requests first
        // priority
        // 1.有未更新的元数据信息, 2.生产者与broker已经建立TCP链接,ChannelReady
        return !metadataUpdater.isUpdateDue(now) && canSendRequest(node.idString(), now);
    }
private boolean canSendRequest(String node, long now) {
        return connectionStates.isReady(node, now) && selector.isChannelReady(node) &&
            // 生产者在收到服务器响应之前可以发送默认5个消息,canSendMore 里进行判断是否超出
            inFlightRequests.canSendMore(node);
    }

2.3 根据readyNodes抽取缓存中发送相同broker的数据进行组装this.accumulator.drain()

public Map<Integer, List<ProducerBatch>> drain(Cluster cluster, Set<Node> nodes, int maxSize, long now) {
        if (nodes.isEmpty())
            return Collections.emptyMap();

        Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
    	// 每个节点进行便利,找出RecordAccumulator#batches中发送到相同node的数据进行统一发送 (减少网络请求),maxSize请求的最大size (通过max.request.size 配置,数据过多时可以增大吞吐量)
        for (Node node : nodes) {
            List<ProducerBatch> ready = drainBatchesForOneNode(cluster, node, maxSize, now);
            batches.put(node.id(), ready);
        }
        return batches;
    }

RecordAccumulator#drainBatchesForOneNode()将发往同一个node节点的批次数据进行整合

   private List<ProducerBatch> drainBatchesForOneNode(Cluster cluster, Node node, int maxSize, long now) {
        int size = 0;
        // 根据brokerId 获取该broker上的所有leader partition
        List<PartitionInfo> parts = cluster.partitionsForNode(node.id());
        List<ProducerBatch> ready = new ArrayList<>();
        /* to make starvation less likely this loop doesn't start at 0 */
        // start 当前开始遍历分区序号
        // drainIndex 抽取的队列索引
        int start = drainIndex = drainIndex % parts.size();
        do {
            // 循环从缓冲区中抽取对应分区的数据  getDeque
            PartitionInfo part = parts.get(drainIndex);
            TopicPartition tp = new TopicPartition(part.topic(), part.partition());
            this.drainIndex = (this.drainIndex + 1) % parts.size();

            // Only proceed if the partition has no in-flight batches.
            if (isMuted(tp))
                continue;

            // 为什么这里可以根据新new的tp获取到batches缓冲区中的queue呢?
            // 可以看一下ProducerBatch里重写了equal +hashCode 方法,所以 map.get时可以获取到
            Deque<ProducerBatch> deque = getDeque(tp);
            if (deque == null)
                continue;

            synchronized (deque) {
                // invariant: !isMuted(tp,now) && deque != null
                ProducerBatch first = deque.peekFirst();
                if (first == null)
                    continue;

                // first != null
                // 如果当前批次是重试且已等待时间小于重试时间 (retry.backoff.ms)则跳过
                boolean backoff = first.attempts() > 0 && first.waitedTimeMs(now) < retryBackoffMs;
                // Only drain the batch if it is not during backoff period.
                if (backoff)
                    continue;
                // 如果size+ 准备抽取的数据大小大于 maxSize 则停止抽取
                if (size + first.estimatedSizeInBytes() > maxSize && !ready.isEmpty()) {
                    // there is a rare case that a single batch size is larger than the request size due to
                    // compression; in this case we will still eventually send this batch in a single request
                    break;
                } else {
                    if (shouldStopDrainBatchesForPartition(first, tp))
                        break;

                    boolean isTransactional = transactionManager != null && transactionManager.isTransactional();
                    ProducerIdAndEpoch producerIdAndEpoch =
                        transactionManager != null ? transactionManager.producerIdAndEpoch() : null;
                    ProducerBatch batch = deque.pollFirst();
                    if (producerIdAndEpoch != null && !batch.hasSequence()) {
//                         If the batch already has an assigned sequence, then we should not change the producer id and
//                         sequence number, since this may introduce duplicates. In particular, the previous attempt
//                         may actually have been accepted, and if we change the producer id and sequence here, this
//                         attempt will also be accepted, causing a duplicate.
//
//                         Additionally, we update the next sequence number bound for the partition, and also have
//                         the transaction manager track the batch so as to ensure that sequence ordering is maintained
//                         even if we receive out of order responses.
                        batch.setProducerState(producerIdAndEpoch, transactionManager.sequenceNumber(batch.topicPartition), isTransactional);
                        transactionManager.incrementSequenceNumber(batch.topicPartition, batch.recordCount);
                        log.debug("Assigned producerId {} and producerEpoch {} to batch with base sequence " +
                                "{} being sent to partition {}", producerIdAndEpoch.producerId,
                            producerIdAndEpoch.epoch, batch.baseSequence(), tp);

                        transactionManager.addInFlightBatch(batch);
                    }
                    batch.close();
                    // 将当前批次加入到已准备集合中,并关闭该批次,即不在允许向该批次中追加消息
                    size += batch.records().sizeInBytes();
                    ready.add(batch);

                    batch.drained(now);
                }
            }
        } while (start != drainIndex);
        return ready;
    }
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值