Kafka由浅入深（5）RecordAccumulator的工作原理-CSDN博客

本文链接：https://blog.csdn.net/dreamcatcher1314/article/details/127524994

一、`RecordAccumulator的作用`

RecordAccumlator是记录收集器，收集记录到MemoryRecords 的队列中。RecordAccumlator主要功能作用是缓存消息以便Sender线程可以批量发送，从而减少网络传输的资源消耗，提升性能。

二、RecordAccumulator内存模型

RecordAccumlator缓存的默认大小事32M，也可以通过生产者客户端参数“buffer.memroy”配置修改缓存大小。如果收集器RecordAccumlator的内存耗尽，调用append()追加新的消息将被阻塞，除非显示的设置消除限制。

RecordAccumlator的内部还有一个BytePool缓冲池，主要实现对ByteBuffer的复用，但只是针对特定大小的ByteBuffer进行管理。ByteBuffer的特定大小，是通过参数“batch.size”指定，默认值为16K。

基于如下的源码，我们可以看到RecordAccumulator的内存模型接口，已经如何定义追加消息的队列

public class RecordAccumulator {
		...
    // topicInfoMap是主题下的消息集合，key是topic，value是TopicInfo信息
		private final ConcurrentMap<String /*topic*/, TopicInfo> topicInfoMap = new CopyOnWriteMap<>();

   /**
     * 每一个主题的存储消息
     */
    private static class TopicInfo {
        // batches ConcurrentMap集合中，key 是partition分区号，value是队列 Deque<ProducerBatch> 
        public final ConcurrentMap<Integer /*partition*/, Deque<ProducerBatch>> batches = new CopyOnWriteMap<>();
        public final BuiltInPartitioner builtInPartitioner;

        public TopicInfo(LogContext logContext, String topic, int stickyBatchSize) {
            builtInPartitioner = new BuiltInPartitioner(logContext, topic, stickyBatchSize);
        }
    }
   ...
}

public final class ProducerBatch {
		...
		private final MemoryRecordsBuilder recordsBuilder;

    /**
     * Append the record to the current record set and return the relative offset within that record set
     * 将记录追加到当前记录集中，并返回在记录集中的相对偏移量
     *
     * @return The RecordSend corresponding to this record or null if there isn't sufficient room.
     */
    public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers, Callback callback, long now) {
        // 如果没有空间容纳，则返回null
        if (!recordsBuilder.hasRoomFor(timestamp, key, value, headers)) {
            return null;
        } else {
           //在下一个顺序偏移的位置追加新记录
            this.recordsBuilder.append(timestamp, key, value, headers);
            this.maxRecordSize = Math.max(this.maxRecordSize, AbstractRecords.estimateSizeInBytesUpperBound(magic(),
                    recordsBuilder.compressionType(), key, value, headers));
            this.lastAppendTime = now;
            FutureRecordMetadata future = new FutureRecordMetadata(this.produceFuture, this.recordCount,
                                                                   timestamp,
                                                                   key == null ? -1 : key.length,
                                                                   value == null ? -1 : value.length,
                                                                   Time.SYSTEM);
            // we have to keep every future returned to the users in case the batch needs to be
            // split to several new batches and resent.
            thunks.add(new Thunk(callback, future));
            this.recordCount++;
            return future;
        }
    }
}

public class MemoryRecordsBuilder implements AutoCloseable {

    // Used to hold a reference to the underlying ByteBuffer so that we can write the record batch header and access
    // the written bytes. ByteBufferOutputStream allocates a new ByteBuffer if the existing one is not large enough,
    // so it's not safe to hold a direct reference to the underlying ByteBuffer.
    private final ByteBufferOutputStream bufferStream;
	

    // Used to append records, may compress data on the fly
    private DataOutputStream appendStream;
}

三、RecordAccumlator内部执行流程

3.1 消息append()流程图

3.2 RecordAccumulator类append() 源码解析

public class RecordAccumulator {
		/**
     * 新增一条消息记录到accumulator，返回RecordAppendResult消息追加结果。
     *  追加的结果将包含未来的元数据，并标记追加的批次是已经满的还是创建了一个新批
     */
    public RecordAppendResult append(String topic,
                                     int partition,
                                     long timestamp,
                                     byte[] key,
                                     byte[] value,
                                     Header[] headers,
                                     AppendCallbacks callbacks,
                                     long maxTimeToBlock,
                                     boolean abortOnNewBatch,
                                     long nowMs,
                                     Cluster cluster) throws InterruptedException {
        // 获取当前主题的TopicInfo主题信息
        TopicInfo topicInfo = topicInfoMap.computeIfAbsent(topic, k -> new TopicInfo(logContext, k, batchSize));

        // 记录追加记录线程的数量，确保在调用abortIncompleteBatches时不会丢失这些批数据
        appendsInProgress.incrementAndGet();
        ByteBuffer buffer = null;
        if (headers == null) headers = Record.EMPTY_HEADERS;
        try {
            // 循环重试，以防遇到分区器的竞争条件
            while (true) {
                //如果前面没有获取到分区号，那么我们将根据broker的可用性和性能计算出一个分区。
                // 注意： 当前的计算的分区号，会保留在阻塞队列中，以防止并发修改的出现
                final BuiltInPartitioner.StickyPartitionInfo partitionInfo;
                final int effectivePartition;
                if (partition == RecordMetadata.UNKNOWN_PARTITION) {
                    // 通过粘性策略进行分区分配-通过负载量进行分配
                    partitionInfo = topicInfo.builtInPartitioner.peekCurrentPartitionInfo(cluster);
                    effectivePartition = partitionInfo.partition();
                } else {
                    partitionInfo = null;
                    effectivePartition = partition;
                }

                // 获取到高效的分区号后，通过AppendCallback 进行回调返回实际的分区号
                setPartition(callbacks, effectivePartition);

                // 先根据分区获取该消息所属的队列中， 如果有已经存在的队列，那么我们就使用存在队列； 如果队列不存在，那么我们新创建一个队列
                Deque<ProducerBatch> dq = topicInfo.batches.computeIfAbsent(effectivePartition, k -> new ArrayDeque<>());
                synchronized (dq) {
                    //获取锁后，验证分区是否没有改变;如果分区发生改变，则进行重试
                    if (partitionChanged(topic, topicInfo, partitionInfo, dq, nowMs, cluster))
                        continue;
                    // 尝试向队列里面的批次中添加数据
                    // * 这里需要说明一下：目前我们只是有了队列，数据是存储在批次中（批次对象是需要分配内存的）
                    // * 目前还没有分配到内存，所以添加数据会失败
                    RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callbacks, dq, nowMs);
                    if (appendResult != null) {
                        // 检查队列中的所有批次是否都满了
                        boolean enableSwitch = allBatchesFull(dq);
                        // 可能会更新分区操作
                        topicInfo.builtInPartitioner.updatePartitionInfo(partitionInfo, appendResult.appendedBytes, cluster, enableSwitch);
                        return appendResult;
                    }
                }

                //没有正在进行（in-progress）的记录批次，试图分配一个新批次
                if (abortOnNewBatch) {
                    // Return a result that will cause another call to append.
                    return new RecordAppendResult(null, false, false, true, 0);
                }

                if (buffer == null) {
                    byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
                    int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
                    log.trace("Allocating a new {} byte message buffer for topic {} partition {} with remaining timeout {}ms", size, topic, partition, maxTimeToBlock);
                    
                    // 如果我们耗尽了缓冲区空间，此调用可能会阻塞
                    buffer = free.allocate(size, maxTimeToBlock);
                    
                    //更新当前时间，以防上述缓冲区分配被阻止。
                    //注意：获取时间可能很昂贵，因此应避免在锁下调用它。
                    nowMs = time.milliseconds();
                }

                synchronized (dq) {
                    //获取锁后，如果验证分区发生更改，就进行重试。
                    if (partitionChanged(topic, topicInfo, partitionInfo, dq, nowMs, cluster))
                        continue;
                    // 添加新创建的ProducerBatch到队列中
                    RecordAppendResult appendResult = appendNewBatch(topic, effectivePartition, dq, timestamp, key, value, headers, callbacks, buffer, nowMs);

                    // 将缓冲区设置为null，因为数据在批处理中使用，所以讲ByteBuffer释放掉内存
                    if (appendResult.newBatchCreated)
                        buffer = null;

                    // 检查当前队列中的所有批次是否都满了,若队列中的批次已经满了，则需要进行切换分区（请参阅updatePartitionInfo中的注释）
                    boolean enableSwitch = allBatchesFull(dq);
                    topicInfo.builtInPartitioner.updatePartitionInfo(partitionInfo, appendResult.appendedBytes, cluster, enableSwitch);
                    return appendResult;
                }
            }
        } finally {
            free.deallocate(buffer);
            appendsInProgress.decrementAndGet();
        }
    }
}

RecordAccumulator类append()的主体逻辑：

1、通过当前主题获取主题消息信息（TopicInfo）

2、获取分区号

a、如果partition参数分区号（大于等于0），则将分区号设置为有效的分区effectivePartition；

b、如果partition参数值为 RecordMetadata.UNKNOWN_PARTITION（-1），需要通过分区负载压力进行粘性分区，计算到分区号并复制给effectivePartition

3、基于有效分区号effectivePartition，从topicInfo.batches中获取到ProduceBatch队列Deque<ProducerBatch> ，如果不存在Deque<ProducerBatch> 队列，则进行新建队列

4、调用tryAppend()尝试想队列中添加数据，如果返回RecordAppendResult，则数据添加成功并，append()方法返回并结束；如果tryAppend()方法中队列已满或ProducerBatch没有容纳，则返回null,则需要进行后续新建的处理逻辑

5、给ByteBuffer分配空间

6、校验分区是否发生改变（当分区队列已满的时候，可能会触发重新分配分区），如果分区发生改变，则重新到2~4进行处理；如果分区没有发生改变，在新创建ProducerBatch，并追加到Deque<ProducerBatch> 队列中

3.3 RecordAccumulator类tryAppend() 源码解析

/**
 *  追加消息到ProducerBatch中
 * 
 *  如果ProducerBatch队列已满，将返回null并新创建一个batch。
 *  关闭记录追加的批处理，以释放压缩缓冲区等资源。
 *   在以下情况之一(以先发生的为准)，批处理将被完全关闭(记录批处理头将被写入并构建内存记录): 如果消息过期了，或者生产者关闭了。
 */
private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers,
                                     Callback callback, Deque<ProducerBatch> deque, long nowMs) {

    // 消息追加将关闭，如果消息过期，或者生产者关闭
    if (closed)
        throw new KafkaException("Producer closed while send in progress");
    // 从Deque<ProducerBatch>队列中获取到last ProducerBatch
    ProducerBatch last = deque.peekLast();
    if (last != null) {
        //获取写入底层缓冲区的字节数的估计值。如果记录集没有被压缩，或者构建器已经关闭，则返回值是完全正确的。
        int initialBytes = last.estimatedSizeInBytes();
        //将消息记录追加到当前ProducerBatch中，并返回在ProducerBatch的相对偏移量
        FutureRecordMetadata future = last.tryAppend(timestamp, key, value, headers, callback, nowMs);
        if (future == null) {
            // 没有空间容纳，则关闭追加记录
            last.closeForRecordAppends();
        } else {
            // 将追加记录进行返回
            int appendedBytes = last.estimatedSizeInBytes() - initialBytes;
            return new RecordAppendResult(future, deque.size() > 1 || last.isFull(), false, false, appendedBytes);
        }
    }
    return null;
}

1、如果消息过期了，或者生产者关闭，则抛出KafkaException异常

2、从Deque<ProducerBatch>队列中获取到last ProducerBatch，如果ProducerBatch 为空则返回null

3、如果ProducerBatch不为空，则将通过调用当前ProducerBatch的tryAppend()方法，将记录追加到当前ProducerBatch中，并返回在ProducerBatch的相对偏移量

3.4 ProducerBatch类的tryAppend()方法

public final class ProducerBatch {
		//MemoryRecordsBuilder是在内存中写入新的日志数据
		private final MemoryRecordsBuilder recordsBuilder;
		/**
     * 将记录追加到当前记录集中，并返回在记录集中的相对偏移量
     */
    public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers, Callback callback, long now) {
        // 如果没有空间容纳，则返回null
        if (!recordsBuilder.hasRoomFor(timestamp, key, value, headers)) {
            return null;
        } else {
            // 通过MemoryRecordsBuilder计算下一个顺序的相对偏移处追加新记录
            this.recordsBuilder.append(timestamp, key, value, headers);
            this.maxRecordSize = Math.max(this.maxRecordSize, AbstractRecords.estimateSizeInBytesUpperBound(magic(),
                    recordsBuilder.compressionType(), key, value, headers));
            this.lastAppendTime = now;
						// 封装返回基本的元数据
            FutureRecordMetadata future = new FutureRecordMetadata(this.produceFuture, this.recordCount,
                                                                   timestamp,
                                                                   key == null ? -1 : key.length,
                                                                   value == null ? -1 : value.length,
                                                                   Time.SYSTEM);
            // we have to keep every future returned to the users in case the batch needs to be
            // split to several new batches and resent.
            thunks.add(new Thunk(callback, future));
            this.recordCount++;
            return future;
        }
    }
}

1、校验MemoryRecordsBuilder 是否有容纳空间，如果没有空间则直接返回null

2、如果MemoryRecordsBuilder有空间，则通过通过MemoryRecordsBuilder计算下一个顺序的相对偏移处追加新记录

3、封装返回基本的元数据到FutureRecordMetadata中并返回

3.5 MemoryRecordsBuilder类的tryAppend()方法

MemoryRecordsBuilder的作用是在内存中写入新的消息日志数据。append()对外的重载方法，最终调用的是appendWithOffset()方法，而在appendWithOffset()方法中，判断message格式将消息写入到数据流中

public class MemoryRecordsBuilder implements AutoCloseable {

    public void append(long timestamp, byte[] key, byte[] value, Header[] headers) {
        // 将key 和value转换为ByteBuffer
		append(timestamp, wrapNullable(key), wrapNullable(value), headers);
    }

	public void append(long timestamp, ByteBuffer key, ByteBuffer value, Header[] headers) {
        appendWithOffset(nextSequentialOffset(), timestamp, key, value, headers);
    }

	public void appendWithOffset(long offset, long timestamp, ByteBuffer key, ByteBuffer value, Header[] headers) {
        appendWithOffset(offset, false, timestamp, key, value, headers);
    }

	/**
     * Append a new record at the given offset.
     */
    private void appendWithOffset(long offset, boolean isControlRecord, long timestamp, ByteBuffer key,
                                  ByteBuffer value, Header[] headers) {
        try {
						// 校验数据的合法性
            if (isControlRecord != isControlBatch)
                throw new IllegalArgumentException("Control records can only be appended to control batches");

            if (lastOffset != null && offset <= lastOffset)
                throw new IllegalArgumentException(String.format("Illegal offset %s following previous offset %s " +
                        "(Offsets must increase monotonically).", offset, lastOffset));

            if (timestamp < 0 && timestamp != RecordBatch.NO_TIMESTAMP)
                throw new IllegalArgumentException("Invalid negative timestamp " + timestamp);

            if (magic < RecordBatch.MAGIC_VALUE_V2 && headers != null && headers.length > 0)
                throw new IllegalArgumentException("Magic v" + magic + " does not support record headers");

            if (baseTimestamp == null)
                baseTimestamp = timestamp;

			// 消息格式类型 V0,V1,V2
            if (magic > RecordBatch.MAGIC_VALUE_V1) {
                // 使用V2 的消息处理格式
                appendDefaultRecord(offset, timestamp, key, value, headers);
            } else {
                // 使用V0、V1 的消息处理格式
                appendLegacyRecord(offset, timestamp, key, value, magic);
            }
        } catch (IOException e) {
            throw new KafkaException("I/O exception when writing to the append stream, closing", e);
        }
    }
}

四、Kafka 的Message格式类型

kafka 目前的 message 的格式，也成magic魔术有三个版本

V0：kafka0.10 版本之前

V1：kafka 0.10 ~ 0.11 版本

V2：kafka 0.11.0 之后的版本

4.1 V0、V1格式版本

V1 版本与 V0 版本的格式基本类似，V1版本新增了 timestamp 字段

V1 版本 message 中的 timestamp 类型由 attributes 中的第4个比特位标识决定

timestamp 类型分为CreateTime 和 LogAppendTime 两种类型：

CreateTime：timestamp 字段中记录的是 producer 生产这条 message 的时间戳 LogAppendTime：timestamp 字段中记录的是 broker 将该 message 写入 segment 文件的时间戳。

字段说明：

offset：Record的消息的偏移量

message size：消息体大小

magic: 魔数类型，也就是message格式类型 V0或V1

crc32: crc校验码

timestamp: V1版本新增了 timestamp 时间戳字段

attributes: 属性标记压缩类型和时间戳类型

key length: key的长度,使用可变字段, 如果没有key,该值为-1。

key: 消息的key

value length：消息体的长度

value：消息体的信息

4.2 V2格式版本

从 kafka 0.11 版本推出之后，开始使用目前主流message格式 V2 版本。Kafak同时也兼容 V0、V1 版本的 message格式。但是如果继续使用旧版本V0和V1的 message格式， kafka 中的一些新特性将无法生效，例如事务、幂等等功能将不起作用。

V2 版本的 message 格式参考了 Protocol Buffer 的一些特性，引入了Varints（变长整型）和 ZigZag 编码。其中，Varints 是使用一个或多个字节来序列化整数的一种方法，数值越小，占用的字节数就越少，可以减少 message 的消息容量，减少数据传输。ZigZag 编码是为了解决 Varints 对负数编码效率低的问题，ZigZag 会将有符号整数映射为无符号整数，从而提高 Varints 对绝对值较小的负数的编码效率除了基础的 Record 格式之外，V2 版本中还定义了一个 Record Batch 的结构。与V1 版本格式相比较，Record 是内层结构，Record Batch 是外层结构

RecordBatch字段说明：

1、 first offset：表示当前RecordBatch的起始偏移量。
2、length：计算从`partition leader epoch`到末尾的长度。
3、 partition leader epoch：分区leader纪元，可以看做是分区leader的版本号或者更新次数。
4、magic：消息格式版本号，V2版本是2。
5、crc32：crc32校验值。
6、attributes：消息属性，这里占用2个字节。低三位表示压缩格式，第4位表示时间戳类型，第5位表示此RecordBatch是否在事务中，第6位表示是否为控制消息。
7、last offset delta：RecordBatch中最后一个Record的offset与first offset的差值。主要用于broker确保RecordBatch中Recoord组装的正确性。
8、first timestamp：RecordBatch中第一条Record的时间戳。
9、 max timestamp：RecordBatch中最大的时间戳。一般情况是最后一条Record的时间戳。
10、producer id：PID，用来支持事务和幂等。暂不解释。
11、producer epoch：用来支持事务和幂等。暂不解释。
12、first sequeue：用来支持事务和幂等。暂不解释。
13、records count：RecordBatch中record的个数。
14、records：消息记录集合

Record字段说明：

1、length：消息总长度

2、attributes：弃用。这里仍然占用了1B大小，供未来扩展。

3、timestamp delta：时间戳增量。

4、offset delta：偏移量增量。保存与RecordBatch起始偏移量的差值。

5、key length：消息key长度。

6、key value：消息key的值。

7、value length：消息体的长度。

8、value：消息体的值。

9、headers：消息头。用来支持应用级别的扩展。

Header字段说明：