3.Producer设计分析之RecordAccumulator
上节讲解了大部分kafka的实现原理,后续我们会逐步深入到kafka源码中体验实现细节,如果分析的有问题可以在评论区进行讨论.
类图
kafka 发送消息流程
我们上节知道了kafka发送消息采用同步/异步两种方式发送消息,在发送时候涉及到的重要线程则是Sender线程,以及线程共享的变量RecordAccumulator,sender现场不断从accumulator中拉取消息发送到broker.
// 在KafkaProducer 中
private final RecordAccumulator accumulator;
private final Sender sender;
RecordAccumulator介绍:
其功能类似于缓冲队列,在其中,会根据TopicPartition对象对消息进行分组,每一个TopicPartition对象会对应一个队列,ProducerBatch表示一批消息,在Kafka发送消息时,总是从队列尾部追加,而Sender则是从队列头部进行获取,如上面的流程图.
//主要利用ConcurrentMap进行缓存
private final ConcurrentMap<TopicPartition, Deque<ProducerBatch>> batches;
//压缩类型gzip/snappy/lz4/zstd
private final CompressionType compression;
//基于NIO ByteBuffer 的缓冲池,
private final BufferPool free;
//用于保存尚未确认的batch消息(包括已发送未回ack和未发送的消息),实际上是一个Set
private final IncompleteBatches incomplete;
RecordAccumulator先介绍append流程:
public RecordAppendResult append(TopicPartition tp,
long timestamp,
byte[] key,
byte[] value,
Header[] headers,
Callback callback,
long maxTimeToBlock,
boolean abortOnNewBatch,
long nowMs) throws InterruptedException {
//appendsInProgress记录当前正在进行append消息的线程数,方便当客户端调用KafkaProducer.close()强制关闭发送消息操作时放弃未处理完的消息请求,释放资源
appendsInProgress.incrementAndGet();
ByteBuffer buffer = null;
if (headers == null) headers = Record.EMPTY_HEADERS;
try {
// 检查是否已经存在于batches中,有则获取,没有则新建.
Deque<ProducerBatch> dq = getOrCreateDeque(tp);
// 对当前Deque进行同步,防止消息存入的顺序发生变化
synchronized (dq) {
// 尝试往buffer中写入消息,尝试过程会先从Deque里的tail获取一条ProducerBatch,若获取不到则返回null,获取的到则会判断ProducerBatch是否有足够的容量存放消息,容量不足同样返回null,否则将消息以bype[]的形式写入缓冲区中.
RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
if (appendResult != null)
return appendResult;
}
byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
buffer = free.allocate(size, maxTimeToBlock);
// Update the current time in case the buffer allocation blocked above.
nowMs = time.milliseconds();
synchronized (dq) {
// 同上解释,这里为什么要进行二次判断呢?
// 为了保险起见可能此时已经有相同的TopicPartition的其他线程创建了ProducerBatch中的部分消息已经被sender进行处理释放了空间,此时已有空间可容纳新的消息,则再次调用tryAppend()尝试写入,若写入成功,则最后通过finally块释放刚才free申请的空间
RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
if (appendResult != null) {
return appendResult;
}
// 建造者模式,构建 MemoryRecords 对象,MemoryRecords合理的将buffer操作进行封装,通过NIO去进行写操作.
MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, nowMs);
// 最后,确实tryAppend()失败,则创建一个新的ProducerBatch对象,并将消息append到缓冲区
FutureRecordMetadata future = Objects.requireNonNull(batch.tryAppend(timestamp, key, value, headers,
callback, nowMs));
dq.addLast(batch);
// 加入尚未确定的队列中,保证消息的可靠性.
incomplete.add(batch);
// 为什么要在这里只为null 而不在finally统一处理呢?
buffer = null;
return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true, false);
}
} finally {
if (buffer != null)
// 释放buffer资源
free.deallocate(buffer);
// 减少发送线程个数
appendsInProgress.decrementAndGet();
}
}
上面先简单讲解了RecordAccumulator.append()是在做什么,就是把消息进行包装,存储到对应的TopicPartition中的队列中,并保存消息到缓冲区
batchIsFull和newBatchCreated在调用RecordAccumulator.append()方法后来判断是否需要唤醒sender线程进行发送消息
- 如果batchIsFull为true:代表双向队列里面有RecordBatch满了,可以唤醒发送现成发送消息了
- 如果newBatchCreated为true:代表旧的RecordBatch满了或者装不下新的消息了,可以唤醒发送消息了
RecordAppendResult中的几个参数有必要说一下:
public final static class RecordAppendResult {
//符合处理当前消息是同步/异步的操作
public final FutureRecordMetadata future;
//标识RecordBatch是否已满
public final boolean batchIsFull;
//是否需要重新创建新的RecordBatch。
public final boolean newBatchCreated;
//是否需要创建新的RecordBatch
public final boolean abortForNewBatch;
public RecordAppendResult(FutureRecordMetadata future, boolean batchIsFull, boolean newBatchCreated, boolean abortForNewBatch) {
this.future = future;
this.batchIsFull = batchIsFull;
this.newBatchCreated = newBatchCreated;
this.abortForNewBatch = abortForNewBatch;
}
}
下面我们看一下最主要的send方法:
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
TopicPartition tp = null;
try {
throwIfProducerClosed();
// first make sure the metadata for the topic is available
long nowMs = time.milliseconds();
ClusterAndWaitTime clusterAndWaitTime;
try {
// 等待cluster获取元数据
clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), nowMs, maxBlockTimeMs);
} catch (KafkaException e) {
if (metadata.isClosed())
throw new KafkaException("Producer closed while send in progress", e);
throw e;
}
nowMs += clusterAndWaitTime.waitedOnMetadataMs;
long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
Cluster cluster = clusterAndWaitTime.cluster;
byte[] serializedKey;
try {
// 将数据进行序列话
serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
} catch (ClassCastException cce) {
// 忽略catch
}
byte[] serializedValue;
try {
serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
" specified in value.serializer", cce);
}
// 确定要发送的分区
int partition = partition(record, serializedKey, serializedValue, cluster);
tp = new TopicPartition(record.topic(), partition);
setReadOnly(record.headers());
Header[] headers = record.headers().toArray();
int serializedSize = AbstractRecords.estimateSizeInBytesUpperBound(apiVersions.maxUsableProduceMagic(),
compressionType, serializedKey, serializedValue, headers);
ensureValidRecordSize(serializedSize);
long timestamp = record.timestamp() == null ? nowMs : record.timestamp();
if (log.isTraceEnabled()) {
log.trace("Attempting to append record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
}
// 拦截器回掉,前面讲过请求过程可以配置拦截器
Callback interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
if (transactionManager != null && transactionManager.isTransactional()) {
transactionManager.failIfNotReadyForSend();
}
// 发送核心:将消息交给消息累加器
RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey,
serializedValue, headers, interceptCallback, remainingWaitMs, true, nowMs);
// 需要创建Batch所以没成功
if (result.abortForNewBatch) {
int prevPartition = partition;
// 这里调用了分区器的onNewBatch
// 如果分区器使用了StickyPartitionCache,通常会在这步执行nextPartition进行更新
partitioner.onNewBatch(record.topic(), cluster, prevPartition);
// 重新分区
partition = partition(record, serializedKey, serializedValue, cluster);
tp = new TopicPartition(record.topic(), partition);
if (log.isTraceEnabled()) {
log.trace("Retrying append due to new batch creation for topic {} partition {}. The old partition was {}", record.topic(), partition, prevPartition);
}
// producer callback will make sure to call both 'callback' and interceptor callback
interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
// 重新执行添加
result = accumulator.append(tp, timestamp, serializedKey,
serializedValue, headers, interceptCallback, remainingWaitMs, false, nowMs);
}
if (transactionManager != null && transactionManager.isTransactional())
transactionManager.maybeAddPartitionToTransaction(tp);
// 追加一条消息到收集器后,如果记录收集器满了或者当前是新创建的Batch,则通知唤醒Sender 发送消息
if (result.batchIsFull || result.newBatchCreated) {
log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
this.sender.wakeup();
}
// 返回future 判断是否需要同步或者异步
return result.future;
// handling exceptions and record the errors;
// for API exceptions return them in the future,
// for other exceptions throw directly
} catch (ApiException e) {
// 忽略异常处理
} catch (InterruptedException e) {
// 忽略异常处理
} catch (KafkaException e) {
// 忽略异常处理
} catch (Exception e) {
// 忽略异常处理
}
}
doSend代码比较长,但是实际逻辑很清晰:
- 确保producer还在运行;
throwIfProducerClosed();
- 确定topic的元数据可用(处理/更新元数据);
- key和value序列化的处理;
- 计算出发送的topic分区;
- 回调接口封装;
- 将消息交给消息累加器accumulator;
- 处理异常中断情况;
下面讲解几个细节的点
kafka内部序列化方式:
序列化方式 |
---|
ByteArraySerializer :字节数组序列化 |
ByteBufferSerializer:ByteBuffer(NIO)序列化 |
BytesSerializer:字节序列化 |
DoubleSerializer:Double序列化 |
ExtendedSerializer:被移除了,以前序列化接口,现在用Serializer代替 |
FloatSerializer:float序列化 |
IntegerSerializer:integer序列化 |
LongSerializer:long序列化 |
ShortSerializer:short序列化 |
StringSerializer:string序列化 |
UUIDSerializer:uuid序列化,与StringSerializer实现相同 |
VoidSerializer:返回null |
partition值的计算
默认实现:DefaultPartitioner
private int partition(ProducerRecord<K, V> record, byte[] serializedKey, byte[] serializedValue, Cluster cluster) {
Integer partition = record.partition();
return partition != null ?
partition :
partitioner.partition(
record.topic(), record.key(), serializedKey, record.value(), serializedValue, cluster);
}
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster,
int numPartitions) {
if (keyBytes == null) {
return stickyPartitionCache.partition(topic, cluster);
}
// hash the keyBytes to choose a partition
return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
实现算法如下:
- 如果指定了partition则会选择指定的
- 如果没有指定:
- 但是指定了key,根据key进行hash%numPartitions,选取partition
- 但是没有指定key,根据StickyPartitionCache.nextPartition()根据随机算法方式进行选取partition
- 上面是有优先级关系的,如果同时指定了partition/key,则优先partition
针对高并发的设计
先看batches 实现的map,其实this.batches = new CopyOnWriteMap<>();
,其中采用了juc中的CopyOnWriteMap(针对于读多写少的场景),为什么这么说呢?我们看看他们put方法与get方法
private volatile Map<K, V> map;
public synchronized V put(K k, V v) {
Map<K, V> copy = new HashMap<K, V>(this.map);
V prev = copy.put(k, v);
this.map = Collections.unmodifiableMap(copy);
return prev;
}
public V get(Object k) {
return map.get(k);
}
相信大家对volatile都不陌生,它保证了map变量的内存可见性,在put时先会copy一份map的数据,之后在put在map中,最后给map更换地址.这是一个纯粹采取空间换时间的Map.