这篇文章主要介绍 Kafka 中的 Sender 和 NetworkClient。当生产者发送消息时,RecordAccumulator 对消息进行缓存分组,而 Sender 线程则会对 RecordAccumulator 中缓存的消息进行发送。Sender 线程内部首先会对消息进行发送前的准备,随后通过调用 NetowrkClient 进行网络操作。NetworkClient 负责发送消息请求并对收到的消息响应进行处理。NetworkClient 内部封装了 Selector,负责进行具体的网络 IO 操作。
1. Sender 线程
Sender 线程的 run() 方法,会循环调用 runOnce() 来发送消息。
@Override
public void run() {
log.debug("Starting Kafka producer I/O thread.");
if (transactionManager != null)
transactionManager.setPoisonStateOnInvalidTransition(true);
// main loop, runs until close is called
while (running) {
try {
// **循环调用 runOnce()**
runOnce();
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
log.debug("Beginning shutdown of Kafka producer I/O thread, sending remaining records.");
// 停止接受发送请求,继续处理完当前堆积的消息
while (!forceClose && ((this.accumulator.hasUndrained() || this.client.inFlightRequestCount() > 0) || hasPendingTransactionalRequests())) {
try {
runOnce();
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
// Abort the transaction if any commit or abort didn't go through the transaction manager's queue
while (!forceClose && transactionManager != null && transactionManager.hasOngoingTransaction()) {
if (!transactionManager.isCompleting()) {
log.info("Aborting incomplete transaction due to shutdown");
try {
// It is possible for the transaction manager to throw errors when aborting. Catch these
// so as not to interfere with the rest of the shutdown logic.
transactionManager.beginAbort();
} catch (Exception e) {
log.error("Error in kafka producer I/O thread while aborting transaction when during closing: ", e);
// Force close in case the transactionManager is in error states.
forceClose = true;
}
}
try {
runOnce();
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
// 强制关闭,放弃所有堆积的请求
if (forceClose) {
// We need to fail all the incomplete transactional requests and batches and wake up the threads waiting on
// the futures.
if (transactionManager != null) {
log.debug("Aborting incomplete transactional requests due to forced shutdown");
transactionManager.close();
}
log.debug("Aborting incomplete batches due to forced shutdown");
this.accumulator.abortIncompleteBatches();
}
try {
// 关闭 NetworkClient
this.client.close();
} catch (Exception e) {
log.error("Failed to close network client", e);
}
log.debug("Shutdown of Kafka producer I/O thread has completed.");
}
runOnce() 方法负责实际发送消息。runOnce() 中包含了消息的事务操作,而具体进行消息发送的部分如下:
void runOnce() {
long currentTimeMs = time.milliseconds();
// sendProducerData() 内部将消息 batch 组装成 ClientRequest,
// **其内部调用了 NetworkClient.send() 对将要发送的消息进行准备**
long pollTimeout = sendProducerData(currentTimeMs);
// **调用 NetworkClient.poll() 发送消息并对响应进行处理**
client.poll(pollTimeout, currentTimeMs);
}
Sender 线程会通过 runOnce() 方法不断调用 sendProducerData() 来发送缓存在 RecordAccumulator 中的消息。sendProducerData() 的具体实现如下:
private long sendProducerData(long now) {
// 获取元数据
MetadataSnapshot metadataSnapshot = metadata.fetchMetadataSnapshot();
// 获取已经可以发送的分区
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(metadataSnapshot, now);
// 如果有 leader 节点未知的分区,则强制更新元数据
if (!result.unknownLeaderTopics.isEmpty()) {
// The set of topics with unknown leader contains topics with leader election pending as well as
// topics which may have expired. Add the topic again to metadata to ensure it is included
// and request metadata update, since there are messages to send to the topic.
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic, now);
log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
result.unknownLeaderTopics);
this.metadata.requestUpdate(false);
}
// 在结果中继续过滤,通过检查与节点的连接,移除没有准备好的节点
Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
// Update just the readyTimeMs of the latency stats, so that it moves forward
// every time the batch is ready (then the difference between readyTimeMs and
// drainTimeMs would represent how long data is waiting for the node).
this.accumulator.updateNodeLatencyStats(node.id(), now, false);
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
} else {
// Update both readyTimeMs and drainTimeMs, this would "reset" the node
// latency.
this.accumulator.updateNodeLatencyStats(node.id(), now, true);
}
}
// 从 RecordAccumulator 取出数据,按照 node 节点和 ProducerBatch 进行映射,交由网络层发送给对应的节点
Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(metadataSnapshot, result.readyNodes, this.maxRequestSize, now);
addToInflightBatches(batches);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List<ProducerBatch> batchList : batches.values()) {
for (ProducerBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
// 处理已经过期的消息
accumulator.resetNextBatchExpiryTime();
List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);
List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);
expiredBatches.addAll(expiredInflightBatches);
if (!expiredBatches.isEmpty())
log.trace("Expired {} batches in accumulator", expiredBatches.size());
for (ProducerBatch expiredBatch : expiredBatches) {
String errorMessage = "Expiring " + expiredBatch.recordCount + " record(s) for " + expiredBatch.topicPartition
+ ":" + (now - expiredBatch.createdMs) + " ms has passed since batch creation";
failBatch(expiredBatch, new TimeoutException(errorMessage), false);
if (transactionManager != null && expiredBatch.inRetry()) {
// This ensures that no new batches are drained until the current in flight batches are fully resolved.
transactionManager.markSequenceUnresolved(expiredBatch);
}
}
// 更新 metrics
sensors.updateProduceRequestMetrics(batches);
// 将 pollTimeout 设定为 下一次检查节点ready的延迟时间 和 下一次batch过期时间 中较小的值
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
pollTimeout = Math.max(pollTimeout, 0);
// 如果有准备好的节点,将 pollTimeout 设定为 0,立即发送消息
if (!result.readyNodes.isEmpty()) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
// if some partitions are already ready to be sent, the select time would be 0;
// otherwise if some partition already has some data accumulated but not ready yet,
// the select time will be the time difference between now and its linger expiry time;
// otherwise the select time will be the time difference between now and the metadata expiry time;
pollTimeout = 0;
}
// 调用 sendProduceRequests(),通过 NetworkClient 准备发送消息
sendProduceRequests(batches, now);
return pollTimeout;
}
按照 node 节点和 ProducerBatch 的映射,遍历每个目标节点准备发送请求。
private void sendProduceRequests(Map<Integer, List<ProducerBatch>> collated, long now) {
for (Map.Entry<Integer, List<ProducerBatch>> entry : collated.entrySet())
sendProduceRequest(now, entry.getKey(), acks, requestTimeoutMs, entry.getValue());
}
private void sendProduceRequest(long now, int destination, short acks, int timeout, List<ProducerBatch> batches) {
if (batches.isEmpty())
return;
final Map<TopicPartition, ProducerBatch> recordsByPartition = new HashMap<>(batches.size());
// 找到最小的消息格式版本
byte minUsedMagic = apiVersions.maxUsableProduceMagic();
for (ProducerBatch batch : batches) {
if (batch.magic() < minUsedMagic)
minUsedMagic = batch.magic();
}
ProduceRequestData.TopicProduceDataCollection tpd = new ProduceRequestData.TopicProduceDataCollection();
for (ProducerBatch batch : batches) {
TopicPartition tp = batch.topicPartition;
MemoryRecords records = batch.records();
// 为了保证向下兼容,将消息转化为最小的消息格式版本
if (!records.hasMatchingMagic(minUsedMagic))
records = batch.records().downConvert(minUsedMagic, 0, time).records();
// 将消息按照 Topic 和 Partition 填充到 TopicProduceDataCollection
ProduceRequestData.TopicProduceData tpData = tpd.find(tp.topic());
if (tpData == null) {
tpData = new ProduceRequestData.TopicProduceData().setName(tp.topic());
tpd.add(tpData);
}
tpData.partitionData().add(new ProduceRequestData.PartitionProduceData()
.setIndex(tp.partition())
.setRecords(records));
recordsByPartition.put(tp, batch);
}
String transactionalId = null;
if (transactionManager != null && transactionManager.isTransactional()) {
transactionalId = transactionManager.transactionalId();
}
// 构建 ProduceRequest.Builder,其中设定了 TopicData
ProduceRequest.Builder requestBuilder = ProduceRequest.forMagic(minUsedMagic,
new ProduceRequestData()
.setAcks(acks)
.setTimeoutMs(timeout)
.setTransactionalId(transactionalId)
.setTopicData(tpd));
// 设定回调函数,在回调时调用 handleProduceResponse()
RequestCompletionHandler callback = response -> handleProduceResponse(response, recordsByPartition, time.milliseconds());
String nodeId = Integer.toString(destination);
// 构建 ClientRequest,其中包含 ProduceRequest.Builder
// acks != 0 代表需要服务端响应
ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,
requestTimeoutMs, callback);
// **调用 client.send() 准备发送请求**
client.send(clientRequest, now);
log.trace("Sent produce request to {}: {}", nodeId, requestBuilder);
}
2. NetworkClient
NetworkClient 负责处理网络操作。Sender 线程中的 sendProducerData() 内部调用了 NetworkClient.send() 方法,NetworkClient.send() 方法又调用了 doSend() 方法将消息加入 inFlightRequests 队列,并调用底层的 Selector.send() 来准备发送消息。
public class NetworkClient implements KafkaClient {
// NetworkClinet 状态
private enum State {
ACTIVE,
CLOSING,
CLOSED
}
private final Logger log;
// 内部封装的 selector,用来执行网络 IO 操作
private final Selectable selector;
// 元数据更新类
private final MetadataUpdater metadataUpdater;
private final Random randOffset;
// 集群所有节点的连接状态
private final ClusterConnectionStates connectionStates;
// InFlightRequests,正在发送或等待响应的请求队列
private final InFlightRequests inFlightRequests;
private final int socketSendBuffer;
private final int socketReceiveBuffer;
// 客户端 id
private final String clientId;
private int correlation;
// 发送单个请求的默认超时时间
private final int defaultRequestTimeoutMs;
// 重新连接节点的退避时间
private final long reconnectBackoffMs;
private final MetadataRecoveryStrategy metadataRecoveryStrategy;
private final Time time;
// 第一次连接到一个节点时设置为 true,以获取节点的 Api version
private final boolean discoverBrokerVersions;
// 节点的 Api versions 集合
private final ApiVersions apiVersions;
// 需要发送的 Api version 请求的集合
private final Map<String, ApiVersionsRequest.Builder> nodesNeedingApiVersionsFetch = new HashMap<>();
// 取消发送的请求列表
private final List<ClientResponse> abortedSends = new LinkedList<>();
private final Sensor throttleTimeSensor;
private final AtomicReference<State> state;
private final TelemetrySender telemetrySender;
}
send() 方法调用 doSend() 方法发送 ClientRequest。
@Override
public void send(ClientRequest request, long now) {
// **调用 doSend() 方法发送 ClientRequest**
doSend(request, false, now);
}
doSend() 方法主要进行发送前的准备,包括检查目标节点的连接状态和 ApiVersion 等,其内部又调用了另一个 doSend() 方法。
private void doSend(ClientRequest clientRequest, boolean isInternalRequest, long now) {
// 确保 NetworkClient 的状态是 Active
ensureActive();
// 获取发送消息的目标节点的 nodeId
String nodeId = clientRequest.destination();
// 如果是外部请求,检查是否可以继续发送
// 1. 与目标节点的 connectionState 是否 ready,
// 2. Selector 的 channel 是否 ready,
// 3. inFlightRequests 队列是否达到上限
// 而对于内部请求的检查在其他代码中已经涵盖,因此只检查外部请求
if (!isInternalRequest) {
if (!canSendRequest(nodeId, now))
throw new IllegalStateException("Attempt to send a request to node " + nodeId + " which is not ready.");
}
// ClientRequest 中持有的 ProduceRequest.Builder
AbstractRequest.Builder<?> builder = clientRequest.requestBuilder();
// 获取目标节点的 ApiVersion
try {
NodeApiVersions versionInfo = apiVersions.get(nodeId);
short version;
// Note: if versionInfo is null, we have no server version information. This would be
// the case when sending the initial ApiVersionRequest which fetches the version
// information itself. It is also the case when discoverBrokerVersions is set to false.
if (versionInfo == null) {
version = builder.latestAllowedVersion();
if (discoverBrokerVersions && log.isTraceEnabled())
log.trace("No version information found when sending {} with correlation id {} to node {}. " +
"Assuming version {}.", clientRequest.apiKey(), clientRequest.correlationId(), nodeId, version);
} else {
version = versionInfo.latestUsableVersion(clientRequest.apiKey(), builder.oldestAllowedVersion(),
builder.latestAllowedVersion());
}
// **调用另一个 doSend() 方法来发送请求**,这里有可能抛出 UnsupportedVersionException
// builder.build() 构建 ProduceRequest
doSend(clientRequest, isInternalRequest, now, builder.build(version));
} catch (UnsupportedVersionException unsupportedVersionException) {
// version 不支持,返回失败响应,并根据请求类型进行对应处理
log.debug("Version mismatch when attempting to send {} with correlation id {} to {}", builder,
clientRequest.correlationId(), clientRequest.destination(), unsupportedVersionException);
ClientResponse clientResponse = new ClientResponse(clientRequest.makeHeader(builder.latestAllowedVersion()),
clientRequest.callback(), clientRequest.destination(), now, now,
false, unsupportedVersionException, null, null);
if (!isInternalRequest)
abortedSends.add(clientResponse);
else if (clientRequest.apiKey() == ApiKeys.METADATA)
metadataUpdater.handleFailedRequest(now, Optional.of(unsupportedVersionException));
else if (isTelemetryApi(clientRequest.apiKey()) && telemetrySender != null)
telemetrySender.handleFailedRequest(clientRequest.apiKey(), unsupportedVersionException);
}
}
doSend() 方法内部将 request 封装成了 InFlightRequest 并加入 inFlightRequests 队列,然后调用 selector.send() 来对发送消息进行准备。
private void doSend(ClientRequest clientRequest, boolean isInternalRequest, long now, AbstractRequest request) {
// 获取目标节点 Id
String destination = clientRequest.destination();
// 生成请求头
RequestHeader header = clientRequest.makeHeader(request.version());
if (log.isDebugEnabled()) {
log.debug("Sending {} request with header {} and timeout {} to node {}: {}",
clientRequest.apiKey(), header, clientRequest.requestTimeoutMs(), destination, request);
}
// 由 request 构建 Send 对象
// request 对象的类为 AbstractRequest,这里实际由 ProduceRequest 实现
// Send 对象包含请求的 header 和 ProduceRequest 中的 data
Send send = request.toSend(header);
// 构建 inFlightRequest
InFlightRequest inFlightRequest = new InFlightRequest(
clientRequest,
header,
isInternalRequest,
request,
send,
now);
// 将请求加入 inFlightRequests 队列
this.inFlightRequests.add(inFlightRequest);
// 调用 selector.send() 准备发送请求
selector.send(new NetworkSend(clientRequest.destination(), send));
}
上文的 NetworkClient.send() 方法对消息发送进行了准备,而 NetworkClient.poll() 方法负责实际对消息进行发送,并且当收到消息的响应时对消息响应进行处理。
@Override
public List<ClientResponse> poll(long timeout, long now) {
// 确保 NetworkClient 的状态是 Active
ensureActive();
// 如果有由于版本不支持或者连接失败而放弃发送的请求,则直接返回响应
if (!abortedSends.isEmpty()) {
List<ClientResponse> responses = new ArrayList<>();
// 将放弃发送的请求加入 responses
handleAbortedSends(responses);
// 完成响应,调用回调函数
completeResponses(responses);
return responses;
}
long metadataTimeout = metadataUpdater.maybeUpdate(now);
long telemetryTimeout = telemetrySender != null ? telemetrySender.maybeUpdate(now) : Integer.MAX_VALUE;
try {
// **调用 selector.poll() 进行网络 IO 操作**
this.selector.poll(Utils.min(timeout, metadataTimeout, telemetryTimeout, defaultRequestTimeoutMs));
} catch (IOException e) {
log.error("Unexpected error during I/O", e);
}
// process completed actions
long updatedNow = this.time.milliseconds();
List<ClientResponse> responses = new ArrayList<>();
// 处理完成发送的请求
// 1. 调用 selector.completedSends()
// 2. 如果不需要响应,则从 inFlightRequests 队列中移除请求,并且返回成功响应
handleCompletedSends(responses, updatedNow);
// 处理成功接收到的响应
// 1. 调用 selector.completedReceives()
// 2. 从 inFlightRequests 队列中移除请求
// 3. 解析响应,并根据响应类型(元数据响应,ApiVersion响应,消息响应......)进行处理
handleCompletedReceives(responses, updatedNow);
// 处理断开连接的节点
handleDisconnections(responses, updatedNow);
// 处理新建立的连接
handleConnections();
// 处理 ApiVersion 请求
handleInitiateApiVersionRequests(updatedNow);
// 处理连接超时的节点
handleTimedOutConnections(responses, updatedNow);
// 处理 inFlightRequests 队列中超时的请求,
// 并且关闭相应节点的连接,将节点也视为 disconnection
handleTimedOutRequests(responses, updatedNow);
// 完成响应,调用回调函数
completeResponses(responses);
return responses;
}