Kafka-Consumer 源码解析 -- rebalance过程和partition的确认
本文参考:
参考1:https://www.cnblogs.com/benfly/p/9605976.html
前言
在listener注册和启动之后,每个KafkaListener会开启若干个线程consumer进行数据拉取。这些consumer会先加入到对应的kafka消费组中,触发rebalance过程,之后由consumer客户端确认每一个consumer的partition分配,最后执行消费过程。
1、rebalance过程分析
rebalance本质上是一种协议,规定了一个consumer group下的所有consumer如何达成一致来分配订阅topic的每个分区。比如某个group下有20个consumer,它订阅了一个具有100个分区的topic。正常情况下,Kafka平均会为每个consumer分配5个分区。这个分配的过程就叫rebalance。
rebalance的触发条件:
- 组成员发生变更,新consumer加入组、已有consumer主动离开组或已有consumer崩溃了
- 订阅主题数发生变更——这当然是可能的,如果你使用了正则表达式的方式进行订阅,那么新建匹配正则表达式的topic就会触发rebalance订阅主题的分区数发生变更
- 订阅主题的分区数发生变更
本文以新的的consumer加入组进行分析rebalance。
1.1、过程总结
rebalance过程:
- kafka确认有新的consumer加入,触发rebalance
- kafka对于consumer的心跳请求做出
REBALANCE_IN_PROGRESS
响应 - consumer心跳得到
REBALANCE_IN_PROGRESS
响应之后,修改状态,通知消费主线程重新加入组 - consumer消费主线程执行加入组的操作,此时为所有需要重新加入组的consumer都会发送加入组的请求
- kafka再收集到所有成员consumer请求前,它会把已收到请求放入一个叫purgatory(炼狱)的地方。
- 在所有的consumer发送完成后,会选取一个consumer作为组的leader,这个leader将会执行partition的分配,之后kafka将topic对应的partition数目、consumer成员信息以及leader信息添加到 consumer加入组请求的响应信息中。
- consumer客户端得到加入组的响应信息后,会判断自己是否为leader。如果是,执行partition的分配,并将分配结果发送至kafka,如果不是,也会发送一个空数据至kafka。
- consumer在发送之后,kafka会将分配结果对应添加到各个consumer的响应信息中。consumer得到响应信息,将对应的分配结果保存,以供之后的数据拉取操作。
- rebalance结束。
1.2、代码分析
上文 Kafka-Consumer 源码解析 – listener 注册和启动 已经说明consumer的注册和启动。在启动之后现有的consumer会主动加入group,从而触发rebalance。
rebalance触发之后,所有consumer的心跳响应都会返回 REBALANCE_IN_PROGRESS,客户端开始执行rebalance。
查看心跳线程HeartbeatThread
的run
方法会调用 sendHeartbeatRequest()
进行心跳的发送,在sendHeartbeatRequest
中会由HeartbeatResponseHandler
处理响应数据,在返回 REBALANCE_IN_PROGRESS的情况下,会执行 requestRejoin
,也就是标记当前consumer需要重新加入group,这里并不执行实际的加入操作。
在consumer的主线程中会判断是否需要重新加入group,具体判断由ConsumerCoordinator
的poll
方法调用rejoinNeededOrPending
执行。如果需要重新加入group,执行ensureActiveGroup
,方法中调用joinGroupIfNeeded
也就是在需要的情况下加入组,方法中执行initiateJoinGroup
开始初始join,initiateJoinGroup
中执行sendJoinGroupRequest
向kafka发送join group的请求,并处理响应结果,响应结果由JoinGroupResponseHandler
处理。
JoinGroupResponseHandler
实现:
private class JoinGroupResponseHandler extends CoordinatorResponseHandler<JoinGroupResponse, ByteBuffer> {
@Override
public void handle(JoinGroupResponse joinResponse, RequestFuture<ByteBuffer> future) {
Errors error = joinResponse.error();
if (error == Errors.NONE) {
log.debug("Received successful JoinGroup response: {}", joinResponse);
sensors.joinLatency.record(response.requestLatencyMs());
synchronized (AbstractCoordinator.this) {
if (state != MemberState.REBALANCING) {
// if the consumer was woken up before a rebalance completes, we may have already left
// the group. In this case, we do not want to continue with the sync group.
future.raise(new UnjoinedGroupException());
} else {
AbstractCoordinator.this.generation = new Generation(joinResponse.data().generationId(),
joinResponse.data().memberId(), joinResponse.data().protocolName());
// 判断当前consumer是否为leader
if (joinResponse.isLeader()) {
onJoinLeader(joinResponse).chain(future);
} else {
onJoinFollower().chain(future);
}
}
}
} else if (error == Errors.COORDINATOR_LOAD_IN_PROGRESS) {
log.debug("Attempt to join group rejected since coordinator {} is loading the group.", coordinator());
// backoff and retry
future.raise(error);
} else if (error == Errors.UNKNOWN_MEMBER_ID) {
// reset the member id and retry immediately
resetGeneration();
log.debug("Attempt to join group failed due to unknown member id.");
future.raise(Errors.UNKNOWN_MEMBER_ID);
} else if (error == Errors.COORDINATOR_NOT_AVAILABLE
|| error == Errors.NOT_COORDINATOR) {
// re-discover the coordinator and retry with backoff
markCoordinatorUnknown();
log.debug("Attempt to join group failed due to obsolete coordinator information: {}", error.message());
future.raise(error);
} else if (error == Errors.FENCED_INSTANCE_ID) {
log.error("Received fatal exception: group.instance.id gets fenced");
future.raise(error);
} else if (error == Errors.INCONSISTENT_GROUP_PROTOCOL
|| error == Errors.INVALID_SESSION_TIMEOUT
|| error == Errors.INVALID_GROUP_ID
|| error == Errors.GROUP_AUTHORIZATION_FAILED
|| error == Errors.GROUP_MAX_SIZE_REACHED) {
// log the error and re-throw the exception
log.error("Attempt to join group failed due to fatal error: {}", error.message());
if (error == Errors.GROUP_MAX_SIZE_REACHED) {
future.raise(new GroupMaxSizeReachedException(groupId));
} else if (error == Errors.GROUP_AUTHORIZATION_FAILED) {
future.raise(new GroupAuthorizationException(groupId));
} else {
future.raise(error);
}
} else if (error == Errors.UNSUPPORTED_VERSION) {
log.error("Attempt to join group failed due to unsupported version error. Please unset field group.instance.id and retry" +
"to see if the problem resolves");
future.raise(error);
} else if (error == Errors.MEMBER_ID_REQUIRED) {
// Broker requires a concrete member id to be allowed to join the group. Update member id
// and send another join group request in next cycle.
synchronized (AbstractCoordinator.this) {
AbstractCoordinator.this.generation = new Generation(OffsetCommitRequest.DEFAULT_GENERATION_ID,
joinResponse.data().memberId(), null);
AbstractCoordinator.this.rejoinNeeded = true;
AbstractCoordinator.this.state = MemberState.UNJOINED;
}
future.raise(Errors.MEMBER_ID_REQUIRED);
} else {
// unexpected error, throw the exception
log.error("Attempt to join group failed due to unexpected error: {}", error.message());
future.raise(new KafkaException("Unexpected error in join group response: " + error.message()));
}
}
}
其中
// 判断当前consumer是否为leader
if (joinResponse.isLeader()) {
onJoinLeader(joinResponse).chain(future);
} else {
onJoinFollower().chain(future);
}
为响应结果判断,确认自己是否为leader。如果是leader,执行分区分配,之后执行sendSyncGroupRequest
将分配结果发送至kafka,如果不是leader,则会直接执行sendSyncGroupRequest
并发送空数据。
onJoinLeader
实现:
private RequestFuture<ByteBuffer> onJoinLeader(JoinGroupResponse joinResponse) {
try {
// perform the leader synchronization and send back the assignment for the group
// 执行partition分配的任务
Map<String, ByteBuffer> groupAssignment = performAssignment(joinResponse.data().leader(), joinResponse.data().protocolName(),
joinResponse.data().members());
// 将分配结果格式化
List<SyncGroupRequestData.SyncGroupRequestAssignment> groupAssignmentList = new ArrayList<>();
for (Map.Entry<String, ByteBuffer> assignment : groupAssignment.entrySet()) {
groupAssignmentList.add(new SyncGroupRequestData.SyncGroupRequestAssignment()
.setMemberId(assignment.getKey())
.setAssignment(Utils.toArray(assignment.getValue()))
);
}
// 将格式化后的分配结果拼装为 SyncGroupRequest
SyncGroupRequest.Builder requestBuilder =
new SyncGroupRequest.Builder(
new SyncGroupRequestData()
.setGroupId(groupId)
.setMemberId(generation.memberId)
.setGroupInstanceId(this.groupInstanceId.orElse(null))
.setGenerationId(generation.generationId)
.setAssignments(groupAssignmentList)
);
log.debug("Sending leader SyncGroup to coordinator {}: {}", this.coordinator, requestBuilder);
// 执行分配结果的同步发送至kafka
return sendSyncGroupRequest(requestBuilder);
} catch (RuntimeException e) {
return RequestFuture.failure(e);
}
}
onJoinFollower()
实现:
private RequestFuture<ByteBuffer> onJoinFollower() {
// send follower's sync group with an empty assignment
SyncGroupRequest.Builder requestBuilder =
new SyncGroupRequest.Builder(
new SyncGroupRequestData()
.setGroupId(groupId)
.setMemberId(generation.memberId)
.setGroupInstanceId(this.groupInstanceId.orElse(null))
.setGenerationId(generation.generationId)
.setAssignments(Collections.emptyList())
);
log.debug("Sending follower SyncGroup to coordinator {}: {}", this.coordinator, requestBuilder);
return sendSyncGroupRequest(requestBuilder);
}
在onJoinLeader
中performAssignment(joinResponse.data().leader(), joinResponse.data().protocolName(), joinResponse.data().members())
为partition的分配,以下部分做此说明。
2、consumer的partition确认
performAssignment(joinResponse.data().leader(), joinResponse.data().protocolName(), joinResponse.data().members())
中joinResponse.data().protocolName()
为分区分配策略的名称,由kafka确认。
分区分配策略 PartitionAssignor 默认提供3种实现:
-
RangeAssignor:按照消费者总数和分区总数进行整除运算来获得一个跨度,然后将分区按照跨度进行平均分配,(一个Topic中partition总数 / 订阅这个Topic的Consumer数)。
-
RoundRobinAssignor:将消费组内的所有消费者以及消费者所订阅的所有topic的partition按照字典顺序排序,然后通过轮询的方式逐个将分区以此分配给每个消费者,说白了也就是先每一个consumer都分配一轮,一轮分配完成之后接着下一轮继续分配,知道分配完为止。
-
StickyAssignor:它保证分配尽可能平衡。分配给Consumer的topic partitions数量最多相差1个;或 每个拥有比其他Consumer少2倍以上的topic partitions的Consumer无法将任何这些topic partitions转移给它。当发生重新分配时,它会保留尽可能多的现有分配。当topic partitions从一个使用者移动到另一个Consumer时,这有助于节省一些开销处理。
分区分配的目的是为了将Topic对应partition更加均匀的分布在各个consumer上,更好实现kafka数据消费的负载均衡。
performAssignment
实现:
protected Map<String, ByteBuffer> performAssignment(String leaderId,
String assignmentStrategy,
List<JoinGroupResponseData.JoinGroupResponseMember> allSubscriptions) {
// 根据分区分配策略的名称找到对应的 PartitionAssignor 实现
PartitionAssignor assignor = lookupAssignor(assignmentStrategy);
if (assignor == null)
throw new IllegalStateException("Coordinator selected invalid assignment protocol: " + assignmentStrategy);
Set<String> allSubscribedTopics = new HashSet<>();
Map<String, Subscription> subscriptions = new HashMap<>();
// 将各个consumer member订阅的元数据执行反序列化
for (JoinGroupResponseData.JoinGroupResponseMember memberSubScription : allSubscriptions) {
Subscription subscription = ConsumerProtocol.deserializeSubscription(ByteBuffer.wrap(memberSubScription.metadata()));
subscriptions.put(memberSubScription.memberId(), subscription);
allSubscribedTopics.addAll(subscription.topics());
}
// the leader will begin watching for changes to any of the topics the group is interested in,
// which ensures that all metadata changes will eventually be seen
updateGroupSubscription(allSubscribedTopics);
isLeader = true;
// 调用 assignor.assign(metadata.fetch(), subscriptions) 执行分区分配
Map<String, Assignment> assignment = assignor.assign(metadata.fetch(), subscriptions);
// 以下操作为将配分结果再次格式化并返回
Set<String> assignedTopics = new HashSet<>();
for (Assignment assigned : assignment.values()) {
for (TopicPartition tp : assigned.partitions())
assignedTopics.add(tp.topic());
}
if (!assignedTopics.containsAll(allSubscribedTopics)) {
Set<String> notAssignedTopics = new HashSet<>(allSubscribedTopics);
notAssignedTopics.removeAll(assignedTopics);
}
if (!allSubscribedTopics.containsAll(assignedTopics)) {
Set<String> newlyAddedTopics = new HashSet<>(assignedTopics);
newlyAddedTopics.removeAll(allSubscribedTopics);
allSubscribedTopics.addAll(assignedTopics);
updateGroupSubscription(allSubscribedTopics);
}
assignmentSnapshot = metadataSnapshot;
Map<String, ByteBuffer> groupAssignment = new HashMap<>();
for (Map.Entry<String, Assignment> assignmentEntry : assignment.entrySet()) {
ByteBuffer buffer = ConsumerProtocol.serializeAssignment(assignmentEntry.getValue());
groupAssignment.put(assignmentEntry.getKey(), buffer);
}
// 返回分区分配结果
return groupAssignment;
}
在leader拿到分区分配结果之后会执行sendSyncGroupRequest
将结果发送至kafka。
sendSyncGroupRequest
实现:
private RequestFuture<ByteBuffer> sendSyncGroupRequest(SyncGroupRequest.Builder requestBuilder) {
if (coordinatorUnknown())
return RequestFuture.coordinatorNotAvailable();
return client.send(coordinator, requestBuilder)
.compose(new SyncGroupResponseHandler());
}
SyncGroupResponseHandler
处理同步group的响应实现:
private class SyncGroupResponseHandler extends CoordinatorResponseHandler<SyncGroupResponse, ByteBuffer> {
@Override
public void handle(SyncGroupResponse syncResponse,
RequestFuture<ByteBuffer> future) {
Errors error = syncResponse.error();
if (error == Errors.NONE) {
// 分配结果同步成功
sensors.syncLatency.record(response.requestLatencyMs());
// 将kafka响应给自己需要消费哪个partition交予future的onSuccess处理
// 此 future 会返回到 AbstractCoordinator 的 initiateJoinGroup 方法中使用joinFuture接收并添加对应的响应
// 同时 此 future 也会在 AbstractCoordinator 的 joinGroupIfNeeded 方法中进行success判断,如果成功,会执行onJoinComplete方法
future.complete(ByteBuffer.wrap(syncResponse.data.assignment()));
} else {
// 如果同步过程出现异常,执行 rejoin
requestRejoin();
if (error == Errors.GROUP_AUTHORIZATION_FAILED) {
future.raise(new GroupAuthorizationException(groupId));
} else if (error == Errors.REBALANCE_IN_PROGRESS) {
log.debug("SyncGroup failed because the group began another rebalance");
future.raise(error);
} else if (error == Errors.FENCED_INSTANCE_ID) {
log.error("Received fatal exception: group.instance.id gets fenced");
future.raise(error);
} else if (error == Errors.UNKNOWN_MEMBER_ID
|| error == Errors.ILLEGAL_GENERATION) {
log.debug("SyncGroup failed: {}", error.message());
resetGeneration();
future.raise(error);
} else if (error == Errors.COORDINATOR_NOT_AVAILABLE
|| error == Errors.NOT_COORDINATOR) {
log.debug("SyncGroup failed: {}", error.message());
markCoordinatorUnknown();
future.raise(error);
} else {
future.raise(new KafkaException("Unexpected error from SyncGroup: " + error.message()));
}
}
}
}
sendSyncGroupRequest
成功响应结果之后的处理过程:
- 执行
AbstractCoordinator
的initiateJoinGroup
中为joinFuture
添加的success监听事件,修改 rejoinNeeded状态,开启心跳线程 - 执行
AbstractCoordinator
的joinGroupIfNeeded
方法中进行success判断,成功之后执行:
if (future.succeeded()) {
// 得到kafka对于当前consumer member的partition分配结果
ByteBuffer memberAssignment = future.value().duplicate();
// 执行join完成后的操作
onJoinComplete(generation.generationId, generation.memberId, generation.protocol, memberAssignment);
// We reset the join group future only after the completion callback returns. This ensures
// that if the callback is woken up, we will retry it on the next joinGroupIfNeeded.
resetJoinGroupFuture();
needsJoinPrepare = true;
}
onJoinComplete
主要为leader consumer判断分配结果的响应是否和之前的分配结果一致,如果一致,则更新各consumer的partition分配状态,具体实现:
protected void onJoinComplete(int generation,
String memberId,
String assignmentStrategy,
ByteBuffer assignmentBuffer) {
// only the leader is responsible for monitoring for metadata changes (i.e. partition changes)
// 对于leader来说,它要检查一下进行分配时的metadata跟当前的metadata是否一致,不一致的话,就标记下需要重新协调一次assign
if (!isLeader)
assignmentSnapshot = null;
PartitionAssignor assignor = lookupAssignor(assignmentStrategy);
if (assignor == null)
throw new IllegalStateException("Coordinator selected invalid assignment protocol: " + assignmentStrategy);
Assignment assignment = ConsumerProtocol.deserializeAssignment(assignmentBuffer);
if (!subscriptions.assignFromSubscribed(assignment.partitions())) {
handleAssignmentMismatch(assignment);
return;
}
Set<TopicPartition> assignedPartitions = subscriptions.assignedPartitions();
// The leader may have assigned partitions which match our subscription pattern, but which
// were not explicitly requested, so we update the joined subscription here.
maybeUpdateJoinedSubscription(assignedPartitions);
// give the assignor a chance to update internal state based on the received assignment
assignor.onAssignment(assignment, generation);
// reschedule the auto commit starting from now
if (autoCommitEnabled)
this.nextAutoCommitTimer.updateAndReset(autoCommitIntervalMs);
// execute the user's callback after rebalance
ConsumerRebalanceListener listener = subscriptions.rebalanceListener();
log.info("Setting newly assigned partitions: {}", Utils.join(assignedPartitions, ", "));
try {
listener.onPartitionsAssigned(assignedPartitions);
} catch (WakeupException | InterruptException e) {
throw e;
} catch (Exception e) {
log.error("User provided listener {} failed on partition assignment", listener.getClass().getName(), e);
}
}