Rebalance 的流程大致分为两大步:加入组(JoinGroup)和组同步(SyncGroup)。
加入组,是指消费者组下的各个成员向 Coordinator 发送 JoinGroupRequest 请求加入进组的过程。这个过程有一个超时时间,如果有成员在超时时间之内,无法完成加入组操作,它就会被排除在这轮 Rebalance 之外。
下面直接看GroupCoordinator:处理成员入组handleJoinGroup 方法,来处理消费者组成员发送过来的加入组请求。
def handleJoinGroup(groupId: String, // 消费者组名
memberId: String, // 消费者组成员ID
groupInstanceId: Option[String], // 组实例ID,用于标识静态成员
requireKnownMemberId: Boolean, // 是否需要成员ID不为空
clientId: String, // client.id值
clientHost: String, // 消费者程序主机名
rebalanceTimeoutMs: Int, // Rebalance超时时间,默认是max.poll.interval.ms值
sessionTimeoutMs: Int, // 会话超时时间
protocolType: String, // 协议类型
protocols: List[(String, Array[Byte])], // 按照分配策略分组的订阅分区
responseCallback: JoinCallback): Unit = { // 回调函数
// 验证消费者组状态的合法性
validateGroupStatus(groupId, ApiKeys.JOIN_GROUP).foreach { error =>
responseCallback(joinError(memberId, error))
return
}
// 确保sessionTimeoutMs介于
// [group.min.session.timeout.ms值,group.max.session.timeout.ms值]之间
// 否则抛出异常,表示超时时间设置无效
if (sessionTimeoutMs < groupConfig.groupMinSessionTimeoutMs ||
sessionTimeoutMs > groupConfig.groupMaxSessionTimeoutMs) {
responseCallback(joinError(memberId, Errors.INVALID_SESSION_TIMEOUT))
} else {
// 消费者组成员ID是否为空
val isUnknownMember = memberId == JoinGroupRequest.UNKNOWN_MEMBER_ID
// 获取消费者组信息,如果组不存在,就创建一个新的消费者组
groupManager.getGroup(groupId) match {
case None =>
// 获取消费者组的元数据信息,如果该组的元数据信息存在,则进入到下一步;
// 如果不存在,代码会看当前成员 ID 是否为空,如果为空,就创建一个空的元数据对象,然后进入到下一步,如果不为空,则返回 None。
// 一旦返回了 None,handleJoinGroup 方法会封装“未知成员 ID”的异常,调用回调函数返回。
// only try to create the group if the group is UNKNOWN AND
// the member id is UNKNOWN, if member is specified but group does not
// exist we should reject the request.
// 消费者组成员ID是否为空
if (isUnknownMember) {
val group = groupManager.addGroup(new GroupMetadata(groupId, Empty, time))
// 为空ID成员执行加入组操作
doUnknownJoinGroup(group, groupInstanceId, requireKnownMemberId, clientId, clientHost, rebalanceTimeoutMs, sessionTimeoutMs, protocolType, protocols, responseCallback)
} else {
responseCallback(joinError(memberId, Errors.UNKNOWN_MEMBER_ID))
}
case Some(group) =>
group.inLock {
// 如果该消费者组已满员
if ((groupIsOverCapacity(group)
&& group.has(memberId) && !group.get(memberId).isAwaitingJoin) // oversized group, need to shed members that haven't joined yet
|| (isUnknownMember && group.size >= groupConfig.groupMaxSize)) { // 当前等待加入组的成员数小于 Broker 端参数 group.max.size 值。
// 移除该消费者组成员
group.remove(memberId)
group.removeStaticMember(groupInstanceId)
// 封装异常表明组已满员
responseCallback(joinError(JoinGroupRequest.UNKNOWN_MEMBER_ID, Errors.GROUP_MAX_SIZE_REACHED))
} else if (isUnknownMember) {
// 为空ID成员执行加入组操作
doUnknownJoinGroup(group, groupInstanceId, requireKnownMemberId, clientId, clientHost, rebalanceTimeoutMs, sessionTimeoutMs, protocolType, protocols, responseCallback)
} else {
// 为非空ID成员执行加入组操作
doJoinGroup(group, memberId, groupInstanceId, clientId, clientHost, rebalanceTimeoutMs, sessionTimeoutMs, protocolType, protocols, responseCallback)
}
// attempt to complete JoinGroup
// 如果消费者组正处于PreparingRebalance状态
if (group.is(PreparingRebalance)) {
// 放入Purgatory,等待后面统一延时处理
joinPurgatory.checkAndComplete(GroupKey(group.groupId))
}
}
}
}
}
真正执行加入组逻辑的是 doUnknownJoinGroup 和 doJoinGroup 这两个方法。
doUnknownJoinGroup方法
如果是全新的消费者组成员加入组,那么,就需要为它们执行 doUnknownJoinGroup 方法,因为此时,它们的 Member ID 尚未生成。
private def doUnknownJoinGroup(group: GroupMetadata,
groupInstanceId: Option[String],
requireKnownMemberId: Boolean,
clientId: String,
clientHost: String,
rebalanceTimeoutMs: Int,
sessionTimeoutMs: Int,
protocolType: String,
protocols: List[(String, Array[Byte])],
responseCallback: JoinCallback): Unit = {
group.inLock {
// Dead状态
if (group.is(Dead)) {
// if the group is marked as dead, it means some other thread has just removed the group
// from the coordinator metadata; it is likely that the group has migrated to some other
// coordinator OR the group is in a transient unstable phase. Let the member retry
// finding the correct coordinator and rejoin.
// 封装异常调用回调函数返回
responseCallback(joinError(JoinGroupRequest.UNKNOWN_MEMBER_ID, Errors.COORDINATOR_NOT_AVAILABLE))
} else if (!group.supportsProtocols(protocolType, MemberMetadata.plainProtocolSet(protocols))) {
// 成员配置的协议类型/分区消费分配策略与消费者组的不匹配
responseCallback(joinError(JoinGroupRequest.UNKNOWN_MEMBER_ID, Errors.INCONSISTENT_GROUP_PROTOCOL))
} else {
// 根据规则为该成员创建成员ID,生成规则是 clientId-UUID。
val newMemberId = group.generateMemberId(clientId, groupInstanceId)
// 如果配置了静态成员
if (group.hasStaticMember(groupInstanceId)) {
val oldMemberId = group.getStaticMemberId(groupInstanceId)
info(s"Static member $groupInstanceId with unknown member id rejoins, assigning new member id $newMemberId, while " +
s"old member $oldMemberId will be removed.")
val currentLeader = group.leaderOrNull
val member = group.replaceGroupInstance(oldMemberId, newMemberId, groupInstanceId)
// Heartbeat of old member id will expire without effect since the group no longer contains that member id.
// New heartbeat shall be scheduled with new member id.
completeAndScheduleNextHeartbeatExpiration(group, member)
val knownStaticMember = group.get(newMemberId)
group.updateMember(knownStaticMember, protocols, responseCallback)
group.currentState match {
case Stable | CompletingRebalance =>
info(s"Static member joins during ${group.currentState} stage will not trigger rebalance.")
group.maybeInvokeJoinCallback(member, JoinGroupResult(
members = List.empty,
memberId = newMemberId,
generationId = group.generationId,
subProtocol = group.protocolOrNull,
// We want to avoid current leader performing trivial assignment while the group
// is in stable/awaiting sync stage, because the new assignment in leader's next sync call
// won't be broadcast by a stable/awaiting sync group. This could be guaranteed by
// always returning the old leader id so that the current leader won't assume itself
// as a leader based on the returned message, since the new member.id won't match
// returned leader id, therefore no assignment will be performed.
leaderId = currentLeader,
error = Errors.NONE))
case Empty | Dead =>
throw new IllegalStateException(s"Group ${group.groupId} was not supposed to be " +
s"in the state ${group.currentState} when the unknown static member $groupInstanceId rejoins.")
case PreparingRebalance =>
}
} else if (requireKnownMemberId) {
// 如果要求成员ID不为空
// If member id required (dynamic membership), register the member in the pending member list
// and send back a response to call for another join group request with allocated member id.
debug(s"Dynamic member with unknown member id rejoins group ${group.groupId} in " +
s"${group.currentState} state. Created a new member id $newMemberId and request the member to rejoin with this id.")
// 则将该成员加入到待决成员列表(Pending Member List)中,然后封装一个异常以及生成好的成员 ID,
// 将该成员的入组申请“打回去”,令其分配好了成员 ID 之后再重新申请;
group.addPendingMember(newMemberId)
addPendingMemberExpiration(group, newMemberId, sessionTimeoutMs)
responseCallback(joinError(newMemberId, Errors.MEMBER_ID_REQUIRED))
} else {
debug(s"Dynamic member with unknown member id rejoins group ${group.groupId} in " +
s"${group.currentState} state. Created a new member id $newMemberId for this member and add to the group.")
// 添加成员到消费者组
addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, newMemberId, groupInstanceId,
clientId, clientHost, protocolType, protocols, group, responseCallback)
}
}
}
}
doJoinGroup 方法
这是为那些设置了成员 ID 的成员,执行加入组逻辑的方法。
private def doJoinGroup(group: GroupMetadata,
memberId: String,
groupInstanceId: Option[String],
clientId: String,
clientHost: String,
rebalanceTimeoutMs: Int,
sessionTimeoutMs: Int,
protocolType: String,
protocols: List[(String, Array[Byte])],
responseCallback: JoinCallback): Unit = {
group.inLock {
// 第 1 部分:主要做一些校验和条件检查。
// 如果是Dead状态,封装COORDINATOR_NOT_AVAILABLE异常调用回调函数返回
if (group.is(Dead)) {
// if the group is marked as dead, it means some other thread has just removed the group
// from the coordinator metadata; this is likely that the group has migrated to some other
// coordinator OR the group is in a transient unstable phase. Let the member retry
// finding the correct coordinator and rejoin.
responseCallback(joinError(memberId, Errors.COORDINATOR_NOT_AVAILABLE))
} else if (!group.supportsProtocols(protocolType, MemberMetadata.plainProtocolSet(protocols))) {
// 如果协议类型或分区消费分配策略与消费者组的不匹配
// 封装INCONSISTENT_GROUP_PROTOCOL异常调用回调函数返回
responseCallback(joinError(memberId, Errors.INCONSISTENT_GROUP_PROTOCOL))
} else if (group.isPendingMember(memberId)) {
// 如果是待决成员,即rebalance前还在组内的成员,由于这次分配了成员ID,故允许加入组,就直接调用 addMemberAndRebalance 方法令其入组
// A rejoining pending member will be accepted. Note that pending member will never be a static member.
if (groupInstanceId.isDefined) {
throw new IllegalStateException(s"the static member $groupInstanceId was not expected to be assigned " +
s"into pending member bucket with member id $memberId")
} else {
// 令其加入组
addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, memberId, groupInstanceId,
clientId, clientHost, protocolType, protocols, group, responseCallback)
}
} else {
// 第 2 部分,即处理一个非待决成员的入组申请。
val groupInstanceIdNotFound = groupInstanceId.isDefined && !group.hasStaticMember(groupInstanceId)
if (group.isStaticMemberFenced(memberId, groupInstanceId)) {
// given member id doesn't match with the groupInstanceId. Inform duplicate instance to shut down immediately.
responseCallback(joinError(memberId, Errors.FENCED_INSTANCE_ID))
} else if (!group.has(memberId) || groupInstanceIdNotFound) {
// If the dynamic member trying to register with an unrecognized id, or
// the static member joins with unknown group instance id, send the response to let
// it reset its member id and retry.
responseCallback(joinError(memberId, Errors.UNKNOWN_MEMBER_ID))
} else {
// 获取该成员的元数据信息
val member = group.get(memberId)
group.currentState match {
case PreparingRebalance =>
// 如果是PreparingRebalance状态,就说明消费者组正要开启 Rebalance 流程,
// 那么,调用 updateMemberAndRebalance 方法更新成员信息,并开始准备 Rebalance 即可。
// 更新成员信息并开始准备Rebalance
updateMemberAndRebalance(group, member, protocols, responseCallback)
case CompletingRebalance =>
// 如果成员以前申请过加入组
if (member.matches(protocols)) {
// member is joining with the same metadata (which could be because it failed to
// receive the initial JoinGroup response), so just return current group information
// for the current generation.
// 就判断一下,该成员的分区消费分配策略与订阅分区列表是否和已保存记录中的一致,
// 如果相同,就说明该成员已经应该发起过加入组的操作,并且 Coordinator 已经批准了,只是该成员没有收到,
// 因此,针对这种情况,代码构造一个 JoinGroupResult 对象,直接返回当前的组信息给成员。
responseCallback(JoinGroupResult(
members = if (group.isLeader(memberId)) {
group.currentMemberMetadata
} else {
List.empty
},
memberId = memberId,
generationId = group.generationId,
subProtocol = group.protocolOrNull,
leaderId = group.leaderOrNull,
error = Errors.NONE))
} else {
// member has changed metadata, so force a rebalance
// 否则,就说明成员变更了订阅信息或分配策略,更新成员信息并开始准备Rebalance
updateMemberAndRebalance(group, member, protocols, responseCallback)
}
// 如果是Stable状态
case Stable =>
val member = group.get(memberId)
// 如果成员是Leader成员,或者成员变更了分区分配策略,
// 如果是这种情况,就调用 updateMemberAndRebalance 方法强迫一次新的 Rebalance。
if (group.isLeader(memberId) || !member.matches(protocols)) {
// force a rebalance if a member has changed metadata or if the leader sends JoinGroup.
// The latter allows the leader to trigger rebalances for changes affecting assignment
// which do not affect the member metadata (such as topic metadata changes for the consumer)
// 更新成员信息并开始准备Rebalance
updateMemberAndRebalance(group, member, protocols, responseCallback)
} else { // 否则的话,返回当前组信息给该成员即可,通知它们可以发起 Rebalance 的下一步操作。
// for followers with no actual change to their metadata, just return group information
// for the current generation which will allow them to issue SyncGroup
responseCallback(JoinGroupResult(
members = List.empty,
memberId = memberId,
generationId = group.generationId,
subProtocol = group.protocolOrNull,
leaderId = group.leaderOrNull,
error = Errors.NONE))
}
// 如果是其它状态,封装异常调用回调函数返回
case Empty | Dead =>
// Group reaches unexpected state. Let the joining member reset their generation and rejoin.
warn(s"Attempt to add rejoining member $memberId of group ${group.groupId} in " +
s"unexpected group state ${group.currentState}")
responseCallback(joinError(memberId, Errors.UNKNOWN_MEMBER_ID))
}
}
}
}
}
这部分代码频繁地调用 updateMemberAndRebalance 方法,它仅仅做两件事情。
private def updateMemberAndRebalance(group: GroupMetadata,
member: MemberMetadata,
protocols: List[(String, Array[Byte])],
callback: JoinCallback): Unit = {
// 更新组成员信息;调用 GroupMetadata 的 updateMember 方法来更新消费者组成员;
group.updateMember(member, protocols, callback)
// 这一步的核心思想,是将消费者组状态变更到 PreparingRebalance,然后创建 DelayedJoin 对象,并交由 Purgatory,等待延时处理加入组操作。
maybePrepareRebalance(group, s"Updating metadata for member ${member.memberId}")
}
addMemberAndRebalance 方法
addMemberAndRebalance 方法向消费者组添加成员;准备 Rebalance。
private def addMemberAndRebalance(rebalanceTimeoutMs: Int,
sessionTimeoutMs: Int,
memberId: String,
groupInstanceId: Option[String],
clientId: String,
clientHost: String,
protocolType: String,
protocols: List[(String, Array[Byte])],
group: GroupMetadata,
callback: JoinCallback): Unit = {
// 创建MemberMetadata对象实例
val member = new MemberMetadata(memberId, group.groupId, groupInstanceId,
clientId, clientHost, rebalanceTimeoutMs,
sessionTimeoutMs, protocolType, protocols)
// 标识该成员是新成员,isNew 字段与心跳设置相关联
member.isNew = true
// update the newMemberAdded flag to indicate that the join group can be further delayed
// 如果消费者组准备开启首次Rebalance,设置newMemberAdded为True
if (group.is(PreparingRebalance) && group.generationId == 0)
group.newMemberAdded = true
// 向消费者组添加成员;GroupMetadata 的 add 方法
group.add(member, callback)
// The session timeout does not affect new members since they do not have their memberId and
// cannot send heartbeats. Furthermore, we cannot detect disconnects because sockets are muted
// while the JoinGroup is in purgatory. If the client does disconnect (e.g. because of a request
// timeout during a long rebalance), they may simply retry which will lead to a lot of defunct
// members in the rebalance. To prevent this going on indefinitely, we timeout JoinGroup requests
// for new members. If the new member is still there, we expect it to retry.
// 设置下次心跳超期时间
completeAndScheduleNextExpiration(group, member, NewMemberJoinTimeoutMs)
if (member.isStaticMember)
group.addStaticMember(groupInstanceId, memberId)
else
// 从待决成员列表中移除,代码将该成员从待决成员列表中移除。毕竟,它已经正式加入到组中了,就不需要待在待决列表中了。
group.removePendingMember(memberId)
// 准备 Rebalance。
maybePrepareRebalance(group, s"Adding new member $memberId with group instanceid $groupInstanceId")
}