GroupCoordinator 源码之FIND_COORDINATOR、JOIN_GROUP、SYNC_GROUP
GroupCoordinator 类注释
GroupCoordinator 从代码的注释中可以看到:处理 group 组成员关系 and 偏移量管理 ,这两点决定了它是个重要性。另外还提到:一些延迟操作是基于 group lock 去控制。
/**
* GroupCoordinator handles general group membership and offset management.
*
* Each Kafka server instantiates a coordinator which is responsible for a set of
* groups. Groups are assigned to coordinators based on their group names.
* <p>
* <b>Delayed operation locking notes:</b>
* Delayed operations in GroupCoordinator use `group` as the delayed operation
* lock. ReplicaManager.appendRecords may be invoked while holding the group lock
* used by its callback. The delayed callback may acquire the group lock
* since the delayed operation is completed only if the group lock can be acquired.
*/
那么带着问题:
-
GroupCoordinator 是如何管理 成员关系?
-
GroupCoordinator 是如何管理 偏移量?
-
GroupCoordinator 在启动的时候会做些什么?
-
GroupCoordinator 处理延迟操作
从KafkaServer Main函数中:
- 实例化GroupCoordinator
- 启动GroupCoordinator
/** kafka.server.KafkaServer#startup **/
...
groupCoordinator = GroupCoordinator(config, zkClient, replicaManager, Time.SYSTEM)
groupCoordinator.startup()
...
/** kafka.server.KafkaServer#startup **/
在实例化的过程中可以看到
- 创建了2个DelayedOperationPurgatory【参考:时间轮】,主要是用于延迟队列操作关于:Heartbeat、Rebalance
- 关于offset、Group的配置信息加载:offsetConfig、groupConfig
- 初始化GroupMetadataManager group元数据管理器
- 返回GroupCoordinator实例
/** kafka.server.KafkaServer#startup **/
object GroupCoordinator {
val NoState = ""
val NoProtocolType = ""
val NoProtocol = ""
val NoLeader = ""
val NoGeneration = -1
val NoMemberId = ""
val NoMembers = List[MemberSummary]()
val EmptyGroup = GroupSummary(NoState, NoProtocolType, NoProtocol, NoMembers)
val DeadGroup = GroupSummary(Dead.toString, NoProtocolType, NoProtocol, NoMembers)
def apply(config: KafkaConfig,
zkClient: KafkaZkClient,
replicaManager: ReplicaManager,
time: Time): GroupCoordinator = {
// 创建了2个DelayedOperationPurgatory,主要是用于延迟队列操作
val heartbeatPurgatory = DelayedOperationPurgatory[DelayedHeartbeat]("Heartbeat", config.brokerId)
val joinPurgatory = DelayedOperationPurgatory[DelayedJoin]("Rebalance", config.brokerId)
apply(config, zkClient, replicaManager, heartbeatPurgatory, joinPurgatory, time)
}
private[group] def offsetConfig(config: KafkaConfig) = OffsetConfig(
maxMetadataSize = config.offsetMetadataMaxSize,
loadBufferSize = config.offsetsLoadBufferSize,
offsetsRetentionMs = config.offsetsRetentionMinutes * 60L * 1000L,
offsetsRetentionCheckIntervalMs = config.offsetsRetentionCheckIntervalMs,
offsetsTopicNumPartitions = config.offsetsTopicPartitions,
offsetsTopicSegmentBytes = config.offsetsTopicSegmentBytes,
offsetsTopicReplicationFactor = config.offsetsTopicReplicationFactor,
offsetsTopicCompressionCodec = config.offsetsTopicCompressionCodec,
offsetCommitTimeoutMs = config.offsetCommitTimeoutMs,
offsetCommitRequiredAcks = config.offsetCommitRequiredAcks
)
def apply(config: KafkaConfig,
zkClient: KafkaZkClient,
replicaManager: ReplicaManager,
heartbeatPurgatory: DelayedOperationPurgatory[DelayedHeartbeat],
joinPurgatory: DelayedOperationPurgatory[DelayedJoin],
time: Time): GroupCoordinator = {
// 关于offset、Group的配置信息加载
val offsetConfig = this.offsetConfig(config)
val groupConfig = GroupConfig(groupMinSessionTimeoutMs = config.groupMinSessionTimeoutMs,
groupMaxSessionTimeoutMs = config.groupMaxSessionTimeoutMs,
groupInitialRebalanceDelayMs = config.groupInitialRebalanceDelay)
//
val groupMetadataManager = new GroupMetadataManager(config.brokerId, config.interBrokerProtocolVersion,
offsetConfig, replicaManager, zkClient, time)
new GroupCoordinator(config.brokerId, groupConfig, offsetConfig, groupMetadataManager, heartbeatPurgatory, joinPurgatory, time)
}
}
/** kafka.server.KafkaServer#startup **/
初始化之后,启动GroupCoordinator实例,从之前的代码可以看出来,GroupCoordinator的功能部分应该是靠的GroupMetadataManager这个管理器,那么GroupMetadataManager中有什么让它支持这个任务?
看 class GroupMetadataManager 以及object GroupMetadataManager 提取主要的部分:
- KafkaScheduler :一个线程池
- loadingPartitions : 一个Set集合 代表正在被加载的consumer group
- owedpartitions: 一个Set集合,代表已经被分配的consumer group
- openGroupsForProducer :一个HashMap[Long,Set[String]],代表transactional有关的producer对应的offset记录
- groupMetadataTopicPartitionCount : number of partitions for the consumer metadata topic
- groupMetadataCache : 一个Pool[String,GroupMetadata] 就是一个ConcurrentHashMap的封装。有关于GroupMetadata的记录
GroupMetadata 是什么?
上面提到GroupMetadata,它具体包含哪些信息?
代码里面看GroupMetadata(val groupId: String, initialState: GroupState),构造参数就包含一个groupId,以及initialState初始状态。从调用的地方可以知道初始状态是:Empty。
进一步看GroupState的所有状态:
- Empty : Group 当前没有成员,但是会一直存在,直到所有的offsets都过期
- PreparingRebalance :Group 当前正在准备rebalance
- CompletingRebalance:Group 正在等待leader分配任务
- Stable : Group 处于稳定状态
- Dead : Group 已经没有任何成员,并且元数据正在背移除
看完状态之后,关注下一些重要的成员变量:state、members、offsets、pendingOffsetCommits、pendingTransactionalOffsetCommits
private var state: GroupState = initialState
// 成员的集合
private val members = new mutable.HashMap[String, MemberMetadata]
// 每个topic对应的CommitRecordMetadataAndOffset(这个里面含有offset的long值、OffsetAndMetadata[offset提交的时间戳、offset超时的时间戳])
private val offsets = new mutable.HashMap[TopicPartition, CommitRecordMetadataAndOffset]
private val pendingOffsetCommits = new mutable.HashMap[TopicPartition, OffsetAndMetadata]
private val pendingTransactionalOffsetCommits = new mutable.HashMap[Long, mutable.Map[TopicPartition, CommitRecordMetadataAndOffset]]()
到现在可以看到元数据中包含了几项重要的信息:brokerId、TopicPartition、offset信息、GroupState
GroupCoordinator启动过程
基于这些反过来看看GroupMetadataManager启动start时是涉及到了哪些部分?
-
线程池scheduler启动
-
向scheduler中添加一个定时任务【delete-expired-group-metadata】,用于清除过期的group-metadata
-
从groupMetadataCache中获取所有的GroupMetadata,遍历并过滤失效的记录 group.removeExpiredOffsets()
-
- 将过期的记录从offsets中去除
- 将有效的记录返回Map
-
主要的func:cleanupGroupMetadata(groups: Iterable[GroupMetadata], selector: GroupMetadata => Map[TopicPartition, OffsetAndMetadata])
-
-
遍历groups获取group即GroupMetadata,从selector中获取对应的OffsetAndMetadata,对group进行判断如下,为true则将group状态改为Dead
-
- group.is(Empty)
- !(offsets.nonEmpty || pendingOffsetCommits.nonEmpty || pendingTransactionalOffsetCommits.nonEmpty)
-
这部分先停下,按照场景驱动的方式,此时刚起动,没有成员加入,不会走到下面的部分 ==》TODO 休息下马上回来
-
假设一台broker启动了,然后服务端的GroupCoordinator在此时启动了。那么后面会发生什么?猜一下从GroupMetadata的状态变更可以看出来,一开始是Empty因为刚起来什么都没有。然后想要状态变更就有2个途径:新成员加入、超时过期。明显这个时候到了的是新成员出来发挥了。
猜:当前的服务端GroupCoordinator已启动了,一个新的消费者组过来了,首先需要找到这个Coordinator,然后发送加入的请求JOIN_GROUP,即:
- FIND_COORDINATOR
- JOIN_GROUP
FIND_COORDINATOR_谁是Coordinator?
那么谁是Coordinator?当只有一个Broker的时候,不用想肯定是它自己,但是Kafka一般是集群,多个broker节点的,谁会被选举为Coordinator? 从FIND_COORDINATOR请求开始,一步步的走下去可以发现,其实好像很直接很直接,先说下groupMetadataTopicPartitionCount这个对应的就是Kafka默认topic:__consumer_offsets的分区个数,默认50 。所以很直接:hash(groupId) % partitions(consumer_offsets) = 分区id。该分区所在的物理broker作为消费组的分组协调器,猜完了之后找到对应的代码验证。并不是找负载比较小的。
JOIN_GROUP_新的消费者组出现,申请出战!
找到了对应的组协调器之后,消费者就会与组协调器通信。首先发送一个JOIN_GROUP请求,申请加入这个组中。这个过程可以从
case ApiKeys.JOIN_GROUP => handleJoinGroupRequest(request) 开始一步步往下看。其大致过程:
==>kafka.server.KafkaApis#handleJoinGroupRequest
==>kafka.coordinator.group.GroupCoordinator#handleJoinGroup
/**
* public static final Field.Str MEMBER_ID = new Field.Str("member_id",
* "The member id assigned by the group coordinator or null if joining for the first time.");
*/
// 第一次进来的时候肯定是走这里
// 第一次首先会创建一个GroupMetadata,然后将该GroupMetadata加入groupMetadataCache中,当前添加的GM的状态是【Empty】
// 接着处理这个join请求doJoinGroup
val group = groupManager.addGroup(new GroupMetadata(groupId, initialState = Empty))
doJoinGroup(group, memberId, clientId, clientHost, rebalanceTimeoutMs, sessionTimeoutMs, protocolType, protocols, responseCallback)
==>kafka.coordinator.group.GroupCoordinator#doJoinGroup
// 根据group.currentState做不同的处理
// 第一次进入Group的状态是Empty
case Empty | Stable =>
if (memberId == JoinGroupRequest.UNKNOWN_MEMBER_ID) {
// if the member id is unknown, register the member to the group
// 新成员加入,并且 memberId 是空的,应该是走这里
/**
* 进行新member添加过程:
* 1、分配member_id: memberId = clientId + "-" + group.generateMemberIdSuffix
* 2、创建MemberMetadata对象
* 3、group.add(member) ==> if (leaderId.isEmpty) leaderId = Some(member.memberId)
* 4、group 状态转为 PreparingRebalance
* 5、延迟队列添加一个延迟task,joinPurgatory.tryCompleteElseWatch(delayedRebalance, Seq(groupKey))
*/
addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, clientId, clientHost, protocolType,
protocols, group, responseCallback)
} else {
val member = group.get(memberId)
// 这里是旧成员的再次发送的joinGroup 请求
if (group.isLeader(memberId) || !member.matches(protocols)) {
// force a rebalance if a member has changed metadata or if the leader sends JoinGroup.
// The latter allows the leader to trigger rebalances for changes affecting assignment
// which do not affect the member metadata (such as topic metadata changes for the consumer)
// 如果成员更改了元数据或leader发送 JoinGroup,则强制重新平衡
/**
* 1、group 状态转为 PreparingRebalance
* 2、延迟队列添加一个延迟task,joinPurgatory.tryCompleteElseWatch(delayedRebalance, Seq(groupKey))
*/
updateMemberAndRebalance(group, member, protocols, responseCallback)
} else {
// for followers with no actual change to their metadata, just return group information
// for the current generation which will allow them to issue SyncGroup
responseCallback(JoinGroupResult(
members = Map.empty,
memberId = memberId,
generationId = group.generationId,
subProtocol = group.protocolOrNull,
leaderId = group.leaderOrNull,
error = Errors.NONE))
}
}
// 从延迟队列中找出对应的任务,然后去触发事件
if (group.is(PreparingRebalance))
joinPurgatory.checkAndComplete(GroupKey(group.groupId))
}
==>kafka.coordinator.group.GroupCoordinator#addMemberAndRebalance
==> kafka.coordinator.group.GroupCoordinator#maybePrepareRebalance
// 场景驱动:
// 第一次进入Group的状态是Empty,
// Group是空的。则new一个InitialDelayedJoin 加入DelayedOperationPurgatory中
// 非第一次进入:如果Group的状态是CompletingRebalance,有members处于 awaiting sync,因为有新的成员加入,打断它们的等待并让它们rejoin
// 当Group状态为:PreparingRebalance
// joinPurgatory.tryCompleteElseWatch(delayedRebalance, Seq(groupKey))
//当完成Join_group之后
==>kafka.coordinator.group.GroupCoordinator#onCompleteJoin
代码有点长,来个图总结下JOIN_GROUP请求的过程:
在第一个Consumer发送JOIN_GROUP请求之后:
- group 的leaderId选出来了,就是当前consumer的memberId
- 当前group的状态会变成 CompletingRebalance
此时如果有一个consumer(相同的groupId)进来会:
- 调用方法doJoinGroup() 方法
- 当前状态是 CompletingRebalance ,调用addMemberAndRebalance():
- 添加新成员,生成对应的memberId
- 调用maybePrepareRebalance(),产生一个DelayedJoin Task,并将group的状态改变为 PreparingRebalance
- DelayedJoin Task加入延迟队列中,延迟队列触发该任务将状态改为 CompletingRebalance
SYNC_GROUP 分配分区消费策略
一个消费者组group中会存在多个consumer,按照之前的介绍第一个会成员Leader。kafka要求同一个group内一个分区的数据只能被一个consumer消费,通过leader将分区的消费策略发送给其他的consumer。入口:
case ApiKeys.SYNC_GROUP => handleSyncGroupRequest(request)
...
groupCoordinator.handleSyncGroup(
syncGroupRequest.groupId,
syncGroupRequest.generationId,
syncGroupRequest.memberId,
syncGroupRequest.groupAssignment().asScala.mapValues(Utils.toArray),
sendResponseCallback
)
...
kafka.coordinator.group.GroupCoordinator#doSyncGroup(group: GroupMetadata,
generationId: Int,
memberId: String,
groupAssignment: Map[String, Array[Byte]],
responseCallback: SyncCallback)
==> 根据group的当前状态GroupState进行判断:
group.currentState match {
.....................
// 上一小节最后,group的状态就是 CompletingRebalance
case CompletingRebalance =>
group.get(memberId).awaitingSyncCallback = responseCallback
// 说明需要leader 发起sync_group请求才会有效
if (group.isLeader(memberId)) {
info(s"Assignment received from leader for group ${group.groupId} for generation ${group.generationId}")
// fill any missing members with an empty assignment
// 如果有成员没有被分配任何消费方案,则创建一个空的方案赋给它,有可能因为消费者数目大于分区数
val missing = group.allMembers -- groupAssignment.keySet
val assignment = groupAssignment ++ missing.map(_ -> Array.empty[Byte]).toMap
// 把消费者组信息保存在消费者组元数据中,并且将其写入到内部位移主题
groupManager.storeGroup(group, assignment, (error: Errors) => {
group.inLock {
// another member may have joined the group while we were awaiting this callback,
// so we must ensure we are still in the CompletingRebalance state and the same generation
// when it gets invoked. if we have transitioned to another state, then do nothing
if (group.is(CompletingRebalance) && generationId == group.generationId) {
if (error != Errors.NONE) {
// 重置消费策略,发送空的消费策略给所有members
resetAndPropagateAssignmentError(group, error)
// 更改当前状态未:PrepareRebalance
maybePrepareRebalance(group)
} else {
// 发送最新的消费策略
setAndPropagateAssignment(group, assignment)
// 更新状态为:Stable
group.transitionTo(Stable)
}
}
}
})
}
case Stable =>
// if the group is stable, we just return the current assignment
val memberMetadata = group.get(memberId)
// 当前状态为Stable 则回调当前的消费策略
responseCallback(memberMetadata.assignment, Errors.NONE)
// 推进下次Heartbeat调度
completeAndScheduleNextHeartbeatExpiration(group, group.get(memberId))
}
..........
kafka.coordinator.group.GroupMetadataManager#storeGroup:
// 从kafka内部主题__consumer_offsets中获取对应的分区
// kafka内部主题__consumer_offsets默认分区数是50
getMagic(partitionFor(group.groupId))
case Some(magicValue) =>
val groupMetadataValueVersion = {
if (interBrokerProtocolVersion < KAFKA_0_10_1_IV0)
0.toShort
else
GroupMetadataManager.CURRENT_GROUP_VALUE_SCHEMA_VERSION
}
// We always use CREATE_TIME, like the producer. The conversion to LOG_APPEND_TIME (if necessary) happens automatically.
val timestampType = TimestampType.CREATE_TIME
val timestamp = time.milliseconds()
val key = GroupMetadataManager.groupMetadataKey(group.groupId)
val value = GroupMetadataManager.groupMetadataValue(group, groupAssignment, version = groupMetadataValueVersion)
val records = {
val buffer = ByteBuffer.allocate(AbstractRecords.estimateSizeInBytes(magicValue, compressionType,
Seq(new SimpleRecord(timestamp, key, value)).asJava))
val builder = MemoryRecords.builder(buffer, magicValue, compressionType, timestampType, 0L)
builder.append(timestamp, key, value)
builder.build()
}
val groupMetadataPartition = new TopicPartition(Topic.GROUP_METADATA_TOPIC_NAME, partitionFor(group.groupId))
val groupMetadataRecords = Map(groupMetadataPartition -> records)
val generationId = group.generationId
// set the callback function to insert the created group into cache after log append completed
def putCacheCallback(responseStatus: Map[TopicPartition, PartitionResponse]) {
....
}
kafka.coordinator.group.GroupMetadataManager#appendForGroup:
// 发送消息
replicaManager.appendRecords(
timeout = config.offsetCommitTimeoutMs.toLong,
requiredAcks = config.offsetCommitRequiredAcks,
internalTopicsAllowed = true,
isFromClient = false,
entriesPerPartition = records,
delayedProduceLock = Some(group.lock),
responseCallback = callback)
代码有点长,总结下主要的过程:
- 接收SYNC_GROUP请求,后groupCoordinator提取请求参数并处理该请求
- 调用GroupCoordinator#doSyncGroup() 方法,根据group的当前状态做不同处理分支:
- case Empty | Dead 抛出异常
- case PreparingRebalance 抛出异常
- case PreparingRebalance
- 校验:请求时leader发送过来的,否则不处理
- 获取members中未被分配消费策略的成员,并赋值一个空消费策略,有可能消费者大于分区数
- 调用groupManager.storeGroup(assignment)
- 将当前的消费策略assignment 写入到kafka默认的内部主题**__consumer_offsets** 中
- 通过replicaManager.appendRecords() 将元数据写入
- 记录写入过程中异常Errors
- 判断当前group的状态是否为:CompletingRebalance 并且generationId 是否一致,防止状态、年代过期
- Errors判断:
- 为空:遍历members 设置传播策略 ,并更改group状态为:Stable
- 非空:下发空消费策略给所有的member,并更改group状态为:PrepareRebalance
- case Stable 把当前的消费策略通过回调函数分发给member
- 将当前的消费策略通过回调函数返回给所有的member
- 触发heartbeatPurgatory延迟队列中的Task,并且添加一个下一次的DelayedHeartbeat任务
接着上图:
至此,已经解决了部分问题,进度如下:
erationId 是否一致,防止状态、年代过期
- Errors判断:
- 为空:遍历members 设置传播策略 ,并更改group状态为:Stable
- 非空:下发空消费策略给所有的member,并更改group状态为:PrepareRebalance
- case Stable 把当前的消费策略通过回调函数分发给member
- 将当前的消费策略通过回调函数返回给所有的member
- 触发heartbeatPurgatory延迟队列中的Task,并且添加一个下一次的DelayedHeartbeat任务
接着上图:
至此,已经解决了部分问题,进度如下:
个人总结,错漏之处,还请多指教。
公众号:大数据下挣扎,欢迎感兴趣的同学关注。