GroupCoordinator 源码之FIND_COORDINATOR、JOIN_GROUP、SYNC_GROUP

最新推荐文章于 2022-07-21 21:39:16 发布

aLivable_Dedode

最新推荐文章于 2022-07-21 21:39:16 发布

阅读量276

点赞数

分类专栏： Kafka2.0源码分析文章标签： kafka 分布式源码

本文链接：https://blog.csdn.net/aLivable_Dedode/article/details/123165033

版权

Kafka2.0源码分析专栏收录该内容

2 篇文章 0 订阅

订阅专栏

GroupCoordinator 源码之FIND_COORDINATOR、JOIN_GROUP、SYNC_GROUP

GroupCoordinator 类注释

GroupCoordinator 从代码的注释中可以看到：处理 group 组成员关系 and 偏移量管理 ，这两点决定了它是个重要性。另外还提到：一些延迟操作是基于 group lock 去控制。

/**
 * GroupCoordinator handles general group membership and offset management.
 *
 * Each Kafka server instantiates a coordinator which is responsible for a set of
 * groups. Groups are assigned to coordinators based on their group names.
 * <p>
 * <b>Delayed operation locking notes:</b>
 * Delayed operations in GroupCoordinator use `group` as the delayed operation
 * lock. ReplicaManager.appendRecords may be invoked while holding the group lock
 * used by its callback.  The delayed callback may acquire the group lock
 * since the delayed operation is completed only if the group lock can be acquired.
 */

那么带着问题：

GroupCoordinator 是如何管理成员关系？
GroupCoordinator 是如何管理偏移量？
GroupCoordinator 在启动的时候会做些什么？
GroupCoordinator 处理延迟操作

从KafkaServer Main函数中：

实例化GroupCoordinator
启动GroupCoordinator

/** kafka.server.KafkaServer#startup **/
...
groupCoordinator = GroupCoordinator(config, zkClient, replicaManager, Time.SYSTEM)
groupCoordinator.startup()
...
/** kafka.server.KafkaServer#startup **/

在实例化的过程中可以看到

创建了2个DelayedOperationPurgatory【参考：时间轮】，主要是用于延迟队列操作关于：Heartbeat、Rebalance
关于offset、Group的配置信息加载：offsetConfig、groupConfig
初始化GroupMetadataManager group元数据管理器
返回GroupCoordinator实例

/** kafka.server.KafkaServer#startup **/
object GroupCoordinator {

  val NoState = ""
  val NoProtocolType = ""
  val NoProtocol = ""
  val NoLeader = ""
  val NoGeneration = -1
  val NoMemberId = ""
  val NoMembers = List[MemberSummary]()
  val EmptyGroup = GroupSummary(NoState, NoProtocolType, NoProtocol, NoMembers)
  val DeadGroup = GroupSummary(Dead.toString, NoProtocolType, NoProtocol, NoMembers)

  def apply(config: KafkaConfig,
            zkClient: KafkaZkClient,
            replicaManager: ReplicaManager,
            time: Time): GroupCoordinator = {
    // 创建了2个DelayedOperationPurgatory，主要是用于延迟队列操作
    val heartbeatPurgatory = DelayedOperationPurgatory[DelayedHeartbeat]("Heartbeat", config.brokerId)
    val joinPurgatory = DelayedOperationPurgatory[DelayedJoin]("Rebalance", config.brokerId)
    apply(config, zkClient, replicaManager, heartbeatPurgatory, joinPurgatory, time)
  }
  
  private[group] def offsetConfig(config: KafkaConfig) = OffsetConfig(
      maxMetadataSize = config.offsetMetadataMaxSize,
      loadBufferSize = config.offsetsLoadBufferSize,
      offsetsRetentionMs = config.offsetsRetentionMinutes * 60L * 1000L,
      offsetsRetentionCheckIntervalMs = config.offsetsRetentionCheckIntervalMs,
      offsetsTopicNumPartitions = config.offsetsTopicPartitions,
      offsetsTopicSegmentBytes = config.offsetsTopicSegmentBytes,
      offsetsTopicReplicationFactor = config.offsetsTopicReplicationFactor,
      offsetsTopicCompressionCodec = config.offsetsTopicCompressionCodec,
      offsetCommitTimeoutMs = config.offsetCommitTimeoutMs,
      offsetCommitRequiredAcks = config.offsetCommitRequiredAcks
    )
  
    def apply(config: KafkaConfig,
            zkClient: KafkaZkClient,
            replicaManager: ReplicaManager,
            heartbeatPurgatory: DelayedOperationPurgatory[DelayedHeartbeat],
            joinPurgatory: DelayedOperationPurgatory[DelayedJoin],
            time: Time): GroupCoordinator = {
    // 关于offset、Group的配置信息加载
    val offsetConfig = this.offsetConfig(config)
    val groupConfig = GroupConfig(groupMinSessionTimeoutMs = config.groupMinSessionTimeoutMs,
      groupMaxSessionTimeoutMs = config.groupMaxSessionTimeoutMs,
      groupInitialRebalanceDelayMs = config.groupInitialRebalanceDelay)
      
    //
    val groupMetadataManager = new GroupMetadataManager(config.brokerId, config.interBrokerProtocolVersion,
      offsetConfig, replicaManager, zkClient, time)
    new GroupCoordinator(config.brokerId, groupConfig, offsetConfig, groupMetadataManager, heartbeatPurgatory, joinPurgatory, time)
  }

}
  /** kafka.server.KafkaServer#startup **/

初始化之后，启动GroupCoordinator实例，从之前的代码可以看出来，GroupCoordinator的功能部分应该是靠的GroupMetadataManager这个管理器，那么GroupMetadataManager中有什么让它支持这个任务？

看 class GroupMetadataManager 以及object GroupMetadataManager 提取主要的部分：

KafkaScheduler :一个线程池
loadingPartitions : 一个Set集合代表正在被加载的consumer group
owedpartitions: 一个Set集合，代表已经被分配的consumer group
openGroupsForProducer ：一个HashMap[Long,Set[String]],代表transactional有关的producer对应的offset记录
groupMetadataTopicPartitionCount ： number of partitions for the consumer metadata topic
groupMetadataCache : 一个Pool[String,GroupMetadata] 就是一个ConcurrentHashMap的封装。有关于GroupMetadata的记录

GroupMetadata 是什么？

上面提到GroupMetadata，它具体包含哪些信息？

代码里面看GroupMetadata(val groupId: String, initialState: GroupState)，构造参数就包含一个groupId，以及initialState初始状态。从调用的地方可以知道初始状态是：Empty。

进一步看GroupState的所有状态：

Empty : Group 当前没有成员，但是会一直存在，直到所有的offsets都过期
PreparingRebalance ：Group 当前正在准备rebalance
CompletingRebalance：Group 正在等待leader分配任务
Stable : Group 处于稳定状态
Dead ： Group 已经没有任何成员，并且元数据正在背移除

看完状态之后，关注下一些重要的成员变量：state、members、offsets、pendingOffsetCommits、pendingTransactionalOffsetCommits

private var state: GroupState = initialState
// 成员的集合
private val members = new mutable.HashMap[String, MemberMetadata]
// 每个topic对应的CommitRecordMetadataAndOffset(这个里面含有offset的long值、OffsetAndMetadata[offset提交的时间戳、offset超时的时间戳])
private val offsets = new mutable.HashMap[TopicPartition, CommitRecordMetadataAndOffset]
private val pendingOffsetCommits = new mutable.HashMap[TopicPartition, OffsetAndMetadata]
private val pendingTransactionalOffsetCommits = new mutable.HashMap[Long, mutable.Map[TopicPartition, CommitRecordMetadataAndOffset]]()

到现在可以看到元数据中包含了几项重要的信息：brokerId、TopicPartition、offset信息、GroupState

GroupCoordinator启动过程

基于这些反过来看看GroupMetadataManager启动start时是涉及到了哪些部分？

线程池scheduler启动
向scheduler中添加一个定时任务【delete-expired-group-metadata】,用于清除过期的group-metadata
从groupMetadataCache中获取所有的GroupMetadata，遍历并过滤失效的记录 group.removeExpiredOffsets()
1. 将过期的记录从offsets中去除
2. 将有效的记录返回Map
主要的func：cleanupGroupMetadata(groups: Iterable[GroupMetadata], selector: GroupMetadata => Map[TopicPartition, OffsetAndMetadata])
1. 遍历groups获取group即GroupMetadata，从selector中获取对应的OffsetAndMetadata，对group进行判断如下，为true则将group状态改为Dead
2. 1. group.is(Empty)
  2. !(offsets.nonEmpty || pendingOffsetCommits.nonEmpty || pendingTransactionalOffsetCommits.nonEmpty)
3. 这部分先停下，按照场景驱动的方式，此时刚起动，没有成员加入，不会走到下面的部分 ==》TODO 休息下马上回来

假设一台broker启动了，然后服务端的GroupCoordinator在此时启动了。那么后面会发生什么？猜一下从GroupMetadata的状态变更可以看出来，一开始是Empty因为刚起来什么都没有。然后想要状态变更就有2个途径：新成员加入、超时过期。明显这个时候到了的是新成员出来发挥了。

猜：当前的服务端GroupCoordinator已启动了，一个新的消费者组过来了，首先需要找到这个Coordinator，然后发送加入的请求JOIN_GROUP，即：

FIND_COORDINATOR
JOIN_GROUP

FIND_COORDINATOR_谁是Coordinator？

那么谁是Coordinator？当只有一个Broker的时候，不用想肯定是它自己，但是Kafka一般是集群，多个broker节点的，谁会被选举为Coordinator？ 从FIND_COORDINATOR请求开始，一步步的走下去可以发现，其实好像很直接很直接，先说下groupMetadataTopicPartitionCount这个对应的就是Kafka默认topic：__consumer_offsets的分区个数，默认50 。所以很直接：hash(groupId) % partitions(consumer_offsets) = 分区id。该分区所在的物理broker作为消费组的分组协调器,猜完了之后找到对应的代码验证。并不是找负载比较小的。
关于coordinator的选举

JOIN_GROUP_新的消费者组出现，申请出战！

找到了对应的组协调器之后，消费者就会与组协调器通信。首先发送一个JOIN_GROUP请求，申请加入这个组中。这个过程可以从

case ApiKeys.JOIN_GROUP => handleJoinGroupRequest(request) 开始一步步往下看。其大致过程：

==>kafka.server.KafkaApis#handleJoinGroupRequest
==>kafka.coordinator.group.GroupCoordinator#handleJoinGroup
/**
 * public static final Field.Str MEMBER_ID = new Field.Str("member_id", 
 *  "The member id assigned by the group coordinator or null if joining for the first time.");
 */
// 第一次进来的时候肯定是走这里
// 第一次首先会创建一个GroupMetadata，然后将该GroupMetadata加入groupMetadataCache中，当前添加的GM的状态是【Empty】
// 接着处理这个join请求doJoinGroup
val group = groupManager.addGroup(new GroupMetadata(groupId, initialState = Empty))
doJoinGroup(group, memberId, clientId, clientHost, rebalanceTimeoutMs, sessionTimeoutMs, protocolType, protocols, responseCallback)

==>kafka.coordinator.group.GroupCoordinator#doJoinGroup
// 根据group.currentState做不同的处理
  // 第一次进入Group的状态是Empty
  case Empty | Stable =>
    if (memberId == JoinGroupRequest.UNKNOWN_MEMBER_ID) {
      // if the member id is unknown, register the member to the group
      // 新成员加入，并且 memberId 是空的，应该是走这里
      
      /**
     	*  进行新member添加过程：
     	*    1、分配member_id: memberId = clientId + "-" + group.generateMemberIdSuffix
     	*    2、创建MemberMetadata对象
     	*    3、group.add(member) ==> if (leaderId.isEmpty) leaderId = Some(member.memberId)
     	*    4、group 状态转为 PreparingRebalance
     	*    5、延迟队列添加一个延迟task，joinPurgatory.tryCompleteElseWatch(delayedRebalance, Seq(groupKey))
     	*/
      addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, clientId, clientHost, protocolType,
        protocols, group, responseCallback)
    } else {
      val member = group.get(memberId)
      // 这里是旧成员的再次发送的joinGroup 请求
      
      if (group.isLeader(memberId) || !member.matches(protocols)) {
        // force a rebalance if a member has changed metadata or if the leader sends JoinGroup.
        // The latter allows the leader to trigger rebalances for changes affecting assignment
        // which do not affect the member metadata (such as topic metadata changes for the consumer)
        // 如果成员更改了元数据或leader发送 JoinGroup，则强制重新平衡
        
        /**
         *    1、group 状态转为 PreparingRebalance
         *    2、延迟队列添加一个延迟task，joinPurgatory.tryCompleteElseWatch(delayedRebalance, Seq(groupKey))
         */
        updateMemberAndRebalance(group, member, protocols, responseCallback)
      } else {
        // for followers with no actual change to their metadata, just return group information
        // for the current generation which will allow them to issue SyncGroup
        responseCallback(JoinGroupResult(
          members = Map.empty,
          memberId = memberId,
          generationId = group.generationId,
          subProtocol = group.protocolOrNull,
          leaderId = group.leaderOrNull,
          error = Errors.NONE))
      }
    }
    
    // 从延迟队列中找出对应的任务，然后去触发事件
    if (group.is(PreparingRebalance))
  		joinPurgatory.checkAndComplete(GroupKey(group.groupId))
}    
    
==>kafka.coordinator.group.GroupCoordinator#addMemberAndRebalance
    ==> kafka.coordinator.group.GroupCoordinator#maybePrepareRebalance
    // 场景驱动：
        // 第一次进入Group的状态是Empty，
            // Group是空的。则new一个InitialDelayedJoin 加入DelayedOperationPurgatory中
            
        // 非第一次进入：如果Group的状态是CompletingRebalance，有members处于 awaiting sync，因为有新的成员加入，打断它们的等待并让它们rejoin    
        
    // 当Group状态为：PreparingRebalance    
    // joinPurgatory.tryCompleteElseWatch(delayedRebalance, Seq(groupKey))
//当完成Join_group之后
==>kafka.coordinator.group.GroupCoordinator#onCompleteJoin

代码有点长，来个图总结下JOIN_GROUP请求的过程：
在这里插入图片描述

在第一个Consumer发送JOIN_GROUP请求之后：

group 的leaderId选出来了，就是当前consumer的memberId
当前group的状态会变成 CompletingRebalance

此时如果有一个consumer(相同的groupId)进来会：

调用方法doJoinGroup() 方法
当前状态是 CompletingRebalance ，调用addMemberAndRebalance():
1. 添加新成员，生成对应的memberId
2. 调用maybePrepareRebalance()，产生一个DelayedJoin Task,并将group的状态改变为 PreparingRebalance
3. DelayedJoin Task加入延迟队列中，延迟队列触发该任务将状态改为 CompletingRebalance

SYNC_GROUP 分配分区消费策略

一个消费者组group中会存在多个consumer,按照之前的介绍第一个会成员Leader。kafka要求同一个group内一个分区的数据只能被一个consumer消费，通过leader将分区的消费策略发送给其他的consumer。入口：

case ApiKeys.SYNC_GROUP => handleSyncGroupRequest(request)
...
groupCoordinator.handleSyncGroup(
        syncGroupRequest.groupId,
        syncGroupRequest.generationId,
        syncGroupRequest.memberId,
        syncGroupRequest.groupAssignment().asScala.mapValues(Utils.toArray),
        sendResponseCallback
      )

...
kafka.coordinator.group.GroupCoordinator#doSyncGroup(group: GroupMetadata,
                          generationId: Int,
                          memberId: String,
                          groupAssignment: Map[String, Array[Byte]],
                          responseCallback: SyncCallback)

==> 根据group的当前状态GroupState进行判断：
group.currentState match {
          .....................
    	// 上一小节最后，group的状态就是 CompletingRebalance
          case CompletingRebalance =>
            group.get(memberId).awaitingSyncCallback = responseCallback
			// 说明需要leader 发起sync_group请求才会有效
            if (group.isLeader(memberId)) {
              info(s"Assignment received from leader for group ${group.groupId} for generation ${group.generationId}")
              // fill any missing members with an empty assignment
              // 如果有成员没有被分配任何消费方案，则创建一个空的方案赋给它，有可能因为消费者数目大于分区数 
              val missing = group.allMembers -- groupAssignment.keySet
              val assignment = groupAssignment ++ missing.map(_ -> Array.empty[Byte]).toMap
              // 把消费者组信息保存在消费者组元数据中，并且将其写入到内部位移主题
              groupManager.storeGroup(group, assignment, (error: Errors) => {
                group.inLock {
                  // another member may have joined the group while we were awaiting this callback,
                  // so we must ensure we are still in the CompletingRebalance state and the same generation
                  // when it gets invoked. if we have transitioned to another state, then do nothing
                  if (group.is(CompletingRebalance) && generationId == group.generationId) {
                    if (error != Errors.NONE) {
                        // 重置消费策略，发送空的消费策略给所有members
                      resetAndPropagateAssignmentError(group, error)
                        // 更改当前状态未：PrepareRebalance
                      maybePrepareRebalance(group)
                    } else {
                        // 发送最新的消费策略
                      setAndPropagateAssignment(group, assignment)
                        // 更新状态为：Stable
                      group.transitionTo(Stable)
                    }
                  }
                }
              })
            }

          case Stable =>
            // if the group is stable, we just return the current assignment
            val memberMetadata = group.get(memberId)
    		// 当前状态为Stable 则回调当前的消费策略
            responseCallback(memberMetadata.assignment, Errors.NONE)
    		// 推进下次Heartbeat调度
            completeAndScheduleNextHeartbeatExpiration(group, group.get(memberId))
        }

..........
kafka.coordinator.group.GroupMetadataManager#storeGroup:
// 从kafka内部主题__consumer_offsets中获取对应的分区
// kafka内部主题__consumer_offsets默认分区数是50
getMagic(partitionFor(group.groupId))
case Some(magicValue) =>
        val groupMetadataValueVersion = {
          if (interBrokerProtocolVersion < KAFKA_0_10_1_IV0)
            0.toShort
          else
            GroupMetadataManager.CURRENT_GROUP_VALUE_SCHEMA_VERSION
        }

        // We always use CREATE_TIME, like the producer. The conversion to LOG_APPEND_TIME (if necessary) happens automatically.
        val timestampType = TimestampType.CREATE_TIME
        val timestamp = time.milliseconds()
        val key = GroupMetadataManager.groupMetadataKey(group.groupId)
        val value = GroupMetadataManager.groupMetadataValue(group, groupAssignment, version = groupMetadataValueVersion)

        val records = {
          val buffer = ByteBuffer.allocate(AbstractRecords.estimateSizeInBytes(magicValue, compressionType,
            Seq(new SimpleRecord(timestamp, key, value)).asJava))
          val builder = MemoryRecords.builder(buffer, magicValue, compressionType, timestampType, 0L)
          builder.append(timestamp, key, value)
          builder.build()
        }

        val groupMetadataPartition = new TopicPartition(Topic.GROUP_METADATA_TOPIC_NAME, partitionFor(group.groupId))
        val groupMetadataRecords = Map(groupMetadataPartition -> records)
        val generationId = group.generationId

        // set the callback function to insert the created group into cache after log append completed
        def putCacheCallback(responseStatus: Map[TopicPartition, PartitionResponse]) {
			....
        }
kafka.coordinator.group.GroupMetadataManager#appendForGroup:
// 发送消息
    replicaManager.appendRecords(
      timeout = config.offsetCommitTimeoutMs.toLong,
      requiredAcks = config.offsetCommitRequiredAcks,
      internalTopicsAllowed = true,
      isFromClient = false,
      entriesPerPartition = records,
      delayedProduceLock = Some(group.lock),
      responseCallback = callback)

代码有点长，总结下主要的过程：

接收SYNC_GROUP请求，后groupCoordinator提取请求参数并处理该请求
调用GroupCoordinator#doSyncGroup() 方法，根据group的当前状态做不同处理分支：
- case Empty | Dead 抛出异常
- case PreparingRebalance 抛出异常
- case PreparingRebalance
  - 校验：请求时leader发送过来的，否则不处理
  - 获取members中未被分配消费策略的成员，并赋值一个空消费策略，有可能消费者大于分区数
  - 调用groupManager.storeGroup(assignment)
    - 将当前的消费策略assignment 写入到kafka默认的内部主题**__consumer_offsets** 中
    - 通过replicaManager.appendRecords() 将元数据写入
    - 记录写入过程中异常Errors
  - 判断当前group的状态是否为：CompletingRebalance 并且generationId 是否一致，防止状态、年代过期
  - Errors判断：
    - 为空：遍历members 设置传播策略，并更改group状态为：Stable
    - 非空：下发空消费策略给所有的member,并更改group状态为：PrepareRebalance
- case Stable 把当前的消费策略通过回调函数分发给member
  - 将当前的消费策略通过回调函数返回给所有的member
  - 触发heartbeatPurgatory延迟队列中的Task,并且添加一个下一次的DelayedHeartbeat任务