Kafka 状态机模块（一）：PartitionStateMachine分区状态机

最新推荐文章于 2024-05-01 15:15:41 发布

其实系一个须刨

最新推荐文章于 2024-05-01 15:15:41 发布

阅读量711

点赞数

分类专栏： kafka-2.4.1

本文链接：https://blog.csdn.net/lianggx3/article/details/108876279

版权

kafka-2.4.1 专栏收录该内容

55 篇文章 19 订阅

订阅专栏

PartitionStateMachine 定义了 Kafka Controller 的分区状态机，用于管理集群中分区的状态信息，每个 Kafka Controller 都定义了自己的分区状态机，但只有在当前 Controller 实例成为 leader 角色时才会启动运行名下的状态机。

它启动逻辑如下：

  def startup(): Unit = {
    info("Initializing partition state")
    // // 初始化本地记录的所有分区状态
    initializePartitionState()
    info("Triggering online partition state changes")
    // 尝试将集群中所有 OfflinePartition 或 NewPartition 状态的可用分区切换成 OnlinePartition 状态
    triggerOnlinePartitionStateChange()
    debug(s"Started partition state machine with initial state -> ${controllerContext.partitionStates}")
  }

  private def initializePartitionState(): Unit = {
    // 遍历集群中的所有分区
    for (topicPartition <- controllerContext.allPartitions) {
      // check if leader and isr path exists for partition. If not, then it is in NEW state
      // 获取对应分区 leader 副本所在的 brokerId、ISR 集合，以及 controller 年代信息
      controllerContext.partitionLeadershipInfo.get(topicPartition) match {
        // 存在 leader 副本和 ISR 集合
        case Some(currentLeaderIsrAndEpoch) =>
          // else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state
          // 分区 leader 副本所在的 broker 可用，初始化分区为 OnlinePartition 状态
          if (controllerContext.isReplicaOnline(currentLeaderIsrAndEpoch.leaderAndIsr.leader, topicPartition))
          // leader is alive
            controllerContext.putPartitionState(topicPartition, OnlinePartition)
          else
          // 分区 leader 副本所在的 broker 不可用，初始化为 OfflinePartition 状态
            controllerContext.putPartitionState(topicPartition, OfflinePartition)
        case None =>
          // 如果不存在，则说明是一个新创建的分区，设置分区状态为 NewPartition
          controllerContext.putPartitionState(topicPartition, NewPartition)
      }
    }
  }

Kafka 为分区定义了 4 类状态。

NewPartition：分区被创建后被设置成这个状态，表明它是一个全新的分区对象。处于这个状态的分区，被 Kafka 认为是“未初始化”，因此，不能选举 Leader。
OnlinePartition：分区正式提供服务时所处的状态。
OfflinePartition：分区下线后所处的状态。
NonExistentPartition：分区被删除，并且从分区状态机移除后所处的状态。

分区 Leader 选举的场景及方法：分区 Leader 选举有 4 类场景。

OfflinePartitionLeaderElectionStrategy：因为 Leader 副本下线而引发的分区 Leader 选举。
ReassignPartitionLeaderElectionStrategy：因为执行分区副本重分配操作而引发的分区 Leader 选举。
PreferredReplicaPartitionLeaderElectionStrategy：因为执行 Preferred 副本 Leader 选举而引发的分区 Leader 选举。
ControlledShutdownPartitionLeaderElectionStrategy：因为正常关闭 Broker 而引发的分区 Leader 选举。

针对这 4 类场景，分区状态机的 PartitionLeaderElectionAlgorithms 对象定义了 4 个方法，分别负责为每种场景选举 Leader 副本，这 4 种方法是：offlinePartitionLeaderElection；reassignPartitionLeaderElection；preferredReplicaPartitionLeaderElection；controlledShutdownPartitionLeaderElection。

// 分区Leader选举策略接口
sealed trait PartitionLeaderElectionStrategy
// 离线分区Leader选举策略
final case class OfflinePartitionLeaderElectionStrategy(
  allowUnclean: Boolean) extends PartitionLeaderElectionStrategy
// 分区副本重分配Leader选举策略  
final case object ReassignPartitionLeaderElectionStrategy 
  extends PartitionLeaderElectionStrategy
// 分区Preferred副本Leader选举策略
final case object PreferredReplicaPartitionLeaderElectionStrategy 
  extends PartitionLeaderElectionStrategy
// Broker Controlled关闭时Leader选举策略
final case object ControlledShutdownPartitionLeaderElectionStrategy 
  extends PartitionLeaderElectionStrategy

offlinePartitionLeaderElection 方法的逻辑是这 4 个方法中最复杂的

  def offlinePartitionLeaderElection(assignment: Seq[Int], // 这是分区的副本列表
                                     isr: Seq[Int],
                                     liveReplicas: Set[Int], // 该分区下所有处于存活状态的副本
                                     uncleanLeaderElectionEnabled: Boolean, // 脏选举
                                     controllerContext: ControllerContext
                                    ): Option[Int] = {
    // 从当前分区副本列表中寻找首个处于存活状态的ISR副本
    assignment.find(id => liveReplicas.contains(id) && isr.contains(id))
      .orElse {
      // 如果找不到满足条件的副本，查看是否允许Unclean Leader选举
      // 即Broker端参数unclean.leader.election.enable是否等于true
      if (uncleanLeaderElectionEnabled) {
        // 选择当前副本列表中的第一个存活副本作为Leader
        val leaderOpt = assignment.find(liveReplicas.contains)
        if (leaderOpt.isDefined)
          controllerContext.stats.uncleanLeaderElectionRate.mark()
        leaderOpt
      } else {
        // 如果不允许Unclean Leader选举，则返回None表示无法选举Leader
        None
      }
    }
  }

其他三个方法较简单。

def reassignPartitionLeaderElection(reassignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int]): Option[Int] = {
  reassignment.find(id => liveReplicas.contains(id) && isr.contains(id))
}

def preferredReplicaPartitionLeaderElection(assignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int]): Option[Int] = {
  assignment.headOption.filter(id => liveReplicas.contains(id) && isr.contains(id))
}

def controlledShutdownPartitionLeaderElection(assignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int], shuttingDownBrokers: Set[Int]): Option[Int] = {
  assignment.find(id => liveReplicas.contains(id) && isr.contains(id) && !shuttingDownBrokers.contains(id))
}

处理分区状态转换

handleStateChanges 主要处理分区状态转换的逻辑。一句话概括 handleStateChanges 的作用，应该这样说：handleStateChanges 把 partitions 的状态设置为 targetState，同时，还可能需要用 leaderElectionStrategy 策略为 partitions 选举新的 Leader，最终将 partitions 的 Leader 信息返回。

  override def handleStateChanges(
    partitions: Seq[TopicPartition],
    targetState: PartitionState,
    partitionLeaderElectionStrategyOpt: Option[PartitionLeaderElectionStrategy]
  ): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
    if (partitions.nonEmpty) {
      try {
        // 清空Controller待发送请求集合，准备本次请求发送
        // 校验待发送的请求集合，确保历史的请求已经全部发送完毕
        controllerBrokerRequestBatch.newBatch()
        // 调用doHandleStateChanges方法执行真正的状态变更逻辑
        val result = doHandleStateChanges(
          partitions,
          targetState,
          partitionLeaderElectionStrategyOpt
        )
        // Controller给相关Broker发送请求通知状态变化
        controllerBrokerRequestBatch.sendRequestsToBrokers(controllerContext.epoch)
        result
      } catch {
        // 如果Controller易主，则记录错误日志，然后重新抛出异常
        // 上层代码会捕获该异常并执行maybeResign方法执行卸任逻辑
        case e: ControllerMovedException =>
          error(s"Controller moved to another broker when moving some partitions to $targetState state", e)
          throw e
        // 如果是其他异常，记录错误日志，封装错误返回
        case e: Throwable =>
          error(s"Error while moving some partitions to $targetState state", e)
          partitions.iterator.map(_ -> Left(e)).toMap
      }
    } else {
      // 如果partitions为空，什么都不用做
      Map.empty
    }
  }

doHandleStateChanges方法实现

  private def doHandleStateChanges(
    partitions: Seq[TopicPartition],
    targetState: PartitionState,
    partitionLeaderElectionStrategyOpt: Option[PartitionLeaderElectionStrategy]
  ): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
    val stateChangeLog = stateChangeLogger.withControllerEpoch(controllerContext.epoch)
    // 初始化新分区的状态为NonExistentPartition
    partitions.foreach(partition => controllerContext.putPartitionStateIfNotExists(partition, NonExistentPartition))
    // 找出要执行非法状态转换的分区，记录错误日志
    val (validPartitions, invalidPartitions) = controllerContext.checkValidPartitionStateChange(partitions, targetState)
    invalidPartitions.foreach(partition => logInvalidTransition(partition, targetState))
    // 根据targetState进入到不同的case分支
    targetState match {
      case NewPartition =>
        validPartitions.foreach { partition =>
          stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState with " +
            s"assigned replicas ${controllerContext.partitionReplicaAssignment(partition).mkString(",")}")
          // 把目标状态设置为NewPartition
          controllerContext.putPartitionState(partition, NewPartition)
        }
        Map.empty
      case OnlinePartition =>
        // 获取未初始化分区列表，也就是NewPartition状态下的所有分区
        val uninitializedPartitions = validPartitions.filter(partition => partitionState(partition) == NewPartition)
        // 获取具备Leader选举资格的分区列表
        // 只能为OnlinePartition和OfflinePartition状态的分区选举Leader
        val partitionsToElectLeader = validPartitions.filter(partition => partitionState(partition) == OfflinePartition || partitionState(partition) == OnlinePartition)
        if (uninitializedPartitions.nonEmpty) {
          // 初始化NewPartition状态分区，在ZooKeeper中写入Leader和ISR数据
          val successfulInitializations = initializeLeaderAndIsrForPartitions(uninitializedPartitions)
          successfulInitializations.foreach { partition =>
            stateChangeLog.trace(s"Changed partition $partition from ${partitionState(partition)} to $targetState with state " +
              s"${controllerContext.partitionLeadershipInfo(partition).leaderAndIsr}")
            controllerContext.putPartitionState(partition, OnlinePartition)
          }
        }
        // 为具备Leader选举资格的分区推选Leader
        if (partitionsToElectLeader.nonEmpty) {
          val electionResults = electLeaderForPartitions(
            partitionsToElectLeader,
            partitionLeaderElectionStrategyOpt.getOrElse(
              throw new IllegalArgumentException("Election strategy is a required field when the target state is OnlinePartition")
            )
          )

          electionResults.foreach {
            case (partition, Right(leaderAndIsr)) =>
              stateChangeLog.trace(
                s"Changed partition $partition from ${partitionState(partition)} to $targetState with state $leaderAndIsr"
              )
              // 将成功选举Leader后的分区设置成OnlinePartition状态
              controllerContext.putPartitionState(partition, OnlinePartition)
            case (_, Left(_)) => // Ignore; no need to update partition state on election error
            // 如果选举失败，忽略之
          }
          // 返回Leader选举结果
          electionResults
        } else {
          Map.empty
        }
      case OfflinePartition =>
        validPartitions.foreach { partition =>
          stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState")
          controllerContext.putPartitionState(partition, OfflinePartition)
        }
        Map.empty
      case NonExistentPartition =>
        validPartitions.foreach { partition =>
          stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState")
          controllerContext.putPartitionState(partition, NonExistentPartition)
        }
        Map.empty
    }
  }

其中第 1 步是initializeLeaderAndIsrForPartitions初始化NewPartition状态分区，在ZooKeeper中写入Leader和ISR数据。

  private def initializeLeaderAndIsrForPartitions(partitions: Seq[TopicPartition]): Seq[TopicPartition] = {
    val successfulInitializations = mutable.Buffer.empty[TopicPartition]
    // 获取每个分区的副本列表
    val replicasPerPartition = partitions.map(partition => partition -> controllerContext.partitionReplicaAssignment(partition))
    // 获取每个分区的所有存活副本
    val liveReplicasPerPartition = replicasPerPartition.map { case (partition, replicas) =>
        val liveReplicasForPartition = replicas.filter(replica => controllerContext.isReplicaOnline(replica, partition))
        partition -> liveReplicasForPartition
    }
    // 按照有无存活副本对分区进行分组
    // 分为两组：有存活副本的分区；无任何存活副本的分区
    val (partitionsWithoutLiveReplicas, partitionsWithLiveReplicas) = liveReplicasPerPartition.partition { case (_, liveReplicas) => liveReplicas.isEmpty }

    partitionsWithoutLiveReplicas.foreach { case (partition, replicas) =>
      val failMsg = s"Controller $controllerId epoch ${controllerContext.epoch} encountered error during state change of " +
        s"partition $partition from New to Online, assigned replicas are " +
        s"[${replicas.mkString(",")}], live brokers are [${controllerContext.liveBrokerIds}]. No assigned " +
        "replica is alive."
      logFailedStateChange(partition, NewPartition, OnlinePartition, new StateChangeFailedException(failMsg))
    }
    // 为"有存活副本的分区"确定Leader和ISR
    // Leader确认依据：存活副本列表的首个副本被认定为Leader
    // ISR确认依据：存活副本列表被认定为ISR
    val leaderIsrAndControllerEpochs = partitionsWithLiveReplicas.map { case (partition, liveReplicas) =>
      val leaderAndIsr = LeaderAndIsr(liveReplicas.head, liveReplicas.toList)
      val leaderIsrAndControllerEpoch = LeaderIsrAndControllerEpoch(leaderAndIsr, controllerContext.epoch)
      partition -> leaderIsrAndControllerEpoch
    }.toMap
    val createResponses = try {
        // 在zk上创建partition对应节点，写入ISR节点信息/brokers/topics/topicname/partitionNum/state
      zkClient.createTopicPartitionStatesRaw(leaderIsrAndControllerEpochs, controllerContext.epochZkVersion)
    } catch {
      case e: ControllerMovedException =>
        error("Controller moved to another broker when trying to create the topic partition state znode", e)
        throw e
      case e: Exception =>
        partitionsWithLiveReplicas.foreach { case (partition,_) => logFailedStateChange(partition, partitionState(partition), NewPartition, e) }
        Seq.empty
    }
    createResponses.foreach { createResponse =>
      val code = createResponse.resultCode
      val partition = createResponse.ctx.get.asInstanceOf[TopicPartition]
      val leaderIsrAndControllerEpoch = leaderIsrAndControllerEpochs(partition)
      if (code == Code.OK) {
        controllerContext.partitionLeadershipInfo.put(partition, leaderIsrAndControllerEpoch)
        // 往 leaderAndIsrRequestMap 集合中添加待发送的 LeaderAndIsrRequest 请求所需的数据
        controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(leaderIsrAndControllerEpoch.leaderAndIsr.isr,
          partition, leaderIsrAndControllerEpoch, controllerContext.partitionFullReplicaAssignment(partition), isNew = true)
        successfulInitializations += partition
      } else {
        logFailedStateChange(partition, NewPartition, OnlinePartition, code)
      }
    }
    successfulInitializations
  }

第 2 步是为具备 Leader 选举资格的分区推选 Leader，代码调用 electLeaderForPartitions 方法实现。这个方法会不断尝试为多个分区选举 Leader，直到所有分区都成功选出 Leader。

  private def electLeaderForPartitions(
    partitions: Seq[TopicPartition],
    partitionLeaderElectionStrategy: PartitionLeaderElectionStrategy
  ): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
    var remaining = partitions
    val finishedElections = mutable.Map.empty[TopicPartition, Either[Throwable, LeaderAndIsr]]

    while (remaining.nonEmpty) {
      val (finished, updatesToRetry) = doElectLeaderForPartitions(remaining, partitionLeaderElectionStrategy)
      remaining = updatesToRetry

      finished.foreach {
        case (partition, Left(e)) =>
          logFailedStateChange(partition, partitionState(partition), OnlinePartition, e)
        case (_, Right(_)) => // Ignore; success so no need to log failed state change
      }

      finishedElections ++= finished

      if (remaining.nonEmpty)
        logger.info(s"Retrying leader election with strategy $partitionLeaderElectionStrategy for partitions $remaining")
    }

    finishedElections.toMap
  }

选举 Leader 的核心代码位于 doElectLeaderForPartitions 方法

  private def doElectLeaderForPartitions(
    partitions: Seq[TopicPartition],
    partitionLeaderElectionStrategy: PartitionLeaderElectionStrategy
  ): (Map[TopicPartition, Either[Exception, LeaderAndIsr]], Seq[TopicPartition]) = {
    val getDataResponses = try {
      // 批量获取ZooKeeper中给定分区的znode节点数据
      zkClient.getTopicPartitionStatesRaw(partitions)
    } catch {
      case e: Exception =>
        return (partitions.iterator.map(_ -> Left(e)).toMap, Seq.empty)
    }
    // 可选举Leader分区列表
    val failedElections = mutable.Map.empty[TopicPartition, Either[Exception, LeaderAndIsr]]
    // 选举失败分区列表
    val validLeaderAndIsrs = mutable.Buffer.empty[(TopicPartition, LeaderAndIsr)]
    // 遍历每个分区的znode节点数据
    getDataResponses.foreach { getDataResponse =>
      val partition = getDataResponse.ctx.get.asInstanceOf[TopicPartition]
      val currState = partitionState(partition)
      // 如果成功拿到znode节点数据
      if (getDataResponse.resultCode == Code.OK) {
        TopicPartitionStateZNode.decode(getDataResponse.data, getDataResponse.stat) match {
          // 节点数据中含Leader和ISR信息
          case Some(leaderIsrAndControllerEpoch) =>
            // 如果节点数据的Controller Epoch值大于当前Controller Epoch值
            if (leaderIsrAndControllerEpoch.controllerEpoch > controllerContext.epoch) {
              val failMsg = s"Aborted leader election for partition $partition since the LeaderAndIsr path was " +
                s"already written by another controller. This probably means that the current controller $controllerId went through " +
                s"a soft failure and another controller was elected with epoch ${leaderIsrAndControllerEpoch.controllerEpoch}."
              // 将该分区加入到选举失败分区列表
              failedElections.put(partition, Left(new StateChangeFailedException(failMsg)))
            } else {
              // 将该分区加入到可选举Leader分区列表
              validLeaderAndIsrs += partition -> leaderIsrAndControllerEpoch.leaderAndIsr
            }
          case None =>
            val exception = new StateChangeFailedException(s"LeaderAndIsr information doesn't exist for partition $partition in $currState state")
            // 将该分区加入到选举失败分区列表
            failedElections.put(partition, Left(exception))
        }
      // 如果没有拿到znode节点数据，则将该分区加入到选举失败分区列表
      } else if (getDataResponse.resultCode == Code.NONODE) {
        val exception = new StateChangeFailedException(s"LeaderAndIsr information doesn't exist for partition $partition in $currState state")
        failedElections.put(partition, Left(exception))
      } else {
        failedElections.put(partition, Left(getDataResponse.resultException.get))
      }
    }
    // validLeaderAndIsrs 容器中是否包含可选举 Leader 的分区
    if (validLeaderAndIsrs.isEmpty) {
      return (failedElections.toMap, Seq.empty)
    }
    // 开始选举Leader，并根据有无Leader将分区进行分区
    val (partitionsWithoutLeaders, partitionsWithLeaders) = partitionLeaderElectionStrategy match {

      // 这一步是根据给定的 PartitionLeaderElectionStrategy，调用 PartitionLeaderElectionAlgorithms 的不同方法执行 Leader 选举
      case OfflinePartitionLeaderElectionStrategy(allowUnclean) =>
        val partitionsWithUncleanLeaderElectionState = collectUncleanLeaderElectionState(
          validLeaderAndIsrs,
          allowUnclean
        )
        // 为OffinePartition分区选举Leader
        leaderForOffline(controllerContext, partitionsWithUncleanLeaderElectionState).partition(_.leaderAndIsr.isEmpty)
      case ReassignPartitionLeaderElectionStrategy =>
        // 为副本重分配的分区选举Leader
        leaderForReassign(controllerContext, validLeaderAndIsrs).partition(_.leaderAndIsr.isEmpty)
      case PreferredReplicaPartitionLeaderElectionStrategy =>
        // 为分区执行Preferred副本Leader选举
        leaderForPreferredReplica(controllerContext, validLeaderAndIsrs).partition(_.leaderAndIsr.isEmpty)
      case ControlledShutdownPartitionLeaderElectionStrategy =>
        // 为因Broker正常关闭而受影响的分区选举Leader
        leaderForControlledShutdown(controllerContext, validLeaderAndIsrs).partition(_.leaderAndIsr.isEmpty)
    }
    // 最后一步，更新 ZooKeeper 节点数据
    // 将所有选举失败的分区全部加入到Leader选举失败分区列表
    partitionsWithoutLeaders.foreach { electionResult =>
      val partition = electionResult.topicPartition
      val failMsg = s"Failed to elect leader for partition $partition under strategy $partitionLeaderElectionStrategy"
      failedElections.put(partition, Left(new StateChangeFailedException(failMsg)))
    }
    val recipientsPerPartition = partitionsWithLeaders.map(result => result.topicPartition -> result.liveReplicas).toMap
    val adjustedLeaderAndIsrs = partitionsWithLeaders.map(result => result.topicPartition -> result.leaderAndIsr.get).toMap
    // 使用新选举的Leader和ISR信息更新ZooKeeper上分区的znode节点数据
    val UpdateLeaderAndIsrResult(finishedUpdates, updatesToRetry) = zkClient.updateLeaderAndIsr(
      adjustedLeaderAndIsrs, controllerContext.epoch, controllerContext.epochZkVersion)
    // 对于ZooKeeper znode节点数据更新成功的分区，封装对应的Leader和ISR信息
    // 构建LeaderAndIsr请求，并将该请求加入到Controller待发送请求集合
    // 等待后续统一发送
    finishedUpdates.foreach { case (partition, result) =>
      result.right.foreach { leaderAndIsr =>
        val replicaAssignment = controllerContext.partitionFullReplicaAssignment(partition)
        val leaderIsrAndControllerEpoch = LeaderIsrAndControllerEpoch(leaderAndIsr, controllerContext.epoch)
        controllerContext.partitionLeadershipInfo.put(partition, leaderIsrAndControllerEpoch)
        // 往 leaderAndIsrRequestMap 集合中添加待发送的 LeaderAndIsrRequest 请求所需的数据
        controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(recipientsPerPartition(partition), partition,
          leaderIsrAndControllerEpoch, replicaAssignment, isNew = false)
      }
    }
    // 返回选举结果，包括成功选举并更新ZooKeeper节点的分区、选举失败分区以及
    // ZooKeeper节点更新失败的分区
    (finishedUpdates ++ failedElections, updatesToRetry)
  }