PartitionStateMachine分析

PartitionStateMachine是Controller Leader用于维护分区状态的状态机,分区状态时PartitionState,它有四个子类:



一 分区的状态转换



# NonExistentPartition -> NewPartition

从zookeeper中加载partition的AR 集合到ControllerContext的partitionReplicaAssignment

 

# NewPartition -> OnlinePartition

首先将第一个可用的副本所在的broker作为leader,再把所有可用的副本对象都装入ISR,然后写leader和ISR信息到zookeeper中保存

对于这个分区而言,发送LeaderAndIsr请求到每个可用的副本broker,以及UpdateMetadata请求到每个可用的broker上

# OnlinePartition/OfflinePartition ->OnlinePartition

为分区选择新的Leader副本和ISR集合,并将结果写入zookeeper,然后向需要进行角色切换的副本发LeaderAndIsrReqeust,指导这些副本进行角色切换,并向所有可用broker发送UpdateMetadataCache请求,更新该broker上的MetadataCache

 

# NewPartition/OnlinePartition ->OfflinePartition

仅仅是在kafkaController中标记该状态为OfflinePartition

 

# OfflinePartition -> NonExistentPartition

只是进行状态切换,没有其他操作

 

二 核心字段

controllerContext: ControllerContext 用于维护KafkaController中上下文信息

partitionState:Map[TopicAndPartition, PartitionState] 用于保存分区对应的状态

brokerRequestBatch:ControllerBrokerRequestBatch 用于向指定的Broker批量发送请请求

noOpPartitionLeaderSelector:NoOpLeaderSelector 默认的副本选举器,并没有真正进行副本选举,只是返回当前的Leader副本,ISR集合和AR集合

topicChangeListener:TopicChangeListener zookeeper的监听器,监听topic的变化

deleteTopicsListener:DeleteTopicsListener zookeeper的监听器,监听topic的删除

partitionModificationsListeners:Map[String, PartitionModifications

Listener] 用于监听分区修改

 

三 核心方法

3.1 startup方法

在PartitionStateMachine初始化的时候,会初始化partition的状态,并且会将NewPartition、OfflinePartition状态的分区试图转换成Online

Partition状态

def startup() {
  // 初始化partition状态
  initializePartitionState()
  // set started flag
  hasStarted.set(true)
  // 试图移动partitiononline状态
  triggerOnlinePartitionStateChange()
}

 

# 初始化各个partition状态,初始化是根据ControllerContext的

partitionLeadershipinfo来决定的

private def initializePartitionState() {
  // 遍历ControllerContext获取的分区和副本映射集合
 
for((topicPartition, replicaAssignment) <- controllerContext.partitionReplicaAssignment) {
    // 检测ControllerContext保存的leader信息的leaderisr的路径在zookeeper是否存在
    //
如果存在表示不是新建的分区,如果不存在则表示这是新分区
   
controllerContext
.partitionLeadershipInfo.get(topicPartition) match {
      case Some(currentLeaderIsrAndEpoch) =>
        // 检测该分区leader是否可用
       
if (controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader))
          // 如果可用初始化为OnlinePartition
         
partitionState
.put(topicPartition, OnlinePartition)
        else
         
// 如果不可用初始化为OfflinePartition
         
partitionState
.put(topicPartition, OfflinePartition)
      case None =>
        // 如果没有,则表示是新建的,状态为NewPartition
       
partitionState
.put(topicPartition, NewPartition)
    }
  }
}

 

# 试图移动所有NewPartition或者OfflinePartition状态的partition到OnlinePartition状态

def triggerOnlinePartitionStateChange() {
  try {
    brokerRequestBatch.newBatch()
    // 试图移动所有NewPartition或者OfflinePartition状态的partitionOnlinePartition状态
    // 遍历每一个分区和对应的状态的映射集合
    for((topicAndPartition, partitionState) <- partitionState
        // 如果没有开启topic物理删除机制且没有在topic删除队列
        if !controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic)) {
      // OfflinePartitionNewPartition 试图转换成NewPartition
      if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
        handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
                          (new CallbackBuilder).build)
    }
    // 批量发送请求到指定的broker
    brokerRequestBatch.sendRequestsToBrokers(controller.epoch)
  } catch {
    case e: Throwable => error("Error while moving some partitions to the online state", e)
  }
}

 

3.2 handleStateChange方法 进行分区状态切换的核心方法,它会根据指定的leader 选举策略进行选举,每一次在转换前都会检测分区的前置状态是否合法

private def handleStateChange(topic: String, partition: Int, targetState: PartitionState, leaderSelector: PartitionLeaderSelector,
    callbacks: Callbacks) {
  val topicAndPartition = TopicAndPartition(topic, partition)
  if (!hasStarted.get)
    throw new StateChangeFailedException(("Controller %d epoch %d initiated state change for partition %s to %s failed because " +
      "the partition state machine has not started").format(controllerId, controller.epoch, topicAndPartition, targetState))
  // 根据指定的分区,获取分区状态,如果没有则为NonExistentPartition
  val currState = partitionState.getOrElseUpdate(topicAndPartition, NonExistentPartition)
  try {
    targetState match {
      // 如果要转换成NewPartition
      case NewPartition =>
        // 检查该分区的前置状态
        assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)
        // 修改partition状态
        partitionState.put(topicAndPartition, NewPartition)
        // 获取分区AR集合
        val assignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).mkString(",")
        stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s with assigned replicas %s"
          .format(controllerId, controller.epoch, topicAndPartition, currState, targetState, assignedReplicas))
      // 如果要转成OnLinePartition
      case OnlinePartition =>
        // 检查该分区的前置状态
        assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition)
        partitionState(topicAndPartition) match {
          // 当前分区状态是NewPartition
          case NewPartition =>
            // 实例化新分区的Leader  ISR
            initializeLeaderAndIsrForPartition(topicAndPartition)
          // 当前分区状态是OfflinePartition
          case OfflinePartition =>
            // 调用OfflinePartition->OnlinePartition状态转换方法
            electLeaderForPartition(topic, partition, leaderSelector)
          // 如果本身就是OnlinePartition,然后因为某种原因重新选举
          case OnlinePartition => // invoked when the leader needs to be re-elected
            // 调用OnlinePartition->OnlinePartition状态转换方法
            electLeaderForPartition(topic, partition, leaderSelector)
          case _ => // should never come here since illegal previous states are checked above
        }
        // 修改partition状态为OnlinePartition
        partitionState.put(topicAndPartition, OnlinePartition)
        val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader
        stateChangeLogger.trace("Controller %d epoch %d changed partition %s from %s to %s with leader %d"
          .format(controllerId, controller.epoch, topicAndPartition, currState, targetState, leader))
      // 如果要转成OfflinePartition
      case OfflinePartition =>
        // 检查前置状态
        assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition)
        stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
          .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
        // 修改partition状态为OnlinePartition
        partitionState.put(topicAndPartition, OfflinePartition)
      // 如果要转成NonExistentPartition
      case NonExistentPartition =>
        // 检查前置状态
        assertValidPreviousStates(topicAndPartition, List(OfflinePartition), NonExistentPartition)
        stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
          .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
        // 修改partition状态为NonExistentPartition
        partitionState.put(topicAndPartition, NonExistentPartition)
    }
  } catch {
    case t: Throwable =>
      stateChangeLogger.error("Controller %d epoch %d initiated state change for partition %s from %s to %s failed"
        .format(controllerId, controller.epoch, topicAndPartition, currState, targetState), t)
  }
}

 

3.3 initializeLeaderAndIsrForPartition如果NewPartition要切换成OnlinePartition状态时,会初始化该分区的Leader和ISR列表

# 获取该分区AR副本集,并且过滤出现在可用的有哪些副本

# 如果没有可用副本,表示转换失败

# 如果有则创建LeaderIsrAndControllerEpoch对象,它封装了Leader,

ISR以及controller epoch相关的信息

# 将LeaderIsrAndControllerEpoch对象进行转换后,保存到zookeeper对应的路径下:

/brokers/topics/[topic_name]/partitions/[partition_id]/state

# 更新ControllerContext的 partitionLeadershipInfo分区的leader信息

# 将获取的Leader副本和ISR列表以及AR等信息,封装成LeaderAndIsrRequest,添加到待发送队列,等待被发送

 

private def initializeLeaderAndIsrForPartition(topicAndPartition: TopicAndPartition) {
  // 获取该分区AR副本集
  val replicaAssignment = controllerContext.partitionReplicaAssignment(topicAndPartition)
  // 获取该分区AR副本集中所有可用的副本
  val liveAssignedReplicas = replicaAssignment.filter(r => controllerContext.liveBrokerIds.contains(r))
  liveAssignedReplicas.size match {
    // 如果AR中没有存活的副本集,抛出状态转换失败的异常
    case 0 =>
      // ......
    case _ =>
      debug("Live assigned replicas for partition %s are: [%s]".format(topicAndPartition, liveAssignedReplicas))
      // 获取AR中可用副本集中的第一个副本作为Leader
      val leader = liveAssignedReplicas.head
      // 创建LeaderIsrAndControllerEpoch对象
      val leaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(new LeaderAndIsr(leader, liveAssignedReplicas.toList),
        controller.epoch)
      debug("Initializing leader and isr for partition %s to %s".format(topicAndPartition, leaderIsrAndControllerEpoch))
      try {
        // 根据leaderIsrAndControllerEpoch信息在zookeeper创建/brokers/topics/[topic_name]/partitions/[partition_id]/state
        zkUtils.createPersistentPath(
          getTopicPartitionLeaderAndIsrPath(topicAndPartition.topic, topicAndPartition.partition),
          zkUtils.leaderAndIsrZkData(leaderIsrAndControllerEpoch.leaderAndIsr, controller.epoch))
        // 更新ControllerContextpartitionLeadershipInfo分区leader相关的信息
        controllerContext.partitionLeadershipInfo.put(topicAndPartition, leaderIsrAndControllerEpoch)
        // 添加LeaderAndIsr请求到队列,等待发送到指定的broker
        brokerRequestBatch.addLeaderAndIsrRequestForBrokers(liveAssignedReplicas, topicAndPartition.topic,
          topicAndPartition.partition, leaderIsrAndControllerEpoch, replicaAssignment)
      } catch {
        //......
      }
  }
}

 

3.4 electLeaderForPartition 当OfflinePartition、OnlinePartition 要切换成OnlinePartition状态时

# 根据指定的选举策略为分区选举新的Leader副本

# 将Leader和ISR信息更新到zookeeper对应的路径下

# 更新ControllerContext的 partitionLeadershipInfo分区的leader信息

# 将获取的Leader副本和ISR列表以及AR等信息,封装成LeaderAndIsrRequest,添加到待发送队列,等待被发送

 

def electLeaderForPartition(topic: String, partition: Int, leaderSelector: PartitionLeaderSelector) {
  val topicAndPartition = TopicAndPartition(topic, partition)
  // handle leader election for the partitions whose leader is no longer alive
  stateChangeLogger.trace("Controller %d epoch %d started leader election for partition %s"
                            .format(controllerId, controller.epoch, topicAndPartition))
  try {
    var zookeeperPathUpdateSucceeded: Boolean = false
    var newLeaderAndIsr: LeaderAndIsr = null
    var replicasForThisPartition: Seq[Int] = Seq.empty[Int]
    while(!zookeeperPathUpdateSucceeded) {
      // zk中获取分区当前的leader副本,ISR集合,zkversion等信息,如果不存在则抛出异常
      val currentLeaderIsrAndEpoch = getLeaderIsrAndEpochOrThrowException(topic, partition)
      val currentLeaderAndIsr = currentLeaderIsrAndEpoch.leaderAndIsr
      val controllerEpoch = currentLeaderIsrAndEpoch.controllerEpoch
      // 判断是否小于已有的controller epoch值,如果小于抛出异常
      if (controllerEpoch > controller.epoch) {
        val failMsg = ("aborted leader election for partition [%s,%d] since the LeaderAndIsr path was " +
                       "already written by another controller. This probably means that the current controller %d went through " +
                       "a soft failure and another controller was elected with epoch %d.")
                         .format(topic, partition, controllerId, controllerEpoch)
        stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
        throw new StateChangeFailedException(failMsg)
      }
      //根据leaderSelector选举出新的Leader副本和ISR列表
      val (leaderAndIsr, replicas) = leaderSelector.selectLeader(topicAndPartition, currentLeaderAndIsr)
      // 将新的LeaderAndIsr信息保存到zookeeper
      val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkUtils, topic, partition,
        leaderAndIsr, controller.epoch, currentLeaderAndIsr.zkVersion)
      newLeaderAndIsr = leaderAndIsr
      newLeaderAndIsr.zkVersion = newVersion
      zookeeperPathUpdateSucceeded = updateSucceeded
      replicasForThisPartition = replicas
    }
    val newLeaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(newLeaderAndIsr, controller.epoch)
    // 更新ControllerContextpartitionLeadershipInfo分区leader信息
    controllerContext.partitionLeadershipInfo.put(TopicAndPartition(topic, partition), newLeaderIsrAndControllerEpoch)
    stateChangeLogger.trace("Controller %d epoch %d elected leader %d for Offline partition %s"
      .format(controllerId, controller.epoch, newLeaderAndIsr.leader, topicAndPartition))
    // 获取该分区AR副本集
    val replicas = controllerContext.partitionReplicaAssignment(TopicAndPartition(topic, partition))
    // 向队列添加LeaderAndIsrRequest,等待被发送到指定的broker
    brokerRequestBatch.addLeaderAndIsrRequestForBrokers(replicasForThisPartition, topic, partition,
      newLeaderIsrAndControllerEpoch, replicas)
  } catch {
    
  }
  debug("After leader election, leader cache is updated to %s".format(controllerContext.partitionLeadershipInfo.map(l => (l._1, l._2))))
}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

莫言静好、

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值