PartitionStateMachine是Controller Leader用于维护分区状态的状态机,分区状态时PartitionState,它有四个子类:
一 分区的状态转换
# NonExistentPartition -> NewPartition
从zookeeper中加载partition的AR 集合到ControllerContext的partitionReplicaAssignment
# NewPartition -> OnlinePartition
首先将第一个可用的副本所在的broker作为leader,再把所有可用的副本对象都装入ISR,然后写leader和ISR信息到zookeeper中保存
对于这个分区而言,发送LeaderAndIsr请求到每个可用的副本broker,以及UpdateMetadata请求到每个可用的broker上
# OnlinePartition/OfflinePartition ->OnlinePartition
为分区选择新的Leader副本和ISR集合,并将结果写入zookeeper,然后向需要进行角色切换的副本发LeaderAndIsrReqeust,指导这些副本进行角色切换,并向所有可用broker发送UpdateMetadataCache请求,更新该broker上的MetadataCache
# NewPartition/OnlinePartition ->OfflinePartition
仅仅是在kafkaController中标记该状态为OfflinePartition
# OfflinePartition -> NonExistentPartition
只是进行状态切换,没有其他操作
二 核心字段
controllerContext: ControllerContext 用于维护KafkaController中上下文信息
partitionState:Map[TopicAndPartition, PartitionState] 用于保存分区对应的状态
brokerRequestBatch:ControllerBrokerRequestBatch 用于向指定的Broker批量发送请请求
noOpPartitionLeaderSelector:NoOpLeaderSelector 默认的副本选举器,并没有真正进行副本选举,只是返回当前的Leader副本,ISR集合和AR集合
topicChangeListener:TopicChangeListener zookeeper的监听器,监听topic的变化
deleteTopicsListener:DeleteTopicsListener zookeeper的监听器,监听topic的删除
partitionModificationsListeners:Map[String, PartitionModifications
Listener] 用于监听分区修改
三 核心方法
3.1 startup方法
在PartitionStateMachine初始化的时候,会初始化partition的状态,并且会将NewPartition、OfflinePartition状态的分区试图转换成Online
Partition状态
def startup() {
// 初始化partition状态
initializePartitionState()
// set started flag
hasStarted.set(true)
// 试图移动partition到online状态
triggerOnlinePartitionStateChange()
}
# 初始化各个partition状态,初始化是根据ControllerContext的
partitionLeadershipinfo来决定的
private def initializePartitionState() {
// 遍历ControllerContext获取的分区和副本映射集合
for((topicPartition, replicaAssignment) <- controllerContext.partitionReplicaAssignment) {
// 检测ControllerContext保存的leader信息的leader和isr的路径在zookeeper是否存在
// 如果存在表示不是新建的分区,如果不存在则表示这是新分区
controllerContext.partitionLeadershipInfo.get(topicPartition) match {
case Some(currentLeaderIsrAndEpoch) =>
// 检测该分区leader是否可用
if (controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader))
// 如果可用初始化为OnlinePartition
partitionState.put(topicPartition, OnlinePartition)
else
// 如果不可用初始化为OfflinePartition
partitionState.put(topicPartition, OfflinePartition)
case None =>
// 如果没有,则表示是新建的,状态为NewPartition
partitionState.put(topicPartition, NewPartition)
}
}
}
# 试图移动所有NewPartition或者OfflinePartition状态的partition到OnlinePartition状态
def triggerOnlinePartitionStateChange() {
try {
brokerRequestBatch.newBatch()
// 试图移动所有NewPartition或者OfflinePartition状态的partition到OnlinePartition状态
// 遍历每一个分区和对应的状态的映射集合
for((topicAndPartition, partitionState) <- partitionState
// 如果没有开启topic物理删除机制且没有在topic删除队列
if !controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic)) {
// 将OfflinePartition和NewPartition 试图转换成NewPartition
if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
(new CallbackBuilder).build)
}
// 批量发送请求到指定的broker
brokerRequestBatch.sendRequestsToBrokers(controller.epoch)
} catch {
case e: Throwable => error("Error while moving some partitions to the online state", e)
}
}
3.2 handleStateChange方法 进行分区状态切换的核心方法,它会根据指定的leader 选举策略进行选举,每一次在转换前都会检测分区的前置状态是否合法
private def handleStateChange(topic: String, partition: Int, targetState: PartitionState, leaderSelector: PartitionLeaderSelector,
callbacks: Callbacks) {
val topicAndPartition = TopicAndPartition(topic, partition)
if (!hasStarted.get)
throw new StateChangeFailedException(("Controller %d epoch %d initiated state change for partition %s to %s failed because " +
"the partition state machine has not started").format(controllerId, controller.epoch, topicAndPartition, targetState))
// 根据指定的分区,获取分区状态,如果没有则为NonExistentPartition
val currState = partitionState.getOrElseUpdate(topicAndPartition, NonExistentPartition)
try {
targetState match {
// 如果要转换成NewPartition
case NewPartition =>
// 检查该分区的前置状态
assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)
// 修改partition状态
partitionState.put(topicAndPartition, NewPartition)
// 获取分区AR集合
val assignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).mkString(",")
stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s with assigned replicas %s"
.format(controllerId, controller.epoch, topicAndPartition, currState, targetState, assignedReplicas))
// 如果要转成OnLinePartition
case OnlinePartition =>
// 检查该分区的前置状态
assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition)
partitionState(topicAndPartition) match {
// 当前分区状态是NewPartition
case NewPartition =>
// 实例化新分区的Leader 和 ISR
initializeLeaderAndIsrForPartition(topicAndPartition)
// 当前分区状态是OfflinePartition
case OfflinePartition =>
// 调用OfflinePartition->OnlinePartition状态转换方法
electLeaderForPartition(topic, partition, leaderSelector)
// 如果本身就是OnlinePartition,然后因为某种原因重新选举
case OnlinePartition => // invoked when the leader needs to be re-elected
// 调用OnlinePartition->OnlinePartition状态转换方法
electLeaderForPartition(topic, partition, leaderSelector)
case _ => // should never come here since illegal previous states are checked above
}
// 修改partition状态为OnlinePartition
partitionState.put(topicAndPartition, OnlinePartition)
val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader
stateChangeLogger.trace("Controller %d epoch %d changed partition %s from %s to %s with leader %d"
.format(controllerId, controller.epoch, topicAndPartition, currState, targetState, leader))
// 如果要转成OfflinePartition
case OfflinePartition =>
// 检查前置状态
assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition)
stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
.format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
// 修改partition状态为OnlinePartition
partitionState.put(topicAndPartition, OfflinePartition)
// 如果要转成NonExistentPartition
case NonExistentPartition =>
// 检查前置状态
assertValidPreviousStates(topicAndPartition, List(OfflinePartition), NonExistentPartition)
stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
.format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
// 修改partition状态为NonExistentPartition
partitionState.put(topicAndPartition, NonExistentPartition)
}
} catch {
case t: Throwable =>
stateChangeLogger.error("Controller %d epoch %d initiated state change for partition %s from %s to %s failed"
.format(controllerId, controller.epoch, topicAndPartition, currState, targetState), t)
}
}
3.3 initializeLeaderAndIsrForPartition如果NewPartition要切换成OnlinePartition状态时,会初始化该分区的Leader和ISR列表
# 获取该分区AR副本集,并且过滤出现在可用的有哪些副本
# 如果没有可用副本,表示转换失败
# 如果有则创建LeaderIsrAndControllerEpoch对象,它封装了Leader,
ISR以及controller epoch相关的信息
# 将LeaderIsrAndControllerEpoch对象进行转换后,保存到zookeeper对应的路径下:
/brokers/topics/[topic_name]/partitions/[partition_id]/state
# 更新ControllerContext的 partitionLeadershipInfo分区的leader信息
# 将获取的Leader副本和ISR列表以及AR等信息,封装成LeaderAndIsrRequest,添加到待发送队列,等待被发送
private def initializeLeaderAndIsrForPartition(topicAndPartition: TopicAndPartition) {
// 获取该分区AR副本集
val replicaAssignment = controllerContext.partitionReplicaAssignment(topicAndPartition)
// 获取该分区AR副本集中所有可用的副本
val liveAssignedReplicas = replicaAssignment.filter(r => controllerContext.liveBrokerIds.contains(r))
liveAssignedReplicas.size match {
// 如果AR中没有存活的副本集,抛出状态转换失败的异常
case 0 =>
// ......
case _ =>
debug("Live assigned replicas for partition %s are: [%s]".format(topicAndPartition, liveAssignedReplicas))
// 获取AR中可用副本集中的第一个副本作为Leader
val leader = liveAssignedReplicas.head
// 创建LeaderIsrAndControllerEpoch对象
val leaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(new LeaderAndIsr(leader, liveAssignedReplicas.toList),
controller.epoch)
debug("Initializing leader and isr for partition %s to %s".format(topicAndPartition, leaderIsrAndControllerEpoch))
try {
// 根据leaderIsrAndControllerEpoch信息在zookeeper创建/brokers/topics/[topic_name]/partitions/[partition_id]/state
zkUtils.createPersistentPath(
getTopicPartitionLeaderAndIsrPath(topicAndPartition.topic, topicAndPartition.partition),
zkUtils.leaderAndIsrZkData(leaderIsrAndControllerEpoch.leaderAndIsr, controller.epoch))
// 更新ControllerContext的partitionLeadershipInfo分区leader相关的信息
controllerContext.partitionLeadershipInfo.put(topicAndPartition, leaderIsrAndControllerEpoch)
// 添加LeaderAndIsr请求到队列,等待发送到指定的broker
brokerRequestBatch.addLeaderAndIsrRequestForBrokers(liveAssignedReplicas, topicAndPartition.topic,
topicAndPartition.partition, leaderIsrAndControllerEpoch, replicaAssignment)
} catch {
//......
}
}
}
3.4 electLeaderForPartition 当OfflinePartition、OnlinePartition 要切换成OnlinePartition状态时
# 根据指定的选举策略为分区选举新的Leader副本
# 将Leader和ISR信息更新到zookeeper对应的路径下
# 更新ControllerContext的 partitionLeadershipInfo分区的leader信息
# 将获取的Leader副本和ISR列表以及AR等信息,封装成LeaderAndIsrRequest,添加到待发送队列,等待被发送
def electLeaderForPartition(topic: String, partition: Int, leaderSelector: PartitionLeaderSelector) {
val topicAndPartition = TopicAndPartition(topic, partition)
// handle leader election for the partitions whose leader is no longer alive
stateChangeLogger.trace("Controller %d epoch %d started leader election for partition %s"
.format(controllerId, controller.epoch, topicAndPartition))
try {
var zookeeperPathUpdateSucceeded: Boolean = false
var newLeaderAndIsr: LeaderAndIsr = null
var replicasForThisPartition: Seq[Int] = Seq.empty[Int]
while(!zookeeperPathUpdateSucceeded) {
// 从zk中获取分区当前的leader副本,ISR集合,zkversion等信息,如果不存在则抛出异常
val currentLeaderIsrAndEpoch = getLeaderIsrAndEpochOrThrowException(topic, partition)
val currentLeaderAndIsr = currentLeaderIsrAndEpoch.leaderAndIsr
val controllerEpoch = currentLeaderIsrAndEpoch.controllerEpoch
// 判断是否小于已有的controller epoch值,如果小于抛出异常
if (controllerEpoch > controller.epoch) {
val failMsg = ("aborted leader election for partition [%s,%d] since the LeaderAndIsr path was " +
"already written by another controller. This probably means that the current controller %d went through " +
"a soft failure and another controller was elected with epoch %d.")
.format(topic, partition, controllerId, controllerEpoch)
stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
throw new StateChangeFailedException(failMsg)
}
//根据leaderSelector选举出新的Leader副本和ISR列表
val (leaderAndIsr, replicas) = leaderSelector.selectLeader(topicAndPartition, currentLeaderAndIsr)
// 将新的LeaderAndIsr信息保存到zookeeper
val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkUtils, topic, partition,
leaderAndIsr, controller.epoch, currentLeaderAndIsr.zkVersion)
newLeaderAndIsr = leaderAndIsr
newLeaderAndIsr.zkVersion = newVersion
zookeeperPathUpdateSucceeded = updateSucceeded
replicasForThisPartition = replicas
}
val newLeaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(newLeaderAndIsr, controller.epoch)
// 更新ControllerContext的partitionLeadershipInfo分区leader信息
controllerContext.partitionLeadershipInfo.put(TopicAndPartition(topic, partition), newLeaderIsrAndControllerEpoch)
stateChangeLogger.trace("Controller %d epoch %d elected leader %d for Offline partition %s"
.format(controllerId, controller.epoch, newLeaderAndIsr.leader, topicAndPartition))
// 获取该分区AR副本集
val replicas = controllerContext.partitionReplicaAssignment(TopicAndPartition(topic, partition))
// 向队列添加LeaderAndIsrRequest,等待被发送到指定的broker
brokerRequestBatch.addLeaderAndIsrRequestForBrokers(replicasForThisPartition, topic, partition,
newLeaderIsrAndControllerEpoch, replicas)
} catch {
}
debug("After leader election, leader cache is updated to %s".format(controllerContext.partitionLeadershipInfo.map(l => (l._1, l._2))))
}