一、PartitionStateMachine的主要功能
Kafka集群中,Topic的分区状态有PartitionStateMachine模块负责,通过在zookeeper上的目录/brokers/topics和/admin/delete_topics注册不同的监听函数,监听Topic的创建和删除事件,从而触发Topic的分区状态转换。
二、分区状态的转换
PartitionStateMachine内部的partitionState变量保存了每个具体的Topic的分区状态,如下所示
class PartitionStateMachine(controller: KafkaController) extends Logging {
......
private val partitionState: mutable.Map[TopicAndPartition, PartitionState] = mutable.Map.empty
......
}
分区的状态有4种,分别为NonExistentPartition、NewPartition、OfflinePartition、OfflinePartition,它们各自的生命周期如下:
其中,
NonExistentPartition:代表分区从来没有被创建或者被创建之后又删除的状态。
NewPartition:代表分区刚被创建,并且包含了AR,但是此时Leader或者ISR还没有被创建。
OnlinePartition:代表分区的Leader已经被选举出来,并且此时已经产生了对应的ISR。
OfflinePartition:代表了分区的Leader由于某种原因下线时导致分区暂时不可用的状态。
分区状态转换的规则如下:
目标状态 | 前置状态 | 转换场景 |
NewPartition | NonExistentPartition | 用户创建topic,将topic信息写入zookeeper,KafkaController监听到/brokers/topics目录上数据发生变化,加载新创建的Topic信息,包括分区个数,AR列表 |
OnlinePartition | NewPartition OnlinePartition OfflinePartition | 1)针对新创建的分区选举出Leader和生成ISR列表。 2)针对分区重新进行leader选举和生成新的ISR列表 |
OfflinePartition | NewPartition OnlinePartition OfflinePartition | AR中没有任何在线的Broker Server |
NonExistentPartition | OfflinePartition | 分区被删除 |
三、PartitionStateMachine模块的启动
在KafkaController选举流程中,如果一个Broker Server被选举为leader,则会进入函数onControllerFailover,进行初始化操作(参见博客:KafkaController的初始化流程源码解析_bao2901203013的专栏-CSDN博客),在初始化的流程中就会启动PartitionStateMachine。PartitionStateMachine的启动过程会初始化各个partition的状态,首先会根据Leader Replica是否在线初始化为OnlinePartition或者OfflinePartition,其次如果没有被分配Leader Replica,因此被初始化为NewPartition,接着尝试将状态为OfflinePartition或者NewPartition的partition转换为OnlinePartition,最后将Partition的状态通过ControllerChannelManager同步给其它剩余的Broker Server。
PartitionStateMachine详细的启动流程如下:
def startup() {
// 初始化分区状态
initializePartitionState()
// 设置启动标志
hasStarted.set(true)
// 触发onlinePartition的状态的转换
triggerOnlinePartitionStateChange()
info("Started partition state machine with initial state -> " + partitionState.toString())
}
其中函数initializePartitionState()会将分区初始化为三种状态:NewPartition、OnlinePartition、OfflinePartition,其具体实现过程如下:
private def initializePartitionState() {
for((topicPartition, replicaAssignment) <- controllerContext.partitionReplicaAssignment) {
controllerContext.partitionLeadershipInfo.get(topicPartition) match {
case Some(currentLeaderIsrAndEpoch) =>
// partition已经被分配了Leader和ISR
controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader) match {
case true => // leader在线,状态为OnlinePartition
partitionState.put(topicPartition, OnlinePartition)
case false => //leader不在线,状态为OfflinePartition
partitionState.put(topicPartition, OfflinePartition)
}
//没有被分配Leader和ISR
case None =>
partitionState.put(topicPartition, NewPartition)
}
}
}
当区分出NewPartition、OnlinePartition、OfflinePartition三种状态之后,会尝试将NewPartition和OfflinePartition转换为OnlinePartition,其具体实现过程如下:
def triggerOnlinePartitionStateChange() {
try {
brokerRequestBatch.newBatch()
// 剔除处于删除状态的Topic
for((topicAndPartition, partitionState) <- partitionState
if(!controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic))) {
/*筛选出offlinePartition和NewPartition状态的分区,然后努力将它们的状态转换为OnlinePartition*/
if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
(new CallbackBuilder).build)
}
//同步分区信息给其它的Broker Server
brokerRequestBatch.sendRequestsToBrokers(controller.epoch, controllerContext.correlationId.getAndIncrement)
} catch {
......
}
}
四、分区状态转换的场景
PartitionStateMachine内部的handleStateChange负责分区状态的具体转换逻辑,具体参数如下:
private def handleStateChange(topic: String,
partition: Int,
targetState: PartitionState,
leaderSelector: PartitionLeaderSelector,
callbacks: Callbacks) {
......
}
其中topic代表当前partition所在的Topic,partition代表分区索引,targetState代表目标状态。
下面介绍几种转换场景。
1、NonExistentPartition -> NewPartition
转换代码如下:
//目标状态为NewPartition
case NewPartition =>
//确保前置状态是NonExistentPartition
assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)
/*从zookeeper目录上/brokers/topics/ 具体的topic读取该topic的AR列表,并且保存至KafkaController内存*/
assignReplicasToPartitions(topic, partition)
//将状态切换为NewPartition
partitionState.put(topicAndPartition, NewPartition)
val assignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).mkString(",")
可见NonExistentPartition切换成NewPartition的逻辑比较简单,就是持久化该Topic在Zookeeper目录上各个分区的AR列表至KafkaController内存,然后设置分区状态为NewPartition。
2、NewPartition、OnlinePartition、OfflinePartition-> OnlinePartition
转换代码如下:
//目标状态为OnlinePartition
case OnlinePartition =>
//确保前置状态为NewPartition、OnlinePartition、OfflinePartition
assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition)
partitionState(topicAndPartition) match {
case NewPartition =>
// 利用分区的AR列表初始化Leader和ISR
initializeLeaderAndIsrForPartition(topicAndPartition)
case OfflinePartition =>
//利用Leader Replica选举器来初始化Leader和ISR
electLeaderForPartition(topic, partition, leaderSelector)
case OnlinePartition =>
//利用Leader Replica选举器来初始化Leader和ISR
electLeaderForPartition(topic, partition, leaderSelector)
case _ => // 不应该走到这个流程
}
//在内存中设置状态为OnlinePartition
partitionState.put(topicAndPartition, OnlinePartition)
val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader
可见当NewPartition转换为OnlinePartition过程中,不需要Leader Replica选举器来进Leader和ISR的选举,这个时候直接将AR列表中的第一个Live Brokers作为Leader,且Live Brokers作为ISR。
但是在在OfflinePartition和OnlinePartition转换为OnlinePartition过程中需要利用Leader选举器来选举,其过程如下:
def electLeaderForPartition(topic: String, partition: Int, leaderSelector: PartitionLeaderSelector) {
//组装TopicAndPartition
val topicAndPartition = TopicAndPartition(topic, partition)
try {
var zookeeperPathUpdateSucceeded: Boolean = false
var newLeaderAndIsr: LeaderAndIsr = null
var replicasForThisPartition: Seq[Int] = Seq.empty[Int]
while(!zookeeperPathUpdateSucceeded) {
//从zookeeper读取LeaderIsrAndControllerEpoch
val currentLeaderIsrAndEpoch = getLeaderIsrAndEpochOrThrowException(topic, partition)
val currentLeaderAndIsr = currentLeaderIsrAndEpoch.leaderAndIsr
val controllerEpoch = currentLeaderIsrAndEpoch.controllerEpoch
if (controllerEpoch > controller.epoch) {
/*
* 只有状态为Leader的KafakController才会触发Partition Leader 选举
* 如果zookeeper上记录的controllerEpoch大于当前的epoch,则表明当前KafkaController已经过时了 */
throw new StateChangeFailedException(failMsg)
}
//根据TopicAndPartition和当前的LeaderAndIsr选举出新的LeaderAndIsr
val (leaderAndIsr, replicas) = leaderSelector.selectLeader(topicAndPartition, currentLeaderAndIsr)
//持久化至zookeeper
val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topic, partition,
leaderAndIsr, controller.epoch, currentLeaderAndIsr.zkVersion)
newLeaderAndIsr = leaderAndIsr
newLeaderAndIsr.zkVersion = newVersion
zookeeperPathUpdateSucceeded = updateSucceeded
replicasForThisPartition = replicas
}
val newLeaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(newLeaderAndIsr, controller.epoch)
// 更新至KafkaController的内存
controllerContext.partitionLeadershipInfo.put(TopicAndPartition(topic, partition), newLeaderIsrAndControllerEpoch)
val replicas = controllerContext.partitionReplicaAssignment(TopicAndPartition(topic, partition))
// 组装元数据请求,同步本节点的信息给集群剩余的Broker Server
brokerRequestBatch.addLeaderAndIsrRequestForBrokers(replicasForThisPartition, topic, partition,
newLeaderIsrAndControllerEpoch, replicas)
} catch {
......
}
}
3、NewPartition、OnlinePartition、OfflinePartition-> OfflinePartition
转换代码如下:
//目标状态为OfflinePartition
case OfflinePartition =>
// 确保先前状态为NewPartition、OnlinePartition、OfflinePartition
assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition)
//将其状态转换为OfflinePartition
partitionState.put(topicAndPartition, OfflinePartition)
给过程很简单,只是将内存中的状态修改为OfflinePartition。
五、总结
PartitionStateMachine作为Kafka集群中分区状态管理模块,通过监听zookeeper中目录的变化来执行分区状态转换工作,满足了topic创建、partition重分配、topic删除等partition变化场景,是kafka集群元数据管理中非常重要的模块。