PartitionStateMachine 定义了 Kafka Controller 的分区状态机,用于管理集群中分区的状态信息,每个 Kafka Controller 都定义了自己的分区状态机,但只有在当前 Controller 实例成为 leader 角色时才会启动运行名下的状态机。
它启动逻辑如下:
def startup(): Unit = {
info("Initializing partition state")
// // 初始化本地记录的所有分区状态
initializePartitionState()
info("Triggering online partition state changes")
// 尝试将集群中所有 OfflinePartition 或 NewPartition 状态的可用分区切换成 OnlinePartition 状态
triggerOnlinePartitionStateChange()
debug(s"Started partition state machine with initial state -> ${controllerContext.partitionStates}")
}
private def initializePartitionState(): Unit = {
// 遍历集群中的所有分区
for (topicPartition <- controllerContext.allPartitions) {
// check if leader and isr path exists for partition. If not, then it is in NEW state
// 获取对应分区 leader 副本所在的 brokerId、ISR 集合,以及 controller 年代信息
controllerContext.partitionLeadershipInfo.get(topicPartition) match {
// 存在 leader 副本和 ISR 集合
case Some(currentLeaderIsrAndEpoch) =>
// else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state
// 分区 leader 副本所在的 broker 可用,初始化分区为 OnlinePartition 状态
if (controllerContext.isReplicaOnline(currentLeaderIsrAndEpoch.leaderAndIsr.leader, topicPartition))
// leader is alive
controllerContext.putPartitionState(topicPartition, OnlinePartition)
else
// 分区 leader 副本所在的 broker 不可用,初始化为 OfflinePartition 状态
controllerContext.putPartitionState(topicPartition, OfflinePartition)
case None =>
// 如果不存在,则说明是一个新创建的分区,设置分区状态为 NewPartition
controllerContext.putPartitionState(topicPartition, NewPartition)
}
}
}
Kafka 为分区定义了 4 类状态。
- NewPartition:分区被创建后被设置成这个状态,表明它是一个全新的分区对象。处于这个状态的分区,被 Kafka 认为是“未初始化”,因此,不能选举 Leader。
- OnlinePartition:分区正式提供服务时所处的状态。
- OfflinePartition:分区下线后所处的状态。
- NonExistentPartition:分区被删除,并且从分区状态机移除后所处的状态。
分区 Leader 选举的场景及方法:分区 Leader 选举有 4 类场景。
- OfflinePartitionLeaderElectionStrategy:因为 Leader 副本下线而引发的分区 Leader 选举。
- ReassignPartitionLeaderElectionStrategy:因为执行分区副本重分配操作而引发的分区 Leader 选举。
- PreferredReplicaPartitionLeaderElectionStrategy:因为执行 Preferred 副本 Leader 选举而引发的分区 Leader 选举。
- ControlledShutdownPartitionLeaderElectionStrategy:因为正常关闭 Broker 而引发的分区 Leader 选举。
针对这 4 类场景,分区状态机的 PartitionLeaderElectionAlgorithms 对象定义了 4 个方法,分别负责为每种场景选举 Leader 副本,这 4 种方法是:offlinePartitionLeaderElection;reassignPartitionLeaderElection;preferredReplicaPartitionLeaderElection;controlledShutdownPartitionLeaderElection。
// 分区Leader选举策略接口
sealed trait PartitionLeaderElectionStrategy
// 离线分区Leader选举策略
final case class OfflinePartitionLeaderElectionStrategy(
allowUnclean: Boolean) extends PartitionLeaderElectionStrategy
// 分区副本重分配Leader选举策略
final case object ReassignPartitionLeaderElectionStrategy
extends PartitionLeaderElectionStrategy
// 分区Preferred副本Leader选举策略
final case object PreferredReplicaPartitionLeaderElectionStrategy
extends PartitionLeaderElectionStrategy
// Broker Controlled关闭时Leader选举策略
final case object ControlledShutdownPartitionLeaderElectionStrategy
extends PartitionLeaderElectionStrategy
offlinePartitionLeaderElection 方法的逻辑是这 4 个方法中最复杂的
def offlinePartitionLeaderElection(assignment: Seq[Int], // 这是分区的副本列表
isr: Seq[Int],
liveReplicas: Set[Int], // 该分区下所有处于存活状态的副本
uncleanLeaderElectionEnabled: Boolean, // 脏选举
controllerContext: ControllerContext
): Option[Int] = {
// 从当前分区副本列表中寻找首个处于存活状态的ISR副本
assignment.find(id => liveReplicas.contains(id) && isr.contains(id))
.orElse {
// 如果找不到满足条件的副本,查看是否允许Unclean Leader选举
// 即Broker端参数unclean.leader.election.enable是否等于true
if (uncleanLeaderElectionEnabled) {
// 选择当前副本列表中的第一个存活副本作为Leader
val leaderOpt = assignment.find(liveReplicas.contains)
if (leaderOpt.isDefined)
controllerContext.stats.uncleanLeaderElectionRate.mark()
leaderOpt
} else {
// 如果不允许Unclean Leader选举,则返回None表示无法选举Leader
None
}
}
}
其他三个方法较简单。
def reassignPartitionLeaderElection(reassignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int]): Option[Int] = {
reassignment.find(id => liveReplicas.contains(id) && isr.contains(id))
}
def preferredReplicaPartitionLeaderElection(assignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int]): Option[Int] = {
assignment.headOption.filter(id => liveReplicas.contains(id) && isr.contains(id))
}
def controlledShutdownPartitionLeaderElection(assignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int], shuttingDownBrokers: Set[Int]): Option[Int] = {
assignment.find(id => liveReplicas.contains(id) && isr.contains(id) && !shuttingDownBrokers.contains(id))
}
处理分区状态转换
handleStateChanges 主要处理分区状态转换的逻辑。一句话概括 handleStateChanges 的作用,应该这样说:handleStateChanges 把 partitions 的状态设置为 targetState,同时,还可能需要用 leaderElectionStrategy 策略为 partitions 选举新的 Leader,最终将 partitions 的 Leader 信息返回。
override def handleStateChanges(
partitions: Seq[TopicPartition],
targetState: PartitionState,
partitionLeaderElectionStrategyOpt: Option[PartitionLeaderElectionStrategy]
): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
if (partitions.nonEmpty) {
try {
// 清空Controller待发送请求集合,准备本次请求发送
// 校验待发送的请求集合,确保历史的请求已经全部发送完毕
controllerBrokerRequestBatch.newBatch()
// 调用doHandleStateChanges方法执行真正的状态变更逻辑
val result = doHandleStateChanges(
partitions,
targetState,
partitionLeaderElectionStrategyOpt
)
// Controller给相关Broker发送请求通知状态变化
controllerBrokerRequestBatch.sendRequestsToBrokers(controllerContext.epoch)
result
} catch {
// 如果Controller易主,则记录错误日志,然后重新抛出异常
// 上层代码会捕获该异常并执行maybeResign方法执行卸任逻辑
case e: ControllerMovedException =>
error(s"Controller moved to another broker when moving some partitions to $targetState state", e)
throw e
// 如果是其他异常,记录错误日志,封装错误返回
case e: Throwable =>
error(s"Error while moving some partitions to $targetState state", e)
partitions.iterator.map(_ -> Left(e)).toMap
}
} else {
// 如果partitions为空,什么都不用做
Map.empty
}
}
doHandleStateChanges方法实现
private def doHandleStateChanges(
partitions: Seq[TopicPartition],
targetState: PartitionState,
partitionLeaderElectionStrategyOpt: Option[PartitionLeaderElectionStrategy]
): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
val stateChangeLog = stateChangeLogger.withControllerEpoch(controllerContext.epoch)
// 初始化新分区的状态为NonExistentPartition
partitions.foreach(partition => controllerContext.putPartitionStateIfNotExists(partition, NonExistentPartition))
// 找出要执行非法状态转换的分区,记录错误日志
val (validPartitions, invalidPartitions) = controllerContext.checkValidPartitionStateChange(partitions, targetState)
invalidPartitions.foreach(partition => logInvalidTransition(partition, targetState))
// 根据targetState进入到不同的case分支
targetState match {
case NewPartition =>
validPartitions.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState with " +
s"assigned replicas ${controllerContext.partitionReplicaAssignment(partition).mkString(",")}")
// 把目标状态设置为NewPartition
controllerContext.putPartitionState(partition, NewPartition)
}
Map.empty
case OnlinePartition =>
// 获取未初始化分区列表,也就是NewPartition状态下的所有分区
val uninitializedPartitions = validPartitions.filter(partition => partitionState(partition) == NewPartition)
// 获取具备Leader选举资格的分区列表
// 只能为OnlinePartition和OfflinePartition状态的分区选举Leader
val partitionsToElectLeader = validPartitions.filter(partition => partitionState(partition) == OfflinePartition || partitionState(partition) == OnlinePartition)
if (uninitializedPartitions.nonEmpty) {
// 初始化NewPartition状态分区,在ZooKeeper中写入Leader和ISR数据
val successfulInitializations = initializeLeaderAndIsrForPartitions(uninitializedPartitions)
successfulInitializations.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition from ${partitionState(partition)} to $targetState with state " +
s"${controllerContext.partitionLeadershipInfo(partition).leaderAndIsr}")
controllerContext.putPartitionState(partition, OnlinePartition)
}
}
// 为具备Leader选举资格的分区推选Leader
if (partitionsToElectLeader.nonEmpty) {
val electionResults = electLeaderForPartitions(
partitionsToElectLeader,
partitionLeaderElectionStrategyOpt.getOrElse(
throw new IllegalArgumentException("Election strategy is a required field when the target state is OnlinePartition")
)
)
electionResults.foreach {
case (partition, Right(leaderAndIsr)) =>
stateChangeLog.trace(
s"Changed partition $partition from ${partitionState(partition)} to $targetState with state $leaderAndIsr"
)
// 将成功选举Leader后的分区设置成OnlinePartition状态
controllerContext.putPartitionState(partition, OnlinePartition)
case (_, Left(_)) => // Ignore; no need to update partition state on election error
// 如果选举失败,忽略之
}
// 返回Leader选举结果
electionResults
} else {
Map.empty
}
case OfflinePartition =>
validPartitions.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState")
controllerContext.putPartitionState(partition, OfflinePartition)
}
Map.empty
case NonExistentPartition =>
validPartitions.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState")
controllerContext.putPartitionState(partition, NonExistentPartition)
}
Map.empty
}
}
其中第 1 步是initializeLeaderAndIsrForPartitions初始化NewPartition状态分区,在ZooKeeper中写入Leader和ISR数据。
private def initializeLeaderAndIsrForPartitions(partitions: Seq[TopicPartition]): Seq[TopicPartition] = {
val successfulInitializations = mutable.Buffer.empty[TopicPartition]
// 获取每个分区的副本列表
val replicasPerPartition = partitions.map(partition => partition -> controllerContext.partitionReplicaAssignment(partition))
// 获取每个分区的所有存活副本
val liveReplicasPerPartition = replicasPerPartition.map { case (partition, replicas) =>
val liveReplicasForPartition = replicas.filter(replica => controllerContext.isReplicaOnline(replica, partition))
partition -> liveReplicasForPartition
}
// 按照有无存活副本对分区进行分组
// 分为两组:有存活副本的分区;无任何存活副本的分区
val (partitionsWithoutLiveReplicas, partitionsWithLiveReplicas) = liveReplicasPerPartition.partition { case (_, liveReplicas) => liveReplicas.isEmpty }
partitionsWithoutLiveReplicas.foreach { case (partition, replicas) =>
val failMsg = s"Controller $controllerId epoch ${controllerContext.epoch} encountered error during state change of " +
s"partition $partition from New to Online, assigned replicas are " +
s"[${replicas.mkString(",")}], live brokers are [${controllerContext.liveBrokerIds}]. No assigned " +
"replica is alive."
logFailedStateChange(partition, NewPartition, OnlinePartition, new StateChangeFailedException(failMsg))
}
// 为"有存活副本的分区"确定Leader和ISR
// Leader确认依据:存活副本列表的首个副本被认定为Leader
// ISR确认依据:存活副本列表被认定为ISR
val leaderIsrAndControllerEpochs = partitionsWithLiveReplicas.map { case (partition, liveReplicas) =>
val leaderAndIsr = LeaderAndIsr(liveReplicas.head, liveReplicas.toList)
val leaderIsrAndControllerEpoch = LeaderIsrAndControllerEpoch(leaderAndIsr, controllerContext.epoch)
partition -> leaderIsrAndControllerEpoch
}.toMap
val createResponses = try {
// 在zk上创建partition对应节点,写入ISR节点信息/brokers/topics/topicname/partitionNum/state
zkClient.createTopicPartitionStatesRaw(leaderIsrAndControllerEpochs, controllerContext.epochZkVersion)
} catch {
case e: ControllerMovedException =>
error("Controller moved to another broker when trying to create the topic partition state znode", e)
throw e
case e: Exception =>
partitionsWithLiveReplicas.foreach { case (partition,_) => logFailedStateChange(partition, partitionState(partition), NewPartition, e) }
Seq.empty
}
createResponses.foreach { createResponse =>
val code = createResponse.resultCode
val partition = createResponse.ctx.get.asInstanceOf[TopicPartition]
val leaderIsrAndControllerEpoch = leaderIsrAndControllerEpochs(partition)
if (code == Code.OK) {
controllerContext.partitionLeadershipInfo.put(partition, leaderIsrAndControllerEpoch)
// 往 leaderAndIsrRequestMap 集合中添加待发送的 LeaderAndIsrRequest 请求所需的数据
controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(leaderIsrAndControllerEpoch.leaderAndIsr.isr,
partition, leaderIsrAndControllerEpoch, controllerContext.partitionFullReplicaAssignment(partition), isNew = true)
successfulInitializations += partition
} else {
logFailedStateChange(partition, NewPartition, OnlinePartition, code)
}
}
successfulInitializations
}
第 2 步是为具备 Leader 选举资格的分区推选 Leader,代码调用 electLeaderForPartitions 方法实现。这个方法会不断尝试为多个分区选举 Leader,直到所有分区都成功选出 Leader。
private def electLeaderForPartitions(
partitions: Seq[TopicPartition],
partitionLeaderElectionStrategy: PartitionLeaderElectionStrategy
): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
var remaining = partitions
val finishedElections = mutable.Map.empty[TopicPartition, Either[Throwable, LeaderAndIsr]]
while (remaining.nonEmpty) {
val (finished, updatesToRetry) = doElectLeaderForPartitions(remaining, partitionLeaderElectionStrategy)
remaining = updatesToRetry
finished.foreach {
case (partition, Left(e)) =>
logFailedStateChange(partition, partitionState(partition), OnlinePartition, e)
case (_, Right(_)) => // Ignore; success so no need to log failed state change
}
finishedElections ++= finished
if (remaining.nonEmpty)
logger.info(s"Retrying leader election with strategy $partitionLeaderElectionStrategy for partitions $remaining")
}
finishedElections.toMap
}
选举 Leader 的核心代码位于 doElectLeaderForPartitions 方法
private def doElectLeaderForPartitions(
partitions: Seq[TopicPartition],
partitionLeaderElectionStrategy: PartitionLeaderElectionStrategy
): (Map[TopicPartition, Either[Exception, LeaderAndIsr]], Seq[TopicPartition]) = {
val getDataResponses = try {
// 批量获取ZooKeeper中给定分区的znode节点数据
zkClient.getTopicPartitionStatesRaw(partitions)
} catch {
case e: Exception =>
return (partitions.iterator.map(_ -> Left(e)).toMap, Seq.empty)
}
// 可选举Leader分区列表
val failedElections = mutable.Map.empty[TopicPartition, Either[Exception, LeaderAndIsr]]
// 选举失败分区列表
val validLeaderAndIsrs = mutable.Buffer.empty[(TopicPartition, LeaderAndIsr)]
// 遍历每个分区的znode节点数据
getDataResponses.foreach { getDataResponse =>
val partition = getDataResponse.ctx.get.asInstanceOf[TopicPartition]
val currState = partitionState(partition)
// 如果成功拿到znode节点数据
if (getDataResponse.resultCode == Code.OK) {
TopicPartitionStateZNode.decode(getDataResponse.data, getDataResponse.stat) match {
// 节点数据中含Leader和ISR信息
case Some(leaderIsrAndControllerEpoch) =>
// 如果节点数据的Controller Epoch值大于当前Controller Epoch值
if (leaderIsrAndControllerEpoch.controllerEpoch > controllerContext.epoch) {
val failMsg = s"Aborted leader election for partition $partition since the LeaderAndIsr path was " +
s"already written by another controller. This probably means that the current controller $controllerId went through " +
s"a soft failure and another controller was elected with epoch ${leaderIsrAndControllerEpoch.controllerEpoch}."
// 将该分区加入到选举失败分区列表
failedElections.put(partition, Left(new StateChangeFailedException(failMsg)))
} else {
// 将该分区加入到可选举Leader分区列表
validLeaderAndIsrs += partition -> leaderIsrAndControllerEpoch.leaderAndIsr
}
case None =>
val exception = new StateChangeFailedException(s"LeaderAndIsr information doesn't exist for partition $partition in $currState state")
// 将该分区加入到选举失败分区列表
failedElections.put(partition, Left(exception))
}
// 如果没有拿到znode节点数据,则将该分区加入到选举失败分区列表
} else if (getDataResponse.resultCode == Code.NONODE) {
val exception = new StateChangeFailedException(s"LeaderAndIsr information doesn't exist for partition $partition in $currState state")
failedElections.put(partition, Left(exception))
} else {
failedElections.put(partition, Left(getDataResponse.resultException.get))
}
}
// validLeaderAndIsrs 容器中是否包含可选举 Leader 的分区
if (validLeaderAndIsrs.isEmpty) {
return (failedElections.toMap, Seq.empty)
}
// 开始选举Leader,并根据有无Leader将分区进行分区
val (partitionsWithoutLeaders, partitionsWithLeaders) = partitionLeaderElectionStrategy match {
// 这一步是根据给定的 PartitionLeaderElectionStrategy,调用 PartitionLeaderElectionAlgorithms 的不同方法执行 Leader 选举
case OfflinePartitionLeaderElectionStrategy(allowUnclean) =>
val partitionsWithUncleanLeaderElectionState = collectUncleanLeaderElectionState(
validLeaderAndIsrs,
allowUnclean
)
// 为OffinePartition分区选举Leader
leaderForOffline(controllerContext, partitionsWithUncleanLeaderElectionState).partition(_.leaderAndIsr.isEmpty)
case ReassignPartitionLeaderElectionStrategy =>
// 为副本重分配的分区选举Leader
leaderForReassign(controllerContext, validLeaderAndIsrs).partition(_.leaderAndIsr.isEmpty)
case PreferredReplicaPartitionLeaderElectionStrategy =>
// 为分区执行Preferred副本Leader选举
leaderForPreferredReplica(controllerContext, validLeaderAndIsrs).partition(_.leaderAndIsr.isEmpty)
case ControlledShutdownPartitionLeaderElectionStrategy =>
// 为因Broker正常关闭而受影响的分区选举Leader
leaderForControlledShutdown(controllerContext, validLeaderAndIsrs).partition(_.leaderAndIsr.isEmpty)
}
// 最后一步,更新 ZooKeeper 节点数据
// 将所有选举失败的分区全部加入到Leader选举失败分区列表
partitionsWithoutLeaders.foreach { electionResult =>
val partition = electionResult.topicPartition
val failMsg = s"Failed to elect leader for partition $partition under strategy $partitionLeaderElectionStrategy"
failedElections.put(partition, Left(new StateChangeFailedException(failMsg)))
}
val recipientsPerPartition = partitionsWithLeaders.map(result => result.topicPartition -> result.liveReplicas).toMap
val adjustedLeaderAndIsrs = partitionsWithLeaders.map(result => result.topicPartition -> result.leaderAndIsr.get).toMap
// 使用新选举的Leader和ISR信息更新ZooKeeper上分区的znode节点数据
val UpdateLeaderAndIsrResult(finishedUpdates, updatesToRetry) = zkClient.updateLeaderAndIsr(
adjustedLeaderAndIsrs, controllerContext.epoch, controllerContext.epochZkVersion)
// 对于ZooKeeper znode节点数据更新成功的分区,封装对应的Leader和ISR信息
// 构建LeaderAndIsr请求,并将该请求加入到Controller待发送请求集合
// 等待后续统一发送
finishedUpdates.foreach { case (partition, result) =>
result.right.foreach { leaderAndIsr =>
val replicaAssignment = controllerContext.partitionFullReplicaAssignment(partition)
val leaderIsrAndControllerEpoch = LeaderIsrAndControllerEpoch(leaderAndIsr, controllerContext.epoch)
controllerContext.partitionLeadershipInfo.put(partition, leaderIsrAndControllerEpoch)
// 往 leaderAndIsrRequestMap 集合中添加待发送的 LeaderAndIsrRequest 请求所需的数据
controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(recipientsPerPartition(partition), partition,
leaderIsrAndControllerEpoch, replicaAssignment, isNew = false)
}
}
// 返回选举结果,包括成功选举并更新ZooKeeper节点的分区、选举失败分区以及
// ZooKeeper节点更新失败的分区
(finishedUpdates ++ failedElections, updatesToRetry)
}