前言:今天我们来分析一下怎么修改分区指定的副本为leader
分析分区重分配可以实现这个功能吗?
不了解分区重分配的流程可以先学习前面关于分区重分配的知识,这里我们只是带着疑问再次回顾分区重分配的代码。
分区重分配关键代码展示
kafka.controller.KafkaController#updateCurrentReassignment
此方法的作用就是将分区重分配数据写入zk及controllerContext,这里要对controllerContext的作用特别说明一下,首先从语义上就能猜到这个类是给controller
使用的,此类相当于一个容器的角色,里面缓存了很多controller需要的数据,这样可以不必频繁请求zk。controllerContext与metadata(元数据)
是两种不同类型的数据,metadata是针对每个broker的,这里也需要区分一下。
在这里写入zk节点为/brokers/topics/{topic},格式为 “{“version”:2,“partitions”:{“0”:[2,1,3]},“adding_replicas”:{},
“removing_replicas”:{}}”。我们平常所说的ar即partition对应的数据,如果我们在这里只改变了顺序,没有增减副本,也是可以写入的。
private def updateCurrentReassignment(topicPartition: TopicPartition, reassignment: ReplicaAssignment): Unit = {
val currentAssignment = controllerContext.partitionFullReplicaAssignment(topicPartition)
if (currentAssignment != reassignment) {
info(s"Updating assignment of partition $topicPartition from $currentAssignment to $reassignment")
//将数据写入zk
updateReplicaAssignmentForPartition(topicPartition, reassignment)
//同样在内存中也留一份
controllerContext.updatePartitionFullReplicaAssignment(topicPartition, reassignment)
val unneededReplicas = currentAssignment.replicas.diff(reassignment.replicas)
if (unneededReplicas.nonEmpty)
stopRemovedReplicasOfReassignedPartition(topicPartition, unneededReplicas)
}
//分区重分配
val reassignIsrChangeHandler = new PartitionReassignmentIsrChangeHandler(eventManager, topicPartition)
zkClient.registerZNodeChangeHandler(reassignIsrChangeHandler)
controllerContext.partitionsBeingReassigned.add(topicPartition)
}
kafka.controller.KafkaController#moveReassignedPartitionLeaderIfRequired
这是在分区重分配新增副本完成之后的操作,判断是否需要重新选举leader,有两种情况需要重新选举leader
1. 该分区原来的leader不在TRS里面
2. 该分区原来的leader在TRS里面,但是已离线
以上两种情况使用的都是分区重分配选举策略(ReassignPartitionLeaderElectionStrategy),通过分析分区状态机中leader重新选举部分的源码我们可以知道,分区重分配选举策略就是选择ar中第一个在isr中且存活的副本为leader。
private def moveReassignedPartitionLeaderIfRequired(topicPartition: TopicPartition,
newAssignment: ReplicaAssignment): Unit = {
val reassignedReplicas = newAssignment.replicas
val currentLeader = controllerContext.partitionLeadershipInfo(topicPartition).leaderAndIsr.leader
///处理原分区leader不在TRS中的情况:分区状态机处理分区状态,重选leader
if (!reassignedReplicas.contains(currentLeader)) {
partitionStateMachine.handleStateChanges(Seq(topicPartition), OnlinePartition, Some(ReassignPartitionLeaderElectionStrategy))
} else if (controllerContext.isReplicaOnline(currentLeader, topicPartition)) {
//更新zk中brokers-topics-当前topic-partitions-当前partition-state中的数据
updateLeaderEpochAndSendRequest(topicPartition, newAssignment)
} else {
//分区leader在TRS中但是已离线的状态,需要重新选取leader
partitionStateMachine.handleStateChanges(Seq(topicPartition), OnlinePartition, Some(ReassignPartitionLeaderElectionStrategy))
}
}
分区重分配实现分区指定leader流程展示
分区重分配方式总结
分区重分配是可以实现分区指定副本为leader的,但是有前提条件,leader不在TRS中或者leader已掉线,这个方法不是很可取。
通过分区leader选举来实现
分区leader选举流程展示
分区leader选举关键代码解析
kafka.controller.KafkaController#onReplicaElection
这里是在执行完leader选举脚本之后kafkaController中处理的逻辑,主要就是设置选举策略及调用分区状态机处理分区状态,这里的状态是由OnlinePartition->OnlinePartition
private[this] def onReplicaElection(
partitions: Set[TopicPartition],
electionType: ElectionType,
electionTrigger: ElectionTrigger
): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
info(s"Starting replica leader election ($electionType) for partitions ${partitions.mkString(",")} triggered by $electionTrigger")
try {
//根据传入的electionType来设置选举策略,如果是PREFERRED则为优先副本分区leader选举策略,如果是UNCLEAN则为离线分区leader选举策略
val strategy = electionType match {
case ElectionType.PREFERRED => PreferredReplicaPartitionLeaderElectionStrategy
case ElectionType.UNCLEAN =>
/* Let's be conservative and only trigger unclean election if the election type is unclean and it was
* triggered by the admin client
*/
OfflinePartitionLeaderElectionStrategy(allowUnclean = electionTrigger == AdminClientTriggered)
}
//直接调用分区状态机修改分区状态
val results = partitionStateMachine.handleStateChanges(
partitions.toSeq,
OnlinePartition,
Some(strategy)
)
//……省略部分代码
results
}
}
kafka.controller.Election#leaderForPreferredReplica
这块是在分区状态机中OnlinePartition->OnlinePartition处理优先副本策略的逻辑,比较简单,就是调用对应的方法选出leader,然后组装返回对象
private def leaderForPreferredReplica(partition: TopicPartition,
leaderAndIsr: LeaderAndIsr,
controllerContext: ControllerContext): ElectionResult = {
val assignment = controllerContext.partitionReplicaAssignment(partition)
val liveReplicas = assignment.filter(replica => controllerContext.isReplicaOnline(replica, partition))
val isr = leaderAndIsr.isr
println("preferredReplicaPartitionLeaderElection:"+assignment+"--"+liveReplicas+"--"+isr)
val leaderOpt = PartitionLeaderElectionAlgorithms.preferredReplicaPartitionLeaderElection(assignment, isr, liveReplicas.toSet)
val newLeaderAndIsrOpt = leaderOpt.map(leader => leaderAndIsr.newLeader(leader))
ElectionResult(partition, newLeaderAndIsrOpt, assignment)
}
kafka.controller.PartitionLeaderElectionAlgorithms#preferredReplicaPartitionLeaderElection
这里想把优先副本选举的代码拿出来特别说明一下,这里的逻辑是选取assignment中的第一个值与liveReplicas及isr中的值匹配,意思就是选取ar中的第一个即优先副本,判断是否在isr中且存活,如果存活则返回,否则返回NONE
def preferredReplicaPartitionLeaderElection(assignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int]): Option[Int] = {
assignment.headOption.filter(id => liveReplicas.contains(id) && isr.contains(id))
}
如果优先副本已离线,执行leader选举脚本会怎样?针对这种场景在此特地做个测试样例说明
- 准备topic_1,0号分区ar为[3,1,2],zk中对应topicPartition的数据为
{"controller_epoch":263,"leader":3,"version":1,"leader_epoch":0,"isr":[3,1,2]}
- 将broker3下线,这时zk中对应topicPartition的数据会变为
{"controller_epoch":263,"leader":1,"version":1,"leader_epoch":1,"isr":[1,2]}
- 执行如下分区选举leader脚本
{
"partitions": [
{
"topic": "topic_1",
"partition": 0
}
]
}
- 可以看到控制台会打印如下日志,异常会从KafkaController那儿返回到LeaderElectionCommand客户端,会抛异常的原因也是跟优先副本选举策略有关,选举时只会拿ar中的第一个副本去跟存活的isr
匹配,如果成功则返回优先副本为leader,如果不匹配则返回NONE,返回NONE的话就会抛AdminCommandFailedException,这也是我们使用副本选举时需要注意的一个点。
Error completing leader election (PREFERRED) for partition: topic_1-0: org.apache.kafka.common.errors.PreferredLeaderNotAvailableException: Failed to elect leader for partition topic_1-0 under strategy PreferredReplicaPartitionLeaderElectionStrategy
Exception in thread "main" kafka.common.AdminCommandFailedException: 1 replica(s) could not be elected
at kafka.admin.LeaderElectionCommand$.electLeaders(LeaderElectionCommand.scala:173)
at kafka.admin.LeaderElectionCommand$.run(LeaderElectionCommand.scala:89)
at kafka.admin.LeaderElectionCommand$.main(LeaderElectionCommand.scala:41)
at kafka.admin.LeaderElectionCommand.main(LeaderElectionCommand.scala)
Suppressed: org.apache.kafka.common.errors.PreferredLeaderNotAvailableException: Failed to elect leader for partition topic_1-0 under strategy PreferredReplicaPartitionLeaderElectionStrategy
分区leader选举方式总结
通过以上的代码分析可以知道分区leader选举脚本并不支持更换ar顺序,直接拿controllerContext中缓存的partitionReplicaAssignment的数据作为ar
去匹配的,如果我们想要实现指定副本为leader,则需要以下三步
1、修改zk中对应topicPartition中ar的顺序
2、删除zk中controller节点重新选举controller,这里就会将zk的数据重新刷入controllerContext中
3、执行分区leader选举脚本
这种方式可以实现指定副本为leader这个功能,就是需要切换controller不太友好。
两种方案对比
通过以上两种方案对比可以知道分区重分配方式是不可行的,而分区leader选举需要删除controller节点触发controllerContext的更新,我们是否可以在kafka内部实现这个功能呢?
源码修改方案
通过修改zk数据触发controllerContext的更新
controller对zk节点/brokers/topics/{topic}的监听是kafka.controller.KafkaController#processPartitionModifications,从语义上可以知道主要是监听分区的改变,代码如下,主要是做新增分区监听的作用,如果我们在这里加上对副本ar修改监听的话不太符合这块代码的设计。
private def processPartitionModifications(topic: String): Unit = {
def restorePartitionReplicaAssignment(
topic: String,
newPartitionReplicaAssignment: Map[TopicPartition, ReplicaAssignment]
): Unit = {
info("Restoring the partition replica assignment for topic %s".format(topic))
val existingPartitions = zkClient.getChildren(TopicPartitionsZNode.path(topic))
val existingPartitionReplicaAssignment = newPartitionReplicaAssignment
.filter(p => existingPartitions.contains(p._1.partition.toString))
.map { case (tp, _) =>
tp -> controllerContext.partitionFullReplicaAssignment(tp)
}.toMap
zkClient.setTopicAssignment(topic,
existingPartitionReplicaAssignment,
controllerContext.epochZkVersion)
}
if (!isActive) return
val partitionReplicaAssignment = zkClient.getFullReplicaAssignmentForTopics(immutable.Set(topic))
val partitionsToBeAdded = partitionReplicaAssignment.filter { case (topicPartition, _) =>
controllerContext.partitionReplicaAssignment(topicPartition).isEmpty
}
if (topicDeletionManager.isTopicQueuedUpForDeletion(topic)) {
if (partitionsToBeAdded.nonEmpty) {
warn("Skipping adding partitions %s for topic %s since it is currently being deleted"
.format(partitionsToBeAdded.map(_._1.partition).mkString(","), topic))
restorePartitionReplicaAssignment(topic, partitionReplicaAssignment)
} else {
// This can happen if existing partition replica assignment are restored to prevent increasing partition count during topic deletion
info("Ignoring partition change during topic deletion as no new partitions are added")
}
} else if (partitionsToBeAdded.nonEmpty) {
info(s"New partitions to be added $partitionsToBeAdded")
partitionsToBeAdded.foreach { case (topicPartition, assignedReplicas) =>
controllerContext.updatePartitionFullReplicaAssignment(topicPartition, assignedReplicas)
}
onNewPartitionCreation(partitionsToBeAdded.keySet)
}
}
在分区leader选举中实现这个功能
入参修改一下格式,如果传了replicas则跟zk中的ar比对,如果只变动了顺序则写入controllerContext,否则忽略,如果没传replicas,则依然进行原ar的优先副本选举,实现流程如图三。
总结
我们可以看到修改zk数据触发controllerContext中主要功能偏向于提供新增分区的功能,如果增加修改副本顺序功能的话我觉得与原先的设计相违背,个人更倾向于直接在分区leader选举中实现这个功能,大家怎么看呢?