七、如何修改某个分区指定的副本为leader

前言:今天我们来分析一下怎么修改分区指定的副本为leader

分析分区重分配可以实现这个功能吗?

不了解分区重分配的流程可以先学习前面关于分区重分配的知识,这里我们只是带着疑问再次回顾分区重分配的代码。

分区重分配关键代码展示

kafka.controller.KafkaController#updateCurrentReassignment

此方法的作用就是将分区重分配数据写入zk及controllerContext,这里要对controllerContext的作用特别说明一下,首先从语义上就能猜到这个类是给controller
使用的,此类相当于一个容器的角色,里面缓存了很多controller需要的数据,这样可以不必频繁请求zk。controllerContext与metadata(元数据)
是两种不同类型的数据,metadata是针对每个broker的,这里也需要区分一下。
在这里写入zk节点为/brokers/topics/{topic},格式为 “{“version”:2,“partitions”:{“0”:[2,1,3]},“adding_replicas”:{},
“removing_replicas”:{}}”。我们平常所说的ar即partition对应的数据,如果我们在这里只改变了顺序,没有增减副本,也是可以写入的。

private def updateCurrentReassignment(topicPartition: TopicPartition, reassignment: ReplicaAssignment): Unit = {
    val currentAssignment = controllerContext.partitionFullReplicaAssignment(topicPartition)

    if (currentAssignment != reassignment) {
      info(s"Updating assignment of partition $topicPartition from $currentAssignment to $reassignment")

      //将数据写入zk
      updateReplicaAssignmentForPartition(topicPartition, reassignment)

      //同样在内存中也留一份
      controllerContext.updatePartitionFullReplicaAssignment(topicPartition, reassignment)

      val unneededReplicas = currentAssignment.replicas.diff(reassignment.replicas)
      if (unneededReplicas.nonEmpty)
        stopRemovedReplicasOfReassignedPartition(topicPartition, unneededReplicas)
    }

    //分区重分配
    val reassignIsrChangeHandler = new PartitionReassignmentIsrChangeHandler(eventManager, topicPartition)
    zkClient.registerZNodeChangeHandler(reassignIsrChangeHandler)

    controllerContext.partitionsBeingReassigned.add(topicPartition)
  }

kafka.controller.KafkaController#moveReassignedPartitionLeaderIfRequired

这是在分区重分配新增副本完成之后的操作,判断是否需要重新选举leader,有两种情况需要重新选举leader
1. 该分区原来的leader不在TRS里面
2. 该分区原来的leader在TRS里面,但是已离线
以上两种情况使用的都是分区重分配选举策略(ReassignPartitionLeaderElectionStrategy),通过分析分区状态机中leader重新选举部分的源码我们可以知道,分区重分配选举策略就是选择ar中第一个在isr中且存活的副本为leader。

 private def moveReassignedPartitionLeaderIfRequired(topicPartition: TopicPartition,
                                                      newAssignment: ReplicaAssignment): Unit = {
    val reassignedReplicas = newAssignment.replicas
    val currentLeader = controllerContext.partitionLeadershipInfo(topicPartition).leaderAndIsr.leader

    ///处理原分区leader不在TRS中的情况:分区状态机处理分区状态,重选leader
    if (!reassignedReplicas.contains(currentLeader)) {
      
      partitionStateMachine.handleStateChanges(Seq(topicPartition), OnlinePartition, Some(ReassignPartitionLeaderElectionStrategy))
    } else if (controllerContext.isReplicaOnline(currentLeader, topicPartition)) {
    
      //更新zk中brokers-topics-当前topic-partitions-当前partition-state中的数据
      updateLeaderEpochAndSendRequest(topicPartition, newAssignment)
    } else {
      //分区leader在TRS中但是已离线的状态,需要重新选取leader
       partitionStateMachine.handleStateChanges(Seq(topicPartition), OnlinePartition, Some(ReassignPartitionLeaderElectionStrategy))
    }
  }

分区重分配实现分区指定leader流程展示

图一

分区重分配方式总结

分区重分配是可以实现分区指定副本为leader的,但是有前提条件,leader不在TRS中或者leader已掉线,这个方法不是很可取。

通过分区leader选举来实现

分区leader选举流程展示

图二

分区leader选举关键代码解析

kafka.controller.KafkaController#onReplicaElection

这里是在执行完leader选举脚本之后kafkaController中处理的逻辑,主要就是设置选举策略及调用分区状态机处理分区状态,这里的状态是由OnlinePartition->OnlinePartition

private[this] def onReplicaElection(
    partitions: Set[TopicPartition],
    electionType: ElectionType,
    electionTrigger: ElectionTrigger
  ): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
    info(s"Starting replica leader election ($electionType) for partitions ${partitions.mkString(",")} triggered by $electionTrigger")
    try {
      //根据传入的electionType来设置选举策略,如果是PREFERRED则为优先副本分区leader选举策略,如果是UNCLEAN则为离线分区leader选举策略
      val strategy = electionType match {
        case ElectionType.PREFERRED => PreferredReplicaPartitionLeaderElectionStrategy
        case ElectionType.UNCLEAN =>
          /* Let's be conservative and only trigger unclean election if the election type is unclean and it was
           * triggered by the admin client
           */
          OfflinePartitionLeaderElectionStrategy(allowUnclean = electionTrigger == AdminClientTriggered)
      }
      //直接调用分区状态机修改分区状态
      val results = partitionStateMachine.handleStateChanges(
        partitions.toSeq,
        OnlinePartition,
        Some(strategy)
      )
      //……省略部分代码
      results
    }
  }

kafka.controller.Election#leaderForPreferredReplica

这块是在分区状态机中OnlinePartition->OnlinePartition处理优先副本策略的逻辑,比较简单,就是调用对应的方法选出leader,然后组装返回对象

 private def leaderForPreferredReplica(partition: TopicPartition,
                                        leaderAndIsr: LeaderAndIsr,
                                        controllerContext: ControllerContext): ElectionResult = {
    val assignment = controllerContext.partitionReplicaAssignment(partition)
    val liveReplicas = assignment.filter(replica => controllerContext.isReplicaOnline(replica, partition))
    val isr = leaderAndIsr.isr
    println("preferredReplicaPartitionLeaderElection:"+assignment+"--"+liveReplicas+"--"+isr)
    val leaderOpt = PartitionLeaderElectionAlgorithms.preferredReplicaPartitionLeaderElection(assignment, isr, liveReplicas.toSet)
    val newLeaderAndIsrOpt = leaderOpt.map(leader => leaderAndIsr.newLeader(leader))
    ElectionResult(partition, newLeaderAndIsrOpt, assignment)
  }

kafka.controller.PartitionLeaderElectionAlgorithms#preferredReplicaPartitionLeaderElection

这里想把优先副本选举的代码拿出来特别说明一下,这里的逻辑是选取assignment中的第一个值与liveReplicas及isr中的值匹配,意思就是选取ar中的第一个即优先副本,判断是否在isr中且存活,如果存活则返回,否则返回NONE

  def preferredReplicaPartitionLeaderElection(assignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int]): Option[Int] = {
    assignment.headOption.filter(id => liveReplicas.contains(id) && isr.contains(id))
  }

如果优先副本已离线,执行leader选举脚本会怎样?针对这种场景在此特地做个测试样例说明

  • 准备topic_1,0号分区ar为[3,1,2],zk中对应topicPartition的数据为
{"controller_epoch":263,"leader":3,"version":1,"leader_epoch":0,"isr":[3,1,2]}
  • 将broker3下线,这时zk中对应topicPartition的数据会变为
{"controller_epoch":263,"leader":1,"version":1,"leader_epoch":1,"isr":[1,2]}
  • 执行如下分区选举leader脚本
{
  "partitions": [
    {
      "topic": "topic_1",
      "partition": 0
    }
  ]
}
  • 可以看到控制台会打印如下日志,异常会从KafkaController那儿返回到LeaderElectionCommand客户端,会抛异常的原因也是跟优先副本选举策略有关,选举时只会拿ar中的第一个副本去跟存活的isr
    匹配,如果成功则返回优先副本为leader,如果不匹配则返回NONE,返回NONE的话就会抛AdminCommandFailedException,这也是我们使用副本选举时需要注意的一个点。
Error completing leader election (PREFERRED) for partition: topic_1-0: org.apache.kafka.common.errors.PreferredLeaderNotAvailableException: Failed to elect leader for partition topic_1-0 under strategy PreferredReplicaPartitionLeaderElectionStrategy
Exception in thread "main" kafka.common.AdminCommandFailedException: 1 replica(s) could not be elected
	at kafka.admin.LeaderElectionCommand$.electLeaders(LeaderElectionCommand.scala:173)
	at kafka.admin.LeaderElectionCommand$.run(LeaderElectionCommand.scala:89)
	at kafka.admin.LeaderElectionCommand$.main(LeaderElectionCommand.scala:41)
	at kafka.admin.LeaderElectionCommand.main(LeaderElectionCommand.scala)
	Suppressed: org.apache.kafka.common.errors.PreferredLeaderNotAvailableException: Failed to elect leader for partition topic_1-0 under strategy PreferredReplicaPartitionLeaderElectionStrategy

分区leader选举方式总结

通过以上的代码分析可以知道分区leader选举脚本并不支持更换ar顺序,直接拿controllerContext中缓存的partitionReplicaAssignment的数据作为ar
去匹配的,如果我们想要实现指定副本为leader,则需要以下三步
1、修改zk中对应topicPartition中ar的顺序
2、删除zk中controller节点重新选举controller,这里就会将zk的数据重新刷入controllerContext中
3、执行分区leader选举脚本
这种方式可以实现指定副本为leader这个功能,就是需要切换controller不太友好。

两种方案对比

通过以上两种方案对比可以知道分区重分配方式是不可行的,而分区leader选举需要删除controller节点触发controllerContext的更新,我们是否可以在kafka内部实现这个功能呢?

源码修改方案

通过修改zk数据触发controllerContext的更新

controller对zk节点/brokers/topics/{topic}的监听是kafka.controller.KafkaController#processPartitionModifications,从语义上可以知道主要是监听分区的改变,代码如下,主要是做新增分区监听的作用,如果我们在这里加上对副本ar修改监听的话不太符合这块代码的设计。

 private def processPartitionModifications(topic: String): Unit = {
    def restorePartitionReplicaAssignment(
      topic: String,
      newPartitionReplicaAssignment: Map[TopicPartition, ReplicaAssignment]
    ): Unit = {
      info("Restoring the partition replica assignment for topic %s".format(topic))

      val existingPartitions = zkClient.getChildren(TopicPartitionsZNode.path(topic))
      val existingPartitionReplicaAssignment = newPartitionReplicaAssignment
        .filter(p => existingPartitions.contains(p._1.partition.toString))
        .map { case (tp, _) =>
          tp -> controllerContext.partitionFullReplicaAssignment(tp)
      }.toMap

      zkClient.setTopicAssignment(topic,
        existingPartitionReplicaAssignment,
        controllerContext.epochZkVersion)
    }
    if (!isActive) return
    val partitionReplicaAssignment = zkClient.getFullReplicaAssignmentForTopics(immutable.Set(topic))
    val partitionsToBeAdded = partitionReplicaAssignment.filter { case (topicPartition, _) =>
      controllerContext.partitionReplicaAssignment(topicPartition).isEmpty
    }

    if (topicDeletionManager.isTopicQueuedUpForDeletion(topic)) {
      if (partitionsToBeAdded.nonEmpty) {
        warn("Skipping adding partitions %s for topic %s since it is currently being deleted"
          .format(partitionsToBeAdded.map(_._1.partition).mkString(","), topic))

        restorePartitionReplicaAssignment(topic, partitionReplicaAssignment)
      } else {
        // This can happen if existing partition replica assignment are restored to prevent increasing partition count during topic deletion
        info("Ignoring partition change during topic deletion as no new partitions are added")
      }
    } else if (partitionsToBeAdded.nonEmpty) {
      info(s"New partitions to be added $partitionsToBeAdded")
      partitionsToBeAdded.foreach { case (topicPartition, assignedReplicas) =>
        controllerContext.updatePartitionFullReplicaAssignment(topicPartition, assignedReplicas)
      }
      onNewPartitionCreation(partitionsToBeAdded.keySet)
    }
  }

在分区leader选举中实现这个功能

入参修改一下格式,如果传了replicas则跟zk中的ar比对,如果只变动了顺序则写入controllerContext,否则忽略,如果没传replicas,则依然进行原ar的优先副本选举,实现流程如图三。

图三

总结

我们可以看到修改zk数据触发controllerContext中主要功能偏向于提供新增分区的功能,如果增加修改副本顺序功能的话我觉得与原先的设计相违背,个人更倾向于直接在分区leader选举中实现这个功能,大家怎么看呢?

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

小飞侠fly

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值