一、kafka分区副本重分配-主流程

本文深入解析了Kafka分区副本重分配的过程,包括客户端脚本kafka-reassign-partitions.sh的使用,参数配置,以及重分配的执行流程。在服务端,详细介绍了如何通过ZooKeeper进行配置变更,分区副本状态的更新,以及如何处理分区重分配任务。整个过程涉及客户端与服务端的交互,以及动态配置管理和副本状态机的变更。
摘要由CSDN通过智能技术生成

kafka分区副本重分配的脚本是kafka-reassign-partitions.sh
配置参数为:

序号参数说明
1bootstrap-serverkafka集群地址清单
2command-config
3zookeeperzk集群地址
4generate生成分区副本重分配策略
5execute按照策略json执行分区副本重分配
6verify验证迁移任务是否成功
7reassignment-json-file指定重分配策略文件
8topics-to-move-json-file指定topic的json文件,与generate搭配
9broker-list指定broker,与generate搭配
10disable-rack-aware忽略机架感知
11throttle限流
12replica-alter-log-dirs-throttle跨路径迁移限流
13timeout

执行分区副本重分配可配的参数:

序号参数说明
1bootstrap-serverkafka集群地址清单
2command-config可指定分区副本重分配客户端配置文件 --command-config config/producer.proterties
3zookeeperzk集群地址
4execute按照策略json执行分区副本重分配
5reassignment-json-file指定重分配策略文件
6disable-rack-aware忽略机架感知
7throttle限流
8replica-alter-log-dirs-throttle跨路径迁移限流
9timeout

分区重分配主流程图

​​在这里插入图片描述

分区副本重分配客户端

kafka.admin.ReassignPartitionsCommand

def main(args: Array[String]): Unit = {
    val opts = validateAndParseArgs(args)
    val zkConnect = opts.options.valueOf(opts.zkConnectOpt)
    val time = Time.SYSTEM
    val zkClient = KafkaZkClient(zkConnect, JaasUtils.isZkSaslEnabled, 30000, 30000, Int.MaxValue, time)
    //连接服务端
    val adminClientOpt = createAdminClient(opts)

    try {
      if(opts.options.has(opts.verifyOpt))
        verifyAssignment(zkClient, adminClientOpt, opts)
      else if(opts.options.has(opts.generateOpt))
        generateAssignment(zkClient, opts)
      else if (opts.options.has(opts.executeOpt))
        executeAssignment(zkClient, adminClientOpt, opts)
    } catch {
      case e: Throwable =>
        println("Partitions reassignment failed due to " + e.getMessage)
        println(Utils.stackTrace(e))
    } finally zkClient.close()
  }
  • 进入executeAssignment方法,主要就是校验分区副本重分配文件的正确性,然后进入下面这个方法,真正执行的是reassignPartitionsCommand.reassignPartitions
def executeAssignment(zkClient: KafkaZkClient, adminClientOpt: Option[Admin], reassignmentJsonString: String, throttle: Throttle, timeoutMs: Long = 10000L): Unit = {
    //partitionAssignment:分区分配规则
    val (partitionAssignment, replicaAssignment) = parseAndValidate(zkClient, reassignmentJsonString)
    val adminZkClient = new AdminZkClient(zkClient)
    val reassignPartitionsCommand = new ReassignPartitionsCommand(zkClient, adminClientOpt, partitionAssignment.toMap, replicaAssignment, adminZkClient)

    // If there is an existing rebalance running, attempt to change its throttle
    //2.1如果有分区副本重分配任务,则支持修改限流参数
    if (zkClient.reassignPartitionsInProgress()) {
      println("There is an existing assignment running.")
      //
      reassignPartitionsCommand.maybeLimit(throttle)
    } else {
      printCurrentAssignment(zkClient, partitionAssignment.map(_._1.topic))
      if (throttle.interBrokerLimit >= 0 || throttle.replicaAlterLogDirsLimit >= 0)
        println(String.format("Warning: You must run Verify periodically, until the reassignment completes, to ensure the throttle is removed. You can also alter the throttle by rerunning the Execute command passing a new value."))
      //2.2真正执行分区副本重分配方法
      if (reassignPartitionsCommand.reassignPartitions(throttle, timeoutMs)) {
        println("Successfully started reassignment of partitions.")
      } else
        println("Failed to reassign partitions %s".format(partitionAssignment))
    }
  }

进入reassignPartitionsCommand.maybeLimit方法

进入reassignPartitionsCommand.reassignPartitions方法

def reassignPartitions(throttle: Throttle = NoThrottle, timeoutMs: Long = 10000L): Boolean = {
    //2.2.1 主要是将配置数据写入zk
    maybeThrottle(throttle)
    try {
      //对分区重分配做一些校验
      val validPartitions = proposedPartitionAssignment.groupBy(_._1.topic())
        .flatMap { case (topic, topicPartitionReplicas) =>
          validatePartition(zkClient, topic, topicPartitionReplicas)
        }
      if (validPartitions.isEmpty) false
      else {
          //proposedReplicaAssignment存储的是路径的重分配
        if (proposedReplicaAssignment.nonEmpty && adminClientOpt.isEmpty)
          throw new AdminCommandFailedException("bootstrap-server needs to be provided in order to reassign replica to the specified log directory")
        val startTimeMs = System.currentTimeMillis()

        // Send AlterReplicaLogDirsRequest to allow broker to create replica in the right log dir later if the replica has not been created yet.
        if (proposedReplicaAssignment.nonEmpty)
          alterReplicaLogDirsIgnoreReplicaNotAvailable(proposedReplicaAssignment, adminClientOpt.get, timeoutMs)

        // Create reassignment znode so that controller will send LeaderAndIsrRequest to create replica in the broker
        //2.2.2 zk中写入分区副本重分配任务
        zkClient.createPartitionReassignment(validPartitions.map({case (key, value) => (new TopicPartition(key.topic, key.partition), value)}).toMap)

        // Send AlterReplicaLogDirsRequest again to make sure broker will start to move replica to the specified log directory.
        // It may take some time for controller to create replica in the broker. Retry if the replica has not been created.
        var remainingTimeMs = startTimeMs + timeoutMs - System.currentTimeMillis()
        val replicasAssignedToFutureDir = mutable.Set.empty[TopicPartitionReplica]
        while (remainingTimeMs > 0 && replicasAssignedToFutureDir.size < proposedReplicaAssignment.size) {
          replicasAssignedToFutureDir ++= alterReplicaLogDirsIgnoreReplicaNotAvailable(
            proposedReplicaAssignment.filter { case (replica, _) => !replicasAssignedToFutureDir.contains(replica) },
            adminClientOpt.get, remainingTimeMs)
          Thread.sleep(100)
          remainingTimeMs = startTimeMs + timeoutMs - System.currentTimeMillis()
        }
        replicasAssignedToFutureDir.size == proposedReplicaAssignment.size
      }
    } catch {
      case _: NodeExistsException =>
        val partitionsBeingReassigned = zkClient.getPartitionReassignment()
        throw new AdminCommandFailedException("Partition reassignment currently in " +
          "progress for %s. Aborting operation".format(partitionsBeingReassigned))
    }
  }
进入maybeThrottle方法
 private def maybeThrottle(throttle: Throttle): Unit = {
    if (throttle.interBrokerLimit >= 0)
        //2.2.1.1,将对topic限流的leader跟副本配置写入zk
      assignThrottledReplicas(existingAssignment(), proposedPartitionAssignment, adminZkClient)
      //2.2.1.2,这里会将对broker限流的配置写入zk
    maybeLimit(throttle)
    if (throttle.interBrokerLimit >= 0 || throttle.replicaAlterLogDirsLimit >= 0)
      throttle.postUpdateAction()
    if (throttle.interBrokerLimit >= 0)
      println(s"The inter-broker throttle limit was set to ${throttle.interBrokerLimit} B/s")
    if (throttle.replicaAlterLogDirsLimit >= 0)
      println(s"The replica-alter-dir throttle limit was set to ${throttle.replicaAlterLogDirsLimit} B/s")
  }
进入assignThrottledReplicas方法,这个方法会将限流的leader跟副本数据写入zk,对应zk节点为config-topics-对应topic,属于动态配置一类
private[admin] def assignThrottledReplicas(existingPartitionAssignment: Map[TopicPartition, Seq[Int]],
                                             proposedPartitionAssignment: Map[TopicPartition, Seq[Int]],
                                             adminZkClient: AdminZkClient): Unit = {
    for (topic <- proposedPartitionAssignment.keySet.map(_.topic).toSeq.distinct) {
      val existingPartitionAssignmentForTopic = existingPartitionAssignment.filter { case (tp, _) => tp.topic == topic }
      val proposedPartitionAssignmentForTopic = proposedPartitionAssignment.filter { case (tp, _) => tp.topic == topic }

      //Apply leader throttle to all replicas that exist before the re-balance.
      val leader = format(preRebalanceReplicaForMovingPartitions(existingPartitionAssignmentForTopic, proposedPartitionAssignmentForTopic))

      //Apply follower throttle to all "move destinations".
      val follower = format(postRebalanceReplicasThatMoved(existingPartitionAssignmentForTopic, proposedPartitionAssignmentForTopic))

      val configs = adminZkClient.fetchEntityConfig(ConfigType.Topic, topic)
      configs.put(LeaderReplicationThrottledReplicasProp, leader)
      configs.put(FollowerReplicationThrottledReplicasProp, follower)
      adminZkClient.changeTopicConfig(topic, configs)

      info(s"Updated leader-throttled replicas for topic $topic with: $leader")
      info(s"Updated follower-throttled replicas for topic $topic with: $follower")
    }
  }
  • 这里可以看一下写入的格式,:左边的是分区,右边对应的副本所在的brokerId
  • 针对leader.replication.throttled.replicas 表示0分区对应1跟3副本限流,2分区的2跟3副本限流
  • follower.replication.throttled.replicas 表示的是副本端的,分区副本重分配涉及到分区数据迁移,这里表示的是新增的副本
{
  "version" : 1,
  "config" : {
    "follower.replication.throttled.replicas" : "0:2,2:1",
    "leader.replication.throttled.replicas" : "2:2,2:3,0:3,0:1"
  }
}
这里会将对broker限流的配置写入zk,我们主要看一下zk的数据,路径为config-brokers-对应的brokerId
{
  "version" : 1,
  "config" : {
    "leader.replication.throttled.rate" : "1",
    "follower.replication.throttled.rate" : "1"
  }
}
将分区副本重分配任务写入zk,zk路径为admin-reassign_partitions,有几个变动就会写入几个
{
    "version": 1,
    "partitions": [
        {
            "topic": "test",
            "partition": 1,
            "replicas": [
                2,
                1
            ]
        }
    ]
}

总结

  • 这部分主要介绍了将限流数据及分区副本重分配任务写入zk的部分,这里可以将重分配程序看做一个内置客户端与服务端相连,发指令给客户端的zk程序,具体操作还是由服务端完成
  • 客户端与服务端通信具体可以查看
    https://www.szzdzhp.com/kafka/Source_code/controller-2-broker.html
  • 状态机是如何工作的具体可以看
    https://www.szzdzhp.com/kafka/Source_code/controller-state-machine.html

zkConfig配置变动源码分析

代码入口在kafka/server/DynamicConfigManager.scala,这里会启线程监听zk的配置数据,并调用对应的handler来处理

topic配置

  • topic对应的handler为TopicConfigHandler,最终调用的processConfigChanges方法,对应关键代码如下
  def processConfigChanges(topic: String, topicConfig: Properties): Unit = {
    // Validate the configurations.
    val configNamesToExclude = excludedConfigs(topic, topicConfig)

    updateLogConfig(topic, topicConfig, configNamesToExclude)

    def updateThrottledList(prop: String, quotaManager: ReplicationQuotaManager) = {
      if (topicConfig.containsKey(prop) && topicConfig.getProperty(prop).length > 0) {
        val partitions = parseThrottledPartitions(topicConfig, kafkaConfig.brokerId, prop)
        quotaManager.markThrottled(topic, partitions)
        info(s"Setting $prop on broker ${kafkaConfig.brokerId} for topic: $topic and partitions $partitions")
      } else {
        quotaManager.removeThrottle(topic)
        info(s"Removing $prop from broker ${kafkaConfig.brokerId} for topic $topic")
      }
    }
    updateThrottledList(LogConfig.LeaderReplicationThrottledReplicasProp, quotas.leader)
    updateThrottledList(LogConfig.FollowerReplicationThrottledReplicasProp, quotas.follower)

    if (Try(topicConfig.getProperty(KafkaConfig.UncleanLeaderElectionEnableProp).toBoolean).getOrElse(false)) {
      kafkaController.enableTopicUncleanLeaderElection(topic)
    }
  }
  
    def markThrottled(topic: String, partitions: Seq[Int]): Unit = {
    //最终将配置读入内存,throttledPartitions为ConcurrentHashMap结构
      throttledPartitions.put(topic, partitions)
    }

broker配置

  • broker对应的handler为BrokerConfigHandler,同样调用的processConfigChanges方法
def processConfigChanges(brokerId: String, properties: Properties): Unit = {
    def getOrDefault(prop: String): Long = {
      if (properties.containsKey(prop))
        properties.getProperty(prop).toLong
      else
        DefaultReplicationThrottledRate
    }
    if (brokerId == ConfigEntityName.Default)
      brokerConfig.dynamicConfig.updateDefaultConfig(properties)
    else if (brokerConfig.brokerId == brokerId.trim.toInt) {
      //更新config配置
      brokerConfig.dynamicConfig.updateBrokerConfig(brokerConfig.brokerId, properties)
      quotaManagers.leader.updateQuota(upperBound(getOrDefault(LeaderReplicationThrottledRateProp)))
      quotaManagers.follower.updateQuota(upperBound(getOrDefault(FollowerReplicationThrottledRateProp)))
      quotaManagers.alterLogDirs.updateQuota(upperBound(getOrDefault(ReplicaAlterLogDirsIoMaxBytesPerSecondProp)))
    }
  }
    private[server] def updateBrokerConfig(brokerId: Int, persistentProps: Properties): Unit = CoreUtils.inWriteLock(lock) {
      try {
        val props = fromPersistentProps(persistentProps, perBrokerConfig = true)
        dynamicBrokerConfigs.clear()
        dynamicBrokerConfigs ++= props.asScala
        //更新kafkaConfig配置
        updateCurrentConfig()
      } catch {
        case e: Exception => error(s"Per-broker configs of $brokerId could not be applied: $persistentProps", e)
      }
    }

服务端如何处理分区重分配任务

分区副本重分配部分概念

  • RS: 当前的副本集合,对应reassignment.replicas

  • ORS:原副本集合,对应reassignment.originReplicas

  • TRS:目标副本集合,对应reassignment.targetReplicas

  • AR:需要添加的副本,reassignment.addingReplicas

  • RR:需要移除的副本,reassignment.removingReplicas

    服务端入口如下

//kafka.controller.KafkaController#process:processZkPartitionReassignment
// We need to register the watcher if the path doesn't exist in order to detect future
    // reassignments and we get the `path exists` check for free
    if (isActive && zkClient.registerZNodeChangeHandlerAndCheckExistence(partitionReassignmentHandler)) {
      val reassignmentResults = mutable.Map.empty[TopicPartition, ApiError]
      val partitionsToReassign = mutable.Map.empty[TopicPartition, ReplicaAssignment]
      //从zk中读取分区副本重分配策略
      zkClient.getPartitionReassignment().foreach { case (tp, targetReplicas) =>
        maybeBuildReassignment(tp, Some(targetReplicas)) match {
          case Some(context) => partitionsToReassign.put(tp, context)
          case None => reassignmentResults.put(tp, new ApiError(Errors.NO_REASSIGNMENT_IN_PROGRESS))
        }
      }

      //进入这个方法
      reassignmentResults ++= maybeTriggerPartitionReassignment(partitionsToReassign)
      val (partitionsReassigned, partitionsFailed) = reassignmentResults.partition(_._2.error == Errors.NONE)
      if (partitionsFailed.nonEmpty) {
        warn(s"Failed reassignment through zk with the following errors: $partitionsFailed")
        maybeRemoveFromZkReassignment((tp, _) => partitionsFailed.contains(tp))
      }
      partitionsReassigned.keySet
    } else {
      Set.empty
    }
    
     private def maybeTriggerPartitionReassignment(reassignments: Map[TopicPartition, ReplicaAssignment]): Map[TopicPartition, ApiError] = {
        reassignments.map { case (tp, reassignment) =>
          val topic = tp.topic
    
          val apiError = if (topicDeletionManager.isTopicQueuedUpForDeletion(topic)) {
            info(s"Skipping reassignment of $tp since the topic is currently being deleted")
            new ApiError(Errors.UNKNOWN_TOPIC_OR_PARTITION, "The partition does not exist.")
          } else {
            val assignedReplicas = controllerContext.partitionReplicaAssignment(tp)
            if (assignedReplicas.nonEmpty) {
              try {
                //针对每个topic,真正的分区副本重分配入口
                onPartitionReassignment(tp, reassignment)
                ApiError.NONE
              } catch {
                case e: ControllerMovedException =>
                  info(s"Failed completing reassignment of partition $tp because controller has moved to another broker")
                  throw e
                case e: Throwable =>
                  error(s"Error completing reassignment of partition $tp", e)
                  new ApiError(Errors.UNKNOWN_SERVER_ERROR)
              }
            } else {
                new ApiError(Errors.UNKNOWN_TOPIC_OR_PARTITION, "The partition does not exist.")
            }
          }
    
          tp -> apiError
        }
      }

这里类似分区重分配的主流程代码,可以看出重分配主要做了哪些事,而具体的代码都封装在每个方法中

private def onPartitionReassignment(topicPartition: TopicPartition, reassignment: ReplicaAssignment): Unit = {
    //4.1在内存中标记topic正在reassignment,防止删除
    topicDeletionManager.markTopicIneligibleForDeletion(Set(topicPartition.topic), reason = "topic reassignment in progress")
    //4.2在zk及内存中写入topic对应需要重分配的副本数据
    updateCurrentReassignment(topicPartition, reassignment)

    val addingReplicas = reassignment.addingReplicas
    val removingReplicas = reassignment.removingReplicas

    //4.3判断如果分区重分配未完成则走if逻辑
    if (!isReassignmentComplete(topicPartition, reassignment)) {
      //
      //4.3.1 给RR中所有副本所在的Broker发送LeaderAndIsr请求
      updateLeaderEpochAndSendRequest(topicPartition, reassignment)
      // 4.3.2 给AR中所有副本置为NewReplica状态
      startNewReplicasForReassignedPartition(topicPartition, addingReplicas)
    } else {
      //4.4分区重分配已完成
      // 4.4.1 副本状态机修改新加副本为上线状态
      replicaStateMachine.handleStateChanges(addingReplicas.map(PartitionAndReplica(topicPartition, _)), OnlineReplica)
      // 4.4.2 定义completedReassignment对象,rs等于reassignment.targetReplicas,即TRS
      val completedReassignment = ReplicaAssignment(reassignment.targetReplicas)
      //修改controllerContext中的partitionAssignments对象中分区对应的副本为TRS
      controllerContext.updatePartitionFullReplicaAssignment(topicPartition, completedReassignment)
      //4.4.3 处理原分区leader不在TRS中的情况:重新选取leader
      moveReassignedPartitionLeaderIfRequired(topicPartition, completedReassignment)
      //4.4.4 下线需要删除的副本并删除
      stopRemovedReplicasOfReassignedPartition(topicPartition, removingReplicas)
      //4.4.5 更新zk中的副本信息
      updateReplicaAssignmentForPartition(topicPartition, completedReassignment)
      //4.4.6 清除zk中reassign_partitions节点的数据
      removePartitionFromReassigningPartitions(topicPartition, completedReassignment)
      //4.4.7 更新元数据信息
      sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicPartition))
      //4.5 完成后移除对topic删除的限制
      topicDeletionManager.resumeDeletionForTopics(Set(topicPartition.topic))
    }

在内存中标记topic正在reassignment,防止删除

在zk及内存中写入topic对应需要重分配的副本数据,操作的节点为brokers-topics-对应的topic,我们可以看看写入的数据结构,明确的标记了哪些副本需要删除,哪些需要增加

{
    "version": 2,
    "partitions": {
        "2": [
            3,
            2
        ],
        "1": [
            2,
            3,
            1
        ],
        "0": [
            3,
            2
        ]
    },
    "adding_replicas": {
        "1": [
            3
        ]
    },
    "removing_replicas": {
        "1": [
            1
        ]
    }
}

判断如果分区重分配未完成则走if逻辑

private def isReassignmentComplete(partition: TopicPartition, assignment: ReplicaAssignment): Boolean = {
    if (!assignment.isBeingReassigned) {
      true
    } else {
      //取出zk节点中brokers-topics-对应topic-partitions-对应partition-state节点的数据,判断targetReplicas是否是isr的子集,如果是,则说明重分配已完成
      zkClient.getTopicPartitionStates(Seq(partition)).get(partition).exists { leaderIsrAndControllerEpoch =>
        val isr = leaderIsrAndControllerEpoch.leaderAndIsr.isr.toSet
        val targetReplicas = assignment.targetReplicas.toSet
        targetReplicas.subsetOf(isr)
      }
    }
  }
这一步主要是对每个副本所在的broker发送LeaderAndIsrRequest
  • 这个LeaderAndIsrRequest很重要,同步副本就是在这里实现的。
private def updateLeaderEpochAndSendRequest(topicPartition: TopicPartition,
                                              assignment: ReplicaAssignment): Unit = {
    val stateChangeLog = stateChangeLogger.withControllerEpoch(controllerContext.epoch)
    updateLeaderEpoch(topicPartition) match {
      case Some(updatedLeaderIsrAndControllerEpoch) =>
        try {
          brokerRequestBatch.newBatch()
          brokerRequestBatch.addLeaderAndIsrRequestForBrokers(assignment.replicas, topicPartition,
            updatedLeaderIsrAndControllerEpoch, assignment, isNew = false)
          brokerRequestBatch.sendRequestsToBrokers(controllerContext.epoch)
        } catch {
          case e: IllegalStateException =>
            handleIllegalState(e)
        }
        stateChangeLog.trace(s"Sent LeaderAndIsr request $updatedLeaderIsrAndControllerEpoch with " +
          s"new replica assignment $assignment to leader ${updatedLeaderIsrAndControllerEpoch.leaderAndIsr.leader} " +
          s"for partition being reassigned $topicPartition")

      case None => // fail the reassignment
        stateChangeLog.error(s"Failed to send LeaderAndIsr request with new replica assignment " +
          s"$assignment to leader for partition being reassigned $topicPartition")
    }
  }
这步主要是针对新副本设置由副本状态机设置NewReplica状态
 private def startNewReplicasForReassignedPartition(topicPartition: TopicPartition, newReplicas: Seq[Int]): Unit = {
    // send the start replica request to the brokers in the reassigned replicas list that are not in the assigned
    // replicas list
    newReplicas.foreach { replica =>
      replicaStateMachine.handleStateChanges(Seq(PartitionAndReplica(topicPartition, replica)), NewReplica)
    }
  }

总结

  • 这部分主要介绍了服务端接收到了分区重分配的任务是如何处理的,这里的逻辑相当于一cpu调配所有资源的过程,具体的实现都封装在了子方法中,后面我们再来逐一分析
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

小飞侠fly

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值