kafka分区副本重分配的脚本是kafka-reassign-partitions.sh
配置参数为:
序号 | 参数 | 说明 |
---|---|---|
1 | bootstrap-server | kafka集群地址清单 |
2 | command-config | |
3 | zookeeper | zk集群地址 |
4 | generate | 生成分区副本重分配策略 |
5 | execute | 按照策略json执行分区副本重分配 |
6 | verify | 验证迁移任务是否成功 |
7 | reassignment-json-file | 指定重分配策略文件 |
8 | topics-to-move-json-file | 指定topic的json文件,与generate搭配 |
9 | broker-list | 指定broker,与generate搭配 |
10 | disable-rack-aware | 忽略机架感知 |
11 | throttle | 限流 |
12 | replica-alter-log-dirs-throttle | 跨路径迁移限流 |
13 | timeout |
执行分区副本重分配可配的参数:
序号 | 参数 | 说明 |
---|---|---|
1 | bootstrap-server | kafka集群地址清单 |
2 | command-config | 可指定分区副本重分配客户端配置文件 --command-config config/producer.proterties |
3 | zookeeper | zk集群地址 |
4 | execute | 按照策略json执行分区副本重分配 |
5 | reassignment-json-file | 指定重分配策略文件 |
6 | disable-rack-aware | 忽略机架感知 |
7 | throttle | 限流 |
8 | replica-alter-log-dirs-throttle | 跨路径迁移限流 |
9 | timeout |
分区重分配主流程图
分区副本重分配客户端
kafka.admin.ReassignPartitionsCommand
def main(args: Array[String]): Unit = {
val opts = validateAndParseArgs(args)
val zkConnect = opts.options.valueOf(opts.zkConnectOpt)
val time = Time.SYSTEM
val zkClient = KafkaZkClient(zkConnect, JaasUtils.isZkSaslEnabled, 30000, 30000, Int.MaxValue, time)
//连接服务端
val adminClientOpt = createAdminClient(opts)
try {
if(opts.options.has(opts.verifyOpt))
verifyAssignment(zkClient, adminClientOpt, opts)
else if(opts.options.has(opts.generateOpt))
generateAssignment(zkClient, opts)
else if (opts.options.has(opts.executeOpt))
executeAssignment(zkClient, adminClientOpt, opts)
} catch {
case e: Throwable =>
println("Partitions reassignment failed due to " + e.getMessage)
println(Utils.stackTrace(e))
} finally zkClient.close()
}
- 进入executeAssignment方法,主要就是校验分区副本重分配文件的正确性,然后进入下面这个方法,真正执行的是reassignPartitionsCommand.reassignPartitions
def executeAssignment(zkClient: KafkaZkClient, adminClientOpt: Option[Admin], reassignmentJsonString: String, throttle: Throttle, timeoutMs: Long = 10000L): Unit = {
//partitionAssignment:分区分配规则
val (partitionAssignment, replicaAssignment) = parseAndValidate(zkClient, reassignmentJsonString)
val adminZkClient = new AdminZkClient(zkClient)
val reassignPartitionsCommand = new ReassignPartitionsCommand(zkClient, adminClientOpt, partitionAssignment.toMap, replicaAssignment, adminZkClient)
// If there is an existing rebalance running, attempt to change its throttle
//2.1如果有分区副本重分配任务,则支持修改限流参数
if (zkClient.reassignPartitionsInProgress()) {
println("There is an existing assignment running.")
//
reassignPartitionsCommand.maybeLimit(throttle)
} else {
printCurrentAssignment(zkClient, partitionAssignment.map(_._1.topic))
if (throttle.interBrokerLimit >= 0 || throttle.replicaAlterLogDirsLimit >= 0)
println(String.format("Warning: You must run Verify periodically, until the reassignment completes, to ensure the throttle is removed. You can also alter the throttle by rerunning the Execute command passing a new value."))
//2.2真正执行分区副本重分配方法
if (reassignPartitionsCommand.reassignPartitions(throttle, timeoutMs)) {
println("Successfully started reassignment of partitions.")
} else
println("Failed to reassign partitions %s".format(partitionAssignment))
}
}
进入reassignPartitionsCommand.maybeLimit方法
进入reassignPartitionsCommand.reassignPartitions方法
def reassignPartitions(throttle: Throttle = NoThrottle, timeoutMs: Long = 10000L): Boolean = {
//2.2.1 主要是将配置数据写入zk
maybeThrottle(throttle)
try {
//对分区重分配做一些校验
val validPartitions = proposedPartitionAssignment.groupBy(_._1.topic())
.flatMap { case (topic, topicPartitionReplicas) =>
validatePartition(zkClient, topic, topicPartitionReplicas)
}
if (validPartitions.isEmpty) false
else {
//proposedReplicaAssignment存储的是路径的重分配
if (proposedReplicaAssignment.nonEmpty && adminClientOpt.isEmpty)
throw new AdminCommandFailedException("bootstrap-server needs to be provided in order to reassign replica to the specified log directory")
val startTimeMs = System.currentTimeMillis()
// Send AlterReplicaLogDirsRequest to allow broker to create replica in the right log dir later if the replica has not been created yet.
if (proposedReplicaAssignment.nonEmpty)
alterReplicaLogDirsIgnoreReplicaNotAvailable(proposedReplicaAssignment, adminClientOpt.get, timeoutMs)
// Create reassignment znode so that controller will send LeaderAndIsrRequest to create replica in the broker
//2.2.2 zk中写入分区副本重分配任务
zkClient.createPartitionReassignment(validPartitions.map({case (key, value) => (new TopicPartition(key.topic, key.partition), value)}).toMap)
// Send AlterReplicaLogDirsRequest again to make sure broker will start to move replica to the specified log directory.
// It may take some time for controller to create replica in the broker. Retry if the replica has not been created.
var remainingTimeMs = startTimeMs + timeoutMs - System.currentTimeMillis()
val replicasAssignedToFutureDir = mutable.Set.empty[TopicPartitionReplica]
while (remainingTimeMs > 0 && replicasAssignedToFutureDir.size < proposedReplicaAssignment.size) {
replicasAssignedToFutureDir ++= alterReplicaLogDirsIgnoreReplicaNotAvailable(
proposedReplicaAssignment.filter { case (replica, _) => !replicasAssignedToFutureDir.contains(replica) },
adminClientOpt.get, remainingTimeMs)
Thread.sleep(100)
remainingTimeMs = startTimeMs + timeoutMs - System.currentTimeMillis()
}
replicasAssignedToFutureDir.size == proposedReplicaAssignment.size
}
} catch {
case _: NodeExistsException =>
val partitionsBeingReassigned = zkClient.getPartitionReassignment()
throw new AdminCommandFailedException("Partition reassignment currently in " +
"progress for %s. Aborting operation".format(partitionsBeingReassigned))
}
}
进入maybeThrottle方法
private def maybeThrottle(throttle: Throttle): Unit = {
if (throttle.interBrokerLimit >= 0)
//2.2.1.1,将对topic限流的leader跟副本配置写入zk
assignThrottledReplicas(existingAssignment(), proposedPartitionAssignment, adminZkClient)
//2.2.1.2,这里会将对broker限流的配置写入zk
maybeLimit(throttle)
if (throttle.interBrokerLimit >= 0 || throttle.replicaAlterLogDirsLimit >= 0)
throttle.postUpdateAction()
if (throttle.interBrokerLimit >= 0)
println(s"The inter-broker throttle limit was set to ${throttle.interBrokerLimit} B/s")
if (throttle.replicaAlterLogDirsLimit >= 0)
println(s"The replica-alter-dir throttle limit was set to ${throttle.replicaAlterLogDirsLimit} B/s")
}
进入assignThrottledReplicas方法,这个方法会将限流的leader跟副本数据写入zk,对应zk节点为config-topics-对应topic,属于动态配置一类
private[admin] def assignThrottledReplicas(existingPartitionAssignment: Map[TopicPartition, Seq[Int]],
proposedPartitionAssignment: Map[TopicPartition, Seq[Int]],
adminZkClient: AdminZkClient): Unit = {
for (topic <- proposedPartitionAssignment.keySet.map(_.topic).toSeq.distinct) {
val existingPartitionAssignmentForTopic = existingPartitionAssignment.filter { case (tp, _) => tp.topic == topic }
val proposedPartitionAssignmentForTopic = proposedPartitionAssignment.filter { case (tp, _) => tp.topic == topic }
//Apply leader throttle to all replicas that exist before the re-balance.
val leader = format(preRebalanceReplicaForMovingPartitions(existingPartitionAssignmentForTopic, proposedPartitionAssignmentForTopic))
//Apply follower throttle to all "move destinations".
val follower = format(postRebalanceReplicasThatMoved(existingPartitionAssignmentForTopic, proposedPartitionAssignmentForTopic))
val configs = adminZkClient.fetchEntityConfig(ConfigType.Topic, topic)
configs.put(LeaderReplicationThrottledReplicasProp, leader)
configs.put(FollowerReplicationThrottledReplicasProp, follower)
adminZkClient.changeTopicConfig(topic, configs)
info(s"Updated leader-throttled replicas for topic $topic with: $leader")
info(s"Updated follower-throttled replicas for topic $topic with: $follower")
}
}
- 这里可以看一下写入的格式,:左边的是分区,右边对应的副本所在的brokerId
- 针对leader.replication.throttled.replicas 表示0分区对应1跟3副本限流,2分区的2跟3副本限流
- follower.replication.throttled.replicas 表示的是副本端的,分区副本重分配涉及到分区数据迁移,这里表示的是新增的副本
{
"version" : 1,
"config" : {
"follower.replication.throttled.replicas" : "0:2,2:1",
"leader.replication.throttled.replicas" : "2:2,2:3,0:3,0:1"
}
}
这里会将对broker限流的配置写入zk,我们主要看一下zk的数据,路径为config-brokers-对应的brokerId
{
"version" : 1,
"config" : {
"leader.replication.throttled.rate" : "1",
"follower.replication.throttled.rate" : "1"
}
}
将分区副本重分配任务写入zk,zk路径为admin-reassign_partitions,有几个变动就会写入几个
{
"version": 1,
"partitions": [
{
"topic": "test",
"partition": 1,
"replicas": [
2,
1
]
}
]
}
总结
- 这部分主要介绍了将限流数据及分区副本重分配任务写入zk的部分,这里可以将重分配程序看做一个内置客户端与服务端相连,发指令给客户端的zk程序,具体操作还是由服务端完成
- 客户端与服务端通信具体可以查看
https://www.szzdzhp.com/kafka/Source_code/controller-2-broker.html - 状态机是如何工作的具体可以看
https://www.szzdzhp.com/kafka/Source_code/controller-state-machine.html
zkConfig配置变动源码分析
代码入口在kafka/server/DynamicConfigManager.scala,这里会启线程监听zk的配置数据,并调用对应的handler来处理
topic配置
- topic对应的handler为TopicConfigHandler,最终调用的processConfigChanges方法,对应关键代码如下
def processConfigChanges(topic: String, topicConfig: Properties): Unit = {
// Validate the configurations.
val configNamesToExclude = excludedConfigs(topic, topicConfig)
updateLogConfig(topic, topicConfig, configNamesToExclude)
def updateThrottledList(prop: String, quotaManager: ReplicationQuotaManager) = {
if (topicConfig.containsKey(prop) && topicConfig.getProperty(prop).length > 0) {
val partitions = parseThrottledPartitions(topicConfig, kafkaConfig.brokerId, prop)
quotaManager.markThrottled(topic, partitions)
info(s"Setting $prop on broker ${kafkaConfig.brokerId} for topic: $topic and partitions $partitions")
} else {
quotaManager.removeThrottle(topic)
info(s"Removing $prop from broker ${kafkaConfig.brokerId} for topic $topic")
}
}
updateThrottledList(LogConfig.LeaderReplicationThrottledReplicasProp, quotas.leader)
updateThrottledList(LogConfig.FollowerReplicationThrottledReplicasProp, quotas.follower)
if (Try(topicConfig.getProperty(KafkaConfig.UncleanLeaderElectionEnableProp).toBoolean).getOrElse(false)) {
kafkaController.enableTopicUncleanLeaderElection(topic)
}
}
def markThrottled(topic: String, partitions: Seq[Int]): Unit = {
//最终将配置读入内存,throttledPartitions为ConcurrentHashMap结构
throttledPartitions.put(topic, partitions)
}
broker配置
- broker对应的handler为BrokerConfigHandler,同样调用的processConfigChanges方法
def processConfigChanges(brokerId: String, properties: Properties): Unit = {
def getOrDefault(prop: String): Long = {
if (properties.containsKey(prop))
properties.getProperty(prop).toLong
else
DefaultReplicationThrottledRate
}
if (brokerId == ConfigEntityName.Default)
brokerConfig.dynamicConfig.updateDefaultConfig(properties)
else if (brokerConfig.brokerId == brokerId.trim.toInt) {
//更新config配置
brokerConfig.dynamicConfig.updateBrokerConfig(brokerConfig.brokerId, properties)
quotaManagers.leader.updateQuota(upperBound(getOrDefault(LeaderReplicationThrottledRateProp)))
quotaManagers.follower.updateQuota(upperBound(getOrDefault(FollowerReplicationThrottledRateProp)))
quotaManagers.alterLogDirs.updateQuota(upperBound(getOrDefault(ReplicaAlterLogDirsIoMaxBytesPerSecondProp)))
}
}
private[server] def updateBrokerConfig(brokerId: Int, persistentProps: Properties): Unit = CoreUtils.inWriteLock(lock) {
try {
val props = fromPersistentProps(persistentProps, perBrokerConfig = true)
dynamicBrokerConfigs.clear()
dynamicBrokerConfigs ++= props.asScala
//更新kafkaConfig配置
updateCurrentConfig()
} catch {
case e: Exception => error(s"Per-broker configs of $brokerId could not be applied: $persistentProps", e)
}
}
服务端如何处理分区重分配任务
分区副本重分配部分概念
-
RS: 当前的副本集合,对应reassignment.replicas
-
ORS:原副本集合,对应reassignment.originReplicas
-
TRS:目标副本集合,对应reassignment.targetReplicas
-
AR:需要添加的副本,reassignment.addingReplicas
-
RR:需要移除的副本,reassignment.removingReplicas
服务端入口如下
//kafka.controller.KafkaController#process:processZkPartitionReassignment
// We need to register the watcher if the path doesn't exist in order to detect future
// reassignments and we get the `path exists` check for free
if (isActive && zkClient.registerZNodeChangeHandlerAndCheckExistence(partitionReassignmentHandler)) {
val reassignmentResults = mutable.Map.empty[TopicPartition, ApiError]
val partitionsToReassign = mutable.Map.empty[TopicPartition, ReplicaAssignment]
//从zk中读取分区副本重分配策略
zkClient.getPartitionReassignment().foreach { case (tp, targetReplicas) =>
maybeBuildReassignment(tp, Some(targetReplicas)) match {
case Some(context) => partitionsToReassign.put(tp, context)
case None => reassignmentResults.put(tp, new ApiError(Errors.NO_REASSIGNMENT_IN_PROGRESS))
}
}
//进入这个方法
reassignmentResults ++= maybeTriggerPartitionReassignment(partitionsToReassign)
val (partitionsReassigned, partitionsFailed) = reassignmentResults.partition(_._2.error == Errors.NONE)
if (partitionsFailed.nonEmpty) {
warn(s"Failed reassignment through zk with the following errors: $partitionsFailed")
maybeRemoveFromZkReassignment((tp, _) => partitionsFailed.contains(tp))
}
partitionsReassigned.keySet
} else {
Set.empty
}
private def maybeTriggerPartitionReassignment(reassignments: Map[TopicPartition, ReplicaAssignment]): Map[TopicPartition, ApiError] = {
reassignments.map { case (tp, reassignment) =>
val topic = tp.topic
val apiError = if (topicDeletionManager.isTopicQueuedUpForDeletion(topic)) {
info(s"Skipping reassignment of $tp since the topic is currently being deleted")
new ApiError(Errors.UNKNOWN_TOPIC_OR_PARTITION, "The partition does not exist.")
} else {
val assignedReplicas = controllerContext.partitionReplicaAssignment(tp)
if (assignedReplicas.nonEmpty) {
try {
//针对每个topic,真正的分区副本重分配入口
onPartitionReassignment(tp, reassignment)
ApiError.NONE
} catch {
case e: ControllerMovedException =>
info(s"Failed completing reassignment of partition $tp because controller has moved to another broker")
throw e
case e: Throwable =>
error(s"Error completing reassignment of partition $tp", e)
new ApiError(Errors.UNKNOWN_SERVER_ERROR)
}
} else {
new ApiError(Errors.UNKNOWN_TOPIC_OR_PARTITION, "The partition does not exist.")
}
}
tp -> apiError
}
}
这里类似分区重分配的主流程代码,可以看出重分配主要做了哪些事,而具体的代码都封装在每个方法中
private def onPartitionReassignment(topicPartition: TopicPartition, reassignment: ReplicaAssignment): Unit = {
//4.1在内存中标记topic正在reassignment,防止删除
topicDeletionManager.markTopicIneligibleForDeletion(Set(topicPartition.topic), reason = "topic reassignment in progress")
//4.2在zk及内存中写入topic对应需要重分配的副本数据
updateCurrentReassignment(topicPartition, reassignment)
val addingReplicas = reassignment.addingReplicas
val removingReplicas = reassignment.removingReplicas
//4.3判断如果分区重分配未完成则走if逻辑
if (!isReassignmentComplete(topicPartition, reassignment)) {
//
//4.3.1 给RR中所有副本所在的Broker发送LeaderAndIsr请求
updateLeaderEpochAndSendRequest(topicPartition, reassignment)
// 4.3.2 给AR中所有副本置为NewReplica状态
startNewReplicasForReassignedPartition(topicPartition, addingReplicas)
} else {
//4.4分区重分配已完成
// 4.4.1 副本状态机修改新加副本为上线状态
replicaStateMachine.handleStateChanges(addingReplicas.map(PartitionAndReplica(topicPartition, _)), OnlineReplica)
// 4.4.2 定义completedReassignment对象,rs等于reassignment.targetReplicas,即TRS
val completedReassignment = ReplicaAssignment(reassignment.targetReplicas)
//修改controllerContext中的partitionAssignments对象中分区对应的副本为TRS
controllerContext.updatePartitionFullReplicaAssignment(topicPartition, completedReassignment)
//4.4.3 处理原分区leader不在TRS中的情况:重新选取leader
moveReassignedPartitionLeaderIfRequired(topicPartition, completedReassignment)
//4.4.4 下线需要删除的副本并删除
stopRemovedReplicasOfReassignedPartition(topicPartition, removingReplicas)
//4.4.5 更新zk中的副本信息
updateReplicaAssignmentForPartition(topicPartition, completedReassignment)
//4.4.6 清除zk中reassign_partitions节点的数据
removePartitionFromReassigningPartitions(topicPartition, completedReassignment)
//4.4.7 更新元数据信息
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicPartition))
//4.5 完成后移除对topic删除的限制
topicDeletionManager.resumeDeletionForTopics(Set(topicPartition.topic))
}
在内存中标记topic正在reassignment,防止删除
在zk及内存中写入topic对应需要重分配的副本数据,操作的节点为brokers-topics-对应的topic,我们可以看看写入的数据结构,明确的标记了哪些副本需要删除,哪些需要增加
{
"version": 2,
"partitions": {
"2": [
3,
2
],
"1": [
2,
3,
1
],
"0": [
3,
2
]
},
"adding_replicas": {
"1": [
3
]
},
"removing_replicas": {
"1": [
1
]
}
}
判断如果分区重分配未完成则走if逻辑
private def isReassignmentComplete(partition: TopicPartition, assignment: ReplicaAssignment): Boolean = {
if (!assignment.isBeingReassigned) {
true
} else {
//取出zk节点中brokers-topics-对应topic-partitions-对应partition-state节点的数据,判断targetReplicas是否是isr的子集,如果是,则说明重分配已完成
zkClient.getTopicPartitionStates(Seq(partition)).get(partition).exists { leaderIsrAndControllerEpoch =>
val isr = leaderIsrAndControllerEpoch.leaderAndIsr.isr.toSet
val targetReplicas = assignment.targetReplicas.toSet
targetReplicas.subsetOf(isr)
}
}
}
这一步主要是对每个副本所在的broker发送LeaderAndIsrRequest
- 这个LeaderAndIsrRequest很重要,同步副本就是在这里实现的。
private def updateLeaderEpochAndSendRequest(topicPartition: TopicPartition,
assignment: ReplicaAssignment): Unit = {
val stateChangeLog = stateChangeLogger.withControllerEpoch(controllerContext.epoch)
updateLeaderEpoch(topicPartition) match {
case Some(updatedLeaderIsrAndControllerEpoch) =>
try {
brokerRequestBatch.newBatch()
brokerRequestBatch.addLeaderAndIsrRequestForBrokers(assignment.replicas, topicPartition,
updatedLeaderIsrAndControllerEpoch, assignment, isNew = false)
brokerRequestBatch.sendRequestsToBrokers(controllerContext.epoch)
} catch {
case e: IllegalStateException =>
handleIllegalState(e)
}
stateChangeLog.trace(s"Sent LeaderAndIsr request $updatedLeaderIsrAndControllerEpoch with " +
s"new replica assignment $assignment to leader ${updatedLeaderIsrAndControllerEpoch.leaderAndIsr.leader} " +
s"for partition being reassigned $topicPartition")
case None => // fail the reassignment
stateChangeLog.error(s"Failed to send LeaderAndIsr request with new replica assignment " +
s"$assignment to leader for partition being reassigned $topicPartition")
}
}
这步主要是针对新副本设置由副本状态机设置NewReplica状态
private def startNewReplicasForReassignedPartition(topicPartition: TopicPartition, newReplicas: Seq[Int]): Unit = {
// send the start replica request to the brokers in the reassigned replicas list that are not in the assigned
// replicas list
newReplicas.foreach { replica =>
replicaStateMachine.handleStateChanges(Seq(PartitionAndReplica(topicPartition, replica)), NewReplica)
}
}
总结
- 这部分主要介绍了服务端接收到了分区重分配的任务是如何处理的,这里的逻辑相当于一cpu调配所有资源的过程,具体的实现都封装在了子方法中,后面我们再来逐一分析