kafka源码解析之九ReplicaManager

原文地址:https://blog.csdn.net/wl044090432/article/details/51035614

首先解释下2个名词:

AR(assignreplicas):分配副本  ISR(in-sync replicas):在同步中的副本,即下图:

 

 
  1. Partition {

  2. topic : string //topic名称

  3. partition_id : int //partition id

  4. leader : Replica // 这个分区的leader副本,是isr中的其中一个

  5. ISR : Set[Replica] // 正在同步中的副本集合

  6. AR : Set[Replica] // 这个分区的所有副本分配集合,一个broker上有至多一个分区副本

  7. LeaderAndISRVersionInZK : long // version id of the LeaderAndISR path; used for conditionally update the LeaderAndISR path in ZK

  8. }

  9. Replica { // 一个分区副本信息

  10. broker_id : int

  11. partition : Partition //分区信息

  12. log : Log //本地日志与副本关联信息

  13. hw : long //最后被commit的message的offset信息

  14. leo : long // 日志结尾offset

  15. isLeader : Boolean //是否为该副本的leader

  16. }

 

 

接下来来看ReplicaManager的主要作用,它的角色定位是负责接收controller的command以完成replica的管理工作,command主要有两种, LeaderAndISRCommand和StopReplicaCommand。因此主要完成三件事:

1)接受LeaderAndISRCommand命令 2)接受StopReplicaCommand命令 3)开启定时线程maybeShrinkIsr

,以便发现那些已经没有进行同步的复本

9.1 LeaderAndISRCommand处理流程

 

当KafkaServer接受到LeaderAndIsrRequest指令时,会调用ReplicaManager的becomeLeaderOrFollower函数
 
  1. def becomeLeaderOrFollower(leaderAndISRRequest: LeaderAndIsrRequest,

  2. offsetManager: OffsetManager): (collection.Map[(String, Int), Short], Short) = {

  3. leaderAndISRRequest.partitionStateInfos.foreach { case ((topic, partition), stateInfo) =>

  4. stateChangeLogger.trace("Broker %d received LeaderAndIsr request %s correlation id %d from controller %d epoch %d for partition [%s,%d]"

  5. .format(localBrokerId, stateInfo, leaderAndISRRequest.correlationId,

  6. leaderAndISRRequest.controllerId, leaderAndISRRequest.controllerEpoch, topic, partition))

  7. }

  8. replicaStateChangeLock synchronized {

  9. val responseMap = new collection.mutable.HashMap[(String, Int), Short]

  10. if(leaderAndISRRequest.controllerEpoch < controllerEpoch) { // 检查requset epoch

  11. leaderAndISRRequest.partitionStateInfos.foreach { case ((topic, partition), stateInfo) =>

  12. stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d with correlation id %d since " +

  13. "its controller epoch %d is old. Latest known controller epoch is %d").format(localBrokerId, leaderAndISRRequest.controllerId,

  14. leaderAndISRRequest.correlationId, leaderAndISRRequest.controllerEpoch, controllerEpoch))

  15. }

  16. (responseMap, ErrorMapping.StaleControllerEpochCode)

  17. } else {

  18. val controllerId = leaderAndISRRequest.controllerId

  19. val correlationId = leaderAndISRRequest.correlationId

  20. controllerEpoch = leaderAndISRRequest.controllerEpoch

  21.  
  22. // First check partition's leader epoch

  23. // 前面只是检查了request的epoch,但是还要检查其中的每个partitionStateInfo中的leader epoch

  24. val partitionState = new HashMap[Partition, PartitionStateInfo]()

  25. leaderAndISRRequest.partitionStateInfos.foreach{ case ((topic, partitionId), partitionStateInfo) =>

  26. val partition = getOrCreatePartition(topic, partitionId)

  27. val partitionLeaderEpoch = partition.getLeaderEpoch()

  28. // If the leader epoch is valid record the epoch of the controller that made the leadership decision.

  29. // This is useful while updating the isr to maintain the decision maker controller's epoch in the zookeeper path

  30. // local的partitionLeaderEpoch要小于request中的leaderEpoch,否则就是过时的request

  31. if (partitionLeaderEpoch < partitionStateInfo.leaderIsrAndControllerEpoch.leaderAndIsr.leaderEpoch) {

  32. // 判断该partition是否被assigned给当前的broker

  33. if(partitionStateInfo.allReplicas.contains(config.brokerId))

  34. // 只将被分配到当前broker的partition放入partitionState,其中partition是当前的状况,partitionStateInfo是request中最新的状况

  35. partitionState.put(partition, partitionStateInfo)

  36. else {

  37. stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d with correlation id %d " +

  38. "epoch %d for partition [%s,%d] as itself is not in assigned replica list %s")

  39. .format(localBrokerId, controllerId, correlationId, leaderAndISRRequest.controllerEpoch,

  40. topic, partition.partitionId, partitionStateInfo.allReplicas.mkString(",")))

  41. }

  42. } else {

  43. // Otherwise record the error code in response

  44. stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d with correlation id %d " +

  45. "epoch %d for partition [%s,%d] since its associated leader epoch %d is old. Current leader epoch is %d")

  46. .format(localBrokerId, controllerId, correlationId, leaderAndISRRequest.controllerEpoch,

  47. topic, partition.partitionId, partitionStateInfo.leaderIsrAndControllerEpoch.leaderAndIsr.leaderEpoch, partitionLeaderEpoch))

  48. responseMap.put((topic, partitionId), ErrorMapping.StaleLeaderEpochCode)

  49. }

  50. }

  51. //核心逻辑,判断是否为leader或follower,分别调用makeLeaders和makeFollowers

  52. //case (partition, partitionStateInfo)中,partition是replicaManager当前的情况,而partitionStateInfo中间放的是request的新的分配情况,

  53. //筛选出partitionsTobeLeader

  54. val partitionsTobeLeader = partitionState

  55. .filter{ case (partition, partitionStateInfo) => partitionStateInfo.leaderIsrAndControllerEpoch.leaderAndIsr.leader == config.brokerId}

  56. //筛选出partitionsToBeFollower

  57. val partitionsToBeFollower = (partitionState -- partitionsTobeLeader.keys)

  58.  
  59. // 如果是leader,则调用leader的流程

  60. if (!partitionsTobeLeader.isEmpty)

  61. makeLeaders(controllerId, controllerEpoch, partitionsTobeLeader, leaderAndISRRequest.correlationId, responseMap, offsetManager)

  62. // 如果是follower,则调用follower的流程

  63. if (!partitionsToBeFollower.isEmpty)

  64. makeFollowers(controllerId, controllerEpoch, partitionsToBeFollower, leaderAndISRRequest.leaders, leaderAndISRRequest.correlationId, responseMap, offsetManager)

  65.  
  66. // we initialize highwatermark thread after the first leaderisrrequest. This ensures that all the partitions

  67. // have been completely populated before starting the checkpointing there by avoiding weird race conditions

  68. if (!hwThreadInitialized) {

  69. // 启动HighWaterMarksCheckPointThread,hw很重要,需要定期存到磁盘,这样failover的时候可以往后load

  70. startHighWaterMarksCheckPointThread()

  71. hwThreadInitialized = true

  72. }

  73. //关闭idle的fether,如果成为leader,就不需要fetch

  74. replicaFetcherManager.shutdownIdleFetcherThreads()

  75. (responseMap, ErrorMapping.NoError)

  76. }

  77. }

  78. }

 

主要是筛选出分配给该broker的partition的副本,并且根据lead是否为该brokerId区分为leader和follower,然后分别进入不同的流程

进入makeLeaders:

 
  1. private def makeLeaders(controllerId: Int, epoch: Int,

  2. partitionState: Map[Partition, PartitionStateInfo],

  3. correlationId: Int, responseMap: mutable.Map[(String, Int), Short],

  4. offsetManager: OffsetManager) = {

  5. partitionState.foreach(state =>

  6. stateChangeLogger.trace(("Broker %d handling LeaderAndIsr request correlationId %d from controller %d epoch %d " +

  7. "starting the become-leader transition for partition %s")

  8. .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId))))

  9. for (partition <- partitionState.keys)

  10. responseMap.put((partition.topic, partition.partitionId), ErrorMapping.NoError)

  11. try {

  12. // First stop fetchers for all the partitions

  13. //暂停该fetch线程

  14. replicaFetcherManager.removeFetcherForPartitions(partitionState.keySet.map(new TopicAndPartition(_)))

  15. partitionState.foreach { state =>

  16. stateChangeLogger.trace(("Broker %d stopped fetchers as part of become-leader request from controller " +

  17. "%d epoch %d with correlation id %d for partition %s")

  18. .format(localBrokerId, controllerId, epoch, correlationId, TopicAndPartition(state._1.topic, state._1.partitionId)))

  19. }

  20. // Update the partition information to be the leader

  21. //更新Partition中的属性

  22. partitionState.foreach{ case (partition, partitionStateInfo) =>

  23. partition.makeLeader(controllerId, partitionStateInfo, correlationId, offsetManager)}

  24. } catch {

  25. case e: Throwable =>

  26. partitionState.foreach { state =>

  27. val errorMsg = ("Error on broker %d while processing LeaderAndIsr request correlationId %d received from controller %d" +

  28. " epoch %d for partition %s").format(localBrokerId, correlationId, controllerId, epoch,

  29. TopicAndPartition(state._1.topic, state._1.partitionId))

  30. stateChangeLogger.error(errorMsg, e)

  31. }

  32. // Re-throw the exception for it to be caught in KafkaApis

  33. throw e

  34. }

  35. partitionState.foreach { state =>

  36. stateChangeLogger.trace(("Broker %d completed LeaderAndIsr request correlationId %d from controller %d epoch %d " +

  37. "for the become-leader transition for partition %s")

  38. .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))

  39. }

  40. }

进入makeFollowers
 
  1. private def makeFollowers(controllerId: Int, epoch: Int, partitionState: Map[Partition, PartitionStateInfo],

  2. leaders: Set[Broker], correlationId: Int, responseMap: mutable.Map[(String, Int), Short],

  3. offsetManager: OffsetManager) {

  4. partitionState.foreach { state =>

  5. stateChangeLogger.trace(("Broker %d handling LeaderAndIsr request correlationId %d from controller %d epoch %d " +

  6. "starting the become-follower transition for partition %s")

  7. .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))

  8. }

  9. for (partition <- partitionState.keys)

  10. responseMap.put((partition.topic, partition.partitionId), ErrorMapping.NoError)

  11. try {

  12. var partitionsToMakeFollower: Set[Partition] = Set()

  13. // TODO: Delete leaders from LeaderAndIsrRequest in 0.8.1

  14. partitionState.foreach{ case (partition, partitionStateInfo) =>

  15. val leaderIsrAndControllerEpoch = partitionStateInfo.leaderIsrAndControllerEpoch

  16. val newLeaderBrokerId = leaderIsrAndControllerEpoch.leaderAndIsr.leader

  17. leaders.find(_.id == newLeaderBrokerId) match {//只改变那些leader是available broker的partition

  18. // Only change partition state when the leader is available

  19. case Some(leaderBroker) =>

  20. // 仅仅当partition的leader发生变化时才返回true,因为如果不变,不需要做任何操作

  21. if (partition.makeFollower(controllerId, partitionStateInfo, correlationId, offsetManager))

  22. partitionsToMakeFollower += partition

  23. else

  24. stateChangeLogger.info(("Broker %d skipped the become-follower state change after marking its partition as follower with correlation id %d from " +

  25. "controller %d epoch %d for partition [%s,%d] since the new leader %d is the same as the old leader")

  26. .format(localBrokerId, correlationId, controllerId, leaderIsrAndControllerEpoch.controllerEpoch,

  27. partition.topic, partition.partitionId, newLeaderBrokerId))

  28. case None =>

  29. // The leader broker should always be present in the leaderAndIsrRequest.

  30. // If not, we should record the error message and abort the transition process for this partition

  31. stateChangeLogger.error(("Broker %d received LeaderAndIsrRequest with correlation id %d from controller" +

  32. " %d epoch %d for partition [%s,%d] but cannot become follower since the new leader %d is unavailable.")

  33. .format(localBrokerId, correlationId, controllerId, leaderIsrAndControllerEpoch.controllerEpoch,

  34. partition.topic, partition.partitionId, newLeaderBrokerId))

  35. // Create the local replica even if the leader is unavailable. This is required to ensure that we include

  36. // the partition's high watermark in the checkpoint file (see KAFKA-1647)

  37. partition.getOrCreateReplica()

  38. }

  39. }

  40. //由于leader已发生变化,需要把旧的fetcher删除 ,因为它指向了旧的leader,从旧的leader fetch数据

  41. replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(new TopicAndPartition(_)))

  42. partitionsToMakeFollower.foreach { partition =>

  43. stateChangeLogger.trace(("Broker %d stopped fetchers as part of become-follower request from controller " +

  44. "%d epoch %d with correlation id %d for partition %s")

  45. .format(localBrokerId, controllerId, epoch, correlationId, TopicAndPartition(partition.topic, partition.partitionId)))

  46. }

  47. //由于leader已发生变化,所以之前和旧leader同步的数据可能和新的leader是不一致的,但hw以下的数据,大家都是一致的,所以就把hw以上的数据truncate掉,防止不一致

  48. logManager.truncateTo(partitionsToMakeFollower.map(partition => (new TopicAndPartition(partition), partition.getOrCreateReplica().highWatermark.messageOffset)).toMap)

  49. partitionsToMakeFollower.foreach { partition =>

  50. stateChangeLogger.trace(("Broker %d truncated logs and checkpointed recovery boundaries for partition [%s,%d] as part of " +

  51. "become-follower request with correlation id %d from controller %d epoch %d").format(localBrokerId,

  52. partition.topic, partition.partitionId, correlationId, controllerId, epoch))

  53. }

  54. if (isShuttingDown.get()) { //真正shuttingDown,就不要再加fetcher

  55. partitionsToMakeFollower.foreach { partition =>

  56. stateChangeLogger.trace(("Broker %d skipped the adding-fetcher step of the become-follower state change with correlation id %d from " +

  57. "controller %d epoch %d for partition [%s,%d] since it is shutting down").format(localBrokerId, correlationId,

  58. controllerId, epoch, partition.topic, partition.partitionId))

  59. }

  60. }

  61. else {

  62. // we do not need to check if the leader exists again since this has been done at the beginning of this process

  63. val partitionsToMakeFollowerWithLeaderAndOffset = partitionsToMakeFollower.map(partition =>

  64. new TopicAndPartition(partition) -> BrokerAndInitialOffset(

  65. leaders.find(_.id == partition.leaderReplicaIdOpt.get).get,

  66. partition.getReplica().get.logEndOffset.messageOffset)).toMap

  67. //增加新的fetcher,指向新的leader

  68. replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset)

  69. partitionsToMakeFollower.foreach { partition =>

  70. stateChangeLogger.trace(("Broker %d started fetcher to new leader as part of become-follower request from controller " +

  71. "%d epoch %d with correlation id %d for partition [%s,%d]")

  72. .format(localBrokerId, controllerId, epoch, correlationId, partition.topic, partition.partitionId))

  73. }

  74. }

  75. } catch {

  76. case e: Throwable =>

  77. val errorMsg = ("Error on broker %d while processing LeaderAndIsr request with correlationId %d received from controller %d " +

  78. "epoch %d").format(localBrokerId, correlationId, controllerId, epoch)

  79. stateChangeLogger.error(errorMsg, e)

  80. // Re-throw the exception for it to be caught in KafkaApis

  81. throw e

  82. }

  83.  
  84. partitionState.foreach { state =>

  85. stateChangeLogger.trace(("Broker %d completed LeaderAndIsr request correlationId %d from controller %d epoch %d " +

  86. "for the become-follower transition for partition %s")

  87. .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))

  88. }

  89. }

 

9.2 StopReplicaCommand处理流程

 

当broker stop或用户删除某replica时,KafkaServer会接受到StopReplicaRequest指令,此时会调用ReplicaManager的stopReplicas函数:
 
  1. def stopReplicas(stopReplicaRequest: StopReplicaRequest): (mutable.Map[TopicAndPartition, Short], Short) = {

  2. replicaStateChangeLock synchronized {

  3. val responseMap = new collection.mutable.HashMap[TopicAndPartition, Short]

  4. if(stopReplicaRequest.controllerEpoch < controllerEpoch) {

  5. stateChangeLogger.warn("Broker %d received stop replica request from an old controller epoch %d."

  6. .format(localBrokerId, stopReplicaRequest.controllerEpoch) +

  7. " Latest known controller epoch is %d " + controllerEpoch)

  8. (responseMap, ErrorMapping.StaleControllerEpochCode)

  9. } else {

  10. controllerEpoch = stopReplicaRequest.controllerEpoch

  11. // First stop fetchers for all partitions, then stop the corresponding replicas

  12. // 先通过FetcherManager停止相关partition的Fetcher线程

  13. replicaFetcherManager.removeFetcherForPartitions(stopReplicaRequest.partitions.map(r => TopicAndPartition(r.topic, r.partition)))

  14. for(topicAndPartition <- stopReplicaRequest.partitions){

  15. // 然后针对不同的 topicAndPartition stop 副本

  16. val errorCode = stopReplica(topicAndPartition.topic, topicAndPartition.partition, stopReplicaRequest.deletePartitions)

  17. responseMap.put(topicAndPartition, errorCode)

  18. }

  19. (responseMap, ErrorMapping.NoError)

  20. }

  21. }

  22. }

stopReplica在很多情况下是不需要真正删除replica的,比如宕机
 
  1. def stopReplica(topic: String, partitionId: Int, deletePartition: Boolean): Short = {

  2. stateChangeLogger.trace("Broker %d handling stop replica (delete=%s) for partition [%s,%d]".format(localBrokerId,

  3. deletePartition.toString, topic, partitionId))

  4. val errorCode = ErrorMapping.NoError

  5. getPartition(topic, partitionId) match {

  6. case Some(partition) =>

  7. if(deletePartition) { // 仅仅在deletePartition=true时,才会真正删除该partition

  8. val removedPartition = allPartitions.remove((topic, partitionId))

  9. if (removedPartition != null)

  10. removedPartition.delete() // this will delete the local log

  11. }

  12. case None =>

  13. // Delete log and corresponding folders in case replica manager doesn't hold them anymore.

  14. // This could happen when topic is being deleted while broker is down and recovers.

  15. if(deletePartition) {

  16. val topicAndPartition = TopicAndPartition(topic, partitionId)

  17.  
  18. if(logManager.getLog(topicAndPartition).isDefined) {

  19. logManager.deleteLog(topicAndPartition)

  20. }

  21. }

  22. stateChangeLogger.trace("Broker %d ignoring stop replica (delete=%s) for partition [%s,%d] as replica doesn't exist on broker"

  23. .format(localBrokerId, deletePartition, topic, partitionId))

  24. }

  25. stateChangeLogger.trace("Broker %d finished handling stop replica (delete=%s) for partition [%s,%d]"

  26. .format(localBrokerId, deletePartition, topic, partitionId))

  27. errorCode

  28. }

 

9.3 maybeShrinkIsr处理流程

在启动的时候会开启maybeShrinkIsr任务供调度器调度,其主要作用是周期性检查isr中的SyncTime和SyncMessages来判断某些副本是否已经不在同步状态了。
 
 
  1. def startup() {

  2. // start ISR expiration thread

  3. scheduler.schedule("isr-expiration", maybeShrinkIsr, period = config.replicaLagTimeMaxMs, unit = TimeUnit.MILLISECONDS)

  4. }

  5. private def maybeShrinkIsr(): Unit = {

  6. trace("Evaluating ISR list of partitions to see which replicas can be removed from the ISR")

  7. allPartitions.values.foreach(partition => partition.maybeShrinkIsr(config.replicaLagTimeMaxMs, config.replicaLagMaxMessages))

  8. }

  9. def maybeShrinkIsr(replicaMaxLagTimeMs: Long, replicaMaxLagMessages: Long) {

  10. inWriteLock(leaderIsrUpdateLock) {

  11. leaderReplicaIfLocal() match {

  12. case Some(leaderReplica) =>

  13. // getOutOfSyncReplicas获取不在同步状态的副本

  14. val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica, replicaMaxLagTimeMs, replicaMaxLagMessages)

  15. if(outOfSyncReplicas.size > 0) {

  16. val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas

  17. assert(newInSyncReplicas.size > 0)

  18. info("Shrinking ISR for partition [%s,%d] from %s to %s".format(topic, partitionId,

  19. inSyncReplicas.map(_.brokerId).mkString(","), newInSyncReplicas.map(_.brokerId).mkString(",")))

  20. // update ISR in zk and in cache

  21. updateIsr(newInSyncReplicas) //把isr上传到zk

  22. // we may need to increment high watermark since ISR could be down to 1

  23. maybeIncrementLeaderHW(leaderReplica)

  24. replicaManager.isrShrinkRate.mark()

  25. }

  26. case None => // do nothing if no longer leader

  27. }

  28. }

  29. }

 
 
  1. def getOutOfSyncReplicas(leaderReplica: Replica, keepInSyncTimeMs: Long, keepInSyncMessages: Long): Set[Replica] = {

  2. /**

  3. * there are two cases that need to be handled here -

  4. * 1. Stuck followers: If the leo of the replica hasn't been updated for keepInSyncTimeMs ms,

  5. * the follower is stuck and should be removed from the ISR

  6. * 2. Slow followers: If the leo of the slowest follower is behind the leo of the leader by keepInSyncMessages, the

  7. * follower is not catching up and should be removed from the ISR

  8. **/

  9. val leaderLogEndOffset = leaderReplica.logEndOffset

  10. val candidateReplicas = inSyncReplicas - leaderReplica

  11. // Case 1 above

  12. // fetch的时候会更新logEndOffsetUpdateTimeMs

  13. val stuckReplicas = candidateReplicas.filter(r => (time.milliseconds - r.logEndOffsetUpdateTimeMs) > keepInSyncTimeMs)

  14. if(stuckReplicas.size > 0)

  15. debug("Stuck replicas for partition [%s,%d] are %s".format(topic, partitionId, stuckReplicas.map(_.brokerId).mkString(",")))

  16. // Case 2 above

  17. // 判断落后的messages数目

  18. val slowReplicas = candidateReplicas.filter(r =>

  19. r.logEndOffset.messageOffset >= 0 &&

  20. leaderLogEndOffset.messageOffset - r.logEndOffset.messageOffset > keepInSyncMessages)

  21. if(slowReplicas.size > 0)

  22. debug("Slow replicas for partition [%s,%d] are %s".format(topic, partitionId, slowReplicas.map(_.brokerId).mkString(",")))

  23. stuckReplicas ++ slowReplicas

  24.  
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值