ReplicaManager分析

一个broker可能分布多个Partition的副本信息,ReplicaManager主要负责管理一个broker范围内的Partition信息,然后它还根据Kafka

Controller发送过来的命令,然后执行这些命令,比如LeaderAndIsr和StopReplica

一 ReplicaManager  数据结构

假设有5个broker节点,分区数目为3,备份因子为2:

broker1, broker2, broker3, broker4, broker5

broker1上现在有3个分区,p11,p12,p13

p11: (broker1, broker3, broker5)

broker1为leader; broker3&broker5为follower

p12: (broker2, broker1, broker4)

broker2 为leader; broker1&broker4 为 follower

p13:( broker1, broker2, broker4)

broker1为leader; broker2&broker4 为 follower



二 核心字段

logManager:LogManager 管理日志的读写请求,但是内部委托给Log对象

scheduler: Scheduler 用于执行ReplicaManager中周期性的定时任务,在ReplicaManager中总共有三个定时任务:highwatermark-checkpoint,

isr-expiration,isr-change-propagation

quotaManager:ReplicationQuotaManager 配额管理

controllerEpoch:Int KafkaController的年代信息,每当重新选举Controller Leader时该字段会递增。之后,在ReplicaManager处理来自KafkaController请求会先检测请求中携带的controllerEpoch字段值,避免接受旧的Controller 请求

localBrokerId:Int 当前broker的id

allPartitions:Pool[(String, Int), Partition] 该broker上所有的分区信息

replicaFetcherManager:ReplicaFetcherManager 这个组件管理多个ReplicaFetcherThread线程,这些线程会向leader副本发送FetchRequest请求来获取消息,实现follower和leader的同步

highWatermarkCheckpoints:Map[Sting,OffsetCheckpoint] 用于记录买一个log目录与OffsetCheckpoint之间的映射关系,OffsetCheckpoint记录了对应的log目录下的replication-offset-checkpoint文件,该文件记录data目录下每一个分区的highwatermark,ReplicaManager中highwatermark-checkpoint定时任务会定时更新replication-offset-checkpoint文件内容

isrChangeSet:mutable.Set[TopicAndPartition] 记录ISR列表发生变化的分区信息

delayedProducePurgatory:DelayedOperationPurgatory[DelayedProduce] 用于管理DelayedProduce的DelayedOperationPurgatory

delayedFetchPurgatory:DelayedOperationPurgatory[DelayedFetch] 用于管理DelayedFetch的DelayedOperationPurgatory

 

三 重要方法

3.1 副本角色切换

KafkaController根据Partition的leader副本和follower副本状态向对应的broker发送LeaderAndIsrRequest,这个请求主要用于副本的角色切换; LeaderAndIsrRequest首先由KafkaApis.handleLeaderAndIsr

Request方法进行处理,其核心逻辑是通过ReplicaManager提供的

becomeLeaderOrFollower方法来实现,这个方法又具体依赖Partition

自己makeLeader方法和makeFollower方法

 

# 首先分析一下LeaderAndIsrRequest和LeaderAndIsrResponse消息体格式





defbecomeLeaderOrFollower(correlationId: Int,leaderAndISRRequest:LeaderAndIsrRequest, metadataCache: MetadataCache,
    onLeadershipChange: (Iterable[Partition],Iterable[Partition]) => Unit):BecomeLeaderOrFollowerResult = {
  leaderAndISRRequest.partitionStates.asScala.foreach{ case (topicPartition,stateInfo) =>
    stateChangeLogger.trace("Broker %d received LeaderAndIsr request %s correlation id %d fromcontroller %d epoch %d for partition [%s,%d]"
                             
.format(localBrokerId,stateInfo, correlationId,
                                      leaderAndISRRequest.controllerId,leaderAndISRRequest.controllerEpoch,topicPartition.topic,topicPartition.partition))
  }
  replicaStateChangeLocksynchronized {
    val responseMap= new mutable.HashMap[TopicPartition, Short]
    // 如果leaderAndISR请求的controllerEpoch < 初始化的controllerEpoch抛出异常
   
if (leaderAndISRRequest.controllerEpoch<controllerEpoch) {
      leaderAndISRRequest.partitionStates.asScala.foreach{ case (topicPartition,stateInfo) =>
      stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d withcorrelation id %d since "+
        "itscontroller epoch %d is old. Latest known controller epoch is %d").format(localBrokerId,leaderAndISRRequest.controllerId,
        correlationId, leaderAndISRRequest.controllerEpoch, controllerEpoch))
      }
      BecomeLeaderOrFollowerResult(responseMap,Errors.STALE_CONTROLLER_EPOCH.code)
    } else {
      val controllerId= leaderAndISRRequest.controllerId
     
controllerEpoch= leaderAndISRRequest.controllerEpoch

     
// 首先检查partitionleaderepoch
     
val partitionState= new mutable.HashMap[Partition,PartitionState]()
      // 遍历LeaderAndIsr请求的partition状态的
     
leaderAndISRRequest.partitionStates.asScala.foreach{ case (topicPartition,stateInfo) =>
        // 根据topic partition获取或者创建partition
       
val partition= getOrCreatePartition(topicPartition.topic,topicPartition.partition)
        // 获取该partitionleadaer epoch
       
val partitionLeaderEpoch= partition.getLeaderEpoch()
        // 如果partitionLeaderEpoch小于请求中的leaderEpoch,否则就是过时的请求
       
if (partitionLeaderEpoch< stateInfo.leaderEpoch) {
          // 判断该partition是否被assigned给当前的broker
         
if(stateInfo.replicas.contains(config.brokerId))
            // 将分配到到当前brokerpartition放入partitionState,其中partition是当前的状况,
            // stateInfo
是请求中最新情况
           
partitionState.put(partition,stateInfo)
          else {
            stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d withcorrelation id %d "+
              "epoch %dfor partition [%s,%d] as itself is not in assigned replica list %s")
              .format(localBrokerId,controllerId, correlationId,leaderAndISRRequest.controllerEpoch,
                topicPartition.topic,topicPartition.partition,stateInfo.replicas.asScala.mkString(",")))
            responseMap.put(topicPartition,Errors.UNKNOWN_TOPIC_OR_PARTITION.code)
          }
        } else {
          // Otherwiserecord the error code in response
         
stateChangeLogger
.warn(("Broker %d ignoring LeaderAndIsr request from controller %d withcorrelation id %d "+
            "epoch %dfor partition [%s,%d] since its associated leader epoch %d is not higher thanthe current leader epoch %d")
            .format(localBrokerId,controllerId, correlationId,leaderAndISRRequest.controllerEpoch,
              topicPartition.topic,topicPartition.partition,stateInfo.leaderEpoch,partitionLeaderEpoch))
          responseMap.put(topicPartition,Errors.STALE_CONTROLLER_EPOCH.code)
        }
      }
      /**判断是否为leaderfollower,分别调用makeLeadersmakeFollowers  */
     
//partitionState:表示当前broker存储的partitionpartition状态信息
      //
过滤出来这些partition中哪些partitionleader是当前broker
     
val partitionsTobeLeader= partitionState.filter{ case (partition,stateInfo) =>
        stateInfo.leader== config.brokerId
     
}
      // partitionState去掉leader,剩下的都是follower
     
val partitionsToBeFollower= partitionState--partitionsTobeLeader.keys
     
// 如果partitionsTobeLeader不为空,调用makeLeaders方法
     
val partitionsBecomeLeader= if (partitionsTobeLeader.nonEmpty)
        makeLeaders(controllerId,controllerEpoch, partitionsTobeLeader, correlationId,responseMap)
      else
       
Set.empty[Partition]
      // 如果partitionsBecomeFollower不为空,调用makeFollowers方法
     
val partitionsBecomeFollower= if (partitionsToBeFollower.nonEmpty)
        makeFollowers(controllerId,controllerEpoch, partitionsToBeFollower, correlationId,responseMap, metadataCache)
      else
       
Set.empty[Partition]

      // weinitialize highwatermark thread after the firstleaderisrrequest.This ensures that all the partitions
      // have been completely populatedbefore starting the
checkpointing there by avoiding weird raceconditions
      //
第一次LeaderAndIsr请求之后,就初始化highwatermark线程,并标记highwatermark已经初始化过了
     
if (!hwThreadInitialized) {
        startHighWaterMarksCheckPointThread()
        hwThreadInitialized= true
     
}
      //ReplicaFetcherManager关闭空闲的Fetcher线程
     
replicaFetcherManager
.shutdownIdleFetcherThreads()
      // 触发LeadershipChange
     
onLeadershipChange(partitionsBecomeLeader,partitionsBecomeFollower)
      BecomeLeaderOrFollowerResult(responseMap,Errors.NONE.code)
    }
  }
}

private def makeLeaders(controllerId: Int, epoch: Int, partitionState: Map[Partition, PartitionState],
    correlationId: Int, responseMap: mutable.Map[TopicPartition, Short]): Set[Partition] = {
  partitionState.foreach(state =>
    stateChangeLogger.trace(("Broker %d handling LeaderAndIsr request correlationId %d from controller %d epoch %d " +
      "starting the become-leader transition for partition %s")
      .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId))))

  for (partition <- partitionState.keys)
    responseMap.put(new TopicPartition(partition.topic, partition.partitionId), Errors.NONE.code)

  val partitionsToMakeLeaders: mutable.Set[Partition] = mutable.Set()

  try {
    // 首先停止这些分区副本的fetch线程
    replicaFetcherManager.removeFetcherForPartitions(partitionState.keySet.map(p => new TopicPartition(p.topic, p.partitionId)))
    // 更新leaderpartition信息
    partitionState.foreach{ case (partition, partitionStateInfo) =>
      if(partition.makeLeader(controllerId, partitionStateInfo, correlationId))
        partitionsToMakeLeaders += partition
      else
        stateChangeLogger.info(("Broker %d skipped the become-leader state change after marking its partition as leader with correlation id %d from " +
          "controller %d epoch %d for partition %s since it is already the leader for the partition.")
          .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(partition.topic, partition.partitionId)));
    }
    partitionsToMakeLeaders.foreach { partition =>
      stateChangeLogger.trace(("Broker %d stopped fetchers as part of become-leader request from controller " +
        "%d epoch %d with correlation id %d for partition %s")
        .format(localBrokerId, controllerId, epoch, correlationId, TopicAndPartition(partition.topic, partition.partitionId)))
    }
  } catch {
    case e: Throwable =>
      partitionState.foreach { state =>
        val errorMsg = ("Error on broker %d while processing LeaderAndIsr request correlationId %d received from controller %d" +
          " epoch %d for partition %s").format(localBrokerId, correlationId, controllerId, epoch,
                                              TopicAndPartition(state._1.topic, state._1.partitionId))
        stateChangeLogger.error(errorMsg, e)
      }
      // Re-throw the exception for it to be caught in KafkaApis
      throw e
  }

  partitionState.foreach { state =>
    stateChangeLogger.trace(("Broker %d completed LeaderAndIsr request correlationId %d from controller %d epoch %d " +
      "for the become-leader transition for partition %s")
      .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))
  }

  partitionsToMakeLeaders
}
private def makeFollowers(controllerId: Int, epoch: Int, partitionState: Map[Partition, PartitionState],
    correlationId: Int, responseMap: mutable.Map[TopicPartition, Short], metadataCache: MetadataCache) : Set[Partition] = {
  partitionState.foreach { state =>
    stateChangeLogger.trace(("Broker %d handling LeaderAndIsr request correlationId %d from controller %d epoch %d " +
      "starting the become-follower transition for partition %s")
      .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))
  }

  for (partition <- partitionState.keys)
    responseMap.put(new TopicPartition(partition.topic, partition.partitionId), Errors.NONE.code)

  val partitionsToMakeFollower: mutable.Set[Partition] = mutable.Set()

  try {

    // TODO: Delete leaders from LeaderAndIsrRequest
    partitionState.foreach{ case (partition, partitionStateInfo) =>
      // 检测新的broker是否存活
      val newLeaderBrokerId = partitionStateInfo.leader
      metadataCache.getAliveBrokers.find(_.id == newLeaderBrokerId) match {
        // Only change partition state when the leader is available
        case Some(leaderBroker) =>
          // 调用PartitionmakeFollower将分区的local replica切换为follower副本
          if (partition.makeFollower(controllerId, partitionStateInfo, correlationId))
            partitionsToMakeFollower += partition
          else
            stateChangeLogger.info(("Broker %d skipped the become-follower state change after marking its partition as follower with correlation id %d from " +
              "controller %d epoch %d for partition [%s,%d] since the new leader %d is the same as the old leader")
              .format(localBrokerId, correlationId, controllerId, partitionStateInfo.controllerEpoch,
              partition.topic, partition.partitionId, newLeaderBrokerId))
        case None =>
          // The leader broker should always be present in the metadata cache.
          // If not, we should record the error message and abort the transition process for this partition
          stateChangeLogger.error(("Broker %d received LeaderAndIsrRequest with correlation id %d from controller" +
            " %d epoch %d for partition [%s,%d] but cannot become follower since the new leader %d is unavailable.")
            .format(localBrokerId, correlationId, controllerId, partitionStateInfo.controllerEpoch,
            partition.topic, partition.partitionId, newLeaderBrokerId))
          // Create the local replica even if the leader is unavailable. This is required to ensure that we include
          // the partition's high watermark in the checkpoint file (see KAFKA-1647)
          partition.getOrCreateReplica()
      }
    }
    // 停止与旧leader同步的fetch线程
    replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(p => new TopicPartition(p.topic, p.partitionId)))
    partitionsToMakeFollower.foreach { partition =>
      stateChangeLogger.trace(("Broker %d stopped fetchers as part of become-follower request from controller " +
        "%d epoch %d with correlation id %d for partition %s")
        .format(localBrokerId, controllerId, epoch, correlationId, TopicAndPartition(partition.topic, partition.partitionId)))
    }
    // 由于leader副本已经发生变化,所以新旧leader副本在HW-LEO之间的消息可能是不一致的,但是HW之前的消息是一致的,所及需要截断日志
    logManager.truncateTo(partitionsToMakeFollower.map(partition => (new TopicAndPartition(partition), partition.getOrCreateReplica().highWatermark.messageOffset)).toMap)
    partitionsToMakeFollower.foreach { partition =>
      val topicPartitionOperationKey = new TopicPartitionOperationKey(partition.topic, partition.partitionId)
      tryCompleteDelayedProduce(topicPartitionOperationKey)
      tryCompleteDelayedFetch(topicPartitionOperationKey)
    }

    partitionsToMakeFollower.foreach { partition =>
      stateChangeLogger.trace(("Broker %d truncated logs and checkpointed recovery boundaries for partition [%s,%d] as part of " +
        "become-follower request with correlation id %d from controller %d epoch %d").format(localBrokerId,
        partition.topic, partition.partitionId, correlationId, controllerId, epoch))
    }
    // 检测ReplicaManager运行状态
    if (isShuttingDown.get()) {
      partitionsToMakeFollower.foreach { partition =>
        stateChangeLogger.trace(("Broker %d skipped the adding-fetcher step of the become-follower state change with correlation id %d from " +
          "controller %d epoch %d for partition [%s,%d] since it is shutting down").format(localBrokerId, correlationId,
          controllerId, epoch, partition.topic, partition.partitionId))
      }
    }
    else {
      // 重新开启leader副本同步和Fetcher线程
      val partitionsToMakeFollowerWithLeaderAndOffset = partitionsToMakeFollower.map(partition =>
        new TopicPartition(partition.topic, partition.partitionId) -> BrokerAndInitialOffset(
          metadataCache.getAliveBrokers.find(_.id == partition.leaderReplicaIdOpt.get).get.getBrokerEndPoint(config.interBrokerSecurityProtocol),
          partition.getReplica().get.logEndOffset.messageOffset)).toMap
      replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset)

      partitionsToMakeFollower.foreach { partition =>
        stateChangeLogger.trace(("Broker %d started fetcher to new leader as part of become-follower request from controller " +
          "%d epoch %d with correlation id %d for partition [%s,%d]")
          .format(localBrokerId, controllerId, epoch, correlationId, partition.topic, partition.partitionId))
      }
    }
  } catch {
    case e: Throwable =>
      val errorMsg = ("Error on broker %d while processing LeaderAndIsr request with correlationId %d received from controller %d " +
        "epoch %d").format(localBrokerId, correlationId, controllerId, epoch)
      stateChangeLogger.error(errorMsg, e)
      // Re-throw the exception for it to be caught in KafkaApis
      throw e
  }

  partitionState.foreach { state =>
    stateChangeLogger.trace(("Broker %d completed LeaderAndIsr request correlationId %d from controller %d epoch %d " +
      "for the become-follower transition for partition %s")
      .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))
  }

  partitionsToMakeFollower
}

 

3.2 追加/读取消息

private def appendToLocalLog(internalTopicsAllowed: Boolean, messagesPerPartition: Map[TopicPartition, MessageSet],
            requiredAcks: Short): Map[TopicPartition, LogAppendResult] = {
  trace("Append [%s] to local log ".format(messagesPerPartition))
  messagesPerPartition.map { case (topicPartition, messages) =>
    BrokerTopicStats.getBrokerTopicStats(topicPartition.topic).totalProduceRequestRate.mark()
    BrokerTopicStats.getBrokerAllTopicsStats().totalProduceRequestRate.mark()
    // 如果不被允许,是不能添加到internal topics
    if (Topic.isInternal(topicPartition.topic) && !internalTopicsAllowed) {
      (topicPartition, LogAppendResult(
        LogAppendInfo.UnknownLogAppendInfo,
        Some(new InvalidTopicException("Cannot append to internal topic %s".format(topicPartition.topic)))))
    } else {
      try {
        // 从当前broker的所有分区获取对应的Partition对象
        val partitionOpt = getPartition(topicPartition.topic, topicPartition.partition)
        val info = partitionOpt match {
          case Some(partition) =>
            // 调用partitionappendMessagesToLeader添加消息,将消息写入log
            partition.appendMessagesToLeader(messages.asInstanceOf[ByteBufferMessageSet], requiredAcks)
          case None => throw new UnknownTopicOrPartitionException("Partition %s doesn't exist on %d"
            .format(topicPartition, localBrokerId))
        }

        val numAppendedMessages =
          if (info.firstOffset == -1L || info.lastOffset == -1L)
            0
          else
            info.lastOffset - info.firstOffset + 1

        // update stats for successfully appended bytes and messages as bytesInRate and messageInRate
        BrokerTopicStats.getBrokerTopicStats(topicPartition.topic).bytesInRate.mark(messages.sizeInBytes)
        BrokerTopicStats.getBrokerAllTopicsStats.bytesInRate.mark(messages.sizeInBytes)
        BrokerTopicStats.getBrokerTopicStats(topicPartition.topic).messagesInRate.mark(numAppendedMessages)
        BrokerTopicStats.getBrokerAllTopicsStats.messagesInRate.mark(numAppendedMessages)

        trace("%d bytes written to log %s-%d beginning at offset %d and ending at offset %d"
          .format(messages.sizeInBytes, topicPartition.topic, topicPartition.partition, info.firstOffset, info.lastOffset))
        (topicPartition, LogAppendResult(info))
      } catch {
      // 
    }
  }
}

 

从leader 副本获取消息,然后等待足够的数据再返回,超时或者满足了必要的获取信息

def fetchMessages(timeout: Long, replicaId: Int, fetchMinBytes: Int, fetchMaxBytes: Int, hardMaxBytesLimit: Boolean,
  fetchInfos: Seq[(TopicAndPartition, PartitionFetchInfo)], quota: ReplicaQuota = UnboundedQuota,
  responseCallback: Seq[(TopicAndPartition, FetchResponsePartitionData)] => Unit) {
  val isFromFollower = replicaId >= 0
  val fetchOnlyFromLeader: Boolean = replicaId != Request.DebuggingConsumerId
  val fetchOnlyCommitted: Boolean = ! Request.isValidBrokerId(replicaId)

  // 从本地日志读取文件
  val logReadResults = readFromLocalLog(
    replicaId = replicaId,
    fetchOnlyFromLeader = fetchOnlyFromLeader,
    readOnlyCommitted = fetchOnlyCommitted,
    fetchMaxBytes = fetchMaxBytes,
    hardMaxBytesLimit = hardMaxBytesLimit,
    readPartitionInfo = fetchInfos,
    quota = quota)

  // 如果fetch请求来自follower,则更新它的LOE
  if(Request.isValidBrokerId(replicaId))
    /*
     * 主要逻辑:
     * 1 leader中维护了follower副本各个状态,这里会更新对应follower的状态比如LEO
     * 2 检测是否需要对ISR进行扩张,如果ISR发生变化,则将ISR集合变化记录保存到zookeeper
     * 3 检测是否后移HighWatermark
     * 4 检测delayedProducePurgatory中相关key对应的DelayedProduce,如果满足则执行完成
     */
    updateFollowerLogReadResults(replicaId, logReadResults)

  // 获取从日志读取到的总字节数
  val logReadResultValues = logReadResults.map { case (_, v) => v }
  // 统计读取到的总字节数
  val bytesReadable = logReadResultValues.map(_.info.messageSet.sizeInBytes).sum
  // 检查读取结果是否有错误
  val errorReadingData = logReadResultValues.foldLeft(false) ((errorIncurred, readResult) =>
    errorIncurred || (readResult.errorCode != Errors.NONE.code))

  /*
   * 判断是否能够立即返回FetchResponse
   * 1 不想等待,需要立即返回的
   * 2 FetchRequest没有指定要读取的分区
   * 3 数据已经够了
   * 4 读取数据时候发生了错误,即检查errorReadingData
   */
  if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) {
    val fetchPartitionData = logReadResults.map { case (tp, result) =>
      tp -> FetchResponsePartitionData(result.errorCode, result.hw, result.info.messageSet)
    }
    // 直接调用回调函数
    responseCallback(fetchPartitionData)
  } else {
    // 封装返回结果
    val fetchPartitionStatus = logReadResults.map { case (topicAndPartition, result) =>
      val fetchInfo = fetchInfos.collectFirst {
        case (tp, v) if tp == topicAndPartition => v
      }.getOrElse(sys.error(s"Partition $topicAndPartition not found in fetchInfos"))
      (topicAndPartition, FetchPartitionStatus(result.info.fetchOffsetMetadata, fetchInfo))
    }
    // 构造FetchMetadata对象
    val fetchMetadata = FetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit, fetchOnlyFromLeader,
      fetchOnlyCommitted, isFromFollower, replicaId, fetchPartitionStatus)
    // 构造一个DelayedFetdch对象
    val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, responseCallback)

    // 创建一个(topic, partition)键值对对形式的列表作为delayed fetch操作的key
    val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => new TopicPartitionOperationKey(tp) }
    // 尝试立即完成当前的请求,否则放入purgatory
    delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
  }
}

当ISR列表所有follower副本都已经了同步了消息,Kafka认为消息已经成功提交,可以将HW后移,所以如果来自follower的Fetch请求多了一步处理:updateFollowerLogReadResults:

# 更新leader副本上维护的follower的各个状态

# 随着follower副本不断fetch消息,最终追上leader副本,可能对isr集合进行扩张,如果isr集合发生变化则将新的isr集合记录到zookeeper

# 检测是否需要后移highwatermark

# 检测delayedProducePurgatory相关的key对应的DelayedOperation是否满足条件,满足条件则执行

private def updateFollowerLogReadResults(replicaId: Int, readResults: Seq[(TopicAndPartition, LogReadResult)]) {
  debug("Recording follower broker %d log read results: %s ".format(replicaId, readResults))
  // 遍历读取的日志结果
  readResults.foreach { case (topicAndPartition, readResult) =>
    getPartition(topicAndPartition.topic, topicAndPartition.partition) match {
      case Some(partition) =>
        // 调用Partition#updateReplicaLogReadResult方法,会更新 follower副本状态,并且尝试扩张ISR列表
        partition.updateReplicaLogReadResult(replicaId, readResult)
        // 尝试执行DelayedOperation
        tryCompleteDelayedProduce(new TopicPartitionOperationKey(topicAndPartition))
      case None =>
        warn("While recording the replica LEO, the partition %s hasn't been created.".format(topicAndPartition))
    }
  }
}

 

3.3 消息同步

follower副本与leader副本同步的功能是由ReplicaFetcherManager组件实现的,它继承了AbstractFetcherManager.

在AbstractFetcherThread中调方法addPartitions和removePartitions对partitionMap字段进行增删,同时会唤醒Fetcher线程同步

def addPartitions(partitionAndOffsets: Map[TopicPartition, Long]) {
  partitionMapLock.lockInterruptibly()
  try {
    // 检测指定分区是否存在
    val newPartitionToState = partitionAndOffsets.filter { case (tp, _) =>
      !partitionStates.contains(tp)
    }.map { case (tp, offset) =>
      val fetchState =
        if (PartitionTopicInfo.isOffsetInvalid(offset)) new PartitionFetchState(handleOffsetOutOfRange(tp))
        else new PartitionFetchState(offset)
      tp -> fetchState
    }
    val existingPartitionToState = partitionStates.partitionStates.asScala.map { state =>
      state.topicPartition -> state.value
    }.toMap
    partitionStates.set((existingPartitionToState ++ newPartitionToState).asJava)
    partitionMapCond.signalAll()// 唤醒当前的fetcher线程,进行同步
  } finally partitionMapLock.unlock()
}
def removePartitions(topicPartitions: Set[TopicPartition]) {
  partitionMapLock.lockInterruptibly()
  try {
    topicPartitions.foreach { topicPartition =>
      partitionStates.remove(topicPartition)
      fetcherLagStats.unregister(topicPartition.topic, topicPartition.partition)
    }
  } finally partitionMapLock.unlock()
}

 

override def doWork() {

  val fetchRequest = inLock(partitionMapLock) {
    // 构建FetchRequest请求
    val fetchRequest = buildFetchRequest(partitionStates.partitionStates.asScala.map { state =>
      state.topicPartition -> state.value
    })
    if (fetchRequest.isEmpty) {
      trace("There are no active partitions. Back off for %d ms before sending a fetch request".format(fetchBackOffMs))
      partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
    }
    fetchRequest
  }
  if (!fetchRequest.isEmpty)
    // 处理FetchRequest请求
    processFetchRequest(fetchRequest)
}

 

处理fetch请求:

private def processFetchRequest(fetchRequest: REQ) {
  val partitionsWithError = mutable.Set[TopicPartition]()

  def updatePartitionsWithError(partition: TopicPartition): Unit = {
    partitionsWithError += partition
    partitionStates.moveToEnd(partition)
  }

  var responseData: Seq[(TopicPartition, PD)] = Seq.empty

  try {
    trace("Issuing to broker %d of fetch request %s".format(sourceBroker.id, fetchRequest))
    // 发送FetchRequest并等待FetchResponse
    responseData = fetch(fetchRequest)
  } catch {
    case t: Throwable =>
      if (isRunning.get) {
        warn(s"Error in fetch $fetchRequest", t)
        inLock(partitionMapLock) {
          partitionStates.partitionSet.asScala.foreach(updatePartitionsWithError)
          // there is an error occurred while fetching partitions, sleep a while
          // note that `ReplicaFetcherThread.handlePartitionsWithError` will also introduce the same delay for every
          // partition with error effectively doubling the delay. It would be good to improve this.
          partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
        }
      }
  }
  fetcherStats.requestRate.mark()

  if (responseData.nonEmpty) { // 处理fetch response
    inLock(partitionMapLock) {
      // 遍历每一个分区的响应信息
      responseData.foreach { case (topicPartition, partitionData) =>
        val topic = topicPartition.topic
        val partitionId = topicPartition.partition
        Option(partitionStates.stateValue(topicPartition)).foreach(currentPartitionFetchState =>
          // 从发送FetchRequest到收到FetchResponse这段那时间内,offset并没有发生太大变化
          if (fetchRequest.offset(topicPartition) == currentPartitionFetchState.offset) {
            Errors.forCode(partitionData.errorCode) match {
              case Errors.NONE =>
                try {
                  // 获取返回的消息集合
                  val messages = partitionData.toByteBufferMessageSet
                  // 获取返回的最后一条消息的offset
                  val newOffset = messages.shallowIterator.toSeq.lastOption.map(_.nextOffset).getOrElse(
                    currentPartitionFetchState.offset)

                  fetcherLagStats.getAndMaybePut(topic, partitionId).lag = Math.max(0L, partitionData.highWatermark - newOffset)
                  // leader副本获取的消息追加到log
                  processPartitionData(topicPartition, currentPartitionFetchState.offset, partitionData)

                  val validBytes = messages.validBytes // 验证
                  if (validBytes > 0) {
                    // 如果没有异常,则更新partition state
                    partitionStates.updateAndMoveToEnd(topicPartition, new PartitionFetchState(newOffset))
                    fetcherStats.byteRate.mark(validBytes)
                  }
                } catch {
                  case ime: CorruptRecordException =>
                    // we log the error and continue. This ensures two things
                    // 1. If there is a corrupt message in a topic partition, it does not bring the fetcher thread down and cause other topic partition to also lag
                    // 2. If the message is corrupt due to a transient state in the log (truncation, partial writes can cause this), we simply continue and
                    // should get fixed in the subsequent fetches
                    logger.error("Found invalid messages during fetch for partition [" + topic + "," + partitionId + "] offset " + currentPartitionFetchState.offset  + " error " + ime.getMessage)
                    updatePartitionsWithError(topicPartition);
                  case e: Throwable =>
                    throw new KafkaException("error processing data for partition [%s,%d] offset %d"
                      .format(topic, partitionId, currentPartitionFetchState.offset), e)
                }
              case Errors.OFFSET_OUT_OF_RANGE => // follower副本的请求的offset超出了leo的范围,返回此错误
                try {
                  val newOffset = handleOffsetOutOfRange(topicPartition)
                  partitionStates.updateAndMoveToEnd(topicPartition, new PartitionFetchState(newOffset))
                  error("Current offset %d for partition [%s,%d] out of range; reset offset to %d"
                    .format(currentPartitionFetchState.offset, topic, partitionId, newOffset))
                } catch {
                  case e: Throwable =>
                    error("Error getting offset for partition [%s,%d] to broker %d".format(topic, partitionId, sourceBroker.id), e)
                    updatePartitionsWithError(topicPartition)
                }
              case _ =>
                if (isRunning.get) {
                  error("Error for partition [%s,%d] to broker %d:%s".format(topic, partitionId, sourceBroker.id,
                    partitionData.exception.get))
                  updatePartitionsWithError(topicPartition)
                }
            }
          })
      }
    }
  }

  if (partitionsWithError.nonEmpty) {
    debug("handling partitions with error for %s".format(partitionsWithError))
    handlePartitionsWithErrors(partitionsWithError)
  }
}

 

protected def fetch(fetchRequest: FetchRequest): Seq[(TopicPartition, PartitionData)] = {
  // 发送fetch请求
  val clientResponse = sendRequest(ApiKeys.FETCH, Some(fetchRequestVersion), fetchRequest.underlying)
  // 返回FetchResponse
  new FetchResponse(clientResponse.responseBody).responseData.asScala.toSeq.map { case (key, value) =>
    key -> new PartitionData(value)
  }
}

 

2.4 关闭副本

当KafkaController发送StopReplicaRequest请求时,会关闭其指定的副本,并根据StopReplicaRequest中的字段决定是否删除副本对应的log. 在分区副本重新分配或者关闭的broker的过程中都会使用到该请求,该请求并不一定意味着需要删除旧副本和log比如关闭 broker的场景。

先分析StopReplicaRequest和StopReplicaResponse的消息体格式:





首先是调用KafkaApis#handleStopReplicaRequest方法:

def handleStopReplicaRequest(request: RequestChannel.Request) {
  // 将请求转成StopReplicaRequest
  val stopReplicaRequest = request.body.asInstanceOf[StopReplicaRequest]
  // 构造response header
  val responseHeader = new ResponseHeader(request.header.correlationId)
  val response =
    if (authorize(request.session, ClusterAction, Resource.ClusterResource)) {
      // 调用replicaManager#stopReplicas
      val (result, error) = replicaManager.stopReplicas(stopReplicaRequest)
      // 遍历返回结果,处理内部topic
      result.foreach { case (topicPartition, errorCode) =>
        if (errorCode == Errors.NONE.code && stopReplicaRequest.deletePartitions() && topicPartition.topic == Topic.GroupMetadataTopicName) {
          coordinator.handleGroupEmigration(topicPartition.partition)
        }
      }
      // 返回响应
      new StopReplicaResponse(error, result.asInstanceOf[Map[TopicPartition, JShort]].asJava)
    } else {
      val result = stopReplicaRequest.partitions.asScala.map((_, new JShort(Errors.CLUSTER_AUTHORIZATION_FAILED.code))).toMap
      new StopReplicaResponse(Errors.CLUSTER_AUTHORIZATION_FAILED.code, result.asJava)
    }

  requestChannel.sendResponse(new RequestChannel.Response(request, new ResponseSend(request.connectionId, responseHeader, response)))
  // 关闭该副本的fetcher线程
  replicaManager.replicaFetcherManager.shutdownIdleFetcherThreads()
}

 

 

调用ReplicaManager的stopReplicas方法:

def stopReplicas(stopReplicaRequest: StopReplicaRequest): (mutable.Map[TopicPartition, Short], Short) = {
  replicaStateChangeLock synchronized {
    // 存放分区和errorcode的一map
    val responseMap = new collection.mutable.HashMap[TopicPartition, Short]
    // 判断请求中controllerEpoch是否小于初始的controllerEpoch,否则抛出异常,说明请求过时
    if(stopReplicaRequest.controllerEpoch() < controllerEpoch) {
      stateChangeLogger.warn("Broker %d received stop replica request from an old controller epoch %d. Latest known controller epoch is %d"
        .format(localBrokerId, stopReplicaRequest.controllerEpoch, controllerEpoch))
      (responseMap, Errors.STALE_CONTROLLER_EPOCH.code)
    } else {
      // 获取StopReplica请求的partition信息
      val partitions = stopReplicaRequest.partitions.asScala
      controllerEpoch = stopReplicaRequest.controllerEpoch
      // 首先停止所有针对请求的partitionfetchers线程
      replicaFetcherManager.removeFetcherForPartitions(partitions)
      // 遍历每一分区,停止副本继续对外提供服务
      for (topicPartition <- partitions){
        val errorCode = stopReplica(topicPartition.topic, topicPartition.partition, stopReplicaRequest.deletePartitions)
        responseMap.put(topicPartition, errorCode)
      }
      (responseMap, Errors.NONE.code)
    }
  }
}

 

def stopReplica(topic: String, partitionId: Int, deletePartition: Boolean): Short  = {
  val errorCode = Errors.NONE.code
  getPartition(topic, partitionId) match {
    case Some(partition) =>
      // 判断该请求是否需要删除旧的分区以及其日志
      if(deletePartition) {
        // 获取要删除的分区
        val removedPartition = allPartitions.remove((topic, partitionId))
        if (removedPartition != null) {
          // 删除该分区的本地日志
          removedPartition.delete() // this will delete the local log
          val topicHasPartitions = allPartitions.keys.exists { case (t, _) => topic == t }
          if (!topicHasPartitions)
              BrokerTopicStats.removeMetrics(topic)
        }
      }
    case None =>
      // broker的所有分区中不存在对应的分区,直接删除log
      if(deletePartition) {
        val topicAndPartition = TopicAndPartition(topic, partitionId)

        if(logManager.getLog(topicAndPartition).isDefined) {
            logManager.deleteLog(topicAndPartition)
        }
      }
  }
  errorCode
}

 

 

2.5 ReplicaManager中的定时任务

ReplicaManager总共有三个定时任务:highwatermark-checkpoint,isr-expiration,isr-change-propogation。

highwatermark-checkpoint: 周期性记录每个副本的HW并保存到其log目录中的replication-offset-checkpoint文件中

def startHighWaterMarksCheckPointThread() = {
  if(highWatermarkCheckPointThreadStarted.compareAndSet(false, true))
    scheduler.schedule("highwatermark-checkpoint", checkpointHighWatermarks, period = config.replicaHighWatermarkCheckpointIntervalMs, unit = TimeUnit.MILLISECONDS)
}
def checkpointHighWatermarks() {
  // 获取当前broker的全部partitionreplica对象
  val replicas = allPartitions.values.flatMap(_.getReplica(config.brokerId))
  // 按照副本所在目录进行分组
  val replicasByDir = replicas.filter(_.log.isDefined).groupBy(_.log.get.dir.getParentFile.getAbsolutePath)
  // 遍历所有log目录
  for ((dir, reps) <- replicasByDir) {
    // 收集当前log目录下全部副本的HW
    val hwms = reps.map(r => new TopicAndPartition(r) -> r.highWatermark.messageOffset).toMap
    try {
      //更新对应log目录下的replication-offset-checkpoint文件
      highWatermarkCheckpoints(dir).write(hwms)
    } catch {
      case e: IOException =>
        fatal("Error writing to highwatermark file: ", e)
        Runtime.getRuntime().halt(1)
    }
  }
}
 
 
def startup() {
  // start ISR expiration thread
  scheduler.schedule("isr-expiration", maybeShrinkIsr, period = config.replicaLagTimeMaxMs, unit = TimeUnit.MILLISECONDS)
  scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges, period = 2500L, unit = TimeUnit.MILLISECONDS)
}

 

isr-expiration: 会周期性的调用maybeShrinkIsr方法检测每一个分区是否需要缩减其ISR集合

private def maybeShrinkIsr(): Unit = {
  trace("Evaluating ISR list of partitions to see which replicas can be removed from the ISR")
  allPartitions.values.foreach(partition => partition.maybeShrinkIsr(config.replicaLagTimeMaxMs))
}

 

def maybeShrinkIsr(replicaMaxLagTimeMs: Long) {
  val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {
    // 首先判断是不是当前broker是不是leader,只有leader才可以管理ISR
    leaderReplicaIfLocal() match {
      // 如果是leader
      case Some(leaderReplica) =>
        // 获取不同步副本,就是那些和leader差距很多的副本
        val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica, replicaMaxLagTimeMs)
        // 不同步副本如果存在
        if(outOfSyncReplicas.nonEmpty) {
          // ISR列表中移除没有同步的副本
          val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas
          assert(newInSyncReplicas.nonEmpty)
          info("Shrinking ISR for partition [%s,%d] from %s to %s".format(topic, partitionId,
            inSyncReplicas.map(_.brokerId).mkString(","), newInSyncReplicas.map(_.brokerId).mkString(",")))
          // zookeeper和缓存中更新ISR列表
          updateIsr(newInSyncReplicas)
          // we may need to increment high watermark since ISR could be down to 1
          // 因为ISR中移除了一个副本,那么有可能剩余的副本都在开始同步了,那么我们可能需要增加高水位线了
          replicaManager.isrShrinkRate.mark()
          maybeIncrementLeaderHW(leaderReplica)
        } else {
          false
        }

      case None => false // do nothing if no longer leader
    }
  }

  // 尝试进行延迟操作
  if (leaderHWIncremented)
    tryCompleteDelayedRequests()
}

 

isr-change-propogation:会周期性的将ISR集合发生变化的分区记录到zookeeper中

 

/*
 * 这个功能定期检查是否ISR需要被传播,当发生以下情况的时候,需要被传播:
 * 1 ISR 已经发生改变,但是还没有传播
 * 2 在最后5秒还没有传播,或者从上次ISR传播之后有60秒没有传播了
 * 它允许临时ISR改变在几秒钟之内就被传播,避免controller或者其他borker带来大量ISR状态改变
 * 这个函数在有partition的副本心跳超时后,把isr的变化对应的partition更新到zk中的/isr_change_notification/isr_change_节点中。
 */
def maybePropagateIsrChanges() {
  val now = System.currentTimeMillis()
  isrChangeSet synchronized {
    if (isrChangeSet.nonEmpty &&
      (lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
        lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
      ReplicationUtils.propagateIsrChanges(zkUtils, isrChangeSet)
      isrChangeSet.clear()
      lastIsrPropagationMs.set(now)
    }
  }
}


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

莫言静好、

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值