ReplicaManager分析

最新推荐文章于 2022-01-22 23:37:39 发布

莫言静好、

最新推荐文章于 2022-01-22 23:37:39 发布

阅读量2.4k

点赞数

分类专栏：大数据/kafka/源码

本文链接：https://blog.csdn.net/zhanglh046/article/details/72821829

版权

大数据/kafka/源码专栏收录该内容

50 篇文章 3 订阅

订阅专栏

一个broker可能分布多个Partition的副本信息，ReplicaManager主要负责管理一个broker范围内的Partition信息，然后它还根据Kafka

Controller发送过来的命令，然后执行这些命令，比如LeaderAndIsr和StopReplica

一 ReplicaManager 数据结构

假设有5个broker节点，分区数目为3，备份因子为2：

broker1, broker2, broker3, broker4, broker5

broker1上现在有3个分区，p11,p12,p13

p11: (broker1, broker3, broker5)

broker1为leader; broker3&broker5为follower

p12: (broker2, broker1, broker4)

broker2 为leader; broker1&broker4 为 follower

p13:( broker1, broker2, broker4)

broker1为leader; broker2&broker4 为 follower

二核心字段

logManager:LogManager 管理日志的读写请求，但是内部委托给Log对象

scheduler: Scheduler 用于执行ReplicaManager中周期性的定时任务,在ReplicaManager中总共有三个定时任务:highwatermark-checkpoint,

isr-expiration,isr-change-propagation

quotaManager:ReplicationQuotaManager 配额管理

controllerEpoch:Int KafkaController的年代信息，每当重新选举Controller Leader时该字段会递增。之后，在ReplicaManager处理来自KafkaController请求会先检测请求中携带的controllerEpoch字段值，避免接受旧的Controller 请求

localBrokerId:Int 当前broker的id

allPartitions:Pool[(String, Int), Partition] 该broker上所有的分区信息

replicaFetcherManager：ReplicaFetcherManager 这个组件管理多个ReplicaFetcherThread线程，这些线程会向leader副本发送FetchRequest请求来获取消息，实现follower和leader的同步

highWatermarkCheckpoints：Map[Sting,OffsetCheckpoint] 用于记录买一个log目录与OffsetCheckpoint之间的映射关系，OffsetCheckpoint记录了对应的log目录下的replication-offset-checkpoint文件，该文件记录data目录下每一个分区的highwatermark，ReplicaManager中highwatermark-checkpoint定时任务会定时更新replication-offset-checkpoint文件内容

isrChangeSet:mutable.Set[TopicAndPartition] 记录ISR列表发生变化的分区信息

delayedProducePurgatory:DelayedOperationPurgatory[DelayedProduce] 用于管理DelayedProduce的DelayedOperationPurgatory

delayedFetchPurgatory:DelayedOperationPurgatory[DelayedFetch] 用于管理DelayedFetch的DelayedOperationPurgatory

三重要方法

3.1 副本角色切换

KafkaController根据Partition的leader副本和follower副本状态向对应的broker发送LeaderAndIsrRequest,这个请求主要用于副本的角色切换； LeaderAndIsrRequest首先由KafkaApis.handleLeaderAndIsr

Request方法进行处理，其核心逻辑是通过ReplicaManager提供的

becomeLeaderOrFollower方法来实现，这个方法又具体依赖Partition

自己makeLeader方法和makeFollower方法

# 首先分析一下LeaderAndIsrRequest和LeaderAndIsrResponse消息体格式

defbecomeLeaderOrFollower(correlationId: Int,leaderAndISRRequest:LeaderAndIsrRequest, metadataCache: MetadataCache,
    onLeadershipChange: (Iterable[Partition],Iterable[Partition]) => Unit):BecomeLeaderOrFollowerResult = {
leaderAndISRRequest.partitionStates.asScala.foreach{ case (topicPartition,stateInfo) =>
    stateChangeLogger.trace("Broker %d received LeaderAndIsr request %s correlation id %d fromcontroller %d epoch %d for partition [%s,%d]"
                              .format(localBrokerId,stateInfo, correlationId,
                                      leaderAndISRRequest.controllerId,leaderAndISRRequest.controllerEpoch,topicPartition.topic,topicPartition.partition))
}
replicaStateChangeLocksynchronized {
    val responseMap= new mutable.HashMap[TopicPartition, Short]
    // 如果leaderAndISR请求的controllerEpoch值 < 初始化的controllerEpoch抛出异常
    if (leaderAndISRRequest.controllerEpoch<controllerEpoch) {
      leaderAndISRRequest.partitionStates.asScala.foreach{ case (topicPartition,stateInfo) =>
      stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d withcorrelation id %d since "+
        "itscontroller epoch %d is old. Latest known controller epoch is %d").format(localBrokerId,leaderAndISRRequest.controllerId,
        correlationId, leaderAndISRRequest.controllerEpoch, controllerEpoch))
      }
      BecomeLeaderOrFollowerResult(responseMap,Errors.STALE_CONTROLLER_EPOCH.code)
    } else {
      val controllerId= leaderAndISRRequest.controllerId
      controllerEpoch= leaderAndISRRequest.controllerEpoch

      // 首先检查partition的leader的epoch值
      val partitionState= new mutable.HashMap[Partition,PartitionState]()
      // 遍历LeaderAndIsr请求的partition状态的
      leaderAndISRRequest.partitionStates.asScala.foreach{ case (topicPartition,stateInfo) =>
        // 根据topic和 partition获取或者创建partition
        val partition= getOrCreatePartition(topicPartition.topic,topicPartition.partition)
        // 获取该partition的leadaer epoch值
        val partitionLeaderEpoch= partition.getLeaderEpoch()
        // 如果partitionLeaderEpoch小于请求中的leaderEpoch，否则就是过时的请求
        if (partitionLeaderEpoch< stateInfo.leaderEpoch) {
          // 判断该partition是否被assigned给当前的broker
          if(stateInfo.replicas.contains(config.brokerId))
            // 将分配到到当前broker的partition放入partitionState，其中partition是当前的状况，
            // stateInfo是请求中最新情况
            partitionState.put(partition,stateInfo)
          else {
            stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d withcorrelation id %d "+
              "epoch %dfor partition [%s,%d] as itself is not in assigned replica list %s")
              .format(localBrokerId,controllerId, correlationId,leaderAndISRRequest.controllerEpoch,
                topicPartition.topic,topicPartition.partition,stateInfo.replicas.asScala.mkString(",")))
            responseMap.put(topicPartition,Errors.UNKNOWN_TOPIC_OR_PARTITION.code)
          }
        } else {
          // Otherwiserecord the error code in response
          stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d withcorrelation id %d "+
            "epoch %dfor partition [%s,%d] since its associated leader epoch %d is not higher thanthe current leader epoch %d")
            .format(localBrokerId,controllerId, correlationId,leaderAndISRRequest.controllerEpoch,
              topicPartition.topic,topicPartition.partition,stateInfo.leaderEpoch,partitionLeaderEpoch))
          responseMap.put(topicPartition,Errors.STALE_CONTROLLER_EPOCH.code)
        }
      }
      /**判断是否为leader或follower，分别调用makeLeaders和makeFollowers */
      //partitionState：表示当前broker存储的partition和partition状态信息
      // 过滤出来这些partition中哪些partition的leader是当前broker
      val partitionsTobeLeader= partitionState.filter{ case (partition,stateInfo) =>
        stateInfo.leader== config.brokerId
      }
      // 从partitionState去掉leader，剩下的都是follower
      val partitionsToBeFollower= partitionState--partitionsTobeLeader.keys
      // 如果partitionsTobeLeader不为空，调用makeLeaders方法
      val partitionsBecomeLeader= if (partitionsTobeLeader.nonEmpty)
        makeLeaders(controllerId,controllerEpoch, partitionsTobeLeader, correlationId,responseMap)
      else
        Set.empty[Partition]
      // 如果partitionsBecomeFollower不为空，调用makeFollowers方法
      val partitionsBecomeFollower= if (partitionsToBeFollower.nonEmpty)
        makeFollowers(controllerId,controllerEpoch, partitionsToBeFollower, correlationId,responseMap, metadataCache)
      else
        Set.empty[Partition]

      // weinitialize highwatermark thread after the firstleaderisrrequest.This ensures that all the partitions
      // have been completely populatedbefore starting the checkpointing there by avoiding weird raceconditions
      // 第一次LeaderAndIsr请求之后，就初始化highwatermark线程，并标记highwatermark已经初始化过了
      if (!hwThreadInitialized) {
        startHighWaterMarksCheckPointThread()
        hwThreadInitialized= true
      }
      //ReplicaFetcherManager关闭空闲的Fetcher线程
      replicaFetcherManager.shutdownIdleFetcherThreads()
      // 触发LeadershipChange
      onLeadershipChange(partitionsBecomeLeader,partitionsBecomeFollower)
      BecomeLeaderOrFollowerResult(responseMap,Errors.NONE.code)
    }
}
}

private def makeLeaders(controllerId: Int, epoch: Int, partitionState: Map[Partition, PartitionState],
    correlationId: Int, responseMap: mutable.Map[TopicPartition, Short]): Set[Partition] = {
  partitionState.foreach(state =>
    stateChangeLogger.trace(("Broker %d handling LeaderAndIsr request correlationId %d from controller %d epoch %d " +
      "starting the become-leader transition for partition %s")
      .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId))))

  for (partition <- partitionState.keys)
    responseMap.put(new TopicPartition(partition.topic, partition.partitionId), Errors.NONE.code)

  val partitionsToMakeLeaders: mutable.Set[Partition] = mutable.Set()

  try {
    // 首先停止这些分区副本的fetch线程
    replicaFetcherManager.removeFetcherForPartitions(partitionState.keySet.map(p => new TopicPartition(p.topic, p.partitionId)))
    // 更新leader的partition信息
    partitionState.foreach{ case (partition, partitionStateInfo) =>
      if(partition.makeLeader(controllerId, partitionStateInfo, correlationId))
        partitionsToMakeLeaders += partition
      else
        stateChangeLogger.info(("Broker %d skipped the become-leader state change after marking its partition as leader with correlation id %d from " +
          "controller %d epoch %d for partition %s since it is already the leader for the partition.")
          .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(partition.topic, partition.partitionId)));
    }
    partitionsToMakeLeaders.foreach { partition =>
      stateChangeLogger.trace(("Broker %d stopped fetchers as part of become-leader request from controller " +
        "%d epoch %d with correlation id %d for partition %s")
        .format(localBrokerId, controllerId, epoch, correlationId, TopicAndPartition(partition.topic, partition.partitionId)))
    }
  } catch {
    case e: Throwable =>
      partitionState.foreach { state =>
        val errorMsg = ("Error on broker %d while processing LeaderAndIsr request correlationId %d received from controller %d" +
          " epoch %d for partition %s").format(localBrokerId, correlationId, controllerId, epoch,
                                              TopicAndPartition(state._1.topic, state._1.partitionId))
        stateChangeLogger.error(errorMsg, e)
      }
      // Re-throw the exception for it to be caught in KafkaApis
      throw e
  }

  partitionState.foreach { state =>
    stateChangeLogger.trace(("Broker %d completed LeaderAndIsr request correlationId %d from controller %d epoch %d " +
      "for the become-leader transition for partition %s")
      .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))
  }

  partitionsToMakeLeaders
}

private def makeFollowers(controllerId: Int, epoch: Int, partitionState: Map[Partition, PartitionState],
    correlationId: Int, responseMap: mutable.Map[TopicPartition, Short], metadataCache: MetadataCache) : Set[Partition] = {
  partitionState.foreach { state =>
    stateChangeLogger.trace(("Broker %d handling LeaderAndIsr request correlationId %d from controller %d epoch %d " +
      "starting the become-follower transition for partition %s")
      .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))
  }

  for (partition <- partitionState.keys)
    responseMap.put(new TopicPartition(partition.topic, partition.partitionId), Errors.NONE.code)

  val partitionsToMakeFollower: mutable.Set[Partition] = mutable.Set()

  try {

    // TODO: Delete leaders from LeaderAndIsrRequest
    partitionState.foreach{ case (partition, partitionStateInfo) =>
      // 检测新的broker是否存活
      val newLeaderBrokerId = partitionStateInfo.leader
      metadataCache.getAliveBrokers.find(_.id == newLeaderBrokerId) match {
        // Only change partition state when the leader is available
        case Some(leaderBroker) =>
          // 调用Partition的makeFollower将分区的local replica切换为follower副本
          if (partition.makeFollower(controllerId, partitionStateInfo, correlationId))
            partitionsToMakeFollower += partition
          else
            stateChangeLogger.info(("Broker %d skipped the become-follower state change after marking its partition as follower with correlation id %d from " +
              "controller %d epoch %d for partition [%s,%d] since the new leader %d is the same as the old leader")
              .format(localBrokerId, correlationId, controllerId, partitionStateInfo.controllerEpoch,
              partition.topic, partition.partitionId, newLeaderBrokerId))
        case None =>
          // The leader broker should always be present in the metadata cache.
          // If not, we should record the error message and abort the transition process for this partition
          stateChangeLogger.error(("Broker %d received LeaderAndIsrRequest with correlation id %d from controller" +
            " %d epoch %d for partition [%s,%d] but cannot become follower since the new leader %d is unavailable.")
            .format(localBrokerId, correlationId, controllerId, partitionStateInfo.controllerEpoch,
            partition.topic, partition.partitionId, newLeaderBrokerId))
          // Create the local replica even if the leader is unavailable. This is required to ensure that we include
          // the partition's high watermark in the checkpoint file (see KAFKA-1647)
          partition.getOrCreateReplica()
      }
    }
    // 停止与旧leader同步的fetch线程
    replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(p => new TopicPartition(p.topic, p.partitionId)))
    partitionsToMakeFollower.foreach { partition =>
      stateChangeLogger.trace(("Broker %d stopped fetchers as part of become-follower request from controller " +
        "%d epoch %d with correlation id %d for partition %s")
        .format(localBrokerId, controllerId, epoch, correlationId, TopicAndPartition(partition.topic, partition.partitionId)))
    }
    // 由于leader副本已经发生变化，所以新旧leader副本在HW-LEO之间的消息可能是不一致的，但是HW之前的消息是一致的，所及需要截断日志
    logManager.truncateTo(partitionsToMakeFollower.map(partition => (new TopicAndPartition(partition), partition.getOrCreateReplica().highWatermark.messageOffset)).toMap)
    partitionsToMakeFollower.foreach { partition =>
      val topicPartitionOperationKey = new TopicPartitionOperationKey(partition.topic, partition.partitionId)
      tryCompleteDelayedProduce(topicPartitionOperationKey)
      tryCompleteDelayedFetch(topicPartitionOperationKey)
    }

    partitionsToMakeFollower.foreach { partition =>
      stateChangeLogger.trace(("Broker %d truncated logs and checkpointed recovery boundaries for partition [%s,%d] as part of " +
        "become-follower request with correlation id %d from controller %d epoch %d").format(localBrokerId,
        partition.topic, partition.partitionId, correlationId, controllerId, epoch))
    }
    // 检测ReplicaManager运行状态
    if (isShuttingDown.get()) {
      partitionsToMakeFollower.foreach { partition =>
        stateChangeLogger.trace(("Broker %d skipped the adding-fetcher step of the become-follower state change with correlation id %d from " +
          "controller %d epoch %d for partition [%s,%d] since it is shutting down").format(localBrokerId, correlationId,
          controllerId, epoch, partition.topic, partition.partitionId))
      }
    }
    else {
      // 重新开启leader副本同步和Fetcher线程
      val partitionsToMakeFollowerWithLeaderAndOffset = partitionsToMakeFollower.map(partition =>
        new TopicPartition(partition.topic, partition.partitionId) -> BrokerAndInitialOffset(
          metadataCache.getAliveBrokers.find(_.id == partition.leaderReplicaIdOpt.get).get.getBrokerEndPoint(config.interBrokerSecurityProtocol),
          partition.getReplica().get.logEndOffset.messageOffset)).toMap
      replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset)

      partitionsToMakeFollower.foreach { partition =>
        stateChangeLogger.trace(("Broker %d started fetcher to new leader as part of become-follower request from controller " +
          "%d epoch %d with correlation id %d for partition [%s,%d]")
          .format(localBrokerId, controllerId, epoch, correlationId, partition.topic, partition.partitionId))
      }
    }
  } catch {
    case e: Throwable =>
      val errorMsg = ("Error on broker %d while processing LeaderAndIsr request with correlationId %d received from controller %d " +
        "epoch %d").format(localBrokerId, correlationId, controllerId, epoch)
      stateChangeLogger.error(errorMsg, e)
      // Re-throw the exception for it to be caught in KafkaApis
      throw e
  }

  partitionState.foreach { state =>
    stateChangeLogger.trace(("Broker %d completed LeaderAndIsr request correlationId %d from controller %d epoch %d " +
      "for the become-follower transition for partition %s")
      .format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))
  }

  partitionsToMakeFollower
}

3.2 追加/读取消息

private def appendToLocalLog(internalTopicsAllowed: Boolean, messagesPerPartition: Map[TopicPartition, MessageSet],
            requiredAcks: Short): Map[TopicPartition, LogAppendResult] = {
  trace("Append [%s] to local log ".format(messagesPerPartition))
  messagesPerPartition.map { case (topicPartition, messages) =>
    BrokerTopicStats.getBrokerTopicStats(topicPartition.topic).totalProduceRequestRate.mark()
    BrokerTopicStats.getBrokerAllTopicsStats().totalProduceRequestRate.mark()
    // 如果不被允许，是不能添加到internal topics的
    if (Topic.isInternal(topicPartition.topic) && !internalTopicsAllowed) {
      (topicPartition, LogAppendResult(
        LogAppendInfo.UnknownLogAppendInfo,
        Some(new InvalidTopicException("Cannot append to internal topic %s".format(topicPartition.topic)))))
    } else {
      try {
        // 从当前broker的所有分区获取对应的Partition对象
        val partitionOpt = getPartition(topicPartition.topic, topicPartition.partition)
        val info = partitionOpt match {
          case Some(partition) =>
            // 调用partition的appendMessagesToLeader添加消息，将消息写入log
            partition.appendMessagesToLeader(messages.asInstanceOf[ByteBufferMessageSet], requiredAcks)
          case None => throw new UnknownTopicOrPartitionException("Partition %s doesn't exist on %d"
            .format(topicPartition, localBrokerId))
        }

        val numAppendedMessages =
          if (info.firstOffset == -1L || info.lastOffset == -1L)
            0
          else
            info.lastOffset - info.firstOffset + 1

        // update stats for successfully appended bytes and messages as bytesInRate and messageInRate
        BrokerTopicStats.getBrokerTopicStats(topicPartition.topic).bytesInRate.mark(messages.sizeInBytes)
        BrokerTopicStats.getBrokerAllTopicsStats.bytesInRate.mark(messages.sizeInBytes)
        BrokerTopicStats.getBrokerTopicStats(topicPartition.topic).messagesInRate.mark(numAppendedMessages)
        BrokerTopicStats.getBrokerAllTopicsStats.messagesInRate.mark(numAppendedMessages)

        trace("%d bytes written to log %s-%d beginning at offset %d and ending at offset %d"
          .format(messages.sizeInBytes, topicPartition.topic, topicPartition.partition, info.firstOffset, info.lastOffset))
        (topicPartition, LogAppendResult(info))
      } catch {
      // 略
    }
  }
}

从leader 副本获取消息，然后等待足够的数据再返回，超时或者满足了必要的获取信息

def fetchMessages(timeout: Long, replicaId: Int, fetchMinBytes: Int, fetchMaxBytes: Int, hardMaxBytesLimit: Boolean,
  fetchInfos: Seq[(TopicAndPartition, PartitionFetchInfo)], quota: ReplicaQuota = UnboundedQuota,
  responseCallback: Seq[(TopicAndPartition, FetchResponsePartitionData)] => Unit) {
  val isFromFollower = replicaId >= 0
  val fetchOnlyFromLeader: Boolean = replicaId != Request.DebuggingConsumerId
  val fetchOnlyCommitted: Boolean = ! Request.isValidBrokerId(replicaId)

  // 从本地日志读取文件
  val logReadResults = readFromLocalLog(
    replicaId = replicaId,
    fetchOnlyFromLeader = fetchOnlyFromLeader,
    readOnlyCommitted = fetchOnlyCommitted,
    fetchMaxBytes = fetchMaxBytes,
    hardMaxBytesLimit = hardMaxBytesLimit,
    readPartitionInfo = fetchInfos,
    quota = quota)

  // 如果fetch请求来自follower，则更新它的LOE
  if(Request.isValidBrokerId(replicaId))
    /*
     * 主要逻辑：
     * 1 在leader中维护了follower副本各个状态，这里会更新对应follower的状态比如LEO等
     * 2 检测是否需要对ISR进行扩张，如果ISR发生变化，则将ISR集合变化记录保存到zookeeper
     * 3 检测是否后移HighWatermark
     * 4 检测delayedProducePurgatory中相关key对应的DelayedProduce，如果满足则执行完成
     */
    updateFollowerLogReadResults(replicaId, logReadResults)

  // 获取从日志读取到的总字节数
  val logReadResultValues = logReadResults.map { case (_, v) => v }
  // 统计读取到的总字节数
  val bytesReadable = logReadResultValues.map(_.info.messageSet.sizeInBytes).sum
  // 检查读取结果是否有错误
  val errorReadingData = logReadResultValues.foldLeft(false) ((errorIncurred, readResult) =>
    errorIncurred || (readResult.errorCode != Errors.NONE.code))

  /*
   * 判断是否能够立即返回FetchResponse
   * 1 不想等待，需要立即返回的
   * 2 FetchRequest没有指定要读取的分区
   * 3 数据已经够了
   * 4 读取数据时候发生了错误，即检查errorReadingData
   */
  if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) {
    val fetchPartitionData = logReadResults.map { case (tp, result) =>
      tp -> FetchResponsePartitionData(result.errorCode, result.hw, result.info.messageSet)
    }
    // 直接调用回调函数
    responseCallback(fetchPartitionData)
  } else {
    // 封装返回结果
    val fetchPartitionStatus = logReadResults.map { case (topicAndPartition, result) =>
      val fetchInfo = fetchInfos.collectFirst {
        case (tp, v) if tp == topicAndPartition => v
      }.getOrElse(sys.error(s"Partition $topicAndPartition not found in fetchInfos"))
      (topicAndPartition, FetchPartitionStatus(result.info.fetchOffsetMetadata, fetchInfo))
    }
    // 构造FetchMetadata对象
    val fetchMetadata = FetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit, fetchOnlyFromLeader,
      fetchOnlyCommitted, isFromFollower, replicaId, fetchPartitionStatus)
    // 构造一个DelayedFetdch对象
    val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, responseCallback)

    // 创建一个(topic, partition)键值对对形式的列表作为delayed fetch操作的key
    val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => new TopicPartitionOperationKey(tp) }
    // 尝试立即完成当前的请求，否则放入purgatory
    delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
  }
}

当ISR列表所有follower副本都已经了同步了消息，Kafka认为消息已经成功提交，可以将HW后移，所以如果来自follower的Fetch请求多了一步处理：updateFollowerLogReadResults：

# 更新leader副本上维护的follower的各个状态

# 随着follower副本不断fetch消息，最终追上leader副本，可能对isr集合进行扩张，如果isr集合发生变化则将新的isr集合记录到zookeeper

# 检测是否需要后移highwatermark

# 检测delayedProducePurgatory相关的key对应的DelayedOperation是否满足条件，满足条件则执行

private def updateFollowerLogReadResults(replicaId: Int, readResults: Seq[(TopicAndPartition, LogReadResult)]) {
  debug("Recording follower broker %d log read results: %s ".format(replicaId, readResults))
  // 遍历读取的日志结果
  readResults.foreach { case (topicAndPartition, readResult) =>
    getPartition(topicAndPartition.topic, topicAndPartition.partition) match {
      case Some(partition) =>
        // 调用Partition#updateReplicaLogReadResult方法，会更新 follower副本状态，并且尝试扩张ISR列表
        partition.updateReplicaLogReadResult(replicaId, readResult)
        // 尝试执行DelayedOperation
        tryCompleteDelayedProduce(new TopicPartitionOperationKey(topicAndPartition))
      case None =>
        warn("While recording the replica LEO, the partition %s hasn't been created.".format(topicAndPartition))
    }
  }
}

3.3 消息同步

follower副本与leader副本同步的功能是由ReplicaFetcherManager组件实现的，它继承了AbstractFetcherManager.

在AbstractFetcherThread中调方法addPartitions和removePartitions对partitionMap字段进行增删，同时会唤醒Fetcher线程同步

def addPartitions(partitionAndOffsets: Map[TopicPartition, Long]) {
  partitionMapLock.lockInterruptibly()
  try {
    // 检测指定分区是否存在
    val newPartitionToState = partitionAndOffsets.filter { case (tp, _) =>
      !partitionStates.contains(tp)
    }.map { case (tp, offset) =>
      val fetchState =
        if (PartitionTopicInfo.isOffsetInvalid(offset)) new PartitionFetchState(handleOffsetOutOfRange(tp))
        else new PartitionFetchState(offset)
      tp -> fetchState
    }
    val existingPartitionToState = partitionStates.partitionStates.asScala.map { state =>
      state.topicPartition -> state.value
    }.toMap
    partitionStates.set((existingPartitionToState ++ newPartitionToState).asJava)
    partitionMapCond.signalAll()// 唤醒当前的fetcher线程，进行同步
  } finally partitionMapLock.unlock()
}

def removePartitions(topicPartitions: Set[TopicPartition]) {
  partitionMapLock.lockInterruptibly()
  try {
    topicPartitions.foreach { topicPartition =>
      partitionStates.remove(topicPartition)
      fetcherLagStats.unregister(topicPartition.topic, topicPartition.partition)
    }
  } finally partitionMapLock.unlock()
}

override def doWork() {

  val fetchRequest = inLock(partitionMapLock) {
    // 构建FetchRequest请求
    val fetchRequest = buildFetchRequest(partitionStates.partitionStates.asScala.map { state =>
      state.topicPartition -> state.value
    })
    if (fetchRequest.isEmpty) {
      trace("There are no active partitions. Back off for %d ms before sending a fetch request".format(fetchBackOffMs))
      partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
    }
    fetchRequest
  }
  if (!fetchRequest.isEmpty)
    // 处理FetchRequest请求
    processFetchRequest(fetchRequest)
}

处理fetch请求：

private def processFetchRequest(fetchRequest: REQ) {
  val partitionsWithError = mutable.Set[TopicPartition]()

  def updatePartitionsWithError(partition: TopicPartition): Unit = {
    partitionsWithError += partition
    partitionStates.moveToEnd(partition)
  }

  var responseData: Seq[(TopicPartition, PD)] = Seq.empty

  try {
    trace("Issuing to broker %d of fetch request %s".format(sourceBroker.id, fetchRequest))
    // 发送FetchRequest并等待FetchResponse
    responseData = fetch(fetchRequest)
  } catch {
    case t: Throwable =>
      if (isRunning.get) {
        warn(s"Error in fetch $fetchRequest", t)
        inLock(partitionMapLock) {
          partitionStates.partitionSet.asScala.foreach(updatePartitionsWithError)
          // there is an error occurred while fetching partitions, sleep a while
          // note that `ReplicaFetcherThread.handlePartitionsWithError` will also introduce the same delay for every
          // partition with error effectively doubling the delay. It would be good to improve this.
          partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
        }
      }
  }
  fetcherStats.requestRate.mark()

  if (responseData.nonEmpty) { // 处理fetch response
    inLock(partitionMapLock) {
      // 遍历每一个分区的响应信息
      responseData.foreach { case (topicPartition, partitionData) =>
        val topic = topicPartition.topic
        val partitionId = topicPartition.partition
        Option(partitionStates.stateValue(topicPartition)).foreach(currentPartitionFetchState =>
          // 从发送FetchRequest到收到FetchResponse这段那时间内，offset并没有发生太大变化
          if (fetchRequest.offset(topicPartition) == currentPartitionFetchState.offset) {
            Errors.forCode(partitionData.errorCode) match {
              case Errors.NONE =>
                try {
                  // 获取返回的消息集合
                  val messages = partitionData.toByteBufferMessageSet
                  // 获取返回的最后一条消息的offset
                  val newOffset = messages.shallowIterator.toSeq.lastOption.map(_.nextOffset).getOrElse(
                    currentPartitionFetchState.offset)

                  fetcherLagStats.getAndMaybePut(topic, partitionId).lag = Math.max(0L, partitionData.highWatermark - newOffset)
                  // 将leader副本获取的消息追加到log中
                  processPartitionData(topicPartition, currentPartitionFetchState.offset, partitionData)

                  val validBytes = messages.validBytes // 验证
                  if (validBytes > 0) {
                    // 如果没有异常，则更新partition state
                    partitionStates.updateAndMoveToEnd(topicPartition, new PartitionFetchState(newOffset))
                    fetcherStats.byteRate.mark(validBytes)
                  }
                } catch {
                  case ime: CorruptRecordException =>
                    // we log the error and continue. This ensures two things
                    // 1. If there is a corrupt message in a topic partition, it does not bring the fetcher thread down and cause other topic partition to also lag
                    // 2. If the message is corrupt due to a transient state in the log (truncation, partial writes can cause this), we simply continue and
                    // should get fixed in the subsequent fetches
                    logger.error("Found invalid messages during fetch for partition [" + topic + "," + partitionId + "] offset " + currentPartitionFetchState.offset  + " error " + ime.getMessage)
                    updatePartitionsWithError(topicPartition);
                  case e: Throwable =>
                    throw new KafkaException("error processing data for partition [%s,%d] offset %d"
                      .format(topic, partitionId, currentPartitionFetchState.offset), e)
                }
              case Errors.OFFSET_OUT_OF_RANGE => // follower副本的请求的offset超出了leo的范围，返回此错误
                try {
                  val newOffset = handleOffsetOutOfRange(topicPartition)
                  partitionStates.updateAndMoveToEnd(topicPartition, new PartitionFetchState(newOffset))
                  error("Current offset %d for partition [%s,%d] out of range; reset offset to %d"
                    .format(currentPartitionFetchState.offset, topic, partitionId, newOffset))
                } catch {
                  case e: Throwable =>
                    error("Error getting offset for partition [%s,%d] to broker %d".format(topic, partitionId, sourceBroker.id), e)
                    updatePartitionsWithError(topicPartition)
                }
              case _ =>
                if (isRunning.get) {
                  error("Error for partition [%s,%d] to broker %d:%s".format(topic, partitionId, sourceBroker.id,
                    partitionData.exception.get))
                  updatePartitionsWithError(topicPartition)
                }
            }
          })
      }
    }
  }

  if (partitionsWithError.nonEmpty) {
    debug("handling partitions with error for %s".format(partitionsWithError))
    handlePartitionsWithErrors(partitionsWithError)
  }
}

protected def fetch(fetchRequest: FetchRequest): Seq[(TopicPartition, PartitionData)] = {
  // 发送fetch请求
  val clientResponse = sendRequest(ApiKeys.FETCH, Some(fetchRequestVersion), fetchRequest.underlying)
  // 返回FetchResponse
  new FetchResponse(clientResponse.responseBody).responseData.asScala.toSeq.map { case (key, value) =>
    key -> new PartitionData(value)
  }
}

2.4 关闭副本

当KafkaController发送StopReplicaRequest请求时，会关闭其指定的副本，并根据StopReplicaRequest中的字段决定是否删除副本对应的log. 在分区副本重新分配或者关闭的broker的过程中都会使用到该请求，该请求并不一定意味着需要删除旧副本和log比如关闭 broker的场景。

先分析StopReplicaRequest和StopReplicaResponse的消息体格式：

首先是调用KafkaApis#handleStopReplicaRequest方法：

def handleStopReplicaRequest(request: RequestChannel.Request) {
  // 将请求转成StopReplicaRequest
  val stopReplicaRequest = request.body.asInstanceOf[StopReplicaRequest]
  // 构造response header
  val responseHeader = new ResponseHeader(request.header.correlationId)
  val response =
    if (authorize(request.session, ClusterAction, Resource.ClusterResource)) {
      // 调用replicaManager#stopReplicas
      val (result, error) = replicaManager.stopReplicas(stopReplicaRequest)
      // 遍历返回结果，处理内部topic
      result.foreach { case (topicPartition, errorCode) =>
        if (errorCode == Errors.NONE.code && stopReplicaRequest.deletePartitions() && topicPartition.topic == Topic.GroupMetadataTopicName) {
          coordinator.handleGroupEmigration(topicPartition.partition)
        }
      }
      // 返回响应
      new StopReplicaResponse(error, result.asInstanceOf[Map[TopicPartition, JShort]].asJava)
    } else {
      val result = stopReplicaRequest.partitions.asScala.map((_, new JShort(Errors.CLUSTER_AUTHORIZATION_FAILED.code))).toMap
      new StopReplicaResponse(Errors.CLUSTER_AUTHORIZATION_FAILED.code, result.asJava)
    }

  requestChannel.sendResponse(new RequestChannel.Response(request, new ResponseSend(request.connectionId, responseHeader, response)))
  // 关闭该副本的fetcher线程
  replicaManager.replicaFetcherManager.shutdownIdleFetcherThreads()
}

调用ReplicaManager的stopReplicas方法：

def stopReplicas(stopReplicaRequest: StopReplicaRequest): (mutable.Map[TopicPartition, Short], Short) = {
  replicaStateChangeLock synchronized {
    // 存放分区和errorcode的一map
    val responseMap = new collection.mutable.HashMap[TopicPartition, Short]
    // 判断请求中controllerEpoch是否小于初始的controllerEpoch，否则抛出异常，说明请求过时
    if(stopReplicaRequest.controllerEpoch() < controllerEpoch) {
      stateChangeLogger.warn("Broker %d received stop replica request from an old controller epoch %d. Latest known controller epoch is %d"
        .format(localBrokerId, stopReplicaRequest.controllerEpoch, controllerEpoch))
      (responseMap, Errors.STALE_CONTROLLER_EPOCH.code)
    } else {
      // 获取StopReplica请求的partition信息
      val partitions = stopReplicaRequest.partitions.asScala
      controllerEpoch = stopReplicaRequest.controllerEpoch
      // 首先停止所有针对请求的partition的fetchers线程
      replicaFetcherManager.removeFetcherForPartitions(partitions)
      // 遍历每一分区，停止副本继续对外提供服务
      for (topicPartition <- partitions){
        val errorCode = stopReplica(topicPartition.topic, topicPartition.partition, stopReplicaRequest.deletePartitions)
        responseMap.put(topicPartition, errorCode)
      }
      (responseMap, Errors.NONE.code)
    }
  }
}

def stopReplica(topic: String, partitionId: Int, deletePartition: Boolean): Short  = {
  val errorCode = Errors.NONE.code
  getPartition(topic, partitionId) match {
    case Some(partition) =>
      // 判断该请求是否需要删除旧的分区以及其日志
      if(deletePartition) {
        // 获取要删除的分区
        val removedPartition = allPartitions.remove((topic, partitionId))
        if (removedPartition != null) {
          // 删除该分区的本地日志
          removedPartition.delete() // this will delete the local log
          val topicHasPartitions = allPartitions.keys.exists { case (t, _) => topic == t }
          if (!topicHasPartitions)
              BrokerTopicStats.removeMetrics(topic)
        }
      }
    case None =>
      // 在broker的所有分区中不存在对应的分区，直接删除log
      if(deletePartition) {
        val topicAndPartition = TopicAndPartition(topic, partitionId)

        if(logManager.getLog(topicAndPartition).isDefined) {
            logManager.deleteLog(topicAndPartition)
        }
      }
  }
  errorCode
}

2.5 ReplicaManager中的定时任务

ReplicaManager总共有三个定时任务：highwatermark-checkpoint，isr-expiration,isr-change-propogation。

highwatermark-checkpoint：周期性记录每个副本的HW并保存到其log目录中的replication-offset-checkpoint文件中

def startHighWaterMarksCheckPointThread() = {
  if(highWatermarkCheckPointThreadStarted.compareAndSet(false, true))
    scheduler.schedule("highwatermark-checkpoint", checkpointHighWatermarks, period = config.replicaHighWatermarkCheckpointIntervalMs, unit = TimeUnit.MILLISECONDS)
}

def checkpointHighWatermarks() {
  // 获取当前broker的全部partition的replica对象
  val replicas = allPartitions.values.flatMap(_.getReplica(config.brokerId))
  // 按照副本所在目录进行分组
  val replicasByDir = replicas.filter(_.log.isDefined).groupBy(_.log.get.dir.getParentFile.getAbsolutePath)
  // 遍历所有log目录
  for ((dir, reps) <- replicasByDir) {
    // 收集当前log目录下全部副本的HW
    val hwms = reps.map(r => new TopicAndPartition(r) -> r.highWatermark.messageOffset).toMap
    try {
      //更新对应log目录下的replication-offset-checkpoint文件
      highWatermarkCheckpoints(dir).write(hwms)
    } catch {
      case e: IOException =>
        fatal("Error writing to highwatermark file: ", e)
        Runtime.getRuntime().halt(1)
    }
  }
}

def startup() {
  // start ISR expiration thread
  scheduler.schedule("isr-expiration", maybeShrinkIsr, period = config.replicaLagTimeMaxMs, unit = TimeUnit.MILLISECONDS)
  scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges, period = 2500L, unit = TimeUnit.MILLISECONDS)
}

isr-expiration: 会周期性的调用maybeShrinkIsr方法检测每一个分区是否需要缩减其ISR集合

private def maybeShrinkIsr(): Unit = {
  trace("Evaluating ISR list of partitions to see which replicas can be removed from the ISR")
  allPartitions.values.foreach(partition => partition.maybeShrinkIsr(config.replicaLagTimeMaxMs))
}

def maybeShrinkIsr(replicaMaxLagTimeMs: Long) {
  val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {
    // 首先判断是不是当前broker是不是leader,只有leader才可以管理ISR
    leaderReplicaIfLocal() match {
      // 如果是leader
      case Some(leaderReplica) =>
        // 获取不同步副本，就是那些和leader差距很多的副本
        val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica, replicaMaxLagTimeMs)
        // 不同步副本如果存在
        if(outOfSyncReplicas.nonEmpty) {
          // 从ISR列表中移除没有同步的副本
          val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas
          assert(newInSyncReplicas.nonEmpty)
          info("Shrinking ISR for partition [%s,%d] from %s to %s".format(topic, partitionId,
            inSyncReplicas.map(_.brokerId).mkString(","), newInSyncReplicas.map(_.brokerId).mkString(",")))
          // 在zookeeper和缓存中更新ISR列表
          updateIsr(newInSyncReplicas)
          // we may need to increment high watermark since ISR could be down to 1
          // 因为ISR中移除了一个副本，那么有可能剩余的副本都在开始同步了，那么我们可能需要增加高水位线了
          replicaManager.isrShrinkRate.mark()
          maybeIncrementLeaderHW(leaderReplica)
        } else {
          false
        }

      case None => false // do nothing if no longer leader
    }
  }

  // 尝试进行延迟操作
  if (leaderHWIncremented)
    tryCompleteDelayedRequests()
}

isr-change-propogation：会周期性的将ISR集合发生变化的分区记录到zookeeper中

/*
 * 这个功能定期检查是否ISR需要被传播，当发生以下情况的时候，需要被传播：
 * 1 ISR 已经发生改变，但是还没有传播
 * 2 在最后5秒还没有传播，或者从上次ISR传播之后有60秒没有传播了
 * 它允许临时ISR改变在几秒钟之内就被传播，避免controller或者其他borker带来大量ISR状态改变
 * 这个函数在有partition的副本心跳超时后，把isr的变化对应的partition更新到zk中的/isr_change_notification/isr_change_节点中。
 */
def maybePropagateIsrChanges() {
  val now = System.currentTimeMillis()
  isrChangeSet synchronized {
    if (isrChangeSet.nonEmpty &&
      (lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
        lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
      ReplicationUtils.propagateIsrChanges(zkUtils, isrChangeSet)
      isrChangeSet.clear()
      lastIsrPropagationMs.set(now)
    }
  }
}

莫言静好、

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
ReplicaManager分析

一个broker可能分布多个Partition的副本信息，ReplicaManager主要负责管理一个broker范围内的Partition信息，然后它还根据KafkaController发送过来的命令，然后执行这些命令，比如LeaderAndIsr和StopReplica一 ReplicaManager 数据结构假设有5个broker节点，分区数目为3，备份因子为2：bro
复制链接

扫一扫