kafka源码阅读-Broker如何处理consumer消费、follower副本数据同步请求

最新推荐文章于 2024-08-23 21:00:57 发布

mhHao

最新推荐文章于 2024-08-23 21:00:57 发布

阅读量961

点赞数 25

分类专栏： kafka源码阅读文章标签： kafka linq 分布式

本文链接：https://blog.csdn.net/hmh13548571896/article/details/140587233

版权

kafka源码阅读专栏收录该内容

9 篇文章 0 订阅

订阅专栏

概述

Kafka源码包含多个模块，每个模块负责不同的功能。以下是一些核心模块及其功能的概述：

服务端源码 ：实现Kafka Broker的核心功能，包括日志存储、控制器、协调器、元数据管理及状态机管理、延迟机制、消费者组管理、高并发网络架构模型实现等。
Java客户端源码 ：实现了Producer和Consumer与Broker的交互机制，以及通用组件支撑代码。
Connect源码 ：用来构建异构数据双向流式同步服务。
Stream源码 ：用来实现实时流处理相关功能。
Raft源码 ：实现了Raft一致性协议。
Admin模块 ：Kafka的管理员模块，操作和管理其topic，partition相关，包含创建，删除topic，或者拓展分区等。
Api模块 ：负责数据交互，客户端与服务端交互数据的编码与解码。
Client模块 ：包含Producer读取Kafka Broker元数据信息的类，如topic和分区，以及leader。
Cluster模块 ：包含Broker、Cluster、Partition、Replica等实体类。
Common模块 ：包含各种异常类以及错误验证。
Consumer模块 ：消费者处理模块，负责客户端消费者数据和逻辑处理。
Controller模块 ：负责中央控制器的选举，分区的Leader选举，Replica的分配或重新分配，分区和副本的扩容等。
Coordinator模块 ：负责管理部分consumer group和他们的offset。
Javaapi模块 ：提供Java语言的Producer和Consumer的API接口。
Log模块 ：负责Kafka文件存储，读写所有Topic消息数据。
Message模块 ：封装多条数据组成数据集或压缩数据集。
Metrics模块 ：负责内部状态监控。
Network模块 ：处理客户端连接，网络事件模块。
Producer模块 ：生产者细节实现，包括同步和异步消息发送。
Security模块 ：负责Kafka的安全验证和管理。
Serializer模块 ：序列化和反序列化消息内容。
Server模块 ：涉及Leader和Offset的checkpoint，动态配置，延时创建和删除Topic，Leader选举，Admin和Replica管理等。
Tools模块 ：包含多种工具，如导出consumer offset值，LogSegments信息，Topic的log位置信息，Zookeeper上的offset值等。
Utils模块 ：包含各种工具类，如Json，ZkUtils，线程池工具类，KafkaScheduler公共调度器类等。

这些模块共同构成了Kafka的整体架构，使其能够提供高吞吐量、高可用性的消息队列服务。

kafka源码分支为1.0.2

KafkaApis类的handleFetchRequest()方法作为api入口：

  /**
   * Handle a fetch request
   */
    //处理消费者消息消费请求、follwer副本同步请求
  def handleFetchRequest(request: RequestChannel.Request) {
    val fetchRequest = request.body[FetchRequest]
    val versionId = request.header.apiVersion
    val clientId = request.header.clientId

    val unauthorizedTopicResponseData = mutable.ArrayBuffer[(TopicPartition, FetchResponse.PartitionData)]()
    val nonExistingTopicResponseData = mutable.ArrayBuffer[(TopicPartition, FetchResponse.PartitionData)]()
    val authorizedRequestInfo = mutable.ArrayBuffer[(TopicPartition, FetchRequest.PartitionData)]()

    //replicaId >= 0时表示是follwer副本同步请求
    if (fetchRequest.isFromFollower() && !authorize(request.session, ClusterAction, Resource.ClusterResource))    
      for (topicPartition <- fetchRequest.fetchData.asScala.keys)
        unauthorizedTopicResponseData += topicPartition -> new FetchResponse.PartitionData(Errors.CLUSTER_AUTHORIZATION_FAILED,
          FetchResponse.INVALID_HIGHWATERMARK, FetchResponse.INVALID_LAST_STABLE_OFFSET,
          FetchResponse.INVALID_LOG_START_OFFSET, null, MemoryRecords.EMPTY)
    else
      for ((topicPartition, partitionData) <- fetchRequest.fetchData.asScala) {
        if (!authorize(request.session, Read, new Resource(Topic, topicPartition.topic)))
          unauthorizedTopicResponseData += topicPartition -> new FetchResponse.PartitionData(Errors.TOPIC_AUTHORIZATION_FAILED,
            FetchResponse.INVALID_HIGHWATERMARK, FetchResponse.INVALID_LAST_STABLE_OFFSET,
            FetchResponse.INVALID_LOG_START_OFFSET, null, MemoryRecords.EMPTY)
        else if (!metadataCache.contains(topicPartition.topic))
          nonExistingTopicResponseData += topicPartition -> new FetchResponse.PartitionData(Errors.UNKNOWN_TOPIC_OR_PARTITION,
            FetchResponse.INVALID_HIGHWATERMARK, FetchResponse.INVALID_LAST_STABLE_OFFSET,
            FetchResponse.INVALID_LOG_START_OFFSET, null, MemoryRecords.EMPTY)
        else
          authorizedRequestInfo += (topicPartition -> partitionData)
      }

    def convertedPartitionData(tp: TopicPartition, data: FetchResponse.PartitionData) = {

      // Down-conversion of the fetched records is needed when the stored magic version is
      // greater than that supported by the client (as indicated by the fetch request version). If the
      // configured magic version for the topic is less than or equal to that supported by the version of the
      // fetch request, we skip the iteration through the records in order to check the magic version since we
      // know it must be supported. However, if the magic version is changed from a higher version back to a
      // lower version, this check will no longer be valid and we will fail to down-convert the messages
      // which were written in the new format prior to the version downgrade.
      replicaManager.getMagic(tp).flatMap { magic =>
        val downConvertMagic = {
          if (magic > RecordBatch.MAGIC_VALUE_V0 && versionId <= 1 && !data.records.hasCompatibleMagic(RecordBatch.MAGIC_VALUE_V0))
            Some(RecordBatch.MAGIC_VALUE_V0)
          else if (magic > RecordBatch.MAGIC_VALUE_V1 && versionId <= 3 && !data.records.hasCompatibleMagic(RecordBatch.MAGIC_VALUE_V1))
            Some(RecordBatch.MAGIC_VALUE_V1)
          else
            None
        }

        downConvertMagic.map { magic =>
          trace(s"Down converting records from partition $tp to message format version $magic for fetch request from $clientId")
          val converted = data.records.downConvert(magic, fetchRequest.fetchData.get(tp).fetchOffset, time)
          updateRecordsProcessingStats(request, tp, converted.recordsProcessingStats)
          new FetchResponse.PartitionData(data.error, data.highWatermark, FetchResponse.INVALID_LAST_STABLE_OFFSET,
            data.logStartOffset, data.abortedTransactions, converted.records)
        }

      }.getOrElse(data)
    }

    // the callback for process a fetch response, invoked before throttling
    def processResponseCallback(responsePartitionData: Seq[(TopicPartition, FetchPartitionData)]) {
      val partitionData = {
        responsePartitionData.map { case (tp, data) =>
          val abortedTransactions = data.abortedTransactions.map(_.asJava).orNull
          val lastStableOffset = data.lastStableOffset.getOrElse(FetchResponse.INVALID_LAST_STABLE_OFFSET)
          tp -> new FetchResponse.PartitionData(data.error, data.highWatermark, lastStableOffset,
            data.logStartOffset, abortedTransactions, data.records)
        }
      }

      val mergedPartitionData = partitionData ++ unauthorizedTopicResponseData ++ nonExistingTopicResponseData
      val fetchedPartitionData = new util.LinkedHashMap[TopicPartition, FetchResponse.PartitionData]()

      mergedPartitionData.foreach { case (topicPartition, data) =>
        if (data.error != Errors.NONE)
          debug(s"Fetch request with correlation id ${request.header.correlationId} from client $clientId " +
            s"on partition $topicPartition failed due to ${data.error.exceptionName}")

        fetchedPartitionData.put(topicPartition, data)
      }

      // fetch response callback invoked after any throttling
      def fetchResponseCallback(bandwidthThrottleTimeMs: Int) {
        def createResponse(requestThrottleTimeMs: Int): FetchResponse = {
          val convertedData = new util.LinkedHashMap[TopicPartition, FetchResponse.PartitionData]
          fetchedPartitionData.asScala.foreach { case (tp, partitionData) =>
            convertedData.put(tp, convertedPartitionData(tp, partitionData))
          }
          val response = new FetchResponse(convertedData, bandwidthThrottleTimeMs + requestThrottleTimeMs)
          response.responseData.asScala.foreach { case (topicPartition, data) =>
            // record the bytes out metrics only when the response is being sent
            brokerTopicStats.updateBytesOut(topicPartition.topic, fetchRequest.isFromFollower, data.records.sizeInBytes)
          }
          response
        }

        if (fetchRequest.isFromFollower)
          sendResponseExemptThrottle(request, createResponse(0))
        else
          sendResponseMaybeThrottle(request, requestThrottleMs => createResponse(requestThrottleMs))
      }

      // When this callback is triggered, the remote API call has completed.
      // Record time before any byte-rate throttling.
      request.apiRemoteCompleteTimeNanos = time.nanoseconds

      if (fetchRequest.isFromFollower) {
        // We've already evaluated against the quota and are good to go. Just need to record it now.
        val responseSize = sizeOfThrottledPartitions(versionId, fetchRequest, mergedPartitionData, quotas.leader)
        quotas.leader.record(responseSize)
        fetchResponseCallback(bandwidthThrottleTimeMs = 0)
      } else {
        // Fetch size used to determine throttle time is calculated before any down conversions.
        // This may be slightly different from the actual response size. But since down conversions
        // result in data being loaded into memory, it is better to do this after throttling to avoid OOM.
        val response = new FetchResponse(fetchedPartitionData, 0)
        val responseStruct = response.toStruct(versionId)
        quotas.fetch.maybeRecordAndThrottle(request.session.sanitizedUser, clientId, responseStruct.sizeOf,
          fetchResponseCallback)
      }
    }

    if (authorizedRequestInfo.isEmpty)
      processResponseCallback(Seq.empty)
    else {
      // call the replica manager to fetch messages from the local replica
      //调用replica manager从本地日志副本中拉取消息
      replicaManager.fetchMessages(
        fetchRequest.maxWait.toLong, //最长等待时间
        fetchRequest.replicaId, //follower副本的brokerid
        fetchRequest.minBytes, //拉取请求设置的最小拉取字节
        fetchRequest.maxBytes, //拉取请求设置的最大拉取字节
        versionId <= 2,
        authorizedRequestInfo,
        replicationQuota(fetchRequest),
        processResponseCallback, //回调函数
        fetchRequest.isolationLevel) //隔离级别，包含 READ_UNCOMMITTED, READ_COMMITTED;
    }
  }

ReplicaManager.fetchMessages()中会继续调用readFromLocalLog()方法：

  /**
   * Fetch messages from the leader replica, and wait until enough data can be fetched and return;
   * the callback function will be triggered either when timeout or required fetch info is satisfied
   */
    //从leader副本拉取数据，等待直到拉取到足够的数据，在超时或拉取数据满足条件时会触发回调函数
  def fetchMessages(timeout: Long,
                    replicaId: Int,
                    fetchMinBytes: Int,
                    fetchMaxBytes: Int,
                    hardMaxBytesLimit: Boolean,
                    fetchInfos: Seq[(TopicPartition, PartitionData)],
                    quota: ReplicaQuota = UnboundedQuota,
                    responseCallback: Seq[(TopicPartition, FetchPartitionData)] => Unit,
                    isolationLevel: IsolationLevel) {
      //replicaId>=0时，表示是对应brokerId的follower副本的拉取请求。否则就是consumer的拉取请求
    val isFromFollower = Request.isValidBrokerId(replicaId)
      //目前只能从leader拉取消息，因此fetchOnlyFromLeader始终为true
    val fetchOnlyFromLeader = replicaId != Request.DebuggingConsumerId
      //fetchOnlyCommitted=true表示拉取请求来自 consumer, 只能拉取 HW 以内的数据；如果请求是来自 Follower Replica 同步,则没有该限制（false）。
    val fetchOnlyCommitted = !isFromFollower

    def readFromLog(): Seq[(TopicPartition, LogReadResult)] = {
      //获取本地日志
      val result = readFromLocalLog(
        replicaId = replicaId,
        fetchOnlyFromLeader = fetchOnlyFromLeader,
        readOnlyCommitted = fetchOnlyCommitted,
        fetchMaxBytes = fetchMaxBytes,
        hardMaxBytesLimit = hardMaxBytesLimit,
        readPartitionInfo = fetchInfos,
        quota = quota,
        isolationLevel = isolationLevel)
      //如果请求来自follower副本，则需要更新follower相关的拉取状态
      if (isFromFollower) updateFollowerLogReadResults(replicaId, result)
      else result
    }

    val logReadResults = readFromLog()

    // check if this fetch request can be satisfied right away
    val logReadResultValues = logReadResults.map { case (_, v) => v }
    val bytesReadable = logReadResultValues.map(_.info.records.sizeInBytes).sum
    val errorReadingData = logReadResultValues.foldLeft(false) ((errorIncurred, readResult) =>
      errorIncurred || (readResult.error != Errors.NONE))

    // respond immediately if 1) fetch request does not want to wait
    //                        2) fetch request does not require any data
    //                        3) has enough data to respond
    //                        4) some error happens while reading data
      //若满足一下条件之一将会直接返回结果：timeout<=0、拉取的消息为空、拉取到了足够的数据、读取数据期间发生异常
    if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) {
      val fetchPartitionData = logReadResults.map { case (tp, result) =>
        tp -> FetchPartitionData(result.error, result.highWatermark, result.leaderLogStartOffset, result.info.records,
          result.lastStableOffset, result.info.abortedTransactions)
      }
      responseCallback(fetchPartitionData)
    } else {
      //未满足立即返回结果的情况，需延迟返回结果
      // construct the fetch results from the read results
      val fetchPartitionStatus = logReadResults.map { case (topicPartition, result) =>
        val fetchInfo = fetchInfos.collectFirst {
          case (tp, v) if tp == topicPartition => v
        }.getOrElse(sys.error(s"Partition $topicPartition not found in fetchInfos"))
        (topicPartition, FetchPartitionStatus(result.info.fetchOffsetMetadata, fetchInfo))
      }
      val fetchMetadata = FetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit, fetchOnlyFromLeader,
        fetchOnlyCommitted, isFromFollower, replicaId, fetchPartitionStatus)
      val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, isolationLevel, responseCallback)

      // create a list of (topic, partition) pairs to use as keys for this delayed fetch operation
      val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => new TopicPartitionOperationKey(tp) }

      // try to complete the request immediately, otherwise put it into the purgatory;
      // this is because while the delayed fetch operation is being created, new requests
      // may arrive and hence make this operation completable.
      delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
    }
  }

  /**
   * Read from multiple topic partitions at the given offset up to maxSize bytes
   */
    //从多个partition给定的offset中获取消息
  def readFromLocalLog(replicaId: Int,
                       fetchOnlyFromLeader: Boolean,
                       readOnlyCommitted: Boolean,
                       fetchMaxBytes: Int,
                       hardMaxBytesLimit: Boolean,
                       readPartitionInfo: Seq[(TopicPartition, PartitionData)],
                       quota: ReplicaQuota,
                       isolationLevel: IsolationLevel): Seq[(TopicPartition, LogReadResult)] = {

    def read(tp: TopicPartition, fetchInfo: PartitionData, limitBytes: Int, minOneMessage: Boolean): LogReadResult = {
      val offset = fetchInfo.fetchOffset
      val partitionFetchSize = fetchInfo.maxBytes
      val followerLogStartOffset = fetchInfo.logStartOffset

      brokerTopicStats.topicStats(tp.topic).totalFetchRequestRate.mark()
      brokerTopicStats.allTopicsStats.totalFetchRequestRate.mark()

      try {
        trace(s"Fetching log segment for partition $tp, offset $offset, partition fetch size $partitionFetchSize, " +
          s"remaining response limit $limitBytes" +
          (if (minOneMessage) s", ignoring response/partition size limits" else ""))

        // decide whether to only fetch from leader
        //当前fetchOnlyFromLeader始终为true
        val localReplica = if (fetchOnlyFromLeader) {
          //获取该分区对应的leader Replica对象
          getLeaderReplicaIfLocal(tp)
        } else
          getReplicaOrException(tp)

        //获取 hw 位置，副本同步不设置这个值
        val initialHighWatermark = localReplica.highWatermark.messageOffset
        val lastStableOffset = if (isolationLevel == IsolationLevel.READ_COMMITTED)
          Some(localReplica.lastStableOffset.messageOffset)
        else
          None

        // decide whether to only fetch committed data (i.e. messages below high watermark)
        //readOnlyCommitted=true，表示是consumer只能消费到hw前的消息，为false表示是follower同步消息，没有这个限制
        val maxOffsetOpt = if (readOnlyCommitted)
          Some(lastStableOffset.getOrElse(initialHighWatermark))
        else
          None

        /* Read the LogOffsetMetadata prior to performing the read from the log.
         * We use the LogOffsetMetadata to determine if a particular replica is in-sync or not.
         * Using the log end offset after performing the read can lead to a race condition
         * where data gets appended to the log immediately after the replica has consumed from it
         * This can cause a replica to always be out of sync.
         */
        //Log end offset
        val initialLogEndOffset = localReplica.logEndOffset.messageOffset
        //log Start Offset
        val initialLogStartOffset = localReplica.logStartOffset
        val fetchTimeMs = time.milliseconds
        val logReadInfo = localReplica.log match {
          case Some(log) =>
            val adjustedFetchSize = math.min(partitionFetchSize, limitBytes)

            // Try the read first, this tells us whether we need all of adjustedFetchSize for this partition
            //从指定的 offset 位置开始读取数据，如果是follower副本同步，则maxOffsetOpt=None
            val fetch = log.read(offset, adjustedFetchSize, maxOffsetOpt, minOneMessage, isolationLevel)

            // If the partition is being throttled, simply return an empty set.
            //如果partition被限速了,那么返回 空 集合
            if (shouldLeaderThrottle(quota, tp, replicaId))
              FetchDataInfo(fetch.fetchOffsetMetadata, MemoryRecords.EMPTY)
            // For FetchRequest version 3, we replace incomplete message sets with an empty one as consumers can make
            // progress in such cases and don't need to report a `RecordTooLargeException`
            else if (!hardMaxBytesLimit && fetch.firstEntryIncomplete)
              FetchDataInfo(fetch.fetchOffsetMetadata, MemoryRecords.EMPTY)
            else fetch

          case None =>
            error(s"Leader for partition $tp does not have a local log")
            FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY)
        }

        LogReadResult(info = logReadInfo,
                      highWatermark = initialHighWatermark,
                      leaderLogStartOffset = initialLogStartOffset,
                      leaderLogEndOffset = initialLogEndOffset,
                      followerLogStartOffset = followerLogStartOffset,
                      fetchTimeMs = fetchTimeMs,
                      readSize = partitionFetchSize,
                      lastStableOffset = lastStableOffset,
                      exception = None)
      } catch {
        // NOTE: Failed fetch requests metric is not incremented for known exceptions since it
        // is supposed to indicate un-expected failure of a broker in handling a fetch request
        case e@ (_: UnknownTopicOrPartitionException |
                 _: NotLeaderForPartitionException |
                 _: ReplicaNotAvailableException |
                 _: KafkaStorageException |
                 _: OffsetOutOfRangeException) =>
          LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY),
                        highWatermark = -1L,
                        leaderLogStartOffset = -1L,
                        leaderLogEndOffset = -1L,
                        followerLogStartOffset = -1L,
                        fetchTimeMs = -1L,
                        readSize = partitionFetchSize,
                        lastStableOffset = None,
                        exception = Some(e))
        case e: Throwable =>
          brokerTopicStats.topicStats(tp.topic).failedFetchRequestRate.mark()
          brokerTopicStats.allTopicsStats.failedFetchRequestRate.mark()
          error(s"Error processing fetch operation on partition $tp, offset $offset", e)
          LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY),
                        highWatermark = -1L,
                        leaderLogStartOffset = -1L,
                        leaderLogEndOffset = -1L,
                        followerLogStartOffset = -1L,
                        fetchTimeMs = -1L,
                        readSize = partitionFetchSize,
                        lastStableOffset = None,
                        exception = Some(e))
      }
    }

    var limitBytes = fetchMaxBytes
      //存储拉取的数据结果集
    val result = new mutable.ArrayBuffer[(TopicPartition, LogReadResult)]
    var minOneMessage = !hardMaxBytesLimit
      //遍历要拉取的partition，并调用read方法读取数据
    readPartitionInfo.foreach { case (tp, fetchInfo) =>
      val readResult = read(tp, fetchInfo, limitBytes, minOneMessage)
      val recordBatchSize = readResult.info.records.sizeInBytes
      // Once we read from a non-empty partition, we stop ignoring request and partition level size limits
      if (recordBatchSize > 0)
        minOneMessage = false
      limitBytes = math.max(0, limitBytes - recordBatchSize)
      result += (tp -> readResult)
    }
    result
  }

Log.read()方法：

  /**
   * Read messages from the log.
   *
   * @param startOffset The offset to begin reading at
   * @param maxLength The maximum number of bytes to read
   * @param maxOffset The offset to read up to, exclusive. (i.e. this offset NOT included in the resulting message set)
   * @param minOneMessage If this is true, the first message will be returned even if it exceeds `maxLength` (if one exists)
   * @param isolationLevel The isolation level of the fetcher. The READ_UNCOMMITTED isolation level has the traditional
   *                       read semantics (e.g. consumers are limited to fetching up to the high watermark). In
   *                       READ_COMMITTED, consumers are limited to fetching up to the last stable offset. Additionally,
   *                       in READ_COMMITTED, the transaction index is consulted after fetching to collect the list
   *                       of aborted transactions in the fetch range which the consumer uses to filter the fetched
   *                       records before they are returned to the user. Note that fetches from followers always use
   *                       READ_UNCOMMITTED.
   *
   * @throws OffsetOutOfRangeException If startOffset is beyond the log end offset or before the log start offset
   * @return The fetch data information including fetch starting offset metadata and messages read.
   */
    //从指定 startOffset 开始读取数据
  def read(startOffset: Long, maxLength: Int, maxOffset: Option[Long] = None, minOneMessage: Boolean = false,
           isolationLevel: IsolationLevel): FetchDataInfo = {
    maybeHandleIOException(s"Exception while reading from $topicPartition in dir ${dir.getParent}") {
      trace(s"Reading $maxLength bytes from offset $startOffset of length $size bytes")

      // Because we don't use lock for reading, the synchronization is a little bit tricky.
      // We create the local variables to avoid race conditions with updates to the log.
      val currentNextOffsetMetadata = nextOffsetMetadata
      val next = currentNextOffsetMetadata.messageOffset
      if (startOffset == next) {
        val abortedTransactions =
          if (isolationLevel == IsolationLevel.READ_COMMITTED) Some(List.empty[AbortedTransaction])
          else None
        return FetchDataInfo(currentNextOffsetMetadata, MemoryRecords.EMPTY, firstEntryIncomplete = false,
          abortedTransactions = abortedTransactions)
      }

      // 从跳跃表中查找对应的日志分段（logSegment）
      var segmentEntry = segments.floorEntry(startOffset)

      // return error on attempt to read beyond the log end offset or read below log start offset
      if (startOffset > next || segmentEntry == null || startOffset < logStartOffset)
        throw new OffsetOutOfRangeException("Request for offset %d but we only have log segments in the range %d to %d.".format(startOffset, logStartOffset, next))

      // Do the read on the segment with a base offset less than the target offset
      // but if that segment doesn't contain any messages with an offset greater than that
      // continue to read from successive segments until we get some messages or we reach the end of the log
      while (segmentEntry != null) {
        val segment = segmentEntry.getValue

        // If the fetch occurs on the active segment, there might be a race condition where two fetch requests occur after
        // the message is appended but before the nextOffsetMetadata is updated. In that case the second fetch may
        // cause OffsetOutOfRangeException. To solve that, we cap the reading up to exposed position instead of the log
        // end of the active segment.
        val maxPosition = {
          if (segmentEntry == segments.lastEntry) {
            val exposedPos = nextOffsetMetadata.relativePositionInSegment.toLong
            // Check the segment again in case a new segment has just rolled out.
            //刚好此时产生了新的 segment文件, 再次判断
            if (segmentEntry != segments.lastEntry)
            // New log segment has rolled out, we can read up to the file end.
              segment.size
            else
              exposedPos
          } else {
            segment.size
          }
        }
        //从 LogSegment 中读取相应的数据
        val fetchInfo = segment.read(startOffset, maxOffset, maxLength, maxPosition, minOneMessage)
        if (fetchInfo == null) {
          //如果该日志分段没有读取到数据, 则读取更高的日志分段
          segmentEntry = segments.higherEntry(segmentEntry.getKey)
        } else {
          return isolationLevel match {
            case IsolationLevel.READ_UNCOMMITTED => fetchInfo
            case IsolationLevel.READ_COMMITTED => addAbortedTransactions(startOffset, segmentEntry, fetchInfo)
          }
        }
      }

      // okay we are beyond the end of the last segment with no data fetched although the start offset is in range,
      // this can happen when all messages with offset larger than start offsets have been deleted.
      // In this case, we will return the empty set with log end offset metadata
      FetchDataInfo(nextOffsetMetadata, MemoryRecords.EMPTY)
    }
  }

会继续调用LogSegment.read()：

  /**
   * Read a message set from this segment beginning with the first offset >= startOffset. The message set will include
   * no more than maxSize bytes and will end before maxOffset if a maxOffset is specified.
   *
   * @param startOffset A lower bound on the first offset to include in the message set we read
   * @param maxSize The maximum number of bytes to include in the message set we read
   * @param maxOffset An optional maximum offset for the message set we read
   * @param maxPosition The maximum position in the log segment that should be exposed for read
   * @param minOneMessage If this is true, the first message will be returned even if it exceeds `maxSize` (if one exists)
   *
   * @return The fetched data and the offset metadata of the first message whose offset is >= startOffset,
   *         or null if the startOffset is larger than the largest offset in this log
   */
  @threadsafe
  def read(startOffset: Long, maxOffset: Option[Long], maxSize: Int, maxPosition: Long = size,
           minOneMessage: Boolean = false): FetchDataInfo = {
    if (maxSize < 0)
      throw new IllegalArgumentException("Invalid max size for log read (%d)".format(maxSize))

    //log文件大小
    val logSize = log.sizeInBytes // this may change, need to save a consistent copy
    //将起始的 offset 转换为起始的实际物理位置
    val startOffsetAndSize = translateOffset(startOffset)

    // if the start position is already off the end of the log, return null
    //如果起始位置已经超出了日志的末尾，则返回null
    if (startOffsetAndSize == null)
      return null

    val startPosition = startOffsetAndSize.position
    val offsetMetadata = new LogOffsetMetadata(startOffset, this.baseOffset, startPosition)

    val adjustedMaxSize =
      if (minOneMessage) math.max(maxSize, startOffsetAndSize.size)
      else maxSize

    // return a log segment but with zero size in the case below
    if (adjustedMaxSize == 0)
      return FetchDataInfo(offsetMetadata, MemoryRecords.EMPTY)

    // calculate the length of the message set to read based on whether or not they gave us a maxOffset
    //计算要读取的消息长度
    val fetchSize: Int = maxOffset match {
      //maxOffset=None，表示follower副本同步时的计算方式
      case None =>
        // no max offset, just read until the max position
        min((maxPosition - startPosition).toInt, adjustedMaxSize)
      //consumer拉取消息的计算方式
      case Some(offset) =>
        // there is a max offset, translate it to a file position and use that to calculate the max read size;
        // when the leader of a partition changes, it's possible for the new leader's high watermark to be less than the
        // true high watermark in the previous leader for a short window. In this window, if a consumer fetches on an
        // offset between new leader's high watermark and the log end offset, we want to return an empty response.
        if (offset < startOffset)
          return FetchDataInfo(offsetMetadata, MemoryRecords.EMPTY, firstEntryIncomplete = false)
        val mapping = translateOffset(offset, startPosition)
        val endPosition =
          if (mapping == null)
            logSize // the max offset is off the end of the log, use the end of the file
          else
            mapping.position
        min(min(maxPosition, endPosition) - startPosition, adjustedMaxSize).toInt
    }

    //根据起始的物理位置和读取长度读取数据文件
    FetchDataInfo(offsetMetadata, log.read(startPosition, fetchSize),
      firstEntryIncomplete = adjustedMaxSize < startOffsetAndSize.size)
  }

  /**
   * Find the physical file position for the first message with offset >= the requested offset.
   *
   * The startingFilePosition argument is an optimization that can be used if we already know a valid starting position
   * in the file higher than the greatest-lower-bound from the index.
   *
   * @param offset The offset we want to translate
   * @param startingFilePosition A lower bound on the file position from which to begin the search. This is purely an optimization and
   * when omitted, the search will begin at the position in the offset index.
   * @return The position in the log storing the message with the least offset >= the requested offset and the size of the
    *        message or null if no message meets this criteria.
   */
    //查找 offset 索引文件：调用 offset 索引文件的 lookup() 查找方法，获取离 startOffset 最接近的物理位置；
    //调用数据文件的 searchFor() 方法，从指定的物理位置开始读取每条数据，知道找到对应 offset 的物理位置。
  @threadsafe
  private[log] def translateOffset(offset: Long, startingFilePosition: Int = 0): LogOffsetPosition = {
    val mapping = index.lookup(offset)
    log.searchForOffsetWithSize(offset, max(mapping.position, startingFilePosition))
  }

最后会调用FileRecords.read()方法，根据position和size参数读取数据：

 /**
     * Return a slice of records from this instance, which is a view into this set starting from the given position
     * and with the given size limit.
     *
     * If the size is beyond the end of the file, the end will be based on the size of the file at the time of the read.
     *
     * If this message set is already sliced, the position will be taken relative to that slicing.
     *
     * @param position The start position to begin the read from
     * @param size The number of bytes after the start position to include
     * @return A sliced wrapper on this message set limited based on the given position and size
     */
    public FileRecords read(int position, int size) throws IOException {
        if (position < 0)
            throw new IllegalArgumentException("Invalid position: " + position);
        if (size < 0)
            throw new IllegalArgumentException("Invalid size: " + size);

        final int end;
        // handle integer overflow
        if (this.start + position + size < 0)
            end = sizeInBytes();
        else
            end = Math.min(this.start + position + size, sizeInBytes());
        return new FileRecords(file, channel, this.start + position, end, true);
    }

mhHao

关注

25
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
kafka源码阅读-Broker如何处理consumer消费、follower副本数据同步请求

Kafka源码包含多个模块，每个模块负责不同的功能。服务端源码：实现Kafka Broker的核心功能，包括日志存储、控制器、协调器、元数据管理及状态机管理、延迟机制、消费者组管理、高并发网络架构模型实现等。Java客户端源码：实现了Producer和Consumer与Broker的交互机制，以及通用组件支撑代码。Connect源码：用来构建异构数据双向流式同步服务。Stream源码：用来实现实时流处理相关功能。Raft源码：实现了Raft一致性协议。Admin模块。
复制链接

扫一扫

专栏目录