Kafka 3.0 源码笔记(7)-Kafka 服务端对客户端的 Produce 请求处理

1. 前言

Kafka 3.0 源码笔记(6)-Kafka 生产者的源码分析 中笔者分析了作为客户端的 Producer 生产消息的主要动作,本文则着重分析 Kafka 服务端对于客户端生产消息的 Produce 请求的处理。这部分实际上比较简单,笔者在 Kafka 3.0 源码笔记(4)-Kafka 服务端对客户端的 Fetch 请求处理 已经介绍了 Kafka 服务端的文件存储结构以及读取消息数据的主要脉络,服务端写入消息的流程与之相比没有太大出入

2. 源码分析

在这里插入图片描述

  1. 客户端的请求抵达 Kafka 服务端后,首先经过底层网络组件的协议解析处理,完成后会分发到上层的 KafkaApis.scala#handle() 方法进行业务逻辑分发。对于 Produce 请求,Kafka服务端的处理方法是 KafkaApis.scala#handleProduceRequest(),可以看到其核心逻辑如下:

    1. 首先对 Produce 请求携带的每一条消息数据进行有效性校验,由方法 ProduceRequest.java#validateRecords() 完成。由于 Kakfa 各个版本的消息结构有差异,这部分其实主要是进行消息结构的版本兼容性校验,没有太多逻辑
    2. 校验完成后,核心逻辑是调用 ReplicaManager.scala#appendRecords() 方法开始处理消息数据,需注意此处会将 KafkaApis.scala#handleProduceRequest()#sendResponseCallback() 方法作为请求处理完成的回调传入
    def handleProduceRequest(request: RequestChannel.Request, requestLocal: RequestLocal): Unit = {
     val produceRequest = request.body[ProduceRequest]
     val requestSize = request.sizeInBytes
    
     if (RequestUtils.hasTransactionalRecords(produceRequest)) {
       val isAuthorizedTransactional = produceRequest.transactionalId != null &&
         authHelper.authorize(request.context, WRITE, TRANSACTIONAL_ID, produceRequest.transactionalId)
       if (!isAuthorizedTransactional) {
         requestHelper.sendErrorResponseMaybeThrottle(request, Errors.TRANSACTIONAL_ID_AUTHORIZATION_FAILED.exception)
         return
       }
     }
    
     val unauthorizedTopicResponses = mutable.Map[TopicPartition, PartitionResponse]()
     val nonExistingTopicResponses = mutable.Map[TopicPartition, PartitionResponse]()
     val invalidRequestResponses = mutable.Map[TopicPartition, PartitionResponse]()
     val authorizedRequestInfo = mutable.Map[TopicPartition, MemoryRecords]()
     // cache the result to avoid redundant authorization calls
     val authorizedTopics = authHelper.filterByAuthorized(request.context, WRITE, TOPIC,
       produceRequest.data().topicData().asScala)(_.name())
    
     produceRequest.data.topicData.forEach(topic => topic.partitionData.forEach { partition =>
       val topicPartition = new TopicPartition(topic.name, partition.index)
       // This caller assumes the type is MemoryRecords and that is true on current serialization
       // We cast the type to avoid causing big change to code base.
       // https://issues.apache.org/jira/browse/KAFKA-10698
       val memoryRecords = partition.records.asInstanceOf[MemoryRecords]
       if (!authorizedTopics.contains(topicPartition.topic))
         unauthorizedTopicResponses += topicPartition -> new PartitionResponse(Errors.TOPIC_AUTHORIZATION_FAILED)
       else if (!metadataCache.contains(topicPartition))
         nonExistingTopicResponses += topicPartition -> new PartitionResponse(Errors.UNKNOWN_TOPIC_OR_PARTITION)
       else
         try {
           ProduceRequest.validateRecords(request.header.apiVersion, memoryRecords)
           authorizedRequestInfo += (topicPartition -> memoryRecords)
         } catch {
           case e: ApiException =>
             invalidRequestResponses += topicPartition -> new PartitionResponse(Errors.forException(e))
         }
     })
    
     // the callback for sending a produce response
     // The construction of ProduceResponse is able to accept auto-generated protocol data so
     // KafkaApis#handleProduceRequest should apply auto-generated protocol to avoid extra conversion.
     // https://issues.apache.org/jira/browse/KAFKA-10730
     @nowarn("cat=deprecation")
     def sendResponseCallback(responseStatus: Map[TopicPartition, PartitionResponse]): Unit = {
       val mergedResponseStatus = responseStatus ++ unauthorizedTopicResponses ++ nonExistingTopicResponses ++ invalidRequestResponses
       var errorInResponse = false
    
       mergedResponseStatus.forKeyValue { (topicPartition, status) =>
         if (status.error != Errors.NONE) {
           errorInResponse = true
           debug("Produce request with correlation id %d from client %s on partition %s failed due to %s".format(
             request.header.correlationId,
             request.header.clientId,
             topicPartition,
             status.error.exceptionName))
         }
       }
    
       // Record both bandwidth and request quota-specific values and throttle by muting the channel if any of the quotas
       // have been violated. If both quotas have been violated, use the max throttle time between the two quotas. Note
       // that the request quota is not enforced if acks == 0.
       val timeMs = time.milliseconds()
       val bandwidthThrottleTimeMs = quotas.produce.maybeRecordAndGetThrottleTimeMs(request, requestSize, timeMs)
       val requestThrottleTimeMs =
         if (produceRequest.acks == 0) 0
         else quotas.request.maybeRecordAndGetThrottleTimeMs(request, timeMs)
       val maxThrottleTimeMs = Math.max(bandwidthThrottleTimeMs, requestThrottleTimeMs)
       if (maxThrottleTimeMs > 0) {
         request.apiThrottleTimeMs = maxThrottleTimeMs
         if (bandwidthThrottleTimeMs > requestThrottleTimeMs) {
           requestHelper.throttle(quotas.produce, request, bandwidthThrottleTimeMs)
         } else {
           requestHelper.throttle(quotas.request, request, requestThrottleTimeMs)
         }
       }
    
       // Send the response immediately. In case of throttling, the channel has already been muted.
       if (produceRequest.acks == 0) {
         // no operation needed if producer request.required.acks = 0; however, if there is any error in handling
         // the request, since no response is expected by the producer, the server will close socket server so that
         // the producer client will know that some error has happened and will refresh its metadata
         if (errorInResponse) {
           val exceptionsSummary = mergedResponseStatus.map { case (topicPartition, status) =>
             topicPartition -> status.error.exceptionName
           }.mkString(", ")
           info(
             s"Closing connection due to error during produce request with correlation id ${request.header.correlationId} " +
               s"from client id ${request.header.clientId} with ack=0\n" +
               s"Topic and partition to exceptions: $exceptionsSummary"
           )
           requestChannel.closeConnection(request, new ProduceResponse(mergedResponseStatus.asJava).errorCounts)
         } else {
           // Note that although request throttling is exempt for acks == 0, the channel may be throttled due to
           // bandwidth quota violation.
           requestHelper.sendNoOpResponseExemptThrottle(request)
         }
       } else {
         requestChannel.sendResponse(request, new ProduceResponse(mergedResponseStatus.asJava, maxThrottleTimeMs), None)
       }
     }
    
     def processingStatsCallback(processingStats: FetchResponseStats): Unit = {
       processingStats.forKeyValue { (tp, info) =>
         updateRecordConversionStats(request, tp, info)
       }
     }
    
     if (authorizedRequestInfo.isEmpty)
       sendResponseCallback(Map.empty)
     else {
       val internalTopicsAllowed = request.header.clientId == AdminUtils.AdminClientId
    
       // call the replica manager to append messages to the replicas
       replicaManager.appendRecords(
         timeout = produceRequest.timeout.toLong,
         requiredAcks = produceRequest.acks,
         internalTopicsAllowed = internalTopicsAllowed,
         origin = AppendOrigin.Client,
         entriesPerPartition = authorizedRequestInfo,
         requestLocal = requestLocal,
         responseCallback = sendResponseCallback,
         recordConversionStatsCallback = processingStatsCallback)
    
       // if the request is put into the purgatory, it will have a held reference and hence cannot be garbage collected;
       // hence we clear its data here in order to let GC reclaim its memory since it is already appended to log
       produceRequest.clearPartitionRecords()
     }
    }
    
  2. ReplicaManager.scala#appendRecords() 方法的实现如下,简单来说核心处理如下:

    1. 首先调用 ReplicaManager.scala#isValidRequiredAcks() 方法校验生产者的 request.required.acks 配置是否合法,该配置主要用于设置消息可靠性级别,有如下 3 个值:
      • 0:生产者不等待服务端完成消息写入,最小延迟
      • 1:生产者等待服务端 Leader 副本的消息写入完成,以便确认消息发送成功
      • -1: 生产者等待服务端 Leader 副本及其ISR(in-sync replicas) 列表中的 Follower 副本都完成消息写入
    2. 调用 ReplicaManager.scala#appendToLocalLog() 方法进行业务处理
    3. 业务处理完成后,根据请求携带的信息及请求处理的情况决定立即响应客户端还是延时响应,不管哪种情况,核心的响应处理还是执行在步骤1中提到的回调函数
    def appendRecords(timeout: Long,
                     requiredAcks: Short,
                     internalTopicsAllowed: Boolean,
                     origin: AppendOrigin,
                     entriesPerPartition: Map[TopicPartition, MemoryRecords],
                     responseCallback: Map[TopicPartition, PartitionResponse] => Unit,
                     delayedProduceLock: Option[Lock] = None,
                     recordConversionStatsCallback: Map[TopicPartition, RecordConversionStats] => Unit = _ => (),
                     requestLocal: RequestLocal = RequestLocal.NoCaching): Unit = {
     if (isValidRequiredAcks(requiredAcks)) {
       val sTime = time.milliseconds
       val localProduceResults = appendToLocalLog(internalTopicsAllowed = internalTopicsAllowed,
         origin, entriesPerPartition, requiredAcks, requestLocal)
       debug("Produce to local log in %d ms".format(time.milliseconds - sTime))
    
       val produceStatus = localProduceResults.map { case (topicPartition, result) =>
         topicPartition -> ProducePartitionStatus(
           result.info.lastOffset + 1, // required offset
           new PartitionResponse(
             result.error,
             result.info.firstOffset.map(_.messageOffset).getOrElse(-1),
             result.info.logAppendTime,
             result.info.logStartOffset,
             result.info.recordErrors.asJava,
             result.info.errorMessage
           )
         ) // response status
       }
    
       actionQueue.add {
         () =>
           localProduceResults.foreach {
             case (topicPartition, result) =>
               val requestKey = TopicPartitionOperationKey(topicPartition)
               result.info.leaderHwChange match {
                 case LeaderHwChange.Increased =>
                   // some delayed operations may be unblocked after HW changed
                   delayedProducePurgatory.checkAndComplete(requestKey)
                   delayedFetchPurgatory.checkAndComplete(requestKey)
                   delayedDeleteRecordsPurgatory.checkAndComplete(requestKey)
                 case LeaderHwChange.Same =>
                   // probably unblock some follower fetch requests since log end offset has been updated
                   delayedFetchPurgatory.checkAndComplete(requestKey)
                 case LeaderHwChange.None =>
                   // nothing
               }
           }
       }
    
       recordConversionStatsCallback(localProduceResults.map { case (k, v) => k -> v.info.recordConversionStats })
    
       if (delayedProduceRequestRequired(requiredAcks, entriesPerPartition, localProduceResults)) {
         // create delayed produce operation
         val produceMetadata = ProduceMetadata(requiredAcks, produceStatus)
         val delayedProduce = new DelayedProduce(timeout, produceMetadata, this, responseCallback, delayedProduceLock)
    
         // create a list of (topic, partition) pairs to use as keys for this delayed produce operation
         val producerRequestKeys = entriesPerPartition.keys.map(TopicPartitionOperationKey(_)).toSeq
    
         // try to complete the request immediately, otherwise put it into the purgatory
         // this is because while the delayed produce operation is being created, new
         // requests may arrive and hence make this operation completable.
         delayedProducePurgatory.tryCompleteElseWatch(delayedProduce, producerRequestKeys)
    
       } else {
         // we can respond immediately
         val produceResponseStatus = produceStatus.map { case (k, status) => k -> status.responseStatus }
         responseCallback(produceResponseStatus)
       }
     } else {
       // If required.acks is outside accepted range, something is wrong with the client
       // Just return an error and don't handle the request at all
       val responseStatus = entriesPerPartition.map { case (topicPartition, _) =>
         topicPartition -> new PartitionResponse(
           Errors.INVALID_REQUIRED_ACKS,
           LogAppendInfo.UnknownLogAppendInfo.firstOffset.map(_.messageOffset).getOrElse(-1),
           RecordBatch.NO_TIMESTAMP,
           LogAppendInfo.UnknownLogAppendInfo.logStartOffset
         )
       }
       responseCallback(responseStatus)
     }
    }
    
  3. ReplicaManager.scala#appendToLocalLog() 方法会以 topic 下的分区为单位遍历 Produce 请求携带的数据,对于每一个 topic 分区下的数据处理过程主要分为两步:

    1. 调用 ReplicaManager.scala#getPartitionOrException() 方法定位到目标分区的 Partition 结构
    2. 调用 Partition.scala#appendRecordsToLeader() 方法进行消息数据的追加处理
    private def appendToLocalLog(internalTopicsAllowed: Boolean,
                                origin: AppendOrigin,
                                entriesPerPartition: Map[TopicPartition, MemoryRecords],
                                requiredAcks: Short,
                                requestLocal: RequestLocal): Map[TopicPartition, LogAppendResult] = {
     val traceEnabled = isTraceEnabled
     def processFailedRecord(topicPartition: TopicPartition, t: Throwable) = {
       val logStartOffset = onlinePartition(topicPartition).map(_.logStartOffset).getOrElse(-1L)
       brokerTopicStats.topicStats(topicPartition.topic).failedProduceRequestRate.mark()
       brokerTopicStats.allTopicsStats.failedProduceRequestRate.mark()
       error(s"Error processing append operation on partition $topicPartition", t)
    
       logStartOffset
     }
    
     if (traceEnabled)
       trace(s"Append [$entriesPerPartition] to local log")
    
     entriesPerPartition.map { case (topicPartition, records) =>
       brokerTopicStats.topicStats(topicPartition.topic).totalProduceRequestRate.mark()
       brokerTopicStats.allTopicsStats.totalProduceRequestRate.mark()
    
       // reject appending to internal topics if it is not allowed
       if (Topic.isInternal(topicPartition.topic) && !internalTopicsAllowed) {
         (topicPartition, LogAppendResult(
           LogAppendInfo.UnknownLogAppendInfo,
           Some(new InvalidTopicException(s"Cannot append to internal topic ${topicPartition.topic}"))))
       } else {
         try {
           val partition = getPartitionOrException(topicPartition)
           val info = partition.appendRecordsToLeader(records, origin, requiredAcks, requestLocal)
           val numAppendedMessages = info.numMessages
    
           // update stats for successfully appended bytes and messages as bytesInRate and messageInRate
           brokerTopicStats.topicStats(topicPartition.topic).bytesInRate.mark(records.sizeInBytes)
           brokerTopicStats.allTopicsStats.bytesInRate.mark(records.sizeInBytes)
           brokerTopicStats.topicStats(topicPartition.topic).messagesInRate.mark(numAppendedMessages)
           brokerTopicStats.allTopicsStats.messagesInRate.mark(numAppendedMessages)
    
           if (traceEnabled)
             trace(s"${records.sizeInBytes} written to log $topicPartition beginning at offset " +
               s"${info.firstOffset.getOrElse(-1)} and ending at offset ${info.lastOffset}")
    
           (topicPartition, LogAppendResult(info))
         } catch {
           // NOTE: Failed produce requests metric is not incremented for known exceptions
           // it is supposed to indicate un-expected failures of a broker in handling a produce request
           case e@ (_: UnknownTopicOrPartitionException |
                    _: NotLeaderOrFollowerException |
                    _: RecordTooLargeException |
                    _: RecordBatchTooLargeException |
                    _: CorruptRecordException |
                    _: KafkaStorageException) =>
             (topicPartition, LogAppendResult(LogAppendInfo.UnknownLogAppendInfo, Some(e)))
           case rve: RecordValidationException =>
             val logStartOffset = processFailedRecord(topicPartition, rve.invalidException)
             val recordErrors = rve.recordErrors
             (topicPartition, LogAppendResult(LogAppendInfo.unknownLogAppendInfoWithAdditionalInfo(
               logStartOffset, recordErrors, rve.invalidException.getMessage), Some(rve.invalidException)))
           case t: Throwable =>
             val logStartOffset = processFailedRecord(topicPartition, t)
             (topicPartition, LogAppendResult(LogAppendInfo.unknownLogAppendInfoWithLogStartOffset(logStartOffset), Some(t)))
         }
       }
     }
    }
    
  4. Partition.scala#appendRecordsToLeader() 方法将在一个读锁里面进行数据写入操作,此处主要处理如下:

    1. 首先检查当前 Leader 副本的 ISR 状态,状态满足要求才调用 Log.scala#appendAsLeader() 方法进行消息数据写入
    2. 调用 Partition.scala#maybeIncrementLeaderHW() 方法尝试去更新这个分区的 高水位标识(high watermark),这部分的处理主要和 Kafka 主从副本的数据同步机制有关,本文暂不深入
      def appendRecordsToLeader(records: MemoryRecords, origin: AppendOrigin, requiredAcks: Int,
                             requestLocal: RequestLocal): LogAppendInfo = {
     val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
       leaderLogIfLocal match {
         case Some(leaderLog) =>
           val minIsr = leaderLog.config.minInSyncReplicas
           val inSyncSize = isrState.isr.size
    
           // Avoid writing to leader if there are not enough insync replicas to make it safe
           if (inSyncSize < minIsr && requiredAcks == -1) {
             throw new NotEnoughReplicasException(s"The size of the current ISR ${isrState.isr} " +
               s"is insufficient to satisfy the min.isr requirement of $minIsr for partition $topicPartition")
           }
    
           val info = leaderLog.appendAsLeader(records, leaderEpoch = this.leaderEpoch, origin,
             interBrokerProtocolVersion, requestLocal)
    
           // we may need to increment high watermark since ISR could be down to 1
           (info, maybeIncrementLeaderHW(leaderLog))
    
         case None =>
           throw new NotLeaderOrFollowerException("Leader not local for partition %s on broker %d"
             .format(topicPartition, localBrokerId))
       }
     }
    
     info.copy(leaderHwChange = if (leaderHWIncremented) LeaderHwChange.Increased else LeaderHwChange.Same)
    }
    
  5. Log.scala#appendAsLeader() 方法其实只是个入口,核心是调用 Log.scala#append() 方法。这个方法比较长,不过关键的处理还算清晰:

    1. 首先调用 Log.scala#analyzeAndValidateRecords() 方法分析校验消息数据是否合法
    2. 调用 LogValidator.scala#validateMessagesAndAssignOffsets() 方法进一步校验消息数据,并为每一条消息设置偏移量和时间戳
    3. 调用 Log.scala#maybeRoll() 方法判断当前最新的 LogSegment 是否能容纳这一批消息数据,如果不能则新建一个 LogSegment 用于存储消息数据。新建 LogSegment 的过程中会创建新的消息存储文件,消息数据写入文件的过程实际上是通过文件系统先写入到页缓存(Page Cache),在合适的时候再写入到磁盘
    4. 接着调用 Log.scala#analyzeAndValidateProducerState() 方法分析校验生产者在服务端的状态,确定消息是否有重复等信息
    5. 消息没有重复则调用 LogSegment.scala#append() 进行消息数据的写入,完成后需要通过 ProducerStateManager 更新服务端保存的该生产者的相关信息
      def appendAsLeader(records: MemoryRecords,
                      leaderEpoch: Int,
                      origin: AppendOrigin = AppendOrigin.Client,
                      interBrokerProtocolVersion: ApiVersion = ApiVersion.latestVersion,
                      requestLocal: RequestLocal = RequestLocal.NoCaching): LogAppendInfo = {
     val validateAndAssignOffsets = origin != AppendOrigin.RaftLeader
     append(records, origin, interBrokerProtocolVersion, validateAndAssignOffsets, leaderEpoch, Some(requestLocal), ignoreRecordSize = false)
    }
    
    private def append(records: MemoryRecords,
                      origin: AppendOrigin,
                      interBrokerProtocolVersion: ApiVersion,
                      validateAndAssignOffsets: Boolean,
                      leaderEpoch: Int,
                      requestLocal: Option[RequestLocal],
                      ignoreRecordSize: Boolean): LogAppendInfo = {
     // We want to ensure the partition metadata file is written to the log dir before any log data is written to disk.
     // This will ensure that any log data can be recovered with the correct topic ID in the case of failure.
     maybeFlushMetadataFile()
    
     val appendInfo = analyzeAndValidateRecords(records, origin, ignoreRecordSize, leaderEpoch)
    
     // return if we have no valid messages or if this is a duplicate of the last appended entry
     if (appendInfo.shallowCount == 0) appendInfo
     else {
    
       // trim any invalid bytes or partial messages before appending it to the on-disk log
       var validRecords = trimInvalidBytes(records, appendInfo)
    
       // they are valid, insert them in the log
       lock synchronized {
         maybeHandleIOException(s"Error while appending records to $topicPartition in dir ${dir.getParent}") {
           checkIfMemoryMappedBufferClosed()
           if (validateAndAssignOffsets) {
             // assign offsets to the message set
             val offset = new LongRef(nextOffsetMetadata.messageOffset)
             appendInfo.firstOffset = Some(LogOffsetMetadata(offset.value))
             val now = time.milliseconds
             val validateAndOffsetAssignResult = try {
               LogValidator.validateMessagesAndAssignOffsets(validRecords,
                 topicPartition,
                 offset,
                 time,
                 now,
                 appendInfo.sourceCodec,
                 appendInfo.targetCodec,
                 config.compact,
                 config.recordVersion.value,
                 config.messageTimestampType,
                 config.messageTimestampDifferenceMaxMs,
                 leaderEpoch,
                 origin,
                 interBrokerProtocolVersion,
                 brokerTopicStats,
                 requestLocal.getOrElse(throw new IllegalArgumentException(
                   "requestLocal should be defined if assignOffsets is true")))
             } catch {
               case e: IOException =>
                 throw new KafkaException(s"Error validating messages while appending to log $name", e)
             }
             validRecords = validateAndOffsetAssignResult.validatedRecords
             appendInfo.maxTimestamp = validateAndOffsetAssignResult.maxTimestamp
             appendInfo.offsetOfMaxTimestamp = validateAndOffsetAssignResult.shallowOffsetOfMaxTimestamp
             appendInfo.lastOffset = offset.value - 1
             appendInfo.recordConversionStats = validateAndOffsetAssignResult.recordConversionStats
             if (config.messageTimestampType == TimestampType.LOG_APPEND_TIME)
               appendInfo.logAppendTime = now
    
             // re-validate message sizes if there's a possibility that they have changed (due to re-compression or message
             // format conversion)
             if (!ignoreRecordSize && validateAndOffsetAssignResult.messageSizeMaybeChanged) {
               validRecords.batches.forEach { batch =>
                 if (batch.sizeInBytes > config.maxMessageSize) {
                   // we record the original message set size instead of the trimmed size
                   // to be consistent with pre-compression bytesRejectedRate recording
                   brokerTopicStats.topicStats(topicPartition.topic).bytesRejectedRate.mark(records.sizeInBytes)
                   brokerTopicStats.allTopicsStats.bytesRejectedRate.mark(records.sizeInBytes)
                   throw new RecordTooLargeException(s"Message batch size is ${batch.sizeInBytes} bytes in append to" +
                     s"partition $topicPartition which exceeds the maximum configured size of ${config.maxMessageSize}.")
                 }
               }
             }
           } else {
             // we are taking the offsets we are given
             if (!appendInfo.offsetsMonotonic)
               throw new OffsetsOutOfOrderException(s"Out of order offsets found in append to $topicPartition: " +
                 records.records.asScala.map(_.offset))
    
             if (appendInfo.firstOrLastOffsetOfFirstBatch < nextOffsetMetadata.messageOffset) {
               // we may still be able to recover if the log is empty
               // one example: fetching from log start offset on the leader which is not batch aligned,
               // which may happen as a result of AdminClient#deleteRecords()
               val firstOffset = appendInfo.firstOffset match {
                 case Some(offsetMetadata) => offsetMetadata.messageOffset
                 case None => records.batches.asScala.head.baseOffset()
               }
    
               val firstOrLast = if (appendInfo.firstOffset.isDefined) "First offset" else "Last offset of the first batch"
               throw new UnexpectedAppendOffsetException(
                 s"Unexpected offset in append to $topicPartition. $firstOrLast " +
                   s"${appendInfo.firstOrLastOffsetOfFirstBatch} is less than the next offset ${nextOffsetMetadata.messageOffset}. " +
                   s"First 10 offsets in append: ${records.records.asScala.take(10).map(_.offset)}, last offset in" +
                   s" append: ${appendInfo.lastOffset}. Log start offset = $logStartOffset",
                 firstOffset, appendInfo.lastOffset)
             }
           }
    
           // update the epoch cache with the epoch stamped onto the message by the leader
           validRecords.batches.forEach { batch =>
             if (batch.magic >= RecordBatch.MAGIC_VALUE_V2) {
               maybeAssignEpochStartOffset(batch.partitionLeaderEpoch, batch.baseOffset)
             } else {
               // In partial upgrade scenarios, we may get a temporary regression to the message format. In
               // order to ensure the safety of leader election, we clear the epoch cache so that we revert
               // to truncation by high watermark after the next leader election.
               leaderEpochCache.filter(_.nonEmpty).foreach { cache =>
                 warn(s"Clearing leader epoch cache after unexpected append with message format v${batch.magic}")
                 cache.clearAndFlush()
               }
             }
           }
    
           // check messages set size may be exceed config.segmentSize
           if (validRecords.sizeInBytes > config.segmentSize) {
             throw new RecordBatchTooLargeException(s"Message batch size is ${validRecords.sizeInBytes} bytes in append " +
               s"to partition $topicPartition, which exceeds the maximum configured segment size of ${config.segmentSize}.")
           }
    
           // maybe roll the log if this segment is full
           val segment = maybeRoll(validRecords.sizeInBytes, appendInfo)
    
           val logOffsetMetadata = LogOffsetMetadata(
             messageOffset = appendInfo.firstOrLastOffsetOfFirstBatch,
             segmentBaseOffset = segment.baseOffset,
             relativePositionInSegment = segment.size)
    
           // now that we have valid records, offsets assigned, and timestamps updated, we need to
           // validate the idempotent/transactional state of the producers and collect some metadata
           val (updatedProducers, completedTxns, maybeDuplicate) = analyzeAndValidateProducerState(
             logOffsetMetadata, validRecords, origin)
    
           maybeDuplicate match {
             case Some(duplicate) =>
               appendInfo.firstOffset = Some(LogOffsetMetadata(duplicate.firstOffset))
               appendInfo.lastOffset = duplicate.lastOffset
               appendInfo.logAppendTime = duplicate.timestamp
               appendInfo.logStartOffset = logStartOffset
             case None =>
               // Before appending update the first offset metadata to include segment information
               appendInfo.firstOffset = appendInfo.firstOffset.map { offsetMetadata =>
                 offsetMetadata.copy(segmentBaseOffset = segment.baseOffset, relativePositionInSegment = segment.size)
               }
    
               segment.append(largestOffset = appendInfo.lastOffset,
                 largestTimestamp = appendInfo.maxTimestamp,
                 shallowOffsetOfMaxTimestamp = appendInfo.offsetOfMaxTimestamp,
                 records = validRecords)
    
               // Increment the log end offset. We do this immediately after the append because a
               // write to the transaction index below may fail and we want to ensure that the offsets
               // of future appends still grow monotonically. The resulting transaction index inconsistency
               // will be cleaned up after the log directory is recovered. Note that the end offset of the
               // ProducerStateManager will not be updated and the last stable offset will not advance
               // if the append to the transaction index fails.
               updateLogEndOffset(appendInfo.lastOffset + 1)
    
               // update the producer state
               updatedProducers.values.foreach(producerAppendInfo => producerStateManager.update(producerAppendInfo))
    
               // update the transaction index with the true last stable offset. The last offset visible
               // to consumers using READ_COMMITTED will be limited by this value and the high watermark.
               completedTxns.foreach { completedTxn =>
                 val lastStableOffset = producerStateManager.lastStableOffset(completedTxn)
                 segment.updateTxnIndex(completedTxn, lastStableOffset)
                 producerStateManager.completeTxn(completedTxn)
               }
    
               // always update the last producer id map offset so that the snapshot reflects the current offset
               // even if there isn't any idempotent data being written
               producerStateManager.updateMapEndOffset(appendInfo.lastOffset + 1)
    
               // update the first unstable offset (which is used to compute LSO)
               maybeIncrementFirstUnstableOffset()
    
               trace(s"Appended message set with last offset: ${appendInfo.lastOffset}, " +
                 s"first offset: ${appendInfo.firstOffset}, " +
                 s"next offset: ${nextOffsetMetadata.messageOffset}, " +
                 s"and messages: $validRecords")
    
               if (unflushedMessages >= config.flushInterval) flush()
           }
           appendInfo
         }
       }
     }
    }
    
  6. LogSegment.scala#append() 进行消息数据写入的操作比较简练,大致分为如下几步:

    1. 首先调用 LogSegment.scala#ensureOffsetInRange() 通过偏移量索引 OffsetIndex 检查要插入的消息数据的偏移量是否符合要求
    2. 其次调用 FileRecords.java#append() 将消息数据写入文件
    3. 根据索引间隔字节数配置及bytesSinceLastIndexEntry累加器判断是否要插入一个索引关键字,这也就是笔者在之前的文章中提到过的 Kafka 索引为稀疏索引的体现
    @nonthreadsafe
    def append(largestOffset: Long,
              largestTimestamp: Long,
              shallowOffsetOfMaxTimestamp: Long,
              records: MemoryRecords): Unit = {
     if (records.sizeInBytes > 0) {
       trace(s"Inserting ${records.sizeInBytes} bytes at end offset $largestOffset at position ${log.sizeInBytes} " +
             s"with largest timestamp $largestTimestamp at shallow offset $shallowOffsetOfMaxTimestamp")
       val physicalPosition = log.sizeInBytes()
       if (physicalPosition == 0)
         rollingBasedTimestamp = Some(largestTimestamp)
    
       ensureOffsetInRange(largestOffset)
    
       // append the messages
       val appendedBytes = log.append(records)
       trace(s"Appended $appendedBytes to ${log.file} at end offset $largestOffset")
       // Update the in memory max timestamp and corresponding offset.
       if (largestTimestamp > maxTimestampSoFar) {
         maxTimestampAndOffsetSoFar = TimestampOffset(largestTimestamp, shallowOffsetOfMaxTimestamp)
       }
       // append an entry to the index (if needed)
       if (bytesSinceLastIndexEntry > indexIntervalBytes) {
         offsetIndex.append(largestOffset, physicalPosition)
         timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestampSoFar)
         bytesSinceLastIndexEntry = 0
       }
       bytesSinceLastIndexEntry += records.sizeInBytes
     }
    }
    
  7. FileRecords.java#append() 将消息数据写入文件系统的实现如下,可以看到此处其实是调用MemoryRecords.java#writeFullyTo()将数据写入到了 FileChannel 中,至此 Kafka 服务端消息数据存储的处理基本结束

     public int append(MemoryRecords records) throws IOException {
         if (records.sizeInBytes() > Integer.MAX_VALUE - size.get())
             throw new IllegalArgumentException("Append of size " + records.sizeInBytes() +
                     " bytes is too large for segment with current file position at " + size.get());
    
         int written = records.writeFullyTo(channel);
         size.getAndAdd(written);
         return written;
     }
    
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
1/kafka是一个分布式的消息缓存系统 2/kafka集群中的服务器都叫做broker 3/kafka有两类客户端,一类叫producer(消息生产者),一类叫做consumer(消息消费者),客户端和broker服务器之间采用tcp协议连接 4/kafka中不同业务系统的消息可以通过topic进行区分,而且每一个消息topic都会被分区,以分担消息读写的负载 5/每一个分区都可以有多个副本,以防止数据的丢失 6/某一个分区中的数据如果需要更新,都必须通过该分区所有副本中的leader来更新 7/消费者可以分组,比如有两个消费者组A和B,共同消费一个topic:order_info,A和B所消费的消息不会重复 比如 order_info 中有100个消息,每个消息有一个id,编号从0-99,那么,如果A组消费0-49号,B组就消费50-99号 8/消费者在具体消费某个topic中的消息时,可以指定起始偏移量 每个partition只能同一个group中的同一个consumer消费,但多个Consumer Group可同时消费同一个partition。 n个topic可以被n个Consumer Group消费,每个Consumer Group有多个Consumer消费同一个topic Topic在逻辑上可以被认为是一个queue,每条消费都必须指定它的Topic,可以简单理解为必须指明把这条消息放进哪个queue里。为了使得Kafka的吞吐率可以线性提高,物理上把Topic分成一个或多个Partition,每个Partition在物理上对应一个文件夹,该文件夹下存储这个Partition的所有消息和索引文件。若创建topic1和topic2两个topic,且分别有13个和19个分区 Kafka的设计理念之一就是同时提供离线处理和实时处理。根据这一特性,可以使用Storm这种实时流处理系统对消息进行实时在线处理,同时使用Hadoop这种批处理系统进行离线处理,还可以同时将数据实时备份到另一个数据中心,只需要保证这三个操作所使用的Consumer属于不同的Consumer Group即可。
Kafka 3.0 的部署相对于之前的版本并没有太大变化,主要的步骤如下: 1. 下载和安装 Kafka 3.0 你可以从官网下载 Kafka 3.0 的压缩包,然后解压到你想要安装的目录下。 2. 配置 Kafka 3.0 在解压后的 Kafka 目录中,进入 config 目录,找到 server.properties 文件。根据你的需求修改一些配置项,例如 broker.id、log.dirs、listeners 等。 3. 启动 ZooKeeper Kafka 依赖于 ZooKeeper,所以需要先启动 ZooKeeper。你可以从官网下载 ZooKeeper 并解压,然后进入 bin 目录,执行以下命令启动 ZooKeeper: ``` ./zkServer.sh start ``` 4. 启动 Kafka 3.0Kafka 目录下执行以下命令启动 Kafka: ``` ./bin/kafka-server-start.sh config/server.properties ``` 5. 创建 Topic 使用 Kafka 的命令行工具 kafka-topics.sh 可以创建和管理 Topic。例如,创建一个名为 test 的 Topic: ``` ./bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test ``` 6. 发送和消费消息 使用 Kafka 的命令行工具 kafka-console-producer.sh 可以发送消息,kafka-console-consumer.sh 可以消费消息。例如,发送一条消息到 test Topic: ``` ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test > hello kafka ``` 然后使用以下命令消费消息: ``` ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning ``` 以上是 Kafka 3.0 的基本部署和使用方法,具体操作可以根据实际情况进行调整。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值