DelayFetch的主要字段如下:
class DelayedFetch(
// 延迟操作的延迟时长
delayMs: Long,
// 为FetchRequest中所有相关分区记录了相关状态,主要用来判断DelayedProduce是否满足执行条件
fetchMetadata: FetchMetadata,
replicaManager: ReplicaManager,
// 满足条件或者到期执行时,在onComplete方法中调用回调函数,其主要功能是创建FetchResponse并添加到RequestChannels中对应的responseQueue中
responseCallback: Map[TopicAndPartition, FetchResponsePartitionData] => Unit)
extends DelayedOperation(delayMs) {}
delayedFetch.tryComplete方法主要检测是否满足elayedFetch,并在满足时执行forceComplete调用forceComplete方法。判断条件有四条之一即满足:
/**
* The operation can be completed if:
*
* Case A: This broker is no longer the leader for some partitions it tries to fetch
* Case B: This broker does not know of some partitions it tries to fetch
* Case C: The fetch offset locates not on the last segment of the log
* Case D: The accumulated bytes from all the fetching partitions exceeds the minimum bytes
*
* Upon completion, should return whatever data is available for each valid partition
*/
override def tryComplete() : Boolean = {
var accumulatedSize = 0
// 遍历fetchMetadata中所有Partition的状态
fetchMetadata.fetchPartitionStatus.foreach {
case (topicAndPartition, fetchStatus) =>
// 获取前面读取log时的结束位置
val fetchOffset = fetchStatus.startOffsetMetadata
try {
if (fetchOffset != LogOffsetMetadata.UnknownOffsetMetadata) {
// 查找分区的leader副本,如果找不到就抛异常
val replica = replicaManager.getLeaderReplicaIfLocal(topicAndPartition.topic, topicAndPartition.partition)
// 根据FetchRequest请求的来源设置能读取的最大offset值。消费者对应的endOffset是HW,而Follower副本对应的endOffset是LEO
val endOffset =
if (fetchMetadata.fetchOnlyCommitted)
replica.highWatermark
else
replica.logEndOffset
// Go directly to the check for Case D if the message offsets are the same. If the log segment
// has just rolled, then the high watermark offset will remain the same but be on the old segment,
// which would incorrectly be seen as an instance of Case C.
// 检查上次读取后endOffset是否发生变化。如果没改变,之前读不到足够的数据现在还是读不到,即任务条件依然不满足;如果变了则继续下面的检查
if (endOffset.messageOffset != fetchOffset.messageOffset) {
// 条件一:开始读取的offset不在activeSegment中
if (endOffset.onOlderSegment(fetchOffset)) {// 可能是发生了Log截断
// Case C, this can happen when the new fetch operation is on a truncated leader
debug("Satisfying fetch %s since it is fetching later segments of partition %s.".format(fetchMetadata, topicAndPartition))
return forceComplete()
} else if (fetchOffset.onOlderSegment(endOffset)) { // fetchOffset虽然在endOffset之前,但是产生了新的activeSegment,fetchOffset在旧的logSegments,而endOffset在新的logSegments
// Case C, this can happen when the fetch operation is falling behind the current segment
// or the partition has just rolled a new segment
debug("Satisfying fetch %s immediately since it is fetching older segments.".format(fetchMetadata))
return forceComplete()
} else if (fetchOffset.messageOffset < endOffset.messageOffset) {
// we need take the partition fetch size as upper bound when accumulating the bytes
// 开始读取的offset和endOffset在同一个activeSegment中,切endOffsets向后移动,那就尝试计算累计的字节数
accumulatedSize += math.min(endOffset.positionDiff(fetchOffset), fetchStatus.fetchInfo.fetchSize)
}
}
}
} catch {
case utpe: UnknownTopicOrPartitionException => // Case B 当前broker找不到需要读取数据的分区副本
debug("Broker no longer know of %s, satisfy %s immediately".format(topicAndPartition, fetchMetadata))
return forceComplete()
case nle: NotLeaderForPartitionException => // Case A 发生leader副本迁移
debug("Broker is no longer the leader of %s, satisfy %s immediately".format(topicAndPartition, fetchMetadata))
return forceComplete()
}
}
// Case D 累计读取字节数超过最小字节数限制
if (accumulatedSize >= fetchMetadata.fetchMinBytes)
forceComplete()
else
false
}
DelayedFetch.onComplete方法如下:
override def onComplete() {
// 重新从log中读取数据
val logReadResults = replicaManager.readFromLocalLog(fetchMetadata.fetchOnlyLeader,
fetchMetadata.fetchOnlyCommitted,
fetchMetadata.fetchPartitionStatus.mapValues(status => status.fetchInfo))
// 将结果进行封装
val fetchPartitionData = logReadResults.mapValues(result =>
FetchResponsePartitionData(result.errorCode, result.hw, result.info.messageSet))
// 调用回调函数
responseCallback(fetchPartitionData)
}
// the callback for sending a fetch response
def sendResponseCallback(responsePartitionData: Map[TopicAndPartition, FetchResponsePartitionData]) {
val convertedPartitionData =
// Need to down-convert message when consumer only takes magic value 0.
if (fetchRequest.versionId <= 1) {
responsePartitionData.map { case (tp, data) =>
// We only do down-conversion when:
// 1. The message format version configured for the topic is using magic value > 0, and
// 2. The message set contains message whose magic > 0
// This is to reduce the message format conversion as much as possible. The conversion will only occur
// when new message format is used for the topic and we see an old request.
// Please note that if the message format is changed from a higher version back to lower version this
// test might break because some messages in new message format can be delivered to consumers before 0.10.0.0
// without format down conversion.
val convertedData = if (replicaManager.getMessageFormatVersion(tp).exists(_ > Message.MagicValue_V0) &&
!data.messages.isMagicValueInAllWrapperMessages(Message.MagicValue_V0)) {
trace(s"Down converting message to V0 for fetch request from ${fetchRequest.clientId}")
new FetchResponsePartitionData(data.error, data.hw, data.messages.asInstanceOf[FileMessageSet].toMessageFormat(Message.MagicValue_V0))
} else data
tp -> convertedData
}
} else responsePartitionData
val mergedPartitionData = convertedPartitionData ++ unauthorizedPartitionData
mergedPartitionData.foreach { case (topicAndPartition, data) =>
if (data.error != Errors.NONE.code)
debug(s"Fetch request with correlation id ${fetchRequest.correlationId} from client ${fetchRequest.clientId} " +
s"on partition $topicAndPartition failed due to ${Errors.forCode(data.error).exceptionName}")
// record the bytes out metrics only when the response is being sent
BrokerTopicStats.getBrokerTopicStats(topicAndPartition.topic).bytesOutRate.mark(data.messages.sizeInBytes)
BrokerTopicStats.getBrokerAllTopicsStats().bytesOutRate.mark(data.messages.sizeInBytes)
}
// 定义fetchResponseCallback函数
def fetchResponseCallback(delayTimeMs: Int) {
trace(s"Sending fetch response to client ${fetchRequest.clientId} of " +
s"${convertedPartitionData.values.map(_.messages.sizeInBytes).sum} bytes")
// 生成fetchResponse对象
val response = FetchResponse(fetchRequest.correlationId, mergedPartitionData, fetchRequest.versionId, delayTimeMs)
// 向赌赢responseQueue中添加一个SendAction的response,其中封装了上面的response对象
requestChannel.sendResponse(new RequestChannel.Response(request, new FetchResponseSend(request.connectionId, response)))
}
// When this callback is triggered, the remote API call has completed
request.apiRemoteCompleteTimeMs = SystemTime.milliseconds
// Do not throttle replication traffic
if (fetchRequest.isFromFollower) {
// 调用fetchResponseCallback返回FetchResponse
fetchResponseCallback(0)
} else {
// 底层也是调用fetchResponseCallback
quotaManagers(ApiKeys.FETCH.id).recordAndMaybeThrottle(fetchRequest.clientId,
FetchResponse.responseSize(mergedPartitionData.groupBy(_._1.topic),
fetchRequest.versionId),
fetchResponseCallback)
}
}
DelayFetch的流程如下:
1. Follower副本或消费者发送FetchRequest,从某些分区中获取消息
2. FetchRequest经过网络层和API层的处理,到达ReplicaManager,他会从日志存储子系统中读取数据,并检测是否要更新ISR集合、HW等,之后还会执行delayedProducePurgatory中满足条件相关的DelayedProduce
3. 日志存储子系统返回读取消息以及相关信息,例如读取到的offset等
4. ReplicaManager为FetchRequest生成DelayedFetch对象,并交由delayedFetchPurgatory管理
5. delayedFetchPurgatory使用SystemTimer管理DelayedFetch是否超时
6. 生产者发送ProduceRequest追加消息,同时也会检查DelayedFetch是否满足执行条件
7. DelayFetch执行时会调用回调函数产生FetchResponse,添加到RequestChannels中
8. 有网络层将FetchResponse返回到客户端