概述
OffsetForLeaderEpoch api仅用于内部broker之间的通信,并且要求集群的许可。在KIP-320中,consumer端也使用这个api去检查在leader change后是否发生日志截断。
首先,我将通过一个例子来说明follower副本向leader副本请求OffsetForLeaderEpoch的过程
A(leader, epoch=1): 1, 2, 3, 4, 5, 6
A cache: leaderEpoch = 1, startOffset = 1
B(follower): 1, 2, 3, 4
B cache: leaderEpoch = 1, startOffset = 1
=============================================
B(leader, epoch=2): 1, 2, 3, 4, 5, 6, 7
B cache:
leaderEpoch = 1, startOffset = 1
leaderEpoch = 2, startOffset = 5、
A挂掉后,B成为新leader,A又恢复过来,此时追加了新数据,B的leaderEpochCache增加了新条目(leaderEpoch=2, startOffset=5)。
当A请求复制B时,请求的epoch为1,B查询到epoch=2(比1大的最小epoch),然后返回对应的startOffset=5,A收到后truncate自己>=5的记录(这里是offset=5和6),然后把请求的offset更新为5,重新复制数据,B返回数据(offset=5, 6 和7,epoch=2),A追加记录时发现数据的epoch=2,新增条目(epoch=2, startOffset=5)到自己的leaderEpochCache。
leader副本在处理OffsetForLeaderEpoch请求时,总是返回大于requestedLeaderEpoch的最小Epoch的startOffset,下面将通过源码说明。
kafkaApis#handleOffsetForLeaderEpochRequest()
解码OffsetForLeaderEpochRequest的请求实体,判断是否授权,调用ReplicaManger#lastOffsetForLeaderEpoch()方法获取每个Partition最近上一次的leader epoch和对应的LEO,最后向客户端发送OffsetsForLeaderEpochResponse响应。
def handleOffsetForLeaderEpochRequest(request: RequestChannel.Request): Unit = {
val offsetForLeaderEpoch = request.body[OffsetsForLeaderEpochRequest]
val requestInfo = offsetForLeaderEpoch.epochsByTopicPartition.asScala
// The OffsetsForLeaderEpoch API was initially only used for inter-broker communication and required
// cluster permission. With KIP-320, the consumer now also uses this API to check for log truncation
// following a leader change, so we also allow topic describe permission.
val (authorizedPartitions, unauthorizedPartitions) = if (isAuthorizedClusterAction(request)) {
(requestInfo, Map.empty[TopicPartition, OffsetsForLeaderEpochRequest.PartitionData])
} else {
requestInfo.partition {
case (tp, _) => authorize(request.session, Describe, Resource(Topic, tp.topic, LITERAL))
}
}
//ReplicaManger#lastOffsetForLeaderEpoch()方法获取最近上一次的leader epoch
val endOffsetsForAuthorizedPartitions = replicaManager.lastOffsetForLeaderEpoch(authorizedPartitions)
val endOffsetsForUnauthorizedPartitions = unauthorizedPartitions.mapValues(_ =>
new EpochEndOffset(Errors.TOPIC_AUTHORIZATION_FAILED, EpochEndOffset.UNDEFINED_EPOCH,
EpochEndOffset.UNDEFINED_EPOCH_OFFSET))
val endOffsetsForAllPartitions = endOffsetsForAuthorizedPartitions ++ endOffsetsForUnauthorizedPartitions
sendResponseMaybeThrottle(request, requestThrottleMs =>
new OffsetsForLeaderEpochResponse(requestThrottleMs, endOffsetsForAllPartitions.asJava))
}
ReplicaManager#lastOffsetForLeaderEpoch()方法
迭代requestedEpochInfo集合,返回每个Partition对应的EpochEndOffset的map集合
def lastOffsetForLeaderEpoch(requestedEpochInfo: Map[TopicPartition, OffsetsForLeaderEpochRequest.PartitionData]): Map[TopicPartition, EpochEndOffset] = {
//迭代requestedEpochInfo集合,返回每个Partition对应的EpochEndOffset的map集合
requestedEpochInfo.map { case (tp, partitionData) =>
val epochEndOffset = getPartition(tp) match {
case Some(partition) =>
if (partition eq ReplicaManager.OfflinePartition)
new EpochEndOffset(Errors.KAFKA_STORAGE_ERROR, UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
else
partition.lastOffsetForLeaderEpoch(partitionData.currentLeaderEpoch, partitionData.leaderEpoch,
fetchOnlyFromLeader = true)
case None if metadataCache.contains(tp) =>
new EpochEndOffset(Errors.NOT_LEADER_FOR_PARTITION, UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
case None =>
new EpochEndOffset(Errors.UNKNOWN_TOPIC_OR_PARTITION, UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
}
tp -> epochEndOffset
}
}
Partition#lastOffsetForLeaderEpoch()方法
返回小于或等于requestedLeaderEpoch的最大Epoch的LEO。LEO被定义为大于requestedLeaderEpoch的第一个Epoch的startOffset;或者定义为latestLeaderEpoch的LEO,如果requestedLeaderEpoch等于latestLeaderEpoch。
/**
* Find the (exclusive) last offset of the largest epoch less than or equal to the requested epoch.
*
* @param currentLeaderEpoch The expected epoch of the current leader (if known)
* @param leaderEpoch Requested leader epoch
* @param fetchOnlyFromLeader Whether or not to require servicing only from the leader
*
* @return The requested leader epoch and the end offset of this leader epoch, or if the requested
* leader epoch is unknown, the leader epoch less than the requested leader epoch and the end offset
* of this leader epoch. The end offset of a leader epoch is defined as the start
* offset of the first leader epoch larger than the leader epoch, or else the log end
* offset if the leader epoch is the latest leader epoch.
*/
def lastOffsetForLeaderEpoch(currentLeaderEpoch: Optional[Integer],
leaderEpoch: Int,
fetchOnlyFromLeader: Boolean): EpochEndOffset = {
inReadLock(leaderIsrUpdateLock) {
//如果localReplica存在,并且currentLeaderEpoch小于Partition记录的LeaderEpoch,获取localReplica
val localReplicaOrError = getLocalReplica(localBrokerId, currentLeaderEpoch, fetchOnlyFromLeader)
localReplicaOrError match {
case Left(replica) =>
//基于requestedLeaderEpoch,返回一个(leaderEpoch, logEndOffset)二元组
replica.endOffsetForEpoch(leaderEpoch) match {
case Some(epochAndOffset) => new EpochEndOffset(NONE, epochAndOffset.leaderEpoch, epochAndOffset.offset)
case None => new EpochEndOffset(NONE, UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
}
case Right(error) =>
new EpochEndOffset(error, UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
}
}
}
Replica#endOffsetForEpoch方法
def endOffsetForEpoch(leaderEpoch: Int): Option[OffsetAndEpoch] = {
if (isLocal) {
log.get.endOffsetForEpoch(leaderEpoch)
} else {
throw new KafkaException(s"Cannot lookup end offset for epoch of non-local replica of $topicPartition")
}
}
Log#endOffsetForEpoch()方法
如果leaderEpochFileCache存在,调用LeaderEpochFileCache#endOffsetFor方法
def endOffsetForEpoch(leaderEpoch: Int): Option[OffsetAndEpoch] = {
//如果leaderEpochCache存在
leaderEpochCache.flatMap { cache =>
val (foundEpoch, foundOffset) = cache.endOffsetFor(leaderEpoch)
if (foundOffset == EpochEndOffset.UNDEFINED_EPOCH_OFFSET)
None
else
Some(OffsetAndEpoch(foundOffset, foundEpoch))
}
}
LeaderEpochFileCache#endOffsetFor方法
基于requestedLeaderEpoch,返回一个(leaderEpoch, logEndOffset)二元组。
返回的leaderEpoch是小于或等于requestedLeaderEpoch的最大Epoch,而logEndOffset是该返回的leaderEpoch的LEO。
相关公式如下:
当requestedLeaderEpoch ! = latestEpoch
- return leaderEpoch = Math.floor(requestedLeaderEpoch)
- return LEO = Math.ceil(requestedLeaderEpoch).startOffset
当requestedLeaderEpoch == latestEpoch
- return leaderEpoch = requestedLeaderEpoch
- return LEO = latestEpoch.currentLogEndOffset
/**
* Returns the Leader Epoch and the End Offset for a requested Leader Epoch.
*
* The Leader Epoch returned is the largest epoch less than or equal to the requested Leader
* Epoch. The End Offset is the end offset of this epoch, which is defined as the start offset
* of the first Leader Epoch larger than the Leader Epoch requested, or else the Log End
* Offset if the latest epoch was requested.
*
* During the upgrade phase, where there are existing messages may not have a leader epoch,
* if requestedEpoch is < the first epoch cached, UNSUPPORTED_EPOCH_OFFSET will be returned
* so that the follower falls back to High Water Mark.
*
* @param requestedEpoch requested leader epoch
* @return found leader epoch and end offset
*/
def endOffsetFor(requestedEpoch: Int): (Int, Long) = {
inReadLock(lock) {
val epochAndOffset =
if (requestedEpoch == UNDEFINED_EPOCH) {
// This may happen if a bootstrapping follower sends a request with undefined epoch or
// a follower is on the older message format where leader epochs are not recorded
(UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
} else if (latestEpoch.contains(requestedEpoch)) {
// For the leader, the latest epoch is always the current leader epoch that is still being written to.
// Followers should not have any reason to query for the end offset of the current epoch, but a consumer
// might if it is verifying its committed offset following a group rebalance. In this case, we return
// the current log end offset which makes the truncation check work as expected.
//对于leader副本,latestEpoch总是当前正在被写入的leaderEpoch。follower副本永远不应该查询currentLeaderEpoch的LEO。但是consumer端可能会这样做,因为它需要在group rebalance发生时检验它的commited offset。
(requestedEpoch, logEndOffset())
} else {
//以requestedEpoch为界,将epochs集合分成2组,分别是大于requestedEpoch的subsequentEpochs集合,和小于等于requestedEpoch的previousEpochs集合
val (subsequentEpochs, previousEpochs) = epochs.partition { e => e.epoch > requestedEpoch}
if (subsequentEpochs.isEmpty) {
// The requested epoch is larger than any known epoch. This case should never be hit because
// the latest cached epoch is always the largest.
//requestedEpoch比任何epoch都大,这个情况永远不应该发生,因为latestEpoch总应该是最大的epoch。
(UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
} else if (previousEpochs.isEmpty) {
// The requested epoch is smaller than any known epoch, so we return the start offset of the first
// known epoch which is larger than it. This may be inaccurate as there could have been
// epochs in between, but the point is that the data has already been removed from the log
// and we want to ensure that the follower can replicate correctly beginning from the leader's
// start offset.
//requestedEpoch比任何epoch都小,所以LEO返回头一个比它大的epoch的startOffset
(requestedEpoch, subsequentEpochs.head.startOffset)
} else {
// We have at least one previous epoch and one subsequent epoch. The result is the first
// prior epoch and the starting offset of the first subsequent epoch.
//LeaderEpoch返回小于等于requestedEpoch的最大epoch,LEO返回大于requestedEpoch的头一个epoch的startOffset
(previousEpochs.last.epoch, subsequentEpochs.head.startOffset)
}
}
debug(s"Processed end offset request for epoch $requestedEpoch and returning epoch ${epochAndOffset._1} " +
s"with end offset ${epochAndOffset._2} from epoch cache of size ${epochs.size}")
epochAndOffset
}
}