一个broker可能分布多个Partition的副本信息,ReplicaManager主要负责管理一个broker范围内的Partition信息,然后它还根据Kafka
Controller发送过来的命令,然后执行这些命令,比如LeaderAndIsr和StopReplica
一 ReplicaManager 数据结构
假设有5个broker节点,分区数目为3,备份因子为2:
broker1, broker2, broker3, broker4, broker5
broker1上现在有3个分区,p11,p12,p13
p11: (broker1, broker3, broker5)
broker1为leader; broker3&broker5为follower
p12: (broker2, broker1, broker4)
broker2 为leader; broker1&broker4 为 follower
p13:( broker1, broker2, broker4)
broker1为leader; broker2&broker4 为 follower
二 核心字段
logManager:LogManager 管理日志的读写请求,但是内部委托给Log对象
scheduler: Scheduler 用于执行ReplicaManager中周期性的定时任务,在ReplicaManager中总共有三个定时任务:highwatermark-checkpoint,
isr-expiration,isr-change-propagation
quotaManager:ReplicationQuotaManager 配额管理
controllerEpoch:Int KafkaController的年代信息,每当重新选举Controller Leader时该字段会递增。之后,在ReplicaManager处理来自KafkaController请求会先检测请求中携带的controllerEpoch字段值,避免接受旧的Controller 请求
localBrokerId:Int 当前broker的id
allPartitions:Pool[(String, Int), Partition] 该broker上所有的分区信息
replicaFetcherManager:ReplicaFetcherManager 这个组件管理多个ReplicaFetcherThread线程,这些线程会向leader副本发送FetchRequest请求来获取消息,实现follower和leader的同步
highWatermarkCheckpoints:Map[Sting,OffsetCheckpoint] 用于记录买一个log目录与OffsetCheckpoint之间的映射关系,OffsetCheckpoint记录了对应的log目录下的replication-offset-checkpoint文件,该文件记录data目录下每一个分区的highwatermark,ReplicaManager中highwatermark-checkpoint定时任务会定时更新replication-offset-checkpoint文件内容
isrChangeSet:mutable.Set[TopicAndPartition] 记录ISR列表发生变化的分区信息
delayedProducePurgatory:DelayedOperationPurgatory[DelayedProduce] 用于管理DelayedProduce的DelayedOperationPurgatory
delayedFetchPurgatory:DelayedOperationPurgatory[DelayedFetch] 用于管理DelayedFetch的DelayedOperationPurgatory
三 重要方法
3.1 副本角色切换
KafkaController根据Partition的leader副本和follower副本状态向对应的broker发送LeaderAndIsrRequest,这个请求主要用于副本的角色切换; LeaderAndIsrRequest首先由KafkaApis.handleLeaderAndIsr
Request方法进行处理,其核心逻辑是通过ReplicaManager提供的
becomeLeaderOrFollower方法来实现,这个方法又具体依赖Partition
自己makeLeader方法和makeFollower方法
# 首先分析一下LeaderAndIsrRequest和LeaderAndIsrResponse消息体格式
defbecomeLeaderOrFollower(correlationId: Int,leaderAndISRRequest:LeaderAndIsrRequest, metadataCache: MetadataCache,
onLeadershipChange: (Iterable[Partition],Iterable[Partition]) => Unit):BecomeLeaderOrFollowerResult = {
leaderAndISRRequest.partitionStates.asScala.foreach{ case (topicPartition,stateInfo) =>
stateChangeLogger.trace("Broker %d received LeaderAndIsr request %s correlation id %d fromcontroller %d epoch %d for partition [%s,%d]"
.format(localBrokerId,stateInfo, correlationId,
leaderAndISRRequest.controllerId,leaderAndISRRequest.controllerEpoch,topicPartition.topic,topicPartition.partition))
}
replicaStateChangeLocksynchronized {
val responseMap= new mutable.HashMap[TopicPartition, Short]
// 如果leaderAndISR请求的controllerEpoch值 < 初始化的controllerEpoch抛出异常
if (leaderAndISRRequest.controllerEpoch<controllerEpoch) {
leaderAndISRRequest.partitionStates.asScala.foreach{ case (topicPartition,stateInfo) =>
stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d withcorrelation id %d since "+
"itscontroller epoch %d is old. Latest known controller epoch is %d").format(localBrokerId,leaderAndISRRequest.controllerId,
correlationId, leaderAndISRRequest.controllerEpoch, controllerEpoch))
}
BecomeLeaderOrFollowerResult(responseMap,Errors.STALE_CONTROLLER_EPOCH.code)
} else {
val controllerId= leaderAndISRRequest.controllerId
controllerEpoch= leaderAndISRRequest.controllerEpoch
// 首先检查partition的leader的epoch值
val partitionState= new mutable.HashMap[Partition,PartitionState]()
// 遍历LeaderAndIsr请求的partition状态的
leaderAndISRRequest.partitionStates.asScala.foreach{ case (topicPartition,stateInfo) =>
// 根据topic和 partition获取或者创建partition
val partition= getOrCreatePartition(topicPartition.topic,topicPartition.partition)
// 获取该partition的leadaer epoch值
val partitionLeaderEpoch= partition.getLeaderEpoch()
// 如果partitionLeaderEpoch小于请求中的leaderEpoch,否则就是过时的请求
if (partitionLeaderEpoch< stateInfo.leaderEpoch) {
// 判断该partition是否被assigned给当前的broker
if(stateInfo.replicas.contains(config.brokerId))
// 将分配到到当前broker的partition放入partitionState,其中partition是当前的状况,
// stateInfo是请求中最新情况
partitionState.put(partition,stateInfo)
else {
stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d withcorrelation id %d "+
"epoch %dfor partition [%s,%d] as itself is not in assigned replica list %s")
.format(localBrokerId,controllerId, correlationId,leaderAndISRRequest.controllerEpoch,
topicPartition.topic,topicPartition.partition,stateInfo.replicas.asScala.mkString(",")))
responseMap.put(topicPartition,Errors.UNKNOWN_TOPIC_OR_PARTITION.code)
}
} else {
// Otherwiserecord the error code in response
stateChangeLogger.warn(("Broker %d ignoring LeaderAndIsr request from controller %d withcorrelation id %d "+
"epoch %dfor partition [%s,%d] since its associated leader epoch %d is not higher thanthe current leader epoch %d")
.format(localBrokerId,controllerId, correlationId,leaderAndISRRequest.controllerEpoch,
topicPartition.topic,topicPartition.partition,stateInfo.leaderEpoch,partitionLeaderEpoch))
responseMap.put(topicPartition,Errors.STALE_CONTROLLER_EPOCH.code)
}
}
/**判断是否为leader或follower,分别调用makeLeaders和makeFollowers */
//partitionState:表示当前broker存储的partition和partition状态信息
// 过滤出来这些partition中哪些partition的leader是当前broker
val partitionsTobeLeader= partitionState.filter{ case (partition,stateInfo) =>
stateInfo.leader== config.brokerId
}
// 从partitionState去掉leader,剩下的都是follower
val partitionsToBeFollower= partitionState--partitionsTobeLeader.keys
// 如果partitionsTobeLeader不为空,调用makeLeaders方法
val partitionsBecomeLeader= if (partitionsTobeLeader.nonEmpty)
makeLeaders(controllerId,controllerEpoch, partitionsTobeLeader, correlationId,responseMap)
else
Set.empty[Partition]
// 如果partitionsBecomeFollower不为空,调用makeFollowers方法
val partitionsBecomeFollower= if (partitionsToBeFollower.nonEmpty)
makeFollowers(controllerId,controllerEpoch, partitionsToBeFollower, correlationId,responseMap, metadataCache)
else
Set.empty[Partition]
// weinitialize highwatermark thread after the firstleaderisrrequest.This ensures that all the partitions
// have been completely populatedbefore starting the checkpointing there by avoiding weird raceconditions
// 第一次LeaderAndIsr请求之后,就初始化highwatermark线程,并标记highwatermark已经初始化过了
if (!hwThreadInitialized) {
startHighWaterMarksCheckPointThread()
hwThreadInitialized= true
}
//ReplicaFetcherManager关闭空闲的Fetcher线程
replicaFetcherManager.shutdownIdleFetcherThreads()
// 触发LeadershipChange
onLeadershipChange(partitionsBecomeLeader,partitionsBecomeFollower)
BecomeLeaderOrFollowerResult(responseMap,Errors.NONE.code)
}
}
}
private def makeLeaders(controllerId: Int, epoch: Int, partitionState: Map[Partition, PartitionState],
correlationId: Int, responseMap: mutable.Map[TopicPartition, Short]): Set[Partition] = {
partitionState.foreach(state =>
stateChangeLogger.trace(("Broker %d handling LeaderAndIsr request correlationId %d from controller %d epoch %d " +
"starting the become-leader transition for partition %s")
.format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId))))
for (partition <- partitionState.keys)
responseMap.put(new TopicPartition(partition.topic, partition.partitionId), Errors.NONE.code)
val partitionsToMakeLeaders: mutable.Set[Partition] = mutable.Set()
try {
// 首先停止这些分区副本的fetch线程
replicaFetcherManager.removeFetcherForPartitions(partitionState.keySet.map(p => new TopicPartition(p.topic, p.partitionId)))
// 更新leader的partition信息
partitionState.foreach{ case (partition, partitionStateInfo) =>
if(partition.makeLeader(controllerId, partitionStateInfo, correlationId))
partitionsToMakeLeaders += partition
else
stateChangeLogger.info(("Broker %d skipped the become-leader state change after marking its partition as leader with correlation id %d from " +
"controller %d epoch %d for partition %s since it is already the leader for the partition.")
.format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(partition.topic, partition.partitionId)));
}
partitionsToMakeLeaders.foreach { partition =>
stateChangeLogger.trace(("Broker %d stopped fetchers as part of become-leader request from controller " +
"%d epoch %d with correlation id %d for partition %s")
.format(localBrokerId, controllerId, epoch, correlationId, TopicAndPartition(partition.topic, partition.partitionId)))
}
} catch {
case e: Throwable =>
partitionState.foreach { state =>
val errorMsg = ("Error on broker %d while processing LeaderAndIsr request correlationId %d received from controller %d" +
" epoch %d for partition %s").format(localBrokerId, correlationId, controllerId, epoch,
TopicAndPartition(state._1.topic, state._1.partitionId))
stateChangeLogger.error(errorMsg, e)
}
// Re-throw the exception for it to be caught in KafkaApis
throw e
}
partitionState.foreach { state =>
stateChangeLogger.trace(("Broker %d completed LeaderAndIsr request correlationId %d from controller %d epoch %d " +
"for the become-leader transition for partition %s")
.format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))
}
partitionsToMakeLeaders
}
private def makeFollowers(controllerId: Int, epoch: Int, partitionState: Map[Partition, PartitionState],
correlationId: Int, responseMap: mutable.Map[TopicPartition, Short], metadataCache: MetadataCache) : Set[Partition] = {
partitionState.foreach { state =>
stateChangeLogger.trace(("Broker %d handling LeaderAndIsr request correlationId %d from controller %d epoch %d " +
"starting the become-follower transition for partition %s")
.format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))
}
for (partition <- partitionState.keys)
responseMap.put(new TopicPartition(partition.topic, partition.partitionId), Errors.NONE.code)
val partitionsToMakeFollower: mutable.Set[Partition] = mutable.Set()
try {
// TODO: Delete leaders from LeaderAndIsrRequest
partitionState.foreach{ case (partition, partitionStateInfo) =>
// 检测新的broker是否存活
val newLeaderBrokerId = partitionStateInfo.leader
metadataCache.getAliveBrokers.find(_.id == newLeaderBrokerId) match {
// Only change partition state when the leader is available
case Some(leaderBroker) =>
// 调用Partition的makeFollower将分区的local replica切换为follower副本
if (partition.makeFollower(controllerId, partitionStateInfo, correlationId))
partitionsToMakeFollower += partition
else
stateChangeLogger.info(("Broker %d skipped the become-follower state change after marking its partition as follower with correlation id %d from " +
"controller %d epoch %d for partition [%s,%d] since the new leader %d is the same as the old leader")
.format(localBrokerId, correlationId, controllerId, partitionStateInfo.controllerEpoch,
partition.topic, partition.partitionId, newLeaderBrokerId))
case None =>
// The leader broker should always be present in the metadata cache.
// If not, we should record the error message and abort the transition process for this partition
stateChangeLogger.error(("Broker %d received LeaderAndIsrRequest with correlation id %d from controller" +
" %d epoch %d for partition [%s,%d] but cannot become follower since the new leader %d is unavailable.")
.format(localBrokerId, correlationId, controllerId, partitionStateInfo.controllerEpoch,
partition.topic, partition.partitionId, newLeaderBrokerId))
// Create the local replica even if the leader is unavailable. This is required to ensure that we include
// the partition's high watermark in the checkpoint file (see KAFKA-1647)
partition.getOrCreateReplica()
}
}
// 停止与旧leader同步的fetch线程
replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(p => new TopicPartition(p.topic, p.partitionId)))
partitionsToMakeFollower.foreach { partition =>
stateChangeLogger.trace(("Broker %d stopped fetchers as part of become-follower request from controller " +
"%d epoch %d with correlation id %d for partition %s")
.format(localBrokerId, controllerId, epoch, correlationId, TopicAndPartition(partition.topic, partition.partitionId)))
}
// 由于leader副本已经发生变化,所以新旧leader副本在HW-LEO之间的消息可能是不一致的,但是HW之前的消息是一致的,所及需要截断日志
logManager.truncateTo(partitionsToMakeFollower.map(partition => (new TopicAndPartition(partition), partition.getOrCreateReplica().highWatermark.messageOffset)).toMap)
partitionsToMakeFollower.foreach { partition =>
val topicPartitionOperationKey = new TopicPartitionOperationKey(partition.topic, partition.partitionId)
tryCompleteDelayedProduce(topicPartitionOperationKey)
tryCompleteDelayedFetch(topicPartitionOperationKey)
}
partitionsToMakeFollower.foreach { partition =>
stateChangeLogger.trace(("Broker %d truncated logs and checkpointed recovery boundaries for partition [%s,%d] as part of " +
"become-follower request with correlation id %d from controller %d epoch %d").format(localBrokerId,
partition.topic, partition.partitionId, correlationId, controllerId, epoch))
}
// 检测ReplicaManager运行状态
if (isShuttingDown.get()) {
partitionsToMakeFollower.foreach { partition =>
stateChangeLogger.trace(("Broker %d skipped the adding-fetcher step of the become-follower state change with correlation id %d from " +
"controller %d epoch %d for partition [%s,%d] since it is shutting down").format(localBrokerId, correlationId,
controllerId, epoch, partition.topic, partition.partitionId))
}
}
else {
// 重新开启leader副本同步和Fetcher线程
val partitionsToMakeFollowerWithLeaderAndOffset = partitionsToMakeFollower.map(partition =>
new TopicPartition(partition.topic, partition.partitionId) -> BrokerAndInitialOffset(
metadataCache.getAliveBrokers.find(_.id == partition.leaderReplicaIdOpt.get).get.getBrokerEndPoint(config.interBrokerSecurityProtocol),
partition.getReplica().get.logEndOffset.messageOffset)).toMap
replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset)
partitionsToMakeFollower.foreach { partition =>
stateChangeLogger.trace(("Broker %d started fetcher to new leader as part of become-follower request from controller " +
"%d epoch %d with correlation id %d for partition [%s,%d]")
.format(localBrokerId, controllerId, epoch, correlationId, partition.topic, partition.partitionId))
}
}
} catch {
case e: Throwable =>
val errorMsg = ("Error on broker %d while processing LeaderAndIsr request with correlationId %d received from controller %d " +
"epoch %d").format(localBrokerId, correlationId, controllerId, epoch)
stateChangeLogger.error(errorMsg, e)
// Re-throw the exception for it to be caught in KafkaApis
throw e
}
partitionState.foreach { state =>
stateChangeLogger.trace(("Broker %d completed LeaderAndIsr request correlationId %d from controller %d epoch %d " +
"for the become-follower transition for partition %s")
.format(localBrokerId, correlationId, controllerId, epoch, TopicAndPartition(state._1.topic, state._1.partitionId)))
}
partitionsToMakeFollower
}
3.2 追加/读取消息
private def appendToLocalLog(internalTopicsAllowed: Boolean, messagesPerPartition: Map[TopicPartition, MessageSet],
requiredAcks: Short): Map[TopicPartition, LogAppendResult] = {
trace("Append [%s] to local log ".format(messagesPerPartition))
messagesPerPartition.map { case (topicPartition, messages) =>
BrokerTopicStats.getBrokerTopicStats(topicPartition.topic).totalProduceRequestRate.mark()
BrokerTopicStats.getBrokerAllTopicsStats().totalProduceRequestRate.mark()
// 如果不被允许,是不能添加到internal topics的
if (Topic.isInternal(topicPartition.topic) && !internalTopicsAllowed) {
(topicPartition, LogAppendResult(
LogAppendInfo.UnknownLogAppendInfo,
Some(new InvalidTopicException("Cannot append to internal topic %s".format(topicPartition.topic)))))
} else {
try {
// 从当前broker的所有分区获取对应的Partition对象
val partitionOpt = getPartition(topicPartition.topic, topicPartition.partition)
val info = partitionOpt match {
case Some(partition) =>
// 调用partition的appendMessagesToLeader添加消息,将消息写入log
partition.appendMessagesToLeader(messages.asInstanceOf[ByteBufferMessageSet], requiredAcks)
case None => throw new UnknownTopicOrPartitionException("Partition %s doesn't exist on %d"
.format(topicPartition, localBrokerId))
}
val numAppendedMessages =
if (info.firstOffset == -1L || info.lastOffset == -1L)
0
else
info.lastOffset - info.firstOffset + 1
// update stats for successfully appended bytes and messages as bytesInRate and messageInRate
BrokerTopicStats.getBrokerTopicStats(topicPartition.topic).bytesInRate.mark(messages.sizeInBytes)
BrokerTopicStats.getBrokerAllTopicsStats.bytesInRate.mark(messages.sizeInBytes)
BrokerTopicStats.getBrokerTopicStats(topicPartition.topic).messagesInRate.mark(numAppendedMessages)
BrokerTopicStats.getBrokerAllTopicsStats.messagesInRate.mark(numAppendedMessages)
trace("%d bytes written to log %s-%d beginning at offset %d and ending at offset %d"
.format(messages.sizeInBytes, topicPartition.topic, topicPartition.partition, info.firstOffset, info.lastOffset))
(topicPartition, LogAppendResult(info))
} catch {
// 略
}
}
}
从leader 副本获取消息,然后等待足够的数据再返回,超时或者满足了必要的获取信息
def fetchMessages(timeout: Long, replicaId: Int, fetchMinBytes: Int, fetchMaxBytes: Int, hardMaxBytesLimit: Boolean,
fetchInfos: Seq[(TopicAndPartition, PartitionFetchInfo)], quota: ReplicaQuota = UnboundedQuota,
responseCallback: Seq[(TopicAndPartition, FetchResponsePartitionData)] => Unit) {
val isFromFollower = replicaId >= 0
val fetchOnlyFromLeader: Boolean = replicaId != Request.DebuggingConsumerId
val fetchOnlyCommitted: Boolean = ! Request.isValidBrokerId(replicaId)
// 从本地日志读取文件
val logReadResults = readFromLocalLog(
replicaId = replicaId,
fetchOnlyFromLeader = fetchOnlyFromLeader,
readOnlyCommitted = fetchOnlyCommitted,
fetchMaxBytes = fetchMaxBytes,
hardMaxBytesLimit = hardMaxBytesLimit,
readPartitionInfo = fetchInfos,
quota = quota)
// 如果fetch请求来自follower,则更新它的LOE
if(Request.isValidBrokerId(replicaId))
/*
* 主要逻辑:
* 1 在leader中维护了follower副本各个状态,这里会更新对应follower的状态比如LEO等
* 2 检测是否需要对ISR进行扩张,如果ISR发生变化,则将ISR集合变化记录保存到zookeeper
* 3 检测是否后移HighWatermark
* 4 检测delayedProducePurgatory中相关key对应的DelayedProduce,如果满足则执行完成
*/
updateFollowerLogReadResults(replicaId, logReadResults)
// 获取从日志读取到的总字节数
val logReadResultValues = logReadResults.map { case (_, v) => v }
// 统计读取到的总字节数
val bytesReadable = logReadResultValues.map(_.info.messageSet.sizeInBytes).sum
// 检查读取结果是否有错误
val errorReadingData = logReadResultValues.foldLeft(false) ((errorIncurred, readResult) =>
errorIncurred || (readResult.errorCode != Errors.NONE.code))
/*
* 判断是否能够立即返回FetchResponse
* 1 不想等待,需要立即返回的
* 2 FetchRequest没有指定要读取的分区
* 3 数据已经够了
* 4 读取数据时候发生了错误,即检查errorReadingData
*/
if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) {
val fetchPartitionData = logReadResults.map { case (tp, result) =>
tp -> FetchResponsePartitionData(result.errorCode, result.hw, result.info.messageSet)
}
// 直接调用回调函数
responseCallback(fetchPartitionData)
} else {
// 封装返回结果
val fetchPartitionStatus = logReadResults.map { case (topicAndPartition, result) =>
val fetchInfo = fetchInfos.collectFirst {
case (tp, v) if tp == topicAndPartition => v
}.getOrElse(sys.error(s"Partition $topicAndPartition not found in fetchInfos"))
(topicAndPartition, FetchPartitionStatus(result.info.fetchOffsetMetadata, fetchInfo))
}
// 构造FetchMetadata对象
val fetchMetadata = FetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit, fetchOnlyFromLeader,
fetchOnlyCommitted, isFromFollower, replicaId, fetchPartitionStatus)
// 构造一个DelayedFetdch对象
val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, responseCallback)
// 创建一个(topic, partition)键值对对形式的列表作为delayed fetch操作的key
val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => new TopicPartitionOperationKey(tp) }
// 尝试立即完成当前的请求,否则放入purgatory
delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
}
}
当ISR列表所有follower副本都已经了同步了消息,Kafka认为消息已经成功提交,可以将HW后移,所以如果来自follower的Fetch请求多了一步处理:updateFollowerLogReadResults:
# 更新leader副本上维护的follower的各个状态
# 随着follower副本不断fetch消息,最终追上leader副本,可能对isr集合进行扩张,如果isr集合发生变化则将新的isr集合记录到zookeeper
# 检测是否需要后移highwatermark
# 检测delayedProducePurgatory相关的key对应的DelayedOperation是否满足条件,满足条件则执行
private def updateFollowerLogReadResults(replicaId: Int, readResults: Seq[(TopicAndPartition, LogReadResult)]) {
debug("Recording follower broker %d log read results: %s ".format(replicaId, readResults))
// 遍历读取的日志结果
readResults.foreach { case (topicAndPartition, readResult) =>
getPartition(topicAndPartition.topic, topicAndPartition.partition) match {
case Some(partition) =>
// 调用Partition#updateReplicaLogReadResult方法,会更新 follower副本状态,并且尝试扩张ISR列表
partition.updateReplicaLogReadResult(replicaId, readResult)
// 尝试执行DelayedOperation
tryCompleteDelayedProduce(new TopicPartitionOperationKey(topicAndPartition))
case None =>
warn("While recording the replica LEO, the partition %s hasn't been created.".format(topicAndPartition))
}
}
}
3.3 消息同步
follower副本与leader副本同步的功能是由ReplicaFetcherManager组件实现的,它继承了AbstractFetcherManager.
在AbstractFetcherThread中调方法addPartitions和removePartitions对partitionMap字段进行增删,同时会唤醒Fetcher线程同步
def addPartitions(partitionAndOffsets: Map[TopicPartition, Long]) {
partitionMapLock.lockInterruptibly()
try {
// 检测指定分区是否存在
val newPartitionToState = partitionAndOffsets.filter { case (tp, _) =>
!partitionStates.contains(tp)
}.map { case (tp, offset) =>
val fetchState =
if (PartitionTopicInfo.isOffsetInvalid(offset)) new PartitionFetchState(handleOffsetOutOfRange(tp))
else new PartitionFetchState(offset)
tp -> fetchState
}
val existingPartitionToState = partitionStates.partitionStates.asScala.map { state =>
state.topicPartition -> state.value
}.toMap
partitionStates.set((existingPartitionToState ++ newPartitionToState).asJava)
partitionMapCond.signalAll()// 唤醒当前的fetcher线程,进行同步
} finally partitionMapLock.unlock()
}
def removePartitions(topicPartitions: Set[TopicPartition]) {
partitionMapLock.lockInterruptibly()
try {
topicPartitions.foreach { topicPartition =>
partitionStates.remove(topicPartition)
fetcherLagStats.unregister(topicPartition.topic, topicPartition.partition)
}
} finally partitionMapLock.unlock()
}
override def doWork() {
val fetchRequest = inLock(partitionMapLock) {
// 构建FetchRequest请求
val fetchRequest = buildFetchRequest(partitionStates.partitionStates.asScala.map { state =>
state.topicPartition -> state.value
})
if (fetchRequest.isEmpty) {
trace("There are no active partitions. Back off for %d ms before sending a fetch request".format(fetchBackOffMs))
partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
}
fetchRequest
}
if (!fetchRequest.isEmpty)
// 处理FetchRequest请求
processFetchRequest(fetchRequest)
}
处理fetch请求:
private def processFetchRequest(fetchRequest: REQ) {
val partitionsWithError = mutable.Set[TopicPartition]()
def updatePartitionsWithError(partition: TopicPartition): Unit = {
partitionsWithError += partition
partitionStates.moveToEnd(partition)
}
var responseData: Seq[(TopicPartition, PD)] = Seq.empty
try {
trace("Issuing to broker %d of fetch request %s".format(sourceBroker.id, fetchRequest))
// 发送FetchRequest并等待FetchResponse
responseData = fetch(fetchRequest)
} catch {
case t: Throwable =>
if (isRunning.get) {
warn(s"Error in fetch $fetchRequest", t)
inLock(partitionMapLock) {
partitionStates.partitionSet.asScala.foreach(updatePartitionsWithError)
// there is an error occurred while fetching partitions, sleep a while
// note that `ReplicaFetcherThread.handlePartitionsWithError` will also introduce the same delay for every
// partition with error effectively doubling the delay. It would be good to improve this.
partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
}
}
}
fetcherStats.requestRate.mark()
if (responseData.nonEmpty) { // 处理fetch response
inLock(partitionMapLock) {
// 遍历每一个分区的响应信息
responseData.foreach { case (topicPartition, partitionData) =>
val topic = topicPartition.topic
val partitionId = topicPartition.partition
Option(partitionStates.stateValue(topicPartition)).foreach(currentPartitionFetchState =>
// 从发送FetchRequest到收到FetchResponse这段那时间内,offset并没有发生太大变化
if (fetchRequest.offset(topicPartition) == currentPartitionFetchState.offset) {
Errors.forCode(partitionData.errorCode) match {
case Errors.NONE =>
try {
// 获取返回的消息集合
val messages = partitionData.toByteBufferMessageSet
// 获取返回的最后一条消息的offset
val newOffset = messages.shallowIterator.toSeq.lastOption.map(_.nextOffset).getOrElse(
currentPartitionFetchState.offset)
fetcherLagStats.getAndMaybePut(topic, partitionId).lag = Math.max(0L, partitionData.highWatermark - newOffset)
// 将leader副本获取的消息追加到log中
processPartitionData(topicPartition, currentPartitionFetchState.offset, partitionData)
val validBytes = messages.validBytes // 验证
if (validBytes > 0) {
// 如果没有异常,则更新partition state
partitionStates.updateAndMoveToEnd(topicPartition, new PartitionFetchState(newOffset))
fetcherStats.byteRate.mark(validBytes)
}
} catch {
case ime: CorruptRecordException =>
// we log the error and continue. This ensures two things
// 1. If there is a corrupt message in a topic partition, it does not bring the fetcher thread down and cause other topic partition to also lag
// 2. If the message is corrupt due to a transient state in the log (truncation, partial writes can cause this), we simply continue and
// should get fixed in the subsequent fetches
logger.error("Found invalid messages during fetch for partition [" + topic + "," + partitionId + "] offset " + currentPartitionFetchState.offset + " error " + ime.getMessage)
updatePartitionsWithError(topicPartition);
case e: Throwable =>
throw new KafkaException("error processing data for partition [%s,%d] offset %d"
.format(topic, partitionId, currentPartitionFetchState.offset), e)
}
case Errors.OFFSET_OUT_OF_RANGE => // follower副本的请求的offset超出了leo的范围,返回此错误
try {
val newOffset = handleOffsetOutOfRange(topicPartition)
partitionStates.updateAndMoveToEnd(topicPartition, new PartitionFetchState(newOffset))
error("Current offset %d for partition [%s,%d] out of range; reset offset to %d"
.format(currentPartitionFetchState.offset, topic, partitionId, newOffset))
} catch {
case e: Throwable =>
error("Error getting offset for partition [%s,%d] to broker %d".format(topic, partitionId, sourceBroker.id), e)
updatePartitionsWithError(topicPartition)
}
case _ =>
if (isRunning.get) {
error("Error for partition [%s,%d] to broker %d:%s".format(topic, partitionId, sourceBroker.id,
partitionData.exception.get))
updatePartitionsWithError(topicPartition)
}
}
})
}
}
}
if (partitionsWithError.nonEmpty) {
debug("handling partitions with error for %s".format(partitionsWithError))
handlePartitionsWithErrors(partitionsWithError)
}
}
protected def fetch(fetchRequest: FetchRequest): Seq[(TopicPartition, PartitionData)] = {
// 发送fetch请求
val clientResponse = sendRequest(ApiKeys.FETCH, Some(fetchRequestVersion), fetchRequest.underlying)
// 返回FetchResponse
new FetchResponse(clientResponse.responseBody).responseData.asScala.toSeq.map { case (key, value) =>
key -> new PartitionData(value)
}
}
2.4 关闭副本
当KafkaController发送StopReplicaRequest请求时,会关闭其指定的副本,并根据StopReplicaRequest中的字段决定是否删除副本对应的log. 在分区副本重新分配或者关闭的broker的过程中都会使用到该请求,该请求并不一定意味着需要删除旧副本和log比如关闭 broker的场景。
先分析StopReplicaRequest和StopReplicaResponse的消息体格式:
首先是调用KafkaApis#handleStopReplicaRequest方法:
def handleStopReplicaRequest(request: RequestChannel.Request) {
// 将请求转成StopReplicaRequest
val stopReplicaRequest = request.body.asInstanceOf[StopReplicaRequest]
// 构造response header
val responseHeader = new ResponseHeader(request.header.correlationId)
val response =
if (authorize(request.session, ClusterAction, Resource.ClusterResource)) {
// 调用replicaManager#stopReplicas
val (result, error) = replicaManager.stopReplicas(stopReplicaRequest)
// 遍历返回结果,处理内部topic
result.foreach { case (topicPartition, errorCode) =>
if (errorCode == Errors.NONE.code && stopReplicaRequest.deletePartitions() && topicPartition.topic == Topic.GroupMetadataTopicName) {
coordinator.handleGroupEmigration(topicPartition.partition)
}
}
// 返回响应
new StopReplicaResponse(error, result.asInstanceOf[Map[TopicPartition, JShort]].asJava)
} else {
val result = stopReplicaRequest.partitions.asScala.map((_, new JShort(Errors.CLUSTER_AUTHORIZATION_FAILED.code))).toMap
new StopReplicaResponse(Errors.CLUSTER_AUTHORIZATION_FAILED.code, result.asJava)
}
requestChannel.sendResponse(new RequestChannel.Response(request, new ResponseSend(request.connectionId, responseHeader, response)))
// 关闭该副本的fetcher线程
replicaManager.replicaFetcherManager.shutdownIdleFetcherThreads()
}
调用ReplicaManager的stopReplicas方法:
def stopReplicas(stopReplicaRequest: StopReplicaRequest): (mutable.Map[TopicPartition, Short], Short) = {
replicaStateChangeLock synchronized {
// 存放分区和errorcode的一map
val responseMap = new collection.mutable.HashMap[TopicPartition, Short]
// 判断请求中controllerEpoch是否小于初始的controllerEpoch,否则抛出异常,说明请求过时
if(stopReplicaRequest.controllerEpoch() < controllerEpoch) {
stateChangeLogger.warn("Broker %d received stop replica request from an old controller epoch %d. Latest known controller epoch is %d"
.format(localBrokerId, stopReplicaRequest.controllerEpoch, controllerEpoch))
(responseMap, Errors.STALE_CONTROLLER_EPOCH.code)
} else {
// 获取StopReplica请求的partition信息
val partitions = stopReplicaRequest.partitions.asScala
controllerEpoch = stopReplicaRequest.controllerEpoch
// 首先停止所有针对请求的partition的fetchers线程
replicaFetcherManager.removeFetcherForPartitions(partitions)
// 遍历每一分区,停止副本继续对外提供服务
for (topicPartition <- partitions){
val errorCode = stopReplica(topicPartition.topic, topicPartition.partition, stopReplicaRequest.deletePartitions)
responseMap.put(topicPartition, errorCode)
}
(responseMap, Errors.NONE.code)
}
}
}
def stopReplica(topic: String, partitionId: Int, deletePartition: Boolean): Short = {
val errorCode = Errors.NONE.code
getPartition(topic, partitionId) match {
case Some(partition) =>
// 判断该请求是否需要删除旧的分区以及其日志
if(deletePartition) {
// 获取要删除的分区
val removedPartition = allPartitions.remove((topic, partitionId))
if (removedPartition != null) {
// 删除该分区的本地日志
removedPartition.delete() // this will delete the local log
val topicHasPartitions = allPartitions.keys.exists { case (t, _) => topic == t }
if (!topicHasPartitions)
BrokerTopicStats.removeMetrics(topic)
}
}
case None =>
// 在broker的所有分区中不存在对应的分区,直接删除log
if(deletePartition) {
val topicAndPartition = TopicAndPartition(topic, partitionId)
if(logManager.getLog(topicAndPartition).isDefined) {
logManager.deleteLog(topicAndPartition)
}
}
}
errorCode
}
2.5 ReplicaManager中的定时任务
ReplicaManager总共有三个定时任务:highwatermark-checkpoint,isr-expiration,isr-change-propogation。
highwatermark-checkpoint: 周期性记录每个副本的HW并保存到其log目录中的replication-offset-checkpoint文件中
def startHighWaterMarksCheckPointThread() = {
if(highWatermarkCheckPointThreadStarted.compareAndSet(false, true))
scheduler.schedule("highwatermark-checkpoint", checkpointHighWatermarks, period = config.replicaHighWatermarkCheckpointIntervalMs, unit = TimeUnit.MILLISECONDS)
}
def checkpointHighWatermarks() {
// 获取当前broker的全部partition的replica对象
val replicas = allPartitions.values.flatMap(_.getReplica(config.brokerId))
// 按照副本所在目录进行分组
val replicasByDir = replicas.filter(_.log.isDefined).groupBy(_.log.get.dir.getParentFile.getAbsolutePath)
// 遍历所有log目录
for ((dir, reps) <- replicasByDir) {
// 收集当前log目录下全部副本的HW
val hwms = reps.map(r => new TopicAndPartition(r) -> r.highWatermark.messageOffset).toMap
try {
//更新对应log目录下的replication-offset-checkpoint文件
highWatermarkCheckpoints(dir).write(hwms)
} catch {
case e: IOException =>
fatal("Error writing to highwatermark file: ", e)
Runtime.getRuntime().halt(1)
}
}
}
def startup() {
// start ISR expiration thread
scheduler.schedule("isr-expiration", maybeShrinkIsr, period = config.replicaLagTimeMaxMs, unit = TimeUnit.MILLISECONDS)
scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges, period = 2500L, unit = TimeUnit.MILLISECONDS)
}
isr-expiration: 会周期性的调用maybeShrinkIsr方法检测每一个分区是否需要缩减其ISR集合
private def maybeShrinkIsr(): Unit = {
trace("Evaluating ISR list of partitions to see which replicas can be removed from the ISR")
allPartitions.values.foreach(partition => partition.maybeShrinkIsr(config.replicaLagTimeMaxMs))
}
def maybeShrinkIsr(replicaMaxLagTimeMs: Long) {
val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {
// 首先判断是不是当前broker是不是leader,只有leader才可以管理ISR
leaderReplicaIfLocal() match {
// 如果是leader
case Some(leaderReplica) =>
// 获取不同步副本,就是那些和leader差距很多的副本
val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica, replicaMaxLagTimeMs)
// 不同步副本如果存在
if(outOfSyncReplicas.nonEmpty) {
// 从ISR列表中移除没有同步的副本
val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas
assert(newInSyncReplicas.nonEmpty)
info("Shrinking ISR for partition [%s,%d] from %s to %s".format(topic, partitionId,
inSyncReplicas.map(_.brokerId).mkString(","), newInSyncReplicas.map(_.brokerId).mkString(",")))
// 在zookeeper和缓存中更新ISR列表
updateIsr(newInSyncReplicas)
// we may need to increment high watermark since ISR could be down to 1
// 因为ISR中移除了一个副本,那么有可能剩余的副本都在开始同步了,那么我们可能需要增加高水位线了
replicaManager.isrShrinkRate.mark()
maybeIncrementLeaderHW(leaderReplica)
} else {
false
}
case None => false // do nothing if no longer leader
}
}
// 尝试进行延迟操作
if (leaderHWIncremented)
tryCompleteDelayedRequests()
}
isr-change-propogation:会周期性的将ISR集合发生变化的分区记录到zookeeper中
/*
* 这个功能定期检查是否ISR需要被传播,当发生以下情况的时候,需要被传播:
* 1 ISR 已经发生改变,但是还没有传播
* 2 在最后5秒还没有传播,或者从上次ISR传播之后有60秒没有传播了
* 它允许临时ISR改变在几秒钟之内就被传播,避免controller或者其他borker带来大量ISR状态改变
* 这个函数在有partition的副本心跳超时后,把isr的变化对应的partition更新到zk中的/isr_change_notification/isr_change_节点中。
*/
def maybePropagateIsrChanges() {
val now = System.currentTimeMillis()
isrChangeSet synchronized {
if (isrChangeSet.nonEmpty &&
(lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
ReplicationUtils.propagateIsrChanges(zkUtils, isrChangeSet)
isrChangeSet.clear()
lastIsrPropagationMs.set(now)
}
}
}