Partition概述
Partition记录了一个partition的所有replica的相关信息,其中包括了local replica、leader replica的信息。
每个Partition都维护着一个LeaderEpoch、TopicPartition、localBrokerId、leaderReplicaId、AR(all replicas)集合和ISR(in-sync replicas)集合。
其中,TopicPartition标明了该Partition的partitionId和所在的topic。
leaderReplicaId记录了leader replica所在的brokerId。
@volatile private var leaderEpoch: Int = LeaderAndIsr.initialLeaderEpoch - 1
// allReplicasMap includes both assigned replicas and the future replica if there is ongoing replica movement
private val allReplicasMap = new Pool[Int, Replica]
@volatile var inSyncReplicas: Set[Replica] = Set.empty[Replica]
Partition可以用指定的replicaId作为AR集合的key,获取相应的Replica。
Partition可以用localBrokerId作为AR集合的key,获取localReplica。如果获取为null,说明该partition没有Replica在本地broker。
def getReplica(replicaId: Int): Option[Replica] = Option(allReplicasMap.get(replicaId))
def localReplica: Option[Replica] = getReplica(localBrokerId)
def localReplicaOrException: Replica = localReplica.getOrElse {
throw new ReplicaNotAvailableException(s"Replica for partition $topicPartition is not available " +
s"on broker $localBrokerId")
}
如果local Replica存在,Partition提供了判断local Replica是否是leader replica的方法
def leaderReplicaIfLocal: Option[Replica] = {
if (leaderReplicaIdOpt.contains(localBrokerId))
localReplica
else
None
}
此外,如果local Replica存在,Partition还提供了让local Replica成为leader replica的方法。
/**
* Make the local replica the leader by resetting LogEndOffset for remote replicas (there could be old LogEndOffset
* from the time when this broker was the leader last time) and setting the new leader and ISR.
* If the leader replica id does not change, return false to indicate the replica manager.
*/
def makeLeader(controllerId: Int, partitionStateInfo: LeaderAndIsrRequest.PartitionState, correlationId: Int): Boolean = {
val (leaderHWIncremented, isNewLeader) = inWriteLock(leaderIsrUpdateLock) {
val newAssignedReplicas = partitionStateInfo.basePartitionState.replicas.asScala.map(_.toInt)
// record the epoch of the controller that made the leadership decision. This is useful while updating the isr
// to maintain the decision maker controller's epoch in the zookeeper path
controllerEpoch = partitionStateInfo.basePartitionState.controllerEpoch
// add replicas that are new
val newInSyncReplicas = partitionStateInfo.basePartitionState.isr.asScala.map(r => getOrCreateReplica(r, partitionStateInfo.isNew)).toSet
// remove assigned replicas that have been removed by the controller
(assignedReplicas.map(_.brokerId) -- newAssignedReplicas).foreach(removeReplica)
//设置ISR为新的集合
inSyncReplicas = newInSyncReplicas
newAssignedReplicas.foreach(id => getOrCreateReplica(id, partitionStateInfo.isNew))
//设置leaderReplica为localReplica,前提是localReplica存在,否则抛出异常
val leaderReplica = localReplicaOrException
//设置leaderEpochStartOffset为leaderReplica的LEO
val leaderEpochStartOffset = leaderReplica.logEndOffset
info(s"$topicPartition starts at Leader Epoch ${partitionStateInfo.basePartitionState.leaderEpoch} from " +
s"offset $leaderEpochStartOffset. Previous Leader Epoch was: $leaderEpoch")
//We cache the leader epoch here, persisting it only if it's local (hence having a log dir)
//设置leaderEpoch
leaderEpoch = partitionStateInfo.basePartitionState.leaderEpoch
leaderEpochStartOffsetOpt = Some(leaderEpochStartOffset)
zkVersion = partitionStateInfo.basePartitionState.zkVersion
// In the case of successive leader elections in a short time period, a follower may have
// entries in its log from a later epoch than any entry in the new leader's log. In order
// to ensure that these followers can truncate to the right offset, we must cache the new
// leader epoch and the start offset since it should be larger than any epoch that a follower
// would try to query.
//为了保证follower副本能把日志截断到正确的offset上,我们把leaderEpoch和leaderEpochStartOffset缓存到Log的LeaderEpochFileCache上。
//follower副本会请求查询leader副本的leaderEpoch,既然leader副本的leaderEpoch会大于其它Epoch
leaderReplica.log.foreach { log =>
log.maybeAssignEpochStartOffset(leaderEpoch, leaderEpochStartOffset)
}
//判断原leaderReplicaId不是localBrokerId,即原leader副本不是本机
val isNewLeader = !leaderReplicaIdOpt.contains(localBrokerId)
val curLeaderLogEndOffset = leaderReplica.logEndOffset
val curTimeMs = time.milliseconds
// initialize lastCaughtUpTime of replicas as well as their lastFetchTimeMs and lastFetchLeaderLogEndOffset.
(assignedReplicas - leaderReplica).foreach { replica =>
val lastCaughtUpTimeMs = if (inSyncReplicas.contains(replica)) curTimeMs else 0L
replica.resetLastCaughtUpTime(curLeaderLogEndOffset, curTimeMs, lastCaughtUpTimeMs)
}
//如果原leader副本不是本机
if (isNewLeader) {
// construct the high watermark metadata for the new leader replica
//构建新的leader副本的HW元数据信息
leaderReplica.convertHWToLocalOffsetMetadata()
// mark local replica as the leader after converting hw
//重置leader副本为本机
leaderReplicaIdOpt = Some(localBrokerId)
// reset log end offset for remote replicas
//重置本机partition记录的其它副本的LEO
//assignedReplicas是指allReplicas集合中有效的replica的集合
assignedReplicas.filter(_.brokerId != localBrokerId).foreach(_.updateLogReadResult(LogReadResult.UnknownLogReadResult))
}
// we may need to increment high watermark since ISR could be down to 1
(maybeIncrementLeaderHW(leaderReplica), isNewLeader)
}
// some delayed operations may be unblocked after HW changed
if (leaderHWIncremented)
tryCompleteDelayedRequests()
isNewLeader
}
如果local Replica是是leader replica,可以追加record到本地日志。
def appendRecordsToLeader(records: MemoryRecords, isFromClient: Boolean, requiredAcks: Int = 0): LogAppendInfo = {
val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
//如果local Replica是是leader replica
leaderReplicaIfLocal match {
case Some(leaderReplica) =>
val log = leaderReplica.log.get
//获取配置的ISR集合的最低大小
val minIsr = log.config.minInSyncReplicas
//获取当前ISR集合的大小
val inSyncSize = inSyncReplicas.size
// Avoid writing to leader if there are not enough insync replicas to make it safe
//如果当前ISR集合的大小小于SR集合的最低大小,并且ack等于-1,抛出异常
if (inSyncSize < minIsr && requiredAcks == -1) {
throw new NotEnoughReplicasException(s"The size of the current ISR ${inSyncReplicas.map(_.brokerId)} " +
s"is insufficient to satisfy the min.isr requirement of $minIsr for partition $topicPartition")
}
//调用底层存储Log对象追加record
val info = log.appendAsLeader(records, leaderEpoch = this.leaderEpoch, isFromClient,
interBrokerProtocolVersion)
// we may need to increment high watermark since ISR could be down to 1
(info, maybeIncrementLeaderHW(leaderReplica))
case None =>
throw new NotLeaderForPartitionException("Leader not local for partition %s on broker %d"
.format(topicPartition, localBrokerId))
}
}
// some delayed operations may be unblocked after HW changed
if (leaderHWIncremented)
tryCompleteDelayedRequests()
else {
// probably unblock some follower fetch requests since log end offset has been updated
replicaManager.tryCompleteDelayedFetch(new TopicPartitionOperationKey(topicPartition))
}
info
}
Replica概述
Replica记录了一个broker上的某个Partition的当前日志存储状态信息。
用于标识一个Replica的字段有:brokerId、TopicPartition。其中,ReplicaId等于brokerId。
class Replica(val brokerId: Int,
val topicPartition: TopicPartition,
time: Time = Time.SYSTEM,
initialHighWatermarkValue: Long = 0L,
@volatile var log: Option[Log] = None) extends Logging {
Replica记录的日志存储状态信息有:HW、logStartOffset、logEndOffset、leaderLogEndOffset。leaderLogEndOffset是指follower Replica上次发送FetchRequest获取到的leader replica的LEO。
// the high watermark offset value, in non-leader replicas only its message offsets are kept
@volatile private[this] var highWatermarkMetadata = new LogOffsetMetadata(initialHighWatermarkValue)
// the log end offset value, kept in all replicas;
// for local replica it is the log's end offset, for remote replicas its value is only updated by follower fetch
@volatile private[this] var _logEndOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata
// the log start offset value, kept in all replicas;
// for local replica it is the log's start offset, for remote replicas its value is only updated by follower fetch
@volatile private[this] var _logStartOffset = Log.UnknownLogStartOffset
// The log end offset value at the time the leader received the last FetchRequest from this follower
// This is used to determine the lastCaughtUpTimeMs of the follower
@volatile private[this] var lastFetchLeaderLogEndOffset = 0L
// The time when the leader received the last FetchRequest from this follower
// This is used to determine the lastCaughtUpTimeMs of the follower
@volatile private[this] var lastFetchTimeMs = 0L