kafka源码分析之副本管理-ReplicaManager

最新推荐文章于 2022-01-30 11:37:29 发布

隔壁老杨hongs

最新推荐文章于 2022-01-30 11:37:29 发布

阅读量4.5k

点赞数 1

文章标签： kafka0.9.0源码分析 kafka的log管理源码 kafka副本管理源码

本文链接：https://blog.csdn.net/u014393917/article/details/52043040

版权

ReplicaManager

说明,此组件用于管理kafka中各partition的副本信息.,实例依赖于kafkaScheduler与logManager的实例.并处理对消息的添加与读取的操作.副本间数据的同步等操作。

实例创建与启动

实例创建

replicaManager = new ReplicaManager(config, metrics, time, kafkaMetricsTime,

zkUtils, kafkaScheduler, logManager,
isShuttingDown)
replicaManager.startup()

首先先看看这个实例生成时,需要进行处理的流程:

这里生成一个epoch的值,这个值用于在leader发生变化后的修改.

/* epoch of the controller that last changed the leader */
@volatile var controllerEpoch: Int = KafkaController.InitialControllerEpoch - 1
private val localBrokerId = config.brokerId
private val allPartitions = new Pool[(String, Int), Partition]
private val replicaStateChangeLock = new Object

这里生成一个用于同步partition副本数据线程的管理组件.
val replicaFetcherManager = new ReplicaFetcherManager(config, this, metrics, jTime,

threadNamePrefix)
private val highWatermarkCheckPointThreadStarted = new AtomicBoolean(false)

这里读取每个logdir的目录下的文件replication-offset-checkpoint,这个文件中记录了每个目录下记录的partition对应的最后一个checkpoint的offset值.
val highWatermarkCheckpoints = config.logDirs.map(dir => (

new File(dir).getAbsolutePath,

new OffsetCheckpoint(

new File(dir,ReplicaManager.HighWatermarkFilename))

)

).toMap
private var hwThreadInitialized = false
this.logIdent = "[Replica Manager on Broker " + localBrokerId + "]: "
val stateChangeLogger = KafkaController.stateChangeLogger

这里定义一个用于存储partition的leader改变顺序的集合.
private val isrChangeSet: mutable.Set[TopicAndPartition] =

new mutable.HashSet[TopicAndPartition]()
private val lastIsrChangeMs = new AtomicLong(System.currentTimeMillis())
private val lastIsrPropagationMs = new AtomicLong(System.currentTimeMillis())

这里读取配置producer.purgatory.purge.interval.requests,默认值1000,用于在procucer的ack设置是-1或者1时,跟踪消息是否添加成功,使用DelayedProduce实现.
val delayedProducePurgatory = new DelayedOperationPurgatory[DelayedProduce](
purgatoryName = "Produce", config.brokerId,

config.producerPurgatoryPurgeIntervalRequests)

这里读取配置fetch.purgatory.purge.interval.requests,默认值1000,
val delayedFetchPurgatory = new DelayedOperationPurgatory[DelayedFetch](
purgatoryName = "Fetch", config.brokerId,

config.fetchPurgatoryPurgeIntervalRequests)

启动ReplicaManager实例:

def startup() {

这里生成两个后台的调度线程,第一个用于定期检查partition对应的isr是否有心跳过期的isr,

这个定期的检查周期通过replica.lag.time.max.ms配置.默认是10秒.

第二个用于定期通知zk的对应路径,有partition的isr发生改变.定期发送消息的周期是2.5秒.
// start ISR expiration thread
scheduler.schedule("isr-expiration", maybeShrinkIsr,

period = config.replicaLagTimeMaxMs, unit = TimeUnit.MILLISECONDS)

scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges,

period = 2500L, unit = TimeUnit.MILLISECONDS)
}

partition的leader定时副本过期检查:

通过ReplicaManager启动时,定期调用maybeShrinkIsr函数来进行处理,

当follower的副本向leader的副本进行数据同步操作时，如果副本已经读取到leader的log的最后的offset部分时，表示这个副本同步达到最新的副本状态，会更新每一个副本的心跳时间，这个函数定期检查这个心跳时间是否超过了配置的时间，如果超过了，就会移出这个副本在isr上的选择。

private def maybeShrinkIsr(): Unit = {
trace("Evaluating ISR list of partitions to see which replicas can be removed from

the ISR")

直接迭代当前的broker中所有的分配的partition的集合,并调用partition内部的处理函数.

在ReplicaManager中的allPartitions集合存储有当前broker中所有的partition.
allPartitions.values.foreach(partition =>

partition.maybeShrinkIsr(config.replicaLagTimeMaxMs))
}

接下来看看Partition中处理isr的过期检查流程:

def maybeShrinkIsr(replicaMaxLagTimeMs: Long) {
val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {

这里首先检查当前的partition的loader是否是当前的broker,如果不是,这个地方得到的值是false,

否则执行leaderReplica的处理部分
leaderReplicaIfLocal() match {
case Some(leaderReplica) =>

当前的partition在当前的机器上是leader时,检查这个partition的所有的副本中,是否有过期的副本,也就是超过了指定的时间没有更新心跳的副本,得到这个过期的副本集合.
val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica,

replicaMaxLagTimeMs)
if(outOfSyncReplicas.size > 0) {

这里表示当前的partition中有副本过期,得到新的未过期的副本.
          val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas
          assert(newInSyncReplicas.size > 0)
          info("Shrinking ISR for partition [%s,%d] from %s to %s".format(topic,

partitionId,
inSyncReplicas.map(_.brokerId).mkString(","),

newInSyncReplicas.map(_.brokerId).mkString(",")))

updateIsr函数用于对副本改变后的isr的更新,具体流程:

1,在zk中/brokers/topics/topicname/partitions/paritionid/state路径下更新最新的isr记录.

2,在ReplicaManager中的isrChangeSet集合中添加副本变化的TopicAndPartition,标记下这个partition的isr被修改.

3,更新inSyncReplicas集合为新的副本集合.
          // update ISR in zk and in cache
          updateIsr(newInSyncReplicas)
          // we may need to increment high watermark since ISR could be down to 1

          replicaManager.isrShrinkRate.mark()

这里根据当前leaderReplica的副本的highWatermark与当前的partition中最大的offsetMeta进行比较,如果leader对应的最高的消息的offset小于当前partition中最大的offsetMeta对应的offset,或者当前leader对应highWatermark的segment的baseoffset小于partition中最大的offset对应的segment的baseoffset时,这个函数返回true,否则返回false.
          maybeIncrementLeaderHW(leaderReplica)
        } else {
          false
        }

      case None => false // do nothing if no longer leader
    }
  }

如果上面的处理得到的返回值是true,表示isr有发生过变化,尝试执行当前的副本中针对此partition当前挂起的任务.

挂起的任务处理包含有fetch与produce的操作.
  // some delayed operations may be unblocked after HW changed
  if (leaderHWIncremented)
    tryCompleteDelayedRequests()
}

把isr的改变更新到zk中

这个由一个定时器每2500ms执行一次maybePropagateIsrChanges函数。

这个函数在定时执行的maybeShrinkIsr函数中如果发现有副本没有心跳更新时，会执行partition的updateIsr的操作，这个操作会把isr发生过变化的partition记录到isrChangeSet集合中.

这个函数在有partition的副本心跳超时后，把isr的变化对应的partition更新到zk中的/isr_change_notification/isr_change_节点中。

更新条件，isrChangeSet集合不为空，最后一次有副本超时的更新已经超过了5秒或者上一次更新到zk中超时的变化信息已经超过了60秒，

*/
def maybePropagateIsrChanges() {
  val now = System.currentTimeMillis()
  isrChangeSet synchronized {
    if (isrChangeSet.nonEmpty &&
      (lastIsrChangeMs.get()

+ ReplicaManager.IsrChangePropagationBlackOut < now ||
lastIsrPropagationMs.get()

+ ReplicaManager.IsrChangePropagationInterval < now)

)

    {
      ReplicationUtils.propagateIsrChanges(zkUtils, isrChangeSet)
      isrChangeSet.clear()
      lastIsrPropagationMs.set(now)
    }
  }
}

副本的leader的切换处理

这个部分在kafka leader的节点处理对partition的leader的变化后，会向对应的broker节点发起一个LeaderAndIsr的请求，这个请求主要用于处理副本变成leader或者从leader变化成follower时需要执行的操作。这个分析在KafkaApis中的处理partition的LeaderAndIsr的请求部分进行了分析，这里不做明细的说明，说明下主要流程：

如果副本从follower切换成了leader节点，那么这个副本中用于同步数据的线程会被停止，同时修改partition对应的leaderId为partition的leader节点对应的brokerId.

如果副本从leader切换成follower时，在这个节点的数据同步线程中加入这个partition的同步,设置这个partition对应的leaderId的值为当前的leader节点的id.

处理消息追加

首先行看看这个函数的定义部分,最后的responseCallback是一个函数,这个函数由调用方来进行实现,主要是对消息添加后的状态的处理,如Produce的请求时,会根据这个响应的消息加上认证失败的消息判断是否需要向client端发送失败的数据回去.第三个参数表示是否是内部操作,如果是Produce时,这个参数为false(clientId不是__admin_client).当一个produce向对应的partition写入消息或者针对consumer（使用的topic来记录group与offset信息时）的syncGroup与commitOffset时，会执行这个操作。

appendMessages函数

在produce向broker写入数据时，会通过replicaManager中的appendMessages来进行消息的添加操作。这个函数的最后一个参数是一个回调函数，也就是上面处理produce请求中定义的函数，用于根据ack的操作，向client端回写操作的情况。

def appendMessages(timeout: Long,
                   requiredAcks: Short,
                   internalTopicsAllowed: Boolean,
                   messagesPerPartition: Map[TopicAndPartition, MessageSet],
        responseCallback: Map[TopicAndPartition, ProducerResponseStatus] => Unit) {

  if (isValidRequiredAcks(requiredAcks)) {

如果ack的值是一个正确的值时，通过appendToLocalLog函数，向log中写入对应的partition的消息,并得到写入后的状态值。
val sTime = SystemTime.milliseconds
val localProduceResults = appendToLocalLog(internalTopicsAllowed,

messagesPerPartition, requiredAcks)
debug("Produce to local log in %d ms".format(SystemTime.milliseconds - sTime))

这里生成produce的写入消息的状态，记录每个partition中写入后的错误代码，开始的offset与结束的offset的值。
val produceStatus = localProduceResults.map {

       case (topicAndPartition, result) =>
          topicAndPartition ->
              ProducePartitionStatus(
                result.info.lastOffset + 1, // required offset
                ProducerResponseStatus(result.errorCode, result.info.firstOffset))
    }

    if (delayedRequestRequired(requiredAcks, messagesPerPartition,