kafka的controller的解析

前言

在一个kafka集群中,增加或删除一个服务节点、topic,当一个主题的增加一个分区的时候,kafka是怎样管理的呢?今天我们将分析kafka的一个核心组件controller

controller是什么

Controller Broker ( KafkaController ) 是一个 Kafka 服务,它运行在 Kafka 集群中的每个 Broker 上,但在任何时间点只有一个可以处于活动状态(选举)。

controller的选主过程

先看每个Broker节点启动的过程,

Kafka KafkaServer SocketServer AlterIsrManager ReplicaManager KafkaController startup startup start startup startup Kafka KafkaServer SocketServer AlterIsrManager ReplicaManager KafkaController

从上图中可以看到在准备好SocketServer(网络服务)、AlterIsrManager()、ReplicaManager(副本管理器)等启动完成才启动KakfaController组件。

      def startup() = {
      zkClient.registerStateChangeHandler(new StateChangeHandler {
      override val name: String = StateChangeHandlers.ControllerHandler
      override def afterInitializingSession(): Unit = {
        eventManager.put(RegisterBrokerAndReelect) //当网络闪断异常情况下,进行注册broker信息和发起重新选举主controller事件放入事件管理器里
      }
      override def beforeInitializingSession(): Unit = {
        val queuedEvent = eventManager.clearAndPut(Expire)

        // Block initialization of the new session until the expiration event is being handled,
        // which ensures that all pending events have been processed before creating the new session
        queuedEvent.awaitProcessing()
      }
    })
    eventManager.put(Startup)//把Startup事件放入事件队列里面
    eventManager.start()//启动事件处理器
  }

上面的函数主要做以下三件事

  • 注册zk的监听事件
  • 把Startup事件放入事件管理器里
  • 启动事件处理器
    EventManager的原理一会儿在分析,我们继续分析怎样处理Startup事件
  private def processStartup(): Unit = {
zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)//注册监听controller处理器ControllerChangeHandler
    elect()//进行选举
  }
  • 注册监听controller节点处理器ControllerChangeHandler
  • 进行选举
class ControllerChangeHandler(eventManager: ControllerEventManager) extends ZNodeChangeHandler {
  override val path: String = ControllerZNode.path

  override def handleCreation(): Unit = eventManager.put(ControllerChange)//监听到zk节点controller创建的时候加入ControllerChange事件
  override def handleDeletion(): Unit = eventManager.put(Reelect)//监听到zk节点controller被移除的时候加入重新选举主controller事件
  override def handleDataChange(): Unit = eventManager.put(ControllerChange)//监听到zk节点controller数据变化的时候加入ControllerChange事件
}

在看选举elect

private def elect(): Unit = {
    activeControllerId = zkClient.getControllerId.getOrElse(-1)//向zk节点controller获取节点信息,如果没有获取到信息直接返回-1
    if (activeControllerId != -1) {//获取到的id不为-1表示有节点已经成为了controller主节点,直接返回,不在参与选主
      return
    }
    try {
      val (epoch, epochZkVersion) = zkClient.registerControllerAndIncrementControllerEpoch(config.brokerId)//向zk注册临时节点controller,注册不成功则抛异常
      controllerContext.epoch = epoch
      controllerContext.epochZkVersion = epochZkVersion
      activeControllerId = config.brokerId


      onControllerFailover()
    } catch {
      case e: ControllerMovedException =>
        maybeResign()

        if (activeControllerId != -1)
          debug(s"Broker $activeControllerId was elected as controller instead of broker ${config.brokerId}", e)
        else
          warn("A controller has been elected but just resigned, this will result in another round of election", e)
      case t: Throwable =>
        error(s"Error while electing or becoming controller on broker ${config.brokerId}. " +
          s"Trigger controller movement immediately", t)
        triggerControllerMove()
    }
  }

真正的选举过程:

  • 向zk节点controller获取节点信息,如果没有获取到节点信息则直接返回-1
  • 如果id不等于-1表明,有另外的broker已经创建了controller节点,不在往下执行
  • 向zk注册临时节点controller,注册不成功则抛异常,注册成功则执行onControllerFailover,对broker、topic,partition的处理都在里面,后面我们会分析

ControllerEventManager的分析

在这里插入图片描述

ControllerEventManager主要由ControllerEventProcessor、QueuedEvent、ControllerEventThread组成

  • ControllerEventProcessor:事件处理器接口,目前只有 KafkaController实现了这个接口。
  • QueuedEvent:事件队列上的事件对象。
  • ControllerEventThread :继承了ShutdownableThread线程类
    ControllerEventManager内置了ControllerEventThread和QueuedEvent的LinkedBlockingQueue,当ControllerEventManager启动的时候则启动ControllerEventThread线程,ControllerEventThread的run方法循环调用doWork
    override def run(): Unit = {
    isStarted = true
    info("Starting")
    try {
      while (isRunning)
        doWork()
    } catch {
      case e: FatalExitError =>
        shutdownInitiated.countDown()
        shutdownComplete.countDown()
        info("Stopped")
        Exit.exit(e.statusCode())
      case e: Throwable =>
        if (isRunning)
          error("Error due to", e)
    } finally {
       shutdownComplete.countDown()
    }
    info("Stopped")
  }

其中ControllerEventThread对doWork的实现如下

 override def doWork(): Unit = {
      val dequeued = pollFromEventQueue()
      dequeued.event match {
        case ShutdownEventThread => // The shutting down of the thread has been initiated at this point. Ignore this event.
        case controllerEvent =>
          _state = controllerEvent.state

          eventQueueTimeHist.update(time.milliseconds() - dequeued.enqueueTimeMs)

          try {
            def process(): Unit = dequeued.process(processor)

            rateAndTimeMetrics.get(state) match {
              case Some(timer) => timer.time { process() }
              case None => process()
            }
          } catch {
            case e: Throwable => error(s"Uncaught error processing event $controllerEvent", e)
          }

          _state = ControllerState.Idle
      }
    } 

doWork方法不断从阻塞队列里取出QueuedEvent事件,并执行QueuedEvent的process方法

 def process(processor: ControllerEventProcessor): Unit = {
   if (spent.getAndSet(true))
     return
   processingStarted.countDown()//为了处理网络闪断等异常情况下,新建session时候需要等所有pending状态下的event执行完才能建立
   processor.process(event)//调用KafkaController的process处理事件
 }

调用ControllerEventProcessor的process方法处理事件,其中ControllerEventProcessor是一个接口,目前只有KafkaController实现该接口,现在我们看KafkaController的process方法

override def process(event: ControllerEvent): Unit = {
   try {
     event match {
       case event: MockEvent =>
         // Used only in test cases
         event.process()
       case ShutdownEventThread =>
         error("Received a ShutdownEventThread event. This type of event is supposed to be handle by ControllerEventThread")
       case AutoPreferredReplicaLeaderElection =>
         processAutoPreferredReplicaLeaderElection()
       case ReplicaLeaderElection(partitions, electionType, electionTrigger, callback) =>
         processReplicaLeaderElection(partitions, electionType, electionTrigger, callback)
       case UncleanLeaderElectionEnable =>
         processUncleanLeaderElectionEnable()
       case TopicUncleanLeaderElectionEnable(topic) =>
         processTopicUncleanLeaderElectionEnable(topic)
       case ControlledShutdown(id, brokerEpoch, callback) =>
         processControlledShutdown(id, brokerEpoch, callback)
       case LeaderAndIsrResponseReceived(response, brokerId) =>
         processLeaderAndIsrResponseReceived(response, brokerId)
       case UpdateMetadataResponseReceived(response, brokerId) =>
         processUpdateMetadataResponseReceived(response, brokerId)
       case TopicDeletionStopReplicaResponseReceived(replicaId, requestError, partitionErrors) =>
         processTopicDeletionStopReplicaResponseReceived(replicaId, requestError, partitionErrors)
       case BrokerChange =>
         processBrokerChange()
       case BrokerModifications(brokerId) =>
         processBrokerModification(brokerId)
       case ControllerChange =>
         processControllerChange()
       case Reelect =>
         processReelect()
       case RegisterBrokerAndReelect =>
         processRegisterBrokerAndReelect()
       case Expire =>
         processExpire()
       case TopicChange =>
         processTopicChange()
       case LogDirEventNotification =>
         processLogDirEventNotification()
       case PartitionModifications(topic) =>
         processPartitionModifications(topic)
       case TopicDeletion =>
         processTopicDeletion()
       case ApiPartitionReassignment(reassignments, callback) =>
         processApiPartitionReassignment(reassignments, callback)
       case ZkPartitionReassignment =>
         processZkPartitionReassignment()
       case ListPartitionReassignments(partitions, callback) =>
         processListPartitionReassignments(partitions, callback)
       case UpdateFeatures(request, callback) =>
         processFeatureUpdates(request, callback)
       case PartitionReassignmentIsrChange(partition) =>
         processPartitionReassignmentIsrChange(partition)
       case IsrChangeNotification =>
         processIsrChangeNotification()
       case AlterIsrReceived(brokerId, brokerEpoch, isrsToAlter, callback) =>
         processAlterIsr(brokerId, brokerEpoch, isrsToAlter, callback)
       case Startup =>
         processStartup()
     }
   } catch {
     case e: ControllerMovedException =>
       info(s"Controller moved to another broker when processing $event.", e)
       maybeResign()
     case e: Throwable =>
       error(s"Error processing event $event", e)
   } finally {
     updateMetrics()
   }
 }

从上面的代码看到不同的事件在KafkaController定义了相应的处理函数,其中 Startup 事件对应的是 processStartup 方法,选主的时候就把Startup事件加入ControllerEventManager事件管理器里,经过ControllerEventThread的一系列处理最终回调到KafkaController的processStartup

ControllerChannelManager的分析

当创建一个topic,增加某个topic下的分区的等操作的时候是在controller完成的,controller处理完成之后会广播到其他broker,这里的ControllerChannelManager用来管理与其他所有的broker node的网络连接和请求发送等
在这里插入图片描述

class ControllerChannelManager(controllerContext: ControllerContext,
                               config: KafkaConfig,
                               time: Time,
                               metrics: Metrics,
                               stateChangeLogger: StateChangeLogger,
                               threadNamePrefix: Option[String] = None) extends Logging with KafkaMetricsGroup {
  import ControllerChannelManager._

  protected val brokerStateInfo = new HashMap[Int, ControllerBrokerStateInfo]
  
    def addBroker(broker: Broker): Unit = {
    brokerLock synchronized {
      if (!brokerStateInfo.contains(broker.id)) {
        addNewBroker(broker)
        startRequestSendThread(broker.id)
      }
    }
  }

  def removeBroker(brokerId: Int): Unit = {
    brokerLock synchronized {
      removeExistingBroker(brokerStateInfo(brokerId))
    }
  }

ControllerChannelManager的主要成员变量有controllerContext、brokerStateInfo

  • controllerContext:controller的元数据信息
  • brokerStateInfo:broker的信息,主controller和其他broker发送相应请求用,其中key为broker的id、需要向某个broker发送请求的时候直接根据id找到ControllerBrokerStateInfo信息并交给ControllerBrokerStateInfo发送相应的网络请求处理。brokerStateInfo还维护了这个集群下的所有broker信息
    当集群里新建或上线一个broker,controller通过监听zk的节点信息发现有新节点上线,最终会调用ControllerChannelManager的addBroker(broker: Broker)方法,下面看一下addBroker怎么处理,从上面的代码发现处理二点事情addNewBroker(broker)startRequestSendThread(broker.id)
    • addNewBroker主要构建ControllerBrokerStateInfo类的实例然后加入到brokerStateInfo里面
    • startRequestSendThread 主要启动addNewBroker构建的ControllerBrokerStateInfo里的startRequestSendThread线程
      我们在看一下ControllerBrokerStateInfo的信息
case class ControllerBrokerStateInfo(networkClient: NetworkClient,
                                     brokerNode: Node,
                                     messageQueue: BlockingQueue[QueueItem],
                                     requestSendThread: RequestSendThread,
                                     queueSizeGauge: Gauge[Int],
                                     requestRateAndTimeMetrics: Timer,
                                     reconfigurableChannelBuilder: Option[Reconfigurable])

ControllerBrokerStateInfo的主要成员有networkClient、brokerNode、messageQueue、requestSendThread

  • networkClient:网络请求的客户端
  • brokerNode:broker的注册信息,主机的Ip,端口号等信息
  • messageQueue: 信息队列
  • requestSendThread: 请求发送队列
    当controller需要想某个broker发送请求的时候直接调用sendRequest
  def sendRequest(brokerId: Int, request: AbstractControlRequest.Builder[_ <: AbstractControlRequest],
                  callback: AbstractResponse => Unit = null): Unit = {
    brokerLock synchronized {
      val stateInfoOpt = brokerStateInfo.get(brokerId)
      stateInfoOpt match {
        case Some(stateInfo) =>
          stateInfo.messageQueue.put(QueueItem(request.apiKey, request, callback, time.milliseconds()))
        case None =>
          warn(s"Not sending request $request to broker $brokerId, since it is offline.")
      }
    }
  }

直接根据请求的brokerId找到相应的ControllerBrokerStateInfo并往队列里面加入发送的请求消息。然而RequestSendThread线程不断轮询从队列里获取请求信息进行发送任务

override def doWork(): Unit = {

    def backoff(): Unit = pause(100, TimeUnit.MILLISECONDS)

    val QueueItem(apiKey, requestBuilder, callback, enqueueTimeMs) = queue.take()
    requestRateAndQueueTimeMetrics.update(time.milliseconds() - enqueueTimeMs, TimeUnit.MILLISECONDS)

    var clientResponse: ClientResponse = null
    try {
      var isSendSuccessful = false
      while (isRunning && !isSendSuccessful) {
        // if a broker goes down for a long time, then at some point the controller's zookeeper listener will trigger a
        // removeBroker which will invoke shutdown() on this thread. At that point, we will stop retrying.
        try {
          if (!brokerReady()) {
            isSendSuccessful = false
            backoff()
          }
          else {
            val clientRequest = networkClient.newClientRequest(brokerNode.idString, requestBuilder,
              time.milliseconds(), true)
            clientResponse = NetworkClientUtils.sendAndReceive(networkClient, clientRequest, time)
            isSendSuccessful = true
          }
        } catch {
          case e: Throwable => // if the send was not successful, reconnect to broker and resend the message
            warn(s"Controller $controllerId epoch ${controllerContext.epoch} fails to send request $requestBuilder " +
              s"to broker $brokerNode. Reconnecting to broker.", e)
            networkClient.close(brokerNode.idString)
            isSendSuccessful = false
            backoff()
        }
      }
      if (clientResponse != null) {
        val requestHeader = clientResponse.requestHeader
        val api = requestHeader.apiKey
        if (api != ApiKeys.LEADER_AND_ISR && api != ApiKeys.STOP_REPLICA && api != ApiKeys.UPDATE_METADATA)
          throw new KafkaException(s"Unexpected apiKey received: $apiKey")

        val response = clientResponse.responseBody

        stateChangeLogger.withControllerEpoch(controllerContext.epoch).trace(s"Received response " +
          s"$response for request $api with correlation id " +
          s"${requestHeader.correlationId} sent to broker $brokerNode")

        if (callback != null) {
          callback(response)
        }
      }
    } catch {
      case e: Throwable =>
        error(s"Controller $controllerId fails to send a request to broker $brokerNode", e)
        // If there is any socket error (eg, socket timeout), the connection is no longer usable and needs to be recreated.
        networkClient.close(brokerNode.idString)
    }
  }

OnControllerFailover过程的分析

当某个broker竞争controller成功以后就会回调onControllerFailover,onControllerFailover主要有下一职责

  • 对broker,topic,topicdeleteion,isr子节点注册监听处理函数,对reassign_partitions,preferred_replica_election注册节点变化监听处理函数
  • 初始化controller的元数据对象ControllerContext,其中元数据对象中包含了当前有那些topics、那些broker是存活的,所有分区的leaders信息
  • 启动ControllerChannelManager一遍发送信息到每个broker
  • 启动副本状态机
  • 启动分区状态机
  • 启动自动平衡分区定时器
private def onControllerFailover(): Unit = {
    maybeSetupFeatureVersioning()

    info("Registering handlers")

    // before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
    //对broker,topic,topicdeleteion,isr子节点注册监听处理
    val childChangeHandlers = Seq(brokerChangeHandler, topicChangeHandler, topicDeletionHandler, logDirEventNotificationHandler,
      isrChangeNotificationHandler)
    childChangeHandlers.foreach(zkClient.registerZNodeChildChangeHandler)
    //对reassign_partitions,preferred_replica_election节点变化监听处理
    val nodeChangeHandlers = Seq(preferredReplicaElectionHandler, partitionReassignmentHandler)
    nodeChangeHandlers.foreach(zkClient.registerZNodeChangeHandlerAndCheckExistence)

    info("Deleting log dir event notifications")
    zkClient.deleteLogDirEventNotifications(controllerContext.epochZkVersion)
    info("Deleting isr change notifications")
    zkClient.deleteIsrChangeNotifications(controllerContext.epochZkVersion)
    info("Initializing controller context")

    initializeControllerContext() //初始化元数据,该数据来源于zk上的数据
    info("Fetching topic deletions in progress")
    val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress()
    info("Initializing topic deletion manager")
    topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion)

    // We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines
    // are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before
    // they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and
    // partitionStateMachine.startup().
    info("Sending update metadata request")
    //对集群中的所有broker发送元数据更新请求
    sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set.empty)

    replicaStateMachine.startup()//启动副本状态机
    partitionStateMachine.startup()//启动分区状态机

    info(s"Ready to serve as the new controller with epoch $epoch")

    initializePartitionReassignments()
    topicDeletionManager.tryTopicDeletion()
    val pendingPreferredReplicaElections = fetchPendingPreferredReplicaElections()
    onReplicaElection(pendingPreferredReplicaElections, ElectionType.PREFERRED, ZkTriggered)
    info("Starting the controller scheduler")
    kafkaScheduler.startup()
    if (config.autoLeaderRebalanceEnable) {
      scheduleAutoLeaderRebalanceTask(delay = 5, unit = TimeUnit.SECONDS)//启动自动平衡分区定时器
    }

    if (config.tokenAuthEnabled) {
      info("starting the token expiry check scheduler")
      tokenCleanScheduler.startup()
      tokenCleanScheduler.schedule(name = "delete-expired-tokens",
        fun = () => tokenManager.expireTokens(),
        period = config.delegationTokenExpiryCheckIntervalMs,
        unit = TimeUnit.MILLISECONDS)
    }
  }

initializeControllerContext主要的职责

  • 从zk的节点/broker/ids获取下面的所有broker信息和epoch并赋值给controllerContext
  • 从zk上获取所有的主题并赋值给controllerContext
  • 监听这些主题的元数据变化
  • 获取所有主题下的ReplicaAssignment
  • 更新每个分区的leader和isr信息
  • 启动ControllerChannelManager(上面已经讲述了其原理和用途)
  private def initializeControllerContext(): Unit = {
    // update controller cache with delete topic information
    //从zk的节点/broker/ids获取下面的所有broker信息和epoch并赋值给controllerContext
    val curBrokerAndEpochs = zkClient.getAllBrokerAndEpochsInCluster
    val (compatibleBrokerAndEpochs, incompatibleBrokerAndEpochs) = partitionOnFeatureCompatibility(curBrokerAndEpochs)
    if (!incompatibleBrokerAndEpochs.isEmpty) {
      warn("Ignoring registration of new brokers due to incompatibilities with finalized features: " +
        incompatibleBrokerAndEpochs.map { case (broker, _) => broker.id }.toSeq.sorted.mkString(","))
    }
    controllerContext.setLiveBrokers(compatibleBrokerAndEpochs)
    info(s"Initialized broker epochs cache: ${controllerContext.liveBrokerIdAndEpochs}")
    //从zk上获取所有的主题并赋值给controllerContext
    controllerContext.setAllTopics(zkClient.getAllTopicsInCluster(true))
    //监听这些主题的元数据变化
    registerPartitionModificationsHandlers(controllerContext.allTopics.toSeq)
    val replicaAssignmentAndTopicIds = zkClient.getReplicaAssignmentAndTopicIdForTopics(controllerContext.allTopics.toSet)
    processTopicIds(replicaAssignmentAndTopicIds)
    replicaAssignmentAndTopicIds.foreach { case TopicIdReplicaAssignment(_, _, assignments) =>
      assignments.foreach { case (topicPartition, replicaAssignment) =>
        controllerContext.updatePartitionFullReplicaAssignment(topicPartition, replicaAssignment)
        if (replicaAssignment.isBeingReassigned)
          controllerContext.partitionsBeingReassigned.add(topicPartition)
      }
    }
    controllerContext.clearPartitionLeadershipInfo()
    controllerContext.shuttingDownBrokerIds.clear()
    // register broker modifications handlers
    registerBrokerModificationsHandler(controllerContext.liveOrShuttingDownBrokerIds)
    // update the leader and isr cache for all existing partitions from Zookeeper
    updateLeaderAndIsrCache()//更新每个分区的leader和isr信息
    // start the channel manager
    controllerChannelManager.startup()//启动ControllerChannelManager
  }

ReplicaStateMachine 副本状态机主要监听一下状态和处理方法

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值