文章目录
前言
在一个kafka集群中,增加或删除一个服务节点、topic,当一个主题的增加一个分区的时候,kafka是怎样管理的呢?今天我们将分析kafka的一个核心组件controller
controller是什么
Controller Broker ( KafkaController ) 是一个 Kafka 服务,它运行在 Kafka 集群中的每个 Broker 上,但在任何时间点只有一个可以处于活动状态(选举)。
controller的选主过程
先看每个Broker节点启动的过程,
从上图中可以看到在准备好SocketServer(网络服务)、AlterIsrManager()、ReplicaManager(副本管理器)等启动完成才启动KakfaController组件。
def startup() = {
zkClient.registerStateChangeHandler(new StateChangeHandler {
override val name: String = StateChangeHandlers.ControllerHandler
override def afterInitializingSession(): Unit = {
eventManager.put(RegisterBrokerAndReelect) //当网络闪断异常情况下,进行注册broker信息和发起重新选举主controller事件放入事件管理器里
}
override def beforeInitializingSession(): Unit = {
val queuedEvent = eventManager.clearAndPut(Expire)
// Block initialization of the new session until the expiration event is being handled,
// which ensures that all pending events have been processed before creating the new session
queuedEvent.awaitProcessing()
}
})
eventManager.put(Startup)//把Startup事件放入事件队列里面
eventManager.start()//启动事件处理器
}
上面的函数主要做以下三件事
- 注册zk的监听事件
- 把Startup事件放入事件管理器里
- 启动事件处理器
EventManager的原理一会儿在分析,我们继续分析怎样处理Startup事件
private def processStartup(): Unit = {
zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)//注册监听controller处理器ControllerChangeHandler
elect()//进行选举
}
- 注册监听controller节点处理器ControllerChangeHandler
- 进行选举
class ControllerChangeHandler(eventManager: ControllerEventManager) extends ZNodeChangeHandler {
override val path: String = ControllerZNode.path
override def handleCreation(): Unit = eventManager.put(ControllerChange)//监听到zk节点controller创建的时候加入ControllerChange事件
override def handleDeletion(): Unit = eventManager.put(Reelect)//监听到zk节点controller被移除的时候加入重新选举主controller事件
override def handleDataChange(): Unit = eventManager.put(ControllerChange)//监听到zk节点controller数据变化的时候加入ControllerChange事件
}
在看选举elect
private def elect(): Unit = {
activeControllerId = zkClient.getControllerId.getOrElse(-1)//向zk节点controller获取节点信息,如果没有获取到信息直接返回-1
if (activeControllerId != -1) {//获取到的id不为-1表示有节点已经成为了controller主节点,直接返回,不在参与选主
return
}
try {
val (epoch, epochZkVersion) = zkClient.registerControllerAndIncrementControllerEpoch(config.brokerId)//向zk注册临时节点controller,注册不成功则抛异常
controllerContext.epoch = epoch
controllerContext.epochZkVersion = epochZkVersion
activeControllerId = config.brokerId
onControllerFailover()
} catch {
case e: ControllerMovedException =>
maybeResign()
if (activeControllerId != -1)
debug(s"Broker $activeControllerId was elected as controller instead of broker ${config.brokerId}", e)
else
warn("A controller has been elected but just resigned, this will result in another round of election", e)
case t: Throwable =>
error(s"Error while electing or becoming controller on broker ${config.brokerId}. " +
s"Trigger controller movement immediately", t)
triggerControllerMove()
}
}
真正的选举过程:
- 向zk节点controller获取节点信息,如果没有获取到节点信息则直接返回-1
- 如果id不等于-1表明,有另外的broker已经创建了controller节点,不在往下执行
- 向zk注册临时节点controller,注册不成功则抛异常,注册成功则执行onControllerFailover,对broker、topic,partition的处理都在里面,后面我们会分析
ControllerEventManager的分析
ControllerEventManager主要由ControllerEventProcessor、QueuedEvent、ControllerEventThread组成
- ControllerEventProcessor:事件处理器接口,目前只有 KafkaController实现了这个接口。
- QueuedEvent:事件队列上的事件对象。
- ControllerEventThread :继承了ShutdownableThread线程类
ControllerEventManager内置了ControllerEventThread和QueuedEvent的LinkedBlockingQueue,当ControllerEventManager启动的时候则启动ControllerEventThread线程,ControllerEventThread的run方法循环调用doWork
override def run(): Unit = {
isStarted = true
info("Starting")
try {
while (isRunning)
doWork()
} catch {
case e: FatalExitError =>
shutdownInitiated.countDown()
shutdownComplete.countDown()
info("Stopped")
Exit.exit(e.statusCode())
case e: Throwable =>
if (isRunning)
error("Error due to", e)
} finally {
shutdownComplete.countDown()
}
info("Stopped")
}
其中ControllerEventThread对doWork的实现如下
override def doWork(): Unit = {
val dequeued = pollFromEventQueue()
dequeued.event match {
case ShutdownEventThread => // The shutting down of the thread has been initiated at this point. Ignore this event.
case controllerEvent =>
_state = controllerEvent.state
eventQueueTimeHist.update(time.milliseconds() - dequeued.enqueueTimeMs)
try {
def process(): Unit = dequeued.process(processor)
rateAndTimeMetrics.get(state) match {
case Some(timer) => timer.time { process() }
case None => process()
}
} catch {
case e: Throwable => error(s"Uncaught error processing event $controllerEvent", e)
}
_state = ControllerState.Idle
}
}
doWork方法不断从阻塞队列里取出QueuedEvent事件,并执行QueuedEvent的process方法
def process(processor: ControllerEventProcessor): Unit = {
if (spent.getAndSet(true))
return
processingStarted.countDown()//为了处理网络闪断等异常情况下,新建session时候需要等所有pending状态下的event执行完才能建立
processor.process(event)//调用KafkaController的process处理事件
}
调用ControllerEventProcessor的process方法处理事件,其中ControllerEventProcessor是一个接口,目前只有KafkaController实现该接口,现在我们看KafkaController的process方法
override def process(event: ControllerEvent): Unit = {
try {
event match {
case event: MockEvent =>
// Used only in test cases
event.process()
case ShutdownEventThread =>
error("Received a ShutdownEventThread event. This type of event is supposed to be handle by ControllerEventThread")
case AutoPreferredReplicaLeaderElection =>
processAutoPreferredReplicaLeaderElection()
case ReplicaLeaderElection(partitions, electionType, electionTrigger, callback) =>
processReplicaLeaderElection(partitions, electionType, electionTrigger, callback)
case UncleanLeaderElectionEnable =>
processUncleanLeaderElectionEnable()
case TopicUncleanLeaderElectionEnable(topic) =>
processTopicUncleanLeaderElectionEnable(topic)
case ControlledShutdown(id, brokerEpoch, callback) =>
processControlledShutdown(id, brokerEpoch, callback)
case LeaderAndIsrResponseReceived(response, brokerId) =>
processLeaderAndIsrResponseReceived(response, brokerId)
case UpdateMetadataResponseReceived(response, brokerId) =>
processUpdateMetadataResponseReceived(response, brokerId)
case TopicDeletionStopReplicaResponseReceived(replicaId, requestError, partitionErrors) =>
processTopicDeletionStopReplicaResponseReceived(replicaId, requestError, partitionErrors)
case BrokerChange =>
processBrokerChange()
case BrokerModifications(brokerId) =>
processBrokerModification(brokerId)
case ControllerChange =>
processControllerChange()
case Reelect =>
processReelect()
case RegisterBrokerAndReelect =>
processRegisterBrokerAndReelect()
case Expire =>
processExpire()
case TopicChange =>
processTopicChange()
case LogDirEventNotification =>
processLogDirEventNotification()
case PartitionModifications(topic) =>
processPartitionModifications(topic)
case TopicDeletion =>
processTopicDeletion()
case ApiPartitionReassignment(reassignments, callback) =>
processApiPartitionReassignment(reassignments, callback)
case ZkPartitionReassignment =>
processZkPartitionReassignment()
case ListPartitionReassignments(partitions, callback) =>
processListPartitionReassignments(partitions, callback)
case UpdateFeatures(request, callback) =>
processFeatureUpdates(request, callback)
case PartitionReassignmentIsrChange(partition) =>
processPartitionReassignmentIsrChange(partition)
case IsrChangeNotification =>
processIsrChangeNotification()
case AlterIsrReceived(brokerId, brokerEpoch, isrsToAlter, callback) =>
processAlterIsr(brokerId, brokerEpoch, isrsToAlter, callback)
case Startup =>
processStartup()
}
} catch {
case e: ControllerMovedException =>
info(s"Controller moved to another broker when processing $event.", e)
maybeResign()
case e: Throwable =>
error(s"Error processing event $event", e)
} finally {
updateMetrics()
}
}
从上面的代码看到不同的事件在KafkaController定义了相应的处理函数,其中 Startup 事件对应的是 processStartup 方法,选主的时候就把Startup事件加入ControllerEventManager事件管理器里,经过ControllerEventThread的一系列处理最终回调到KafkaController的processStartup
ControllerChannelManager的分析
当创建一个topic,增加某个topic下的分区的等操作的时候是在controller完成的,controller处理完成之后会广播到其他broker,这里的ControllerChannelManager用来管理与其他所有的broker node的网络连接和请求发送等
class ControllerChannelManager(controllerContext: ControllerContext,
config: KafkaConfig,
time: Time,
metrics: Metrics,
stateChangeLogger: StateChangeLogger,
threadNamePrefix: Option[String] = None) extends Logging with KafkaMetricsGroup {
import ControllerChannelManager._
protected val brokerStateInfo = new HashMap[Int, ControllerBrokerStateInfo]
def addBroker(broker: Broker): Unit = {
brokerLock synchronized {
if (!brokerStateInfo.contains(broker.id)) {
addNewBroker(broker)
startRequestSendThread(broker.id)
}
}
}
def removeBroker(brokerId: Int): Unit = {
brokerLock synchronized {
removeExistingBroker(brokerStateInfo(brokerId))
}
}
ControllerChannelManager的主要成员变量有controllerContext、brokerStateInfo
- controllerContext:controller的元数据信息
- brokerStateInfo:broker的信息,主controller和其他broker发送相应请求用,其中key为broker的id、需要向某个broker发送请求的时候直接根据id找到ControllerBrokerStateInfo信息并交给ControllerBrokerStateInfo发送相应的网络请求处理。brokerStateInfo还维护了这个集群下的所有broker信息
当集群里新建或上线一个broker,controller通过监听zk的节点信息发现有新节点上线,最终会调用ControllerChannelManager的addBroker(broker: Broker)方法,下面看一下addBroker怎么处理,从上面的代码发现处理二点事情addNewBroker(broker) 、startRequestSendThread(broker.id)- addNewBroker主要构建ControllerBrokerStateInfo类的实例然后加入到brokerStateInfo里面
- startRequestSendThread 主要启动addNewBroker构建的ControllerBrokerStateInfo里的startRequestSendThread线程
我们在看一下ControllerBrokerStateInfo的信息
case class ControllerBrokerStateInfo(networkClient: NetworkClient,
brokerNode: Node,
messageQueue: BlockingQueue[QueueItem],
requestSendThread: RequestSendThread,
queueSizeGauge: Gauge[Int],
requestRateAndTimeMetrics: Timer,
reconfigurableChannelBuilder: Option[Reconfigurable])
ControllerBrokerStateInfo的主要成员有networkClient、brokerNode、messageQueue、requestSendThread
- networkClient:网络请求的客户端
- brokerNode:broker的注册信息,主机的Ip,端口号等信息
- messageQueue: 信息队列
- requestSendThread: 请求发送队列
当controller需要想某个broker发送请求的时候直接调用sendRequest
def sendRequest(brokerId: Int, request: AbstractControlRequest.Builder[_ <: AbstractControlRequest],
callback: AbstractResponse => Unit = null): Unit = {
brokerLock synchronized {
val stateInfoOpt = brokerStateInfo.get(brokerId)
stateInfoOpt match {
case Some(stateInfo) =>
stateInfo.messageQueue.put(QueueItem(request.apiKey, request, callback, time.milliseconds()))
case None =>
warn(s"Not sending request $request to broker $brokerId, since it is offline.")
}
}
}
直接根据请求的brokerId找到相应的ControllerBrokerStateInfo并往队列里面加入发送的请求消息。然而RequestSendThread线程不断轮询从队列里获取请求信息进行发送任务
override def doWork(): Unit = {
def backoff(): Unit = pause(100, TimeUnit.MILLISECONDS)
val QueueItem(apiKey, requestBuilder, callback, enqueueTimeMs) = queue.take()
requestRateAndQueueTimeMetrics.update(time.milliseconds() - enqueueTimeMs, TimeUnit.MILLISECONDS)
var clientResponse: ClientResponse = null
try {
var isSendSuccessful = false
while (isRunning && !isSendSuccessful) {
// if a broker goes down for a long time, then at some point the controller's zookeeper listener will trigger a
// removeBroker which will invoke shutdown() on this thread. At that point, we will stop retrying.
try {
if (!brokerReady()) {
isSendSuccessful = false
backoff()
}
else {
val clientRequest = networkClient.newClientRequest(brokerNode.idString, requestBuilder,
time.milliseconds(), true)
clientResponse = NetworkClientUtils.sendAndReceive(networkClient, clientRequest, time)
isSendSuccessful = true
}
} catch {
case e: Throwable => // if the send was not successful, reconnect to broker and resend the message
warn(s"Controller $controllerId epoch ${controllerContext.epoch} fails to send request $requestBuilder " +
s"to broker $brokerNode. Reconnecting to broker.", e)
networkClient.close(brokerNode.idString)
isSendSuccessful = false
backoff()
}
}
if (clientResponse != null) {
val requestHeader = clientResponse.requestHeader
val api = requestHeader.apiKey
if (api != ApiKeys.LEADER_AND_ISR && api != ApiKeys.STOP_REPLICA && api != ApiKeys.UPDATE_METADATA)
throw new KafkaException(s"Unexpected apiKey received: $apiKey")
val response = clientResponse.responseBody
stateChangeLogger.withControllerEpoch(controllerContext.epoch).trace(s"Received response " +
s"$response for request $api with correlation id " +
s"${requestHeader.correlationId} sent to broker $brokerNode")
if (callback != null) {
callback(response)
}
}
} catch {
case e: Throwable =>
error(s"Controller $controllerId fails to send a request to broker $brokerNode", e)
// If there is any socket error (eg, socket timeout), the connection is no longer usable and needs to be recreated.
networkClient.close(brokerNode.idString)
}
}
OnControllerFailover过程的分析
当某个broker竞争controller成功以后就会回调onControllerFailover,onControllerFailover主要有下一职责
- 对broker,topic,topicdeleteion,isr子节点注册监听处理函数,对reassign_partitions,preferred_replica_election注册节点变化监听处理函数
- 初始化controller的元数据对象ControllerContext,其中元数据对象中包含了当前有那些topics、那些broker是存活的,所有分区的leaders信息
- 启动ControllerChannelManager一遍发送信息到每个broker
- 启动副本状态机
- 启动分区状态机
- 启动自动平衡分区定时器
private def onControllerFailover(): Unit = {
maybeSetupFeatureVersioning()
info("Registering handlers")
// before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
//对broker,topic,topicdeleteion,isr子节点注册监听处理
val childChangeHandlers = Seq(brokerChangeHandler, topicChangeHandler, topicDeletionHandler, logDirEventNotificationHandler,
isrChangeNotificationHandler)
childChangeHandlers.foreach(zkClient.registerZNodeChildChangeHandler)
//对reassign_partitions,preferred_replica_election节点变化监听处理
val nodeChangeHandlers = Seq(preferredReplicaElectionHandler, partitionReassignmentHandler)
nodeChangeHandlers.foreach(zkClient.registerZNodeChangeHandlerAndCheckExistence)
info("Deleting log dir event notifications")
zkClient.deleteLogDirEventNotifications(controllerContext.epochZkVersion)
info("Deleting isr change notifications")
zkClient.deleteIsrChangeNotifications(controllerContext.epochZkVersion)
info("Initializing controller context")
initializeControllerContext() //初始化元数据,该数据来源于zk上的数据
info("Fetching topic deletions in progress")
val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress()
info("Initializing topic deletion manager")
topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion)
// We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines
// are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before
// they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and
// partitionStateMachine.startup().
info("Sending update metadata request")
//对集群中的所有broker发送元数据更新请求
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set.empty)
replicaStateMachine.startup()//启动副本状态机
partitionStateMachine.startup()//启动分区状态机
info(s"Ready to serve as the new controller with epoch $epoch")
initializePartitionReassignments()
topicDeletionManager.tryTopicDeletion()
val pendingPreferredReplicaElections = fetchPendingPreferredReplicaElections()
onReplicaElection(pendingPreferredReplicaElections, ElectionType.PREFERRED, ZkTriggered)
info("Starting the controller scheduler")
kafkaScheduler.startup()
if (config.autoLeaderRebalanceEnable) {
scheduleAutoLeaderRebalanceTask(delay = 5, unit = TimeUnit.SECONDS)//启动自动平衡分区定时器
}
if (config.tokenAuthEnabled) {
info("starting the token expiry check scheduler")
tokenCleanScheduler.startup()
tokenCleanScheduler.schedule(name = "delete-expired-tokens",
fun = () => tokenManager.expireTokens(),
period = config.delegationTokenExpiryCheckIntervalMs,
unit = TimeUnit.MILLISECONDS)
}
}
initializeControllerContext主要的职责
- 从zk的节点/broker/ids获取下面的所有broker信息和epoch并赋值给controllerContext
- 从zk上获取所有的主题并赋值给controllerContext
- 监听这些主题的元数据变化
- 获取所有主题下的ReplicaAssignment
- 更新每个分区的leader和isr信息
- 启动ControllerChannelManager(上面已经讲述了其原理和用途)
private def initializeControllerContext(): Unit = {
// update controller cache with delete topic information
//从zk的节点/broker/ids获取下面的所有broker信息和epoch并赋值给controllerContext
val curBrokerAndEpochs = zkClient.getAllBrokerAndEpochsInCluster
val (compatibleBrokerAndEpochs, incompatibleBrokerAndEpochs) = partitionOnFeatureCompatibility(curBrokerAndEpochs)
if (!incompatibleBrokerAndEpochs.isEmpty) {
warn("Ignoring registration of new brokers due to incompatibilities with finalized features: " +
incompatibleBrokerAndEpochs.map { case (broker, _) => broker.id }.toSeq.sorted.mkString(","))
}
controllerContext.setLiveBrokers(compatibleBrokerAndEpochs)
info(s"Initialized broker epochs cache: ${controllerContext.liveBrokerIdAndEpochs}")
//从zk上获取所有的主题并赋值给controllerContext
controllerContext.setAllTopics(zkClient.getAllTopicsInCluster(true))
//监听这些主题的元数据变化
registerPartitionModificationsHandlers(controllerContext.allTopics.toSeq)
val replicaAssignmentAndTopicIds = zkClient.getReplicaAssignmentAndTopicIdForTopics(controllerContext.allTopics.toSet)
processTopicIds(replicaAssignmentAndTopicIds)
replicaAssignmentAndTopicIds.foreach { case TopicIdReplicaAssignment(_, _, assignments) =>
assignments.foreach { case (topicPartition, replicaAssignment) =>
controllerContext.updatePartitionFullReplicaAssignment(topicPartition, replicaAssignment)
if (replicaAssignment.isBeingReassigned)
controllerContext.partitionsBeingReassigned.add(topicPartition)
}
}
controllerContext.clearPartitionLeadershipInfo()
controllerContext.shuttingDownBrokerIds.clear()
// register broker modifications handlers
registerBrokerModificationsHandler(controllerContext.liveOrShuttingDownBrokerIds)
// update the leader and isr cache for all existing partitions from Zookeeper
updateLeaderAndIsrCache()//更新每个分区的leader和isr信息
// start the channel manager
controllerChannelManager.startup()//启动ControllerChannelManager
}
ReplicaStateMachine 副本状态机主要监听一下状态和处理方法