目录
4.1 command创建时Partition均匀分布于Broker的策略
4.2 kafka Controller监听到topic创建事件后的处理
4.3 Broker leader和follower的产生过程
1. 概述
KafkaController是kafka集群的控制管理模块,每个broker都有一个控制器,但是有且存在一个leader,若干个follower。leader角色的控制器管理所有节点的信息,并通过向ZK注册各种监听事件来管理整个集群节点、分区的leader的选举、再平衡等问题。外部事件会更新ZK的数据,ZK中的数据一旦发生变化,控制器都要做不同的响应处理。
kafkaController通过注册多种监听器来完成响应的处理:
class KafkaController(val config: KafkaConfig,
zkClient: KafkaZkClient,
time: Time,
metrics: Metrics,
initialBrokerInfo: BrokerInfo,
initialBrokerEpoch: Long,
tokenManager: DelegationTokenManager,
threadNamePrefix: Option[String] = None)
extends ControllerEventProcessor with Logging with KafkaMetricsGroup {
this.logIdent = s"[Controller id=${config.brokerId}] "
@volatile private var brokerInfo = initialBrokerInfo
@volatile private var _brokerEpoch = initialBrokerEpoch
private val stateChangeLogger = new StateChangeLogger(config.brokerId, inControllerContext = true, None)
val controllerContext = new ControllerContext
var controllerChannelManager = new ControllerChannelManager(controllerContext, config, time, metrics,
stateChangeLogger, threadNamePrefix)
private[controller] val kafkaScheduler = new KafkaScheduler(1)
private[controller] val eventManager = new ControllerEventManager(config.brokerId, this, time,
controllerContext.stats.rateAndTimeMetrics)
private val brokerRequestBatch = new ControllerBrokerRequestBatch(config, controllerChannelManager,
eventManager, controllerContext, stateChangeLogger)
val replicaStateMachine: ReplicaStateMachine = new ZkReplicaStateMachine(config, stateChangeLogger, controllerContext, zkClient,
new ControllerBrokerRequestBatch(config, controllerChannelManager, eventManager, controllerContext, stateChangeLogger))
val partitionStateMachine: PartitionStateMachine = new ZkPartitionStateMachine(config, stateChangeLogger, controllerContext, zkClient,
new ControllerBrokerRequestBatch(config, controllerChannelManager, eventManager, controllerContext, stateChangeLogger))
val topicDeletionManager = new TopicDeletionManager(config, controllerContext, replicaStateMachine,
partitionStateMachine, new ControllerDeletionClient(this, zkClient))
private val controllerChangeHandler = new ControllerChangeHandler(eventManager)
private val brokerChangeHandler = new BrokerChangeHandler(eventManager)
private val brokerModificationsHandlers: mutable.Map[Int, BrokerModificationsHandler] = mutable.Map.empty
private val topicChangeHandler = new TopicChangeHandler(eventManager)
private val topicDeletionHandler = new TopicDeletionHandler(eventManager)
private val partitionModificationsHandlers: mutable.Map[String, PartitionModificationsHandler] = mutable.Map.empty
private val partitionReassignmentHandler = new PartitionReassignmentHandler(eventManager)
private val preferredReplicaElectionHandler = new PreferredReplicaElectionHandler(eventManager)
private val isrChangeNotificationHandler = new IsrChangeNotificationHandler(eventManager)
private val logDirEventNotificationHandler = new LogDirEventNotificationHandler(eventManager)
2.重要类介绍
KafkaController
KafkaController作为kafka集群的控制者,有且存在一个leader,若干个follower。Leader能够发送具体的指令给follower,具体指令如:RequestKeys.LeaderAndIsrKey,RequestKeys.StopReplicaKey,RequestKeys.UpdateMetadataKey。
**Handler
kafkaController启动的时候会实例化**Handler,handler往ControllerEventManager队列中put,自己关心的事件(ControllerEvent)和需要监听zk节点的path,一个handler可能会关心多个ControllerEvent,其中有:ControllerChangeHandler、BrokerChangeHandler、BrokerModificationsHandler、TopicChangeHandler、TopicDeletionHandler、PartitionModificationsHandler、PartitionReassignmentHandler、PreferredReplicaElectionHandler、IsrChangeNotificationHandler、LogDirEventNotificationHandler
ControllerEvent
其中包括:MockEvent、ShutdownEventThread、ControllerChange、ReplicaLeaderElection、BrokerChange等
ControllerEventManager
kafkaController实例化Handler的时候&watcher监听到zk节点变化事件,会将事件放进ControllerEventManager队列,ControllerEventManager内部线程从队列中获取事件,调用ControllerEventProcessor的process方法,process根据不同的事件执行响应的处理
ControllerEventProcessor
根据ControllerEvent执行响应的处理
3.主要流程
3.1 选举主控制器
主控制器的选举主要利用的是ZK的leader选举机制,每个节点都会参与竞选主控制器,只有一个节点可以成为主控制器。其他节点只有在主控制器出现故障或会话失效时参与领导选举。每个节点都会作为ZK客户端,向ZK服务端尝试创建/controller的临时节点
, 最终只有一个代理节点可以成功创建/controller节点。
各节点启动的时候都会通过创建/controller 节点竞选主控制器,但只有一个成为主控制器。多个节点都会注册会话失效监听器,并在/controller节点注册数据改变监听器。
如果是主控制器产生会话失效,就会删除/controller临时节点。其他节点就会收到/controller节点的数据改变事件,它们的选举器都会尝试重新创建/controller竞选主控制器。
Kafka 是如何避免脑裂问题的呢?
- Controller 给 Broker 发送的请求中,都会携带 controller epoch 信息,如果 broker 发现当前请求的 epoch 小于缓存中的值,那么就证明这是来自旧 Controller 的请求,就会决绝这个请求,正常情况下是没什么问题的;
- 但是异常情况下呢?如果 Broker 先收到异常 Controller 的请求进行处理呢?现在看 Kafka 在这一部分并没有适合的方案;
- 正常情况下,Kafka 新的 Controller 选举出来之后,Controller 会向全局所有 broker 发送一个 metadata 请求,这样全局所有 Broker 都可以知道当前最新的 controller epoch,但是并不能保证可以完全避免上面这个问题,还是有出现这个问题的几率的,只不过非常小,而且即使出现了由于 Kafka 的高可靠架构,影响也非常有限,至少从目前看,这个问题并不是严重的问题。
3.2 管理分区和副本状态机
这个过程实际上也是基于Zookeeper实现了订阅发布系统,比如创建topic的时候发布者是TopicCommand类,订阅者是kafkaController类。再由kafkaController进行分区leader选举(副本列表第一个),然后给TopicCommand已经指定的各个Broker Follower发送LeaderAndIsrRequest,由根据我们TopicCommand中分区的分配的具体Broker去启动副本为leader(leader的被分配的Brokerid和当前Broker的id相等)或者Follower。
分区和副本都有四种状态: 新建、在线、离线和不存在。分区的四种状态为:
- 新增分区(NewPartition)
- 在线分区(OnlinePartition)
- 离线分区(OfflinePartition)
- 不存在分区(NonExistenPartition)
副本的四种状态:
- 新建副本(NewReplica)
- 在线副本(OnlineReplica)
- 离线副本(OfflineReplica)
- 不存在副本(NonExistentReplica)
当外部事件发生变化的时候会调用状态机的状态转移方法,根据不同的状态做出不同的响应。控制器通过分区和副本状态机来管理集群节点、实现主副本选举已经再平衡操作等。
KafkaController监听流程:
3.2 选举主副本
分区从”下线状态“或”上线状态“到“上线状态”都要重新选举分区的主副本:
- 首先读取分区当前的主副本、ISR集合
- 优先从ISR中选择第一个副本作为主副本。 如果第一副本挂了,就会选择其他副本作为主副本。
- 如果ISR都挂了,那么就从AR(所有的副本)中选择第一个存活的副本作为主副本。
选择最优副本选举(第一个副本作为leader)是为了分区平衡,因为kafka的分区分配操作保证了分区的主副本会均匀的分布在所有节点上。kafka为了均衡将所有主副本均衡地分配到各个节点上。会有一个分区平衡的后台线程,定时检查最优副本(第一个副本)是不是主副本,如果不是则会进行重新选举,将最优副本作为主副本。KafkaController会发送LeaderAndIsr请求给分区的所有存活副本,让这些节点更新元数据。
4.创建topic源码
4.1 command创建时Partition均匀分布于Broker的策略
源码执行的具体过程,TopicCommand.main调用adminClient发送CreateTopicsRequest
val createResult = adminClient.createTopics(Collections.singleton(newTopic))
KafkaApis处理CreateTopicsRequest
class KafkaApis(val requestChannel: RequestChannel,
val replicaManager: ReplicaManager,
val adminManager: AdminManager,
val groupCoordinator: GroupCoordinator,
val txnCoordinator: TransactionCoordinator,
val controller: KafkaController,
val zkClient: KafkaZkClient,
val brokerId: Int,
val config: KafkaConfig,
val metadataCache: MetadataCache,
val metrics: Metrics,
val authorizer: Option[Authorizer],
val quotas: QuotaManagers,
val fetchManager: FetchManager,
brokerTopicStats: BrokerTopicStats,
val clusterId: String,
time: Time,
val tokenManager: DelegationTokenManager) extends Logging {
def handle(request: RequestChannel.Request): Unit = {
try {
request.header.apiKey match {
// 处理创建topic
case ApiKeys.CREATE_TOPICS => handleCreateTopicsRequest(request)
// 创建topic
adminManager.createTopics(createTopicsRequest.data.timeoutMs,
createTopicsRequest.data.validateOnly,
toCreate,
authorizedForDescribeConfigs,
handleCreateTopicsResults)
adminManager进行partition和replica分配
class AdminManager(val config: KafkaConfig,
val metrics: Metrics,
val metadataCache: MetadataCache,
val zkClient: KafkaZkClient) extends Logging with KafkaMetricsGroup {
// 计算和分配topic、partition、replica
AdminUtils.assignReplicasToBrokers(
brokers, resolvedNumPartitions, resolvedReplicationFactor)
// 调用zk创建topic
adminZkClient.createTopicWithAssignment(topic.name, configs, assignments)
计算和分配topic、partition、replica:
1、随机获取一个broker的位置作为startIndex
2、设置当前分区的id的值>=0
3、随机选取Broker数目范围内的位移作为下一个副本replica的位置
4、遍历分区个数
4.1、当前分区id加上起始位置,对Brokersize取余得到第一个副本所属的broker位置
4.2、遍历副本个数
4.2.1、计算出每个副本的位置 计算方法是replicaIndex:
4.3、当前副本id+1
private def assignReplicasToBrokersRackUnaware(nPartitions: Int,
replicationFactor: Int,
brokerList: Seq[Int],
fixedStartIndex: Int,
startPartitionId: Int): Map[Int, Seq[Int]] = {
val ret = mutable.Map[Int, Seq[Int]]()
val brokerArray = brokerList.toArray
// 随机选取一个Broker位置作为startIndex
val startIndex = if (fixedStartIndex >= 0) fixedStartIndex else rand.nextInt(brokerArray.length)
// 当前分区Id赋值为0
var currentPartitionId = math.max(0, startPartitionId)
// 随机选取Broker数目范围内的位移
var nextReplicaShift = if (fixedStartIndex >= 0) fixedStartIndex else rand.nextInt(brokerArray.length)
for (_ <- 0 until nPartitions) {
// 只有在所有遍历过Broker数目个分区后才将位移加一
if (currentPartitionId > 0 && (currentPartitionId % brokerArray.length == 0))
nextReplicaShift += 1
// 当前分区id加上起始位置,对Brokersize取余得到第一个副本所属的broker位置
val firstReplicaIndex = (currentPartitionId + startIndex) % brokerArray.length
val replicaBuffer = mutable.ArrayBuffer(brokerArray(firstReplicaIndex))
for (j <- 0 until replicationFactor - 1)
// 计算出每个副本的位置 计算方法是replicaIndex:
replicaBuffer += brokerArray(replicaIndex(firstReplicaIndex, nextReplicaShift, j, brokerArray.length))
ret.put(currentPartitionId, replicaBuffer)
// 分区id加一
currentPartitionId += 1
}
ret
}
写配置:Zookeeper的目录是:/config/topics/TopicName
写topic相关,Zookeeper的目录是:/brokers/topics/TopicName
// write out the config if there is any, this isn't transactional with the partition assignments
zkClient.setOrCreateEntityConfigs(ConfigType.Topic, topic, config)
// create the partition assignment
writeTopicPartitionAssignment(topic, partitionReplicaAssignment.mapValues(ReplicaAssignment(_)).toMap, isUpdate = false)
4.2 kafka Controller监听到topic创建事件后的处理
KafkaController启动的时候实例化**Handler,**Handler往队列中put事件进行监听path初始化和相关操作。
class KafkaController(val config: KafkaConfig,
zkClient: KafkaZkClient,
time: Time,
metrics: Metrics,
initialBrokerInfo: BrokerInfo,
initialBrokerEpoch: Long,
tokenManager: DelegationTokenManager,
threadNamePrefix: Option[String] = None)
extends ControllerEventProcessor with Logging with KafkaMetricsGroup {
// 实例化TopicChangeHandler
private val topicChangeHandler = new TopicChangeHandler(eventManager)
class BrokerChangeHandler(eventManager: ControllerEventManager) extends ZNodeChildChangeHandler {
// 需要注册监听的path
override val path: String = BrokerIdsZNode.path
override def handleChildChange(): Unit = {
// 往队里中注册BrokerChange事件
eventManager.put(BrokerChange)
}
}
class ControllerEventManager(controllerId: Int,
processor: ControllerEventProcessor,
time: Time,
rateAndTimeMetrics: Map[ControllerState, KafkaTimer]) extends KafkaMetricsGroup {
// kafkacontroller实例化&zk节点变更watcher事件都会丢进队列
def put(event: ControllerEvent): QueuedEvent = inLock(putLock) {
val queuedEvent = new QueuedEvent(event, time.milliseconds())
queue.put(queuedEvent)
queuedEvent
}
class ControllerEventThread(name: String) extends ShutdownableThread(name = name, isInterruptible = false) {
logIdent = s"[ControllerEventThread controllerId=$controllerId] "
override def doWork(): Unit = {
val dequeued = queue.take()
dequeued.event match {
case ShutdownEventThread => // The shutting down of the thread has been initiated at this point. Ignore this event.
case controllerEvent =>
_state = controllerEvent.state
eventQueueTimeHist.update(time.milliseconds() - dequeued.enqueueTimeMs)
try {
// 从队列中获取并处理事件
def process(): Unit = dequeued.process(processor)
rateAndTimeMetrics.get(state) match {
case Some(timer) => timer.time { process() }
case None => process()
}
} catch {
case e: Throwable => error(s"Uncaught error processing event $controllerEvent", e)
}
_state = ControllerState.Idle
}
}
}
}
proccessor处理事件:
override def process(event: ControllerEvent): Unit = {
try {
event match {
case event: MockEvent =>
............
case BrokerChange =>
processBrokerChange()
case BrokerModifications(brokerId) =>
processBrokerModification(brokerId)
case ControllerChange =>
processControllerChange()
case Reelect =>
processReelect()
case RegisterBrokerAndReelect =>
processRegisterBrokerAndReelect()
case Expire =>
processExpire()
// 处理topic改变事件
case TopicChange =>
processTopicChange()
...........
case Startup =>
processStartup()
}
} catch {
..........
} finally {
updateMetrics()
}
}
真正处理topic改变逻辑:
private def processTopicChange(): Unit = {
if (!isActive) return
// 在zk上面创建topic节点
val topics = zkClient.getAllTopicsInCluster(true)
// 获取新增topic
val newTopics = topics -- controllerContext.allTopics
// 获取删除topic
val deletedTopics = controllerContext.allTopics -- topics
controllerContext.allTopics = topics
// 注册path和handler对应的处理关系,将Handler放进zNodeChangeHandlers注册watcher
registerPartitionModificationsHandlers(newTopics.toSeq)
// 获取分区副本分配策略HashMap[TopicAndPartition, Seq[Int]]
val addedPartitionReplicaAssignment = zkClient.getFullReplicaAssignmentForTopics(newTopics)
deletedTopics.foreach(controllerContext.removeTopic)
addedPartitionReplicaAssignment.foreach {
case (topicAndPartition, newReplicaAssignment) => controllerContext.updatePartitionFullReplicaAssignment(topicAndPartition, newReplicaAssignment)
}
info(s"New topics: [$newTopics], deleted topics: [$deletedTopics], new partition replica assignment " +
s"[$addedPartitionReplicaAssignment]")
if (addedPartitionReplicaAssignment.nonEmpty)
// 进入具体的操作
onNewPartitionCreation(addedPartitionReplicaAssignment.keySet)
}
PartitionStateMachine执行分区和副本的handleStateChanges
private def onNewPartitionCreation(newPartitions: Set[TopicPartition]): Unit = {
info(s"New partition creation callback for ${newPartitions.mkString(",")}")
// 将新建分区的状态转化为NewPartition状态
partitionStateMachine.handleStateChanges(newPartitions.toSeq, NewPartition)
// 将新建副本的状态转化为NewReplica状态
replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions).toSeq, NewReplica)
// 将新建副本的状态转化为OnlinePartition状态
partitionStateMachine.handleStateChanges(
newPartitions.toSeq,
OnlinePartition,
Some(OfflinePartitionLeaderElectionStrategy(false))
)
// 将新建副本的状态转化为OnlineReplica状态
replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions).toSeq, OnlineReplica)
}
将新建分区的状态转化为NewPartition状态&将新建分区的状态转化为OnlinePartition状态
private def doHandleStateChanges(
partitions: Seq[TopicPartition],
targetState: PartitionState,
partitionLeaderElectionStrategyOpt: Option[PartitionLeaderElectionStrategy]
): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
val stateChangeLog = stateChangeLogger.withControllerEpoch(controllerContext.epoch)
partitions.foreach(partition => controllerContext.putPartitionStateIfNotExists(partition, NonExistentPartition))
val (validPartitions, invalidPartitions) = controllerContext.checkValidPartitionStateChange(partitions, targetState)
invalidPartitions.foreach(partition => logInvalidTransition(partition, targetState))
targetState match {
// 处理NewPartition
case NewPartition =>
validPartitions.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState with " +
s"assigned replicas ${controllerContext.partitionReplicaAssignment(partition).mkString(",")}")
controllerContext.putPartitionState(partition, NewPartition)
}
Map.empty
// 处理OnlinePartition
case OnlinePartition =>
val uninitializedPartitions = validPartitions.filter(partition => partitionState(partition) == NewPartition)
val partitionsToElectLeader = validPartitions.filter(partition => partitionState(partition) == OfflinePartition || partitionState(partition) == OnlinePartition)
if (uninitializedPartitions.nonEmpty) {
// 初始化leader
val successfulInitializations = initializeLeaderAndIsrForPartitions(uninitializedPartitions)
successfulInitializations.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition from ${partitionState(partition)} to $targetState with state " +
s"${controllerContext.partitionLeadershipInfo(partition).leaderAndIsr}")
controllerContext.putPartitionState(partition, OnlinePartition)
}
}
if (partitionsToElectLeader.nonEmpty) {
val electionResults = electLeaderForPartitions(
partitionsToElectLeader,
partitionLeaderElectionStrategyOpt.getOrElse(
throw new IllegalArgumentException("Election strategy is a required field when the target state is OnlinePartition")
)
)
electionResults.foreach {
case (partition, Right(leaderAndIsr)) =>
stateChangeLog.trace(
s"Changed partition $partition from ${partitionState(partition)} to $targetState with state $leaderAndIsr"
)
controllerContext.putPartitionState(partition, OnlinePartition)
case (_, Left(_)) => // Ignore; no need to update partition state on election error
}
electionResults
} else {
Map.empty
}
case OfflinePartition =>
validPartitions.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState")
controllerContext.putPartitionState(partition, OfflinePartition)
}
Map.empty
case NonExistentPartition =>
validPartitions.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState")
controllerContext.putPartitionState(partition, NonExistentPartition)
}
Map.empty
}
}
在initializeLeaderAndIsrForPartitions 第一个seq中的Broker当做leader
val leaderIsrAndControllerEpochs = partitionsWithLiveReplicas.map { case (partition, liveReplicas) =>
// 选择第一个副本节点为leader
val leaderAndIsr = LeaderAndIsr(liveReplicas.head, liveReplicas.toList)
val leaderIsrAndControllerEpoch = LeaderIsrAndControllerEpoch(leaderAndIsr, controllerContext.epoch)
partition -> leaderIsrAndControllerEpoch
}.toMap
val createResponses = try {
zkClient.createTopicPartitionStatesRaw(leaderIsrAndControllerEpochs, controllerContext.epochZkVersion)
}
........
// 放到上下文controllerContext
controllerContext.partitionLeadershipInfo.put(partition, leaderIsrAndControllerEpoch)
// 放到leaderAndIsrRequestMap controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(leaderIsrAndControllerEpoch.leaderAndIsr.isr,
partition, leaderIsrAndControllerEpoch, controllerContext.partitionFullReplicaAssignment(partition), isNew = true)
topic 分区 副本 放入leaderAndIsrRequestMap,以便我们可以通过Brokerid找到
def addLeaderAndIsrRequestForBrokers(brokerIds: Seq[Int],
topicPartition: TopicPartition,
leaderIsrAndControllerEpoch: LeaderIsrAndControllerEpoch,
replicaAssignment: ReplicaAssignment,
isNew: Boolean): Unit = {
brokerIds.filter(_ >= 0).foreach { brokerId =>
val result = leaderAndIsrRequestMap.getOrElseUpdate(brokerId, mutable.Map.empty)
val alreadyNew = result.get(topicPartition).exists(_.isNew)
val leaderAndIsr = leaderIsrAndControllerEpoch.leaderAndIsr
result.put(topicPartition, new LeaderAndIsrPartitionState()
.setTopicName(topicPartition.topic)
.setPartitionIndex(topicPartition.partition)
.setControllerEpoch(leaderIsrAndControllerEpoch.controllerEpoch)
.setLeader(leaderAndIsr.leader)
.setLeaderEpoch(leaderAndIsr.leaderEpoch)
.setIsr(leaderAndIsr.isr.map(Integer.valueOf).asJava)
.setZkVersion(leaderAndIsr.zkVersion)
.setReplicas(replicaAssignment.replicas.map(Integer.valueOf).asJava)
.setAddingReplicas(replicaAssignment.addingReplicas.map(Integer.valueOf).asJava)
.setRemovingReplicas(replicaAssignment.removingReplicas.map(Integer.valueOf).asJava)
.setIsNew(isNew || alreadyNew))
}
最后调用需要通知的broker sendRequestsToBrokers
def sendRequestsToBrokers(controllerEpoch: Int): Unit = {
try {
val stateChangeLog = stateChangeLogger.withControllerEpoch(controllerEpoch)
sendLeaderAndIsrRequest(controllerEpoch, stateChangeLog)
sendUpdateMetadataRequests(controllerEpoch, stateChangeLog)
sendStopReplicaRequests(controllerEpoch)
} catch {
}
}
4.3 Broker leader和follower的产生过程
在Broker接收到Controller的LeaderAndIsrRequest消息后,交由kafkaApis的handle处理
case RequestKeys.LeaderAndIsrKey => handleLeaderAndIsrRequest(request)
当前Broker成为副本的leader或者follower的入口函数:replicaManager.becomeLeaderOrFollower
当前Broker能不能成为Broker,取决于Brokerid是否与leader分配的Brokerid一致,一致就会成为leader,否则follower
val highWatermarkCheckpoints = new LazyOffsetCheckpoints(this.highWatermarkCheckpoints)
val partitionsBecomeLeader = if (partitionsTobeLeader.nonEmpty)
// 成为leader
makeLeaders(controllerId, controllerEpoch, partitionsTobeLeader, correlationId, responseMap,
highWatermarkCheckpoints)
else
Set.empty[Partition]
val partitionsBecomeFollower = if (partitionsToBeFollower.nonEmpty)
// 成为follower
makeFollowers(controllerId, controllerEpoch, partitionsToBeFollower, correlationId, responseMap,
highWatermarkCheckpoints)
else
Set.empty[Partition]
使当前Broker成为给定分区的leader ,需要做以下几个处理:
* 1,停止掉这些分区的fetchers
* 2,更新缓存的当前分区的元数据
* 3,将分区加入leader 分区集合
// First stop fetchers for all the partitions
replicaFetcherManager.removeFetcherForPartitions(partitionStates.keySet.map(_.topicPartition))
// Update the partition information to be the leader
partitionStates.foreach { case (partition, partitionState) =>
try {
if (partition.makeLeader(controllerId, partitionState, correlationId, highWatermarkCheckpoints)) {
当前Broker成为给定分区的follower要做要做以下几个处理:
* 1,将分区从leader partition 集合中移除
* 2,将副本标记为follower ,目的是不让生产者继续往该副本生产消息
* 3,停止掉该分区的所有fetcher,目的是不让副本fetcher线程往该副本写数据。
* 4,清空当前分区的log和Checkpoint offsets
* 5,假如Broker没有挂掉,增加从新leader获取数据的副本fetcher线程
metadataCache.getAliveBrokers.find(_.id == newLeaderBrokerId) match {
// Only change partition state when the leader is available
case Some(_) =>
if (partition.makeFollower(controllerId, partitionState, correlationId, highWatermarkCheckpoints))
.....
replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(_.topicPartition))
partitionsToMakeFollower.foreach { partition =>
completeDelayedFetchOrProduceRequests(partition.topicPartition)
}
replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset)
4.4 总结
本文源码主要是以topic的创建过程,大致过程如下:
1. admin修改zk节点
2.controller的watcher监听zk节点,选举副本的leader,通知brokers
3.broker根据选举结果是leader或follower做出相应处理