在kakfa的集群中,只有一个Controller能够成为Leader管理整个集群,而其他未成为ControllerLeader的Broker也会创建一个KafkaController对象,他们唯一能做的事情就是当leader发生故障的时候竞争成为新的Controller。
KafkaController的启动和故障转移的过程与ZookeeperLeaderElector有着密切的关系,ZookeeperLeaderElector中有两个比较重要的字段
class ZookeeperLeaderElector(controllerContext: ControllerContext,
electionPath: String,
onBecomingLeader: () => Unit,
onResigningAsLeader: () => Unit,
brokerId: Int)
extends LeaderElector with Logging {
var leaderId = -1 // 缓存当前的Controller leaderID
// create the election path in ZK, if one does not exist
val index = electionPath.lastIndexOf("/")
if (index > 0)
controllerContext.zkUtils.makeSurePersistentPathExists(electionPath.substring(0, index))
// LeaderChangeListener会监听/controller节点的数据变化,当节点中保存的leaderID发生变化时,会触发LeaderChangeListner进行相应的处理。
val leaderChangeListener = new LeaderChangeListener
}
def handleDataChange(dataPath: String, data: Object) {
inLock(controllerContext.controllerLock) {
val amILeaderBeforeDataChange = amILeader
//记录新的Controller的brokerID
leaderId = KafkaController.parseControllerId(data.toString)
info("New leader is %d".format(leaderId))
// The old leader needs to resign leadership if it is no longer the leader
// 如果当前的Broker由Controller Leader变成Foolower,则要进行相应的清理动作。
if (amILeaderBeforeDataChange && !amILeader)
onResigningAsLeader()
}
}
当/controller节点中的数据被删除时会触发handleDataDeleted()方法进行处理
def handleDataDeleted(dataPath: String) {
inLock(controllerContext.controllerLock) {
debug("%s leader change listener fired for path %s to handle data deleted: trying to elect as a leader"
.format(brokerId, dataPath))
if(amILeader)
onResigningAsLeader()
// 尝试新的leader选举
elect
}
}
ZookeeperLeaderElector.elect()方法具体如下:
def elect: Boolean = {
val timestamp = SystemTime.milliseconds.toString
val electString = Json.encode(Map("version" -> 1, "brokerid" -> brokerId, "timestamp" -> timestamp))
// 获得当前ZK中记录的Controller Leader的ID
leaderId = getControllerID
/*
* We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition,
* it's possible that the controller has already been elected when we get here. This check will prevent the following
* createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
*/
// 已经存在leader,放弃选举
if(leaderId != -1) {
debug("Broker %d has been elected as leader, so stopping the election process.".format(leaderId))
return amILeader
}
try {
//尝试创建临时节点,如果存在就跑出隐藏
val zkCheckedEphemeral = new ZKCheckedEphemeral(electionPath,
electString,
controllerContext.zkUtils.zkConnection.getZookeeper,
JaasUtils.isZkSecurityEnabled())
zkCheckedEphemeral.create()
info(brokerId + " successfully elected as leader")
// 创建成功,成为Leader,更新leaderID
leaderId = brokerId
// 实际上是调用onControllerFailover()
onBecomingLeader()
} catch {
case e: ZkNodeExistsException =>
// If someone else has written the path, then
leaderId = getControllerID
if (leaderId != -1)
debug("Broker %d was elected as leader instead of broker %d".format(leaderId, brokerId))
else
warn("A leader has been elected but just resigned, this will result in another round of election")
case e2: Throwable =>
error("Error while electing or becoming leader on broker %d".format(brokerId), e2)
// 对onBecomingLeader的异常进行处理,重置leaderID,删除/controller路径
resign()
}
amILeader
}
/**
* This callback is invoked by the zookeeper leader elector on electing the current broker as the new controller.
* It does the following things on the become-controller state change -
* 1. Register controller epoch changed listener
* 2. Increments the controller epoch
* 3. Initializes the controller's context object that holds cache objects for current topics, live brokers and
* leaders for all existing partitions.
* 4. Starts the controller's channel manager
* 5. Starts the replica state machine
* 6. Starts the partition state machine
* If it encounters any unexpected exception/error while becoming controller, it resigns as the current controller.
* This ensures another controller election will be triggered and there will always be an actively serving controller
*/
def onControllerFailover() {
if(isRunning) {
info("Broker %d starting become controller state transition".format(config.brokerId))
//read controller epoch from zk
// 读取Zookeeper中ControllerEpochPath信息更新到ControllerContext中
readControllerEpochFromZookeeper()
// increment the controller epoch
// 递增controller epoch并写入zk
incrementControllerEpoch(zkUtils.zkClient)
// before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
// 注册之前介绍过的一系列Zookeeper的监听器
registerReassignedPartitionsListener()
registerIsrChangeNotificationListener()
registerPreferredReplicaElectionListener()
partitionStateMachine.registerListeners()
replicaStateMachine.registerListeners()
// 初始化ControllerContext,主要是从Zookeeper中读取topic、分区、副本相关的各种元数据信息。
initializeControllerContext()
// 启动replicaStateMachine组件,初始化各个副本的状态
replicaStateMachine.startup()
// 启动partitionStateMachine组件,初始化分区状态
partitionStateMachine.startup()
// register the partition change listeners for all existing topics on failover
// 为所有的topic注册PartitionModicationListener
controllerContext.allTopics.foreach(topic => partitionStateMachine.registerPartitionChangeListener(topic))
info("Broker %d is ready to serve as the new controller with epoch %d".format(config.brokerId, epoch))
// 修改Broker状态
brokerState.newState(RunningAsController)
// 处理副本重新分配的分区
maybeTriggerPartitionReassignment()
// 处理优先副本选举的的分区
maybeTriggerPreferredReplicaElection()
/* send partition leadership info to all live brokers */
//向所有broker发送UpdateMetatadtaRequest
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)
// 根据配置决定开启分区自动rebalance功能
if (config.autoLeaderRebalanceEnable) {
info("starting the partition rebalance scheduler")
autoRebalanceScheduler.startup()
autoRebalanceScheduler.schedule("partition-rebalance-thread", checkAndTriggerPartitionRebalance,
5, config.leaderImbalanceCheckIntervalSeconds.toLong, TimeUnit.SECONDS)
}
deleteTopicManager.start()
}
else
info("Controller has been shut down, aborting startup/failover")
}
initializeControllerContext方法从zk读取信息:
private def initializeControllerContext() {
// update controller cache with delete topic information
// 读取ids字段
controllerContext.liveBrokers = zkUtils.getAllBrokersInCluster().toSet
// 读取topics路径
controllerContext.allTopics = zkUtils.getAllTopics().toSet
// 读取/brokers/topics/xxx/partitions,初始化每个partition的AR信息
controllerContext.partitionReplicaAssignment = zkUtils.getReplicaAssignmentForTopics(controllerContext.allTopics.toSeq)
controllerContext.partitionLeadershipInfo = new mutable.HashMap[TopicAndPartition, LeaderIsrAndControllerEpoch]
controllerContext.shuttingDownBrokerIds = mutable.Set.empty[Int]
// update the leader and isr cache for all existing partitions from Zookeeper
// 初始化每个Partition的leader、ISR信息
updateLeaderAndIsrCache()
// start the channel manager
// 启动ControllerChannelManager
startChannelManager()
// 读取需要优选副本选举的partition
initializePreferredReplicaElection()
// 读取/admin/reassign_partitions,初始化需要进行副本重新分配的Partition
initializePartitionReassignment()
// 启动TopicDeletionManager
initializeTopicDeletion()
info("Currently active brokers in the cluster: %s".format(controllerContext.liveBrokerIds))
info("Currently shutting brokers in the cluster: %s".format(controllerContext.shuttingDownBrokerIds))
info("Current list of topics in the cluster: %s".format(controllerContext.allTopics))
}
Partition Rebalance
为了达到负载均衡,在onControllerFailover中会启动一个名为partition-rebalance的定时任务,提供了分区自动均衡的功能,如果一些broker宕机,然后leader集中在一个broker上,则会导致这个broker压力过大,此时需要重新选举。checkAndTriggerPartitionRebalance:
private def checkAndTriggerPartitionRebalance(): Unit = {
if (isActive()) {
trace("checking need to trigger partition rebalance")
// get all the active brokers
// 获取所有可用的broker副本
var preferredReplicasForTopicsByBrokers: Map[Int, Map[TopicAndPartition, Seq[Int]]] = null
inLock(controllerContext.controllerLock) {
// 获取优先副本坐在的BrokerID与分区的关系。
preferredReplicasForTopicsByBrokers =
controllerContext.partitionReplicaAssignment.filterNot(p => deleteTopicManager.isTopicQueuedUpForDeletion(p._1.topic)).groupBy {
case(topicAndPartition, assignedReplicas) => assignedReplicas.head
}
}
debug("preferred replicas by broker " + preferredReplicasForTopicsByBrokers)
// for each broker, check if a preferred replica election needs to be triggered
// 算每个broker的imbalance比例
preferredReplicasForTopicsByBrokers.foreach {
case(leaderBroker, topicAndPartitionsForBroker) => {
var imbalanceRatio: Double = 0
var topicsNotInPreferredReplica: Map[TopicAndPartition, Seq[Int]] = null
inLock(controllerContext.controllerLock) {
topicsNotInPreferredReplica =
topicAndPartitionsForBroker.filter {
case(topicPartition, replicas) => {
controllerContext.partitionLeadershipInfo.contains(topicPartition) &&
controllerContext.partitionLeadershipInfo(topicPartition).leaderAndIsr.leader != leaderBroker
}
}
debug("topics not in preferred replica " + topicsNotInPreferredReplica)
val totalTopicPartitionsForBroker = topicAndPartitionsForBroker.size
val totalTopicPartitionsNotLedByBroker = topicsNotInPreferredReplica.size
//非优选副本除以当前broker的leader数目
imbalanceRatio = totalTopicPartitionsNotLedByBroker.toDouble / totalTopicPartitionsForBroker
trace("leader imbalance ratio for broker %d is %f".format(leaderBroker, imbalanceRatio))
}
// check ratio and if greater than desired ratio, trigger a rebalance for the topic partitions
// that need to be on this broker
// 比例大于一定阈值时,触发优先选举
if (imbalanceRatio > (config.leaderImbalancePerBrokerPercentage.toDouble / 100)) {
topicsNotInPreferredReplica.foreach {
case(topicPartition, replicas) => {
inLock(controllerContext.controllerLock) {
// do this check only if the broker is live and there are no partitions being reassigned currently
// and preferred replica election is not in progress
if (controllerContext.liveBrokerIds.contains(leaderBroker) &&
controllerContext.partitionsBeingReassigned.size == 0 &&
controllerContext.partitionsUndergoingPreferredReplicaElection.size == 0 &&
!deleteTopicManager.isTopicQueuedUpForDeletion(topicPartition.topic) &&
controllerContext.allTopics.contains(topicPartition.topic)) {
// 触发选举
onPreferredReplicaElection(Set(topicPartition), true)
}
}
}
}
}
}
}
}
}
OnControllerResignation
当他监听到/controller中的数据被删除时,旧的leader会调用回调函数进行一些清理的工作,它实际上是OnControllerResignation方法:
def onControllerResignation() {
debug("Controller resigning, broker id %d".format(config.brokerId))
// de-register listeners
// 取消zk上的监听器
deregisterIsrChangeNotificationListener()
deregisterReassignedPartitionsListener()
deregisterPreferredReplicaElectionListener()
// shutdown delete topic manager
if (deleteTopicManager != null)
deleteTopicManager.shutdown()
// shutdown leader rebalance scheduler
if (config.autoLeaderRebalanceEnable)
autoRebalanceScheduler.shutdown()
inLock(controllerContext.controllerLock) {
// de-register partition ISR listener for on-going partition reassignment task
// 关闭deregisterReassignedPartitionsIsrChangeListeners
deregisterReassignedPartitionsIsrChangeListeners()
// shutdown partition state machine
partitionStateMachine.shutdown()
// shutdown replica state machine
replicaStateMachine.shutdown()
// shutdown controller channel manager
if(controllerContext.controllerChannelManager != null) {
controllerContext.controllerChannelManager.shutdown()
controllerContext.controllerChannelManager = null
}
// reset controller context
controllerContext.epoch=0
controllerContext.epochZkVersion=0
// 切换Broker状态
brokerState.newState(RunningAsBroker)
info("Broker %d resigned as the controller".format(config.brokerId))
}
}