笔者最近在使用Kafka进行日志收集时,发现kafka集群中,各个磁盘的利用率相差比较大,带着疑问,笔者展开了对kafka是如何实现broker均衡以及单个broker上不同磁盘目录(rack)的数据均衡的探索,现将本次探索总结分享如下
起因
笔者在前端时间对Kafka集群中的每个broker进行了磁盘扩容,之前每个broker下面配置了一块磁盘,现在每个broker下面挂载了两块磁盘,并且将kafka的logs.dir分别配置到这两块磁盘下面,但是在近期笔者在创建一个新的broker的时候,发现新建的broker的大多数分区被分配到了新加的两块磁盘下面,导致第一块磁盘空置,而后加的两块利用率太高,按照之前对kafka的理解,应该是能做到broker和磁盘均衡的啊,为什么这儿就不行了,带着疑问我们去github上拉了kafka源码一探究竟
排查过程
首先我们查了新创建的Topic在每个broker上的总的分区数据,发现每台broker上的分区数是相等,这点儿没毛病,源码中分配策略也能看出来
private def assignReplicasToBrokersRackUnaware(nPartitions: Int,
replicationFactor: Int,
brokerList: Seq[Int],
fixedStartIndex: Int,
startPartitionId: Int): Map[Int, Seq[Int]] = {
val ret = mutable.Map[Int, Seq[Int]]()
val brokerArray = brokerList.toArray
val startIndex = if (fixedStartIndex >= 0) fixedStartIndex else rand.nextInt(brokerArray.length)
var currentPartitionId = math.max(0, startPartitionId)
var nextReplicaShift = if (fixedStartIndex >= 0) fixedStartIndex else rand.nextInt(brokerArray.length)
for (_ <- 0 until nPartitions) {
if (currentPartitionId > 0 && (currentPartitionId % brokerArray.length == 0))
nextReplicaShift += 1
val firstReplicaIndex = (currentPartitionId + startIndex) % brokerArray.length
val replicaBuffer = mutable.ArrayBuffer(brokerArray(firstReplicaIndex))
for (j <- 0 until replicationFactor - 1)
replicaBuffer += brokerArray(replicaIndex(firstReplicaIndex, nextReplicaShift, j, brokerArray.length))
ret.put(currentPartitionId, replicaBuffer)
currentPartitionId += 1
}
ret
}
接着,我们排查在每个broker中的磁盘上为什么不均衡,我们去看磁盘分配策略
/**
* If the log already exists, just return a copy of the existing log
* Otherwise if isNew=true or if there is no offline log directory, create a log for the given topic and the given partition
* Otherwise throw KafkaStorageException
*
* @param topicPartition The partition whose log needs to be returned or created
* @param config The configuration of the log that should be applied for log creation
* @param isNew Whether the replica should have existed on the broker or not
* @param isFuture True iff the future log of the specified partition should be returned or created
* @throws KafkaStorageException if isNew=false, log is not found in the cache and there is offline log directory on the broker
*/
def getOrCreateLog(topicPartition: TopicPartition, config: LogConfig, isNew: Boolean = false, isFuture: Boolean = false): Log = {
logCreationOrDeletionLock synchronized {
getLog(topicPartition, isFuture).getOrElse {
// create the log if it has not already been created in another thread
if (!isNew && offlineLogDirs.nonEmpty)
throw new KafkaStorageException(s"Can not create log for $topicPartition because log directories ${offlineLogDirs.mkString(",")} are offline")
val logDirs: List[File] = {
val preferredLogDir = preferredLogDirs.get(topicPartition)
if (isFuture) {
if (preferredLogDir == null)
throw new IllegalStateException(s"Can not create the future log for $topicPartition without having a preferred log directory")
else if (getLog(topicPartition).get.dir.getParent == preferredLogDir)
throw new IllegalStateException(s"Can not create the future log for $topicPartition in the current log directory of this partition")
}
if (preferredLogDir != null)
List(new File(preferredLogDir))
else
nextLogDirs()//默认配置情况下分区路径选择会走到这儿
}
val logDirName = {
if (isFuture)
Log.logFutureDirName(topicPartition)
else
Log.logDirName(topicPartition)
}
val logDir = logDirs
.toStream // to prevent actually mapping the whole list, lazy map
.map(createLogDirectory(_, logDirName))
.find(_.isSuccess)
.getOrElse(Failure(new KafkaStorageException("No log directories available. Tried " + logDirs.map(_.getAbsolutePath).mkString(", "))))
.get // If Failure, will throw
val log = Log(
dir = logDir,
config = config,
logStartOffset = 0L,
recoveryPoint = 0L,
maxProducerIdExpirationMs = maxPidExpirationMs,
producerIdExpirationCheckIntervalMs = LogManager.ProducerIdExpirationCheckIntervalMs,
scheduler = scheduler,
time = time,
brokerTopicStats = brokerTopicStats,
logDirFailureChannel = logDirFailureChannel)
if (isFuture)
futureLogs.put(topicPartition, log)
else
currentLogs.put(topicPartition, log)
info(s"Created log for partition $topicPartition in $logDir with properties " + s"{${config.originals.asScala.mkString(", ")}}.")
// Remove the preferred log dir since it has already been satisfied
preferredLogDirs.remove(topicPartition)
log
}
}
}
接着查看具体的路径分配策略是啥
/**
* 计算每个目录中的分区数,然后选择含有分区数最小的那个目录
*/
private def nextLogDirs(): List[File] = {
if(_liveLogDirs.size == 1) {
List(_liveLogDirs.peek())
} else {
// count the number of logs in each parent directory (including 0 for empty directories
val logCounts = allLogs.groupBy(_.dir.getParent).mapValues(_.size)
val zeros = _liveLogDirs.asScala.map(dir => (dir.getPath, 0)).toMap
val dirCounts = (zeros ++ logCounts).toBuffer
// choose the directory with the least logs in it
dirCounts.sortBy(_._2).map {
case (path: String, _: Int) => new File(path)
}.toList
}
}
ok,可以看到,partion分配到哪个磁盘块上取决于每个磁盘块上已有的partion目录数量,会优先分配目录数较小的磁盘块,然后我们就去空置的第一块磁盘上看个究竟
进来发现第一块磁盘下面有大量__consumer目录,这才突然顿悟,由于笔者的后两块磁盘是后面加的,而kafka消费者信息维护的Topic在在kafka第一次消费时自动创建,且全部分配到戴氏仅有的第一块磁盘上,而消费者共计有50个分区,占用了大量的目录才导致后续的目录数量不均衡进而导致磁盘存储不均衡。
最后结论,目前kafka只能做到broker级别的partion数量均衡以及每个broker上配置的多个磁盘目录中的目录数均衡,还没办法做到磁盘存储的容量大小均衡,有兴趣的可以自己在磁盘分配策略中实现按大小进行均衡,笔者的临时解决方案是创建了部分无用的Topic分区将目录数较少的磁盘块先填上,后续再创建Topic便可实现均衡。