这里说的日志,是指Kafka保存写入消息的文件;
Kafka日志清除策略包括中间:基于时间和大小的删除策略;
Compact清理策略;
我们这里主要介绍基于Compact策略的Log Clean;Compact策略说明Compact就是压缩, 只能针对特定的topic应用此策略,即写入的message都带有Key, 合并相同Key的message, 只留下最新的message;
在压缩过程中, 针对message的payload为null的也将会去除掉;
官网上扒了一张图, 大家先感受下:
110.png日志清理过程中的状态主要涉及三种状态: LogCleaningInProgress, LogCleaningAborted,和LogCleaningPaused, 从字面上就很容易理解是什么意思,下面是源码中的注释:If a partition is to be cleaned, it enters the LogCleaningInProgress state.
While a partition is being cleaned, it can be requested to be aborted and paused. Then the partition first enters
the LogCleaningAborted state. Once the cleaning task is aborted, the partition enters the LogCleaningPaused state.
While a partition is in the LogCleaningPaused state, it won't be scheduled for cleaning again, until cleaning is requested to be resumed.LogCleanerManager类 管理所有清理的log的状态及转换:def abortCleaning(topicAndPartition: TopicAndPartition)def abortAndPauseCleaning(topicAndPartition: TopicAndPartition)def resumeCleaning(topicAndPartition: TopicAndPartition)def checkCleaningAborted(topicAndPartition: TopicAndPartition)要清理的日志的选取因为这个compact清理过程涉及到log和index等文件的重写,比较耗IO, 因此kafka会作流控, 每次compact时都会先按规则确定要清理哪些TopicAndPartiton的log;
使用LogToClean类来表示要被清理的Log:private case class LogToClean(topicPartition: TopicAndPartition, log: Log, firstDirtyOffset: Long) extends Ordered[LogToClean] {
val cleanBytes = log.logSegments(-1, firstDirtyOffset).map(_.size).sum
val dirtyBytes = log.logSegments(firstDirtyOffset, math.max(firstDirtyOffset, log.activeSegment.baseOffset)).map(_.size).sum
val cleanableRatio = dirtyBytes / totalBytes.toDouble def totalBytes = cleanBytes + dirtyBytes
override def compare(that: LogToClean): Int = math.signum(this.cleanableRatio - that.cleanableRatio).toInt
}firstDirtyOffset:表示本次清理的起始点, 其前边的offset将被作清理,与在其后的message作key的合并;
val cleanableRatio = dirtyBytes / totalBytes.toDouble, 需要清理的log的比例,这个值越大,越可能被最后选中作清理;
每次清理完,要更新当前已经清理到的位置, 记录在cleaner-offset-checkpoint文件中,作为下一次清理时生成fir