目录
Topic、Partition和Replica三者之间的关系:
1. 概述
大家都知道Kafka的吞吐量很高,那么这是为什么呢?
- Broker NIO异步消息处理,实现了IO线程与业务线程分离。
- 磁盘顺序写。
- 零拷贝。
本文主要是讲解,磁盘顺序写和零拷贝。关于Broker的消息处理体系,参考:kafka源码-kafka的启动&内部模块、kafka源码--broker的基础模块serversocket。
2.日志结构
Topic、Partition和Replica三者之间的关系
Log利用segements来管理Partition数据,里面包含多个日志段LogSegment,LogManager、Log和LogSegment三者的关系:
一个分区(Partition)对应一个文件目录,分区下有多个日志分段(LogSegment), 同一个目录下的所有日志分段都属于同一个分区。随着日志文件不断增加,为了便于管理和查找,Kafka对分区日志进行分段。当日志文件超过一个阈值后就产生一个新段。每个日志分段在物理上由一个数据文件(00000000000000000000.log)和一个索引文件(00000000000000000000.index)组成。数据文件存储的是消息的真正内容,而索引文件存储的是数据文件的索引信息。为数据文件建立索引文件的目的是更快地访问数据文件,生产者些消息到分区时采用追加方式,顺序写磁盘的性能很高。消费者一般情况下也是顺序读取消息,顺序读消息的性能也很高。有时候需要从指定偏移量读取信息,此刻可以用上段索引文件,提高读取性能。
日志和日志分段的主要成员变量:
/* the actual segments of the log */
private val segments: ConcurrentNavigableMap[java.lang.Long, LogSegment] = new ConcurrentSkipListMap[java.lang.Long, LogSegment]
class LogSegment private[log] (val log: FileRecords,
val lazyOffsetIndex: LazyIndex[OffsetIndex],
val lazyTimeIndex: LazyIndex[TimeIndex],
val txnIndex: TransactionIndex,
val baseOffset: Long,
val indexIntervalBytes: Int,
val rollJitterMs: Long,
val time: Time) extends Logging {
}
LogSegment的数据结构
log代表的是消息结合,每条消息都有一个Offset,这个是针对Partition中的偏移量;OffsetIndex代表的是消息的偏移量,以KV对的形式记录,其中K为消息在log中的相对偏移量,V为消息在log文件中的绝对位置;baseOffset代表的是该LogSegment日志段的起始偏移量;indexIntervalBytes代表是是写入多少个字节后生成一条索引,offSetIndex不会保存每条消息的索引,因此*.index索引文件是一个稀疏索引文件:
index文件:相对baseOffset的偏移量,log文件中的绝对位置
log文件:偏移量、数据大小、消息内容
0000000769.log文件记录了消息的内容,其中baseOffset=0000000769,0000000769.index代表了索引文件,内部以KV对的形式存储,比如第一条索引记录(1,0)中的1代表在该文件中相对偏移量1的消息在log文件中的物理偏移量,即绝对偏移量Offset为0000000770(0000000769+1)的消息在log文件中的其实物理位置为0。
假设:要查找0000000771的记录,先根据二分法查找找到0000000769.index文件,然后计算在该文件的第几个位置,n=0000000771-(0000000769+1)=2,获取偏移2,根据二分法找到比2小的记录为(1,0),然后从log文件中的第0条往下找,直到找到msg0000000771的记录。
零拷贝传输
kafka服务端在处理客户端拉取请求时,在读取文件消息集后,通过transferTo方法实现了文件通道内容到网络通道的传输,而无需将通道内容拷贝到缓冲区,再将缓冲区拷贝到网络通道中。文件通道和网络通道属于内核态,而缓冲区是属于用户态,这样避免从内核态切换到用户态也提高了发送效率。
@Override
public long writeTo(GatheringByteChannel destChannel, long offset, int length) throws IOException {
long newSize = Math.min(channel.size(), end) - start;
int oldSize = sizeInBytes();
if (newSize < oldSize)
throw new KafkaException(String.format(
"Size of FileRecords %s has been truncated during write: old size %d, new size %d",
file.getAbsolutePath(), oldSize, newSize));
long position = start + offset;
int count = Math.min(length, oldSize);
final long bytesTransferred;
if (destChannel instanceof TransportLayer) {
TransportLayer tl = (TransportLayer) destChannel;
bytesTransferred = tl.transferFrom(channel, position, count);
} else {
bytesTransferred = channel.transferTo(position, count, destChannel);
}
return bytesTransferred;
}
3.日志管理
kafka提供了一个检查点,每个日志分区目录都有一个全局的检查点文件,该文件存储了这个目录下所有日志文件的检查点信息。检查点表示日志已经刷新到磁盘的位置,它在分布式系统中主要用于故障恢复。 kafka在启动时会创建日志管理类,读取检查点文件,并把每一个分区的检查点作为日志恢复点,最后创建分区对应的日志实例。
消息每次传输给broker, 如果直接写入到kafka日志,并刷新到硬盘中会影响性能。kafka在写日志是利用了操作系统底层原理,先将日志写入到操作系统的页面缓存中,提高了日志的写IO,因为写入页面缓存比写入磁盘文件快。当如果没有及时将页面缓存内容刷新到磁盘上,消息节点崩溃了,就会导致数据丢失。。
有两种策略可以将日志刷写到磁盘上:时间策略和大小策略。
- 对于时间策略而言,日志管理器启动时会启动定时器,定时调度flushDirtyLogs方法,每隔log.flush.interval.ms将页面缓存中的数据真正刷写到磁盘上。
- 对于大小策略,但超过一个阈值后,仅有新消息产生的时候才有机会调用flush()方法刷新页面缓存到磁盘。
当日志分段总大小大于阈值log.retention.bytes时,日志管理器会定时清理旧的日志分段,日志的清理有两种策略:
- 清理(delete):超过日志的阈值或时间,直接物理删除整个日志分段。
- 压缩(compact): 不直接删除日志分段,而是采用合并压缩方式。
3.1日志压缩
日志追加的方式可以提高了性能,但不能直接更新。因此需要才有追加方式去更新,但是追加方式到时文件变大,其次就是同一条消息存储在多个位置。针对这个问题,kafka对日志文件进行压缩,通过后台的压缩操作,使得相同键的多条记录经过合并之后保留最新的一条记录。
如图所示:
3.2日志删除
删除日志的实现思路是将当前最新的日志大小减去下一个即将删除的日志分段大小,如果其阈值超过阈值(log.retention.bytes)则允许删除下一个日志分段;如果小于则不会删除。如图是设置阈值为1G的删除过程:
4.源码
KafkaApis.handle*Request
class KafkaApis(val requestChannel: RequestChannel,
val replicaManager: ReplicaManager,
val adminManager: AdminManager,
val groupCoordinator: GroupCoordinator,
val txnCoordinator: TransactionCoordinator,
val controller: KafkaController,
val zkClient: KafkaZkClient,
val brokerId: Int,
val config: KafkaConfig,
val metadataCache: MetadataCache,
val metrics: Metrics,
val authorizer: Option[Authorizer],
val quotas: QuotaManagers,
val fetchManager: FetchManager,
brokerTopicStats: BrokerTopicStats,
val clusterId: String,
time: Time,
val tokenManager: DelegationTokenManager) extends Logging {
def handle(request: RequestChannel.Request): Unit = {
request.header.apiKey match {
case ApiKeys.PRODUCE => handleProduceRequest(request)
case ApiKeys.FETCH => handleFetchRequest(request)
................
}
}
}
ReplicaManager.appendRecords
class ReplicaManager(val config: KafkaConfig,
metrics: Metrics,
time: Time,
val zkClient: KafkaZkClient,
scheduler: Scheduler,
val logManager: LogManager,
val isShuttingDown: AtomicBoolean,
quotaManagers: QuotaManagers,
val brokerTopicStats: BrokerTopicStats,
val metadataCache: MetadataCache,
logDirFailureChannel: LogDirFailureChannel,
val delayedProducePurgatory: DelayedOperationPurgatory[DelayedProduce],
val delayedFetchPurgatory: DelayedOperationPurgatory[DelayedFetch],
val delayedDeleteRecordsPurgatory: DelayedOperationPurgatory[DelayedDeleteRecords],
val delayedElectLeaderPurgatory: DelayedOperationPurgatory[DelayedElectLeader],
threadNamePrefix: Option[String]) extends Logging with KafkaMetricsGroup {
def appendRecords(timeout: Long,
requiredAcks: Short,
internalTopicsAllowed: Boolean,
origin: AppendOrigin,
entriesPerPartition: Map[TopicPartition, MemoryRecords],
responseCallback: Map[TopicPartition, PartitionResponse] => Unit,
delayedProduceLock: Option[Lock] = None,
recordConversionStatsCallback: Map[TopicPartition, RecordConversionStats] => Unit = _ => ()): Unit = {
private def appendToLocalLog(internalTopicsAllowed: Boolean,
origin: AppendOrigin,
entriesPerPartition: Map[TopicPartition, MemoryRecords],
requiredAcks: Short): Map[TopicPartition, LogAppendResult] = {
Partition.appendRecordsToLeader
class Partition(val topicPartition: TopicPartition,
val replicaLagTimeMaxMs: Long,
interBrokerProtocolVersion: ApiVersion,
localBrokerId: Int,
time: Time,
stateStore: PartitionStateStore,
delayedOperations: DelayedOperations,
metadataCache: MetadataCache,
logManager: LogManager) extends Logging with KafkaMetricsGroup {
def appendRecordsToLeader(records: MemoryRecords, origin: AppendOrigin, requiredAcks: Int): LogAppendInfo = {
val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
leaderLogIfLocal match {
case Some(leaderLog) =>
val minIsr = leaderLog.config.minInSyncReplicas
val inSyncSize = inSyncReplicaIds.size
if (inSyncSize < minIsr && requiredAcks == -1) {
throw new NotEnoughReplicasException(s"The size of the current ISR $inSyncReplicaIds " +
s"is insufficient to satisfy the min.isr requirement of $minIsr for partition $topicPartition")
}
val info = leaderLog.appendAsLeader(records, leaderEpoch = this.leaderEpoch, origin,
interBrokerProtocolVersion)
// we may need to increment high watermark since ISR could be down to 1
(info, maybeIncrementLeaderHW(leaderLog))
case None =>
throw new NotLeaderForPartitionException("Leader not local for partition %s on broker %d"
.format(topicPartition, localBrokerId))
}
}
// some delayed operations may be unblocked after HW changed
if (leaderHWIncremented)
tryCompleteDelayedRequests()
else {
// probably unblock some follower fetch requests since log end offset has been updated
delayedOperations.checkAndCompleteFetch()
}
info
}
Log.append
class Log(@volatile var dir: File,
@volatile var config: LogConfig,
@volatile var logStartOffset: Long,
@volatile var recoveryPoint: Long,
scheduler: Scheduler,
brokerTopicStats: BrokerTopicStats,
val time: Time,
val maxProducerIdExpirationMs: Int,
val producerIdExpirationCheckIntervalMs: Int,
val topicPartition: TopicPartition,
val producerStateManager: ProducerStateManager,
logDirFailureChannel: LogDirFailureChannel) extends Logging with KafkaMetricsGroup {
def appendAsLeader(records: MemoryRecords,
leaderEpoch: Int,
origin: AppendOrigin = AppendOrigin.Client,
interBrokerProtocolVersion: ApiVersion = ApiVersion.latestVersion): LogAppendInfo = {
append(records, origin, interBrokerProtocolVersion, assignOffsets = true, leaderEpoch)
}
def appendAsFollower(records: MemoryRecords): LogAppendInfo = {
append(records,
origin = AppendOrigin.Replication,
interBrokerProtocolVersion = ApiVersion.latestVersion,
assignOffsets = false,
leaderEpoch = -1)
}
private def append(records: MemoryRecords,
origin: AppendOrigin,
interBrokerProtocolVersion: ApiVersion,
assignOffsets: Boolean,
leaderEpoch: Int): LogAppendInfo = {
}
LogSegment.append
def append(largestOffset: Long,
largestTimestamp: Long,
shallowOffsetOfMaxTimestamp: Long,
records: MemoryRecords): Unit = {
if (records.sizeInBytes > 0) {
trace(s"Inserting ${records.sizeInBytes} bytes at end offset $largestOffset at position ${log.sizeInBytes} " +
s"with largest timestamp $largestTimestamp at shallow offset $shallowOffsetOfMaxTimestamp")
val physicalPosition = log.sizeInBytes()
if (physicalPosition == 0)
rollingBasedTimestamp = Some(largestTimestamp)
ensureOffsetInRange(largestOffset)
// append the messages
val appendedBytes = log.append(records)
trace(s"Appended $appendedBytes to ${log.file} at end offset $largestOffset")
// Update the in memory max timestamp and corresponding offset.
if (largestTimestamp > maxTimestampSoFar) {
maxTimestampSoFar = largestTimestamp
offsetOfMaxTimestampSoFar = shallowOffsetOfMaxTimestamp
}
// append an entry to the index (if needed)
if (bytesSinceLastIndexEntry > indexIntervalBytes) {
offsetIndex.append(largestOffset, physicalPosition)
timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestampSoFar)
bytesSinceLastIndexEntry = 0
}
bytesSinceLastIndexEntry += records.sizeInBytes
}
}