Log是多个LogSegment的顺序组合,形成一个逻辑上的日志,为了实现快速定位LogSegment,Log使用跳跃表SkipList对LogSegment进行管理
private val segments:ConcurrentNavigableMap[java.lang.Long, LogSeg
ment] = new ConcurrentSkipListMap[java.lang.Long,LogSegment]
在Log中,将每一个LogSegment的baseOffset作为key, LogSegment对象作为value存放在这个跳跃表segments中,比如我们现在要查找offset 大于4205的消息,可以首先通过segments跳表快速定位到消息所在的LogSegment,然后再调用read方法从LogSegment读取数据
在Log中追加消息时是顺序写入的,所以只有最后一个LogSegment才是可以写入的,他前面的LogSegment表示已经写满了,不能写入了。
我们可以使用Log.activeSegment获取最后一个LogSegment,即segments跳表中最后一个元素,当activeSegment对应的日志文件达到一个阀值,就需要创建新的activeSegment,然后后续新的消息写入新的activeSegment
一 核心字段
dir: 日志文件对应的磁盘目录
lock: 由于可能多个线程同时写,所以需要加锁
segments: 用于管理LogSegment的跳跃表
recoveryPoint: 指定恢复操作的其实offset
nextOffsetMetadata: LogOffsetMetadata主要用于产生offset分配给消息,同时也是当前副本的LEO
二 重要方法
2.1 append 添加消息
添加消息集合到日志有效的segment,如有必要,滚动到另一个段,主要负责给消息分配offsets,但是如果assignOffsets=false,我们只是检查存在的offsets是否有效
def append(messages: ByteBufferMessageSet, assignOffsets: Boolean = true): LogAppendInfo = {
// 分析和校验ByteBufferMessageSet,并返回LogAppendInfo对象,在这个对象封装了ByteBufferMessageSet
// 第一个offset,最后一个offset,生产者采用的压缩方式,追加到log的时间戳,服务端采用的压缩方式,外层
// 消息个数,通过验证的总字节数
val appendInfo = analyzeAndValidateMessageSet(messages)
// 如果没有有效的消息,则直接返回
if (appendInfo.shallowCount == 0)
return appendInfo
// 在将其添加到磁盘日志之前,清除未验证通过的message
var validMessages = trimInvalidBytes(messages, appendInfo)
try {
// 开始将消息写入文件磁盘
lock synchronized {
// 判断是否需要分配offsets,默认是需要的
if (assignOffsets) {
// 给message set分配offset
val offset = new LongRef(nextOffsetMetadata.messageOffset)
// 将分配的offset置为firstOffset
appendInfo.firstOffset = offset.value
val now = time.milliseconds
// 对message set做进一步的验证:消息格式转换,调整magic的值,修改时间戳等操作,并为message分配offset
val validateAndOffsetAssignResult = try {
validMessages.validateMessagesAndAssignOffsets(offset, now, appendInfo.sourceCodec, appendInfo.targetCodec,
config.compact, config.messageFormatVersion.messageFormatVersion,
config.messageTimestampType, config.messageTimestampDifferenceMaxMs)
} catch {
case e: IOException => throw new KafkaException("Error in validatingmessages while appending to log '%s'".format(name), e)
}
// 获取验证后的消息
validMessages = validateAndOffsetAssignResult.validatedMessages
// 获取最大的时间戳
appendInfo.maxTimestamp = validateAndOffsetAssignResult.maxTimestamp
// 获取时间戳的offset
appendInfo.offsetOfMaxTimestamp = validateAndOffsetAssignResult.offsetOfMaxTimestamp
// 更新last offset
appendInfo.lastOffset = offset.value - 1
// 设置日志添加时间
if (config.messageTimestampType == TimestampType.LOG_APPEND_TIME)
appendInfo.logAppendTime = now
// 如果 message size是够改变,重新校验
if (validateAndOffsetAssignResult.messageSizeMaybeChanged) {
for (messageAndOffset <- validMessages.shallowIterator) {
if (MessageSet.entrySize(messageAndOffset.message) > config.maxMessageSize) {
// we recordthe original message set size instead of the trimmed size
// to be consistent withpre-compression bytesRejectedRate recording
BrokerTopicStats.getBrokerTopicStats(topicAndPartition.topic).bytesRejectedRate.mark(messages.sizeInBytes)
BrokerTopicStats.getBrokerAllTopicsStats.bytesRejectedRate.mark(messages.sizeInBytes)
throw new RecordTooLargeException("Messagesize is %d bytes which exceeds the maximum configured message size of %d."
.format(MessageSet.entrySize(messageAndOffset.message), config.maxMessageSize))
}
}
}
} else {
// 判断appendInfo的offset不是递增的抛出异常
if (!appendInfo.offsetsMonotonic || appendInfo.firstOffset < nextOffsetMetadata.messageOffset)
throw new IllegalArgumentException("Out oforder offsets found in " + messages)
}
// checkmessages set size may be exceed config.segmentSize
// 检查message set的大小是否超过了配置文件的segmentsize(segment.bytes),超过了抛出异常
if (validMessages.sizeInBytes > config.segmentSize) {
throw new RecordBatchTooLargeException("Message set size is %d bytes which exceeds the maximum configuredsegment size of %d."
.format(validMessages.sizeInBytes, config.segmentSize))
}
// segment如果满了,需要滚动到一个新的segment,否则还是用当前activeSegment
val segment = maybeRoll(messagesSize = validMessages.sizeInBytes,
maxTimestampInMessages = appendInfo.maxTimestamp)
// 现在开始添加日志到segment
segment.append(firstOffset = appendInfo.firstOffset, largestTimestamp = appendInfo.maxTimestamp,
offsetOfLargestTimestamp= appendInfo.offsetOfMaxTimestamp, messages = validMessages)
// 更新LogEndOffset(LOE)即nextOffsetMetadata字段
updateLogEndOffset(appendInfo.lastOffset + 1)
trace("Appended message set to log %s with first offset: %d, next offset:%d, and messages: %s"
.format(this.name, appendInfo.firstOffset, nextOffsetMetadata.messageOffset, validMessages))
// 判断是否满足flush条件,刷到磁盘文件
if (unflushedMessages >= config.flushInterval)
flush()
appendInfo
}
} catch {
case e: IOException => throw new KafkaStorageException("I/Oexception in append to log '%s'".format(name), e)
}
}
2.2 analyzeAndValidateMessageSet
校验一下内容:1 每一个消息的crc值 2 每一个消息的size
也会计算一下内容:
1 message set的 first offset
2 message set的last offset
3 消息数量
4 字节数
5 offset是否单调递增
6 compression codec 是否被使用
private def analyzeAndValidateMessageSet(messages: ByteBufferMessageSet): LogAppendInfo = {
var shallowMessageCount = 0 // 记录外层消息数量
var validBytesCount = 0 // 记录通过验证的消息的字节数和
var firstOffset, lastOffset = -1L // 记录第一条消息和最后一条消息
var sourceCodec: CompressionCodec = NoCompressionCodec
var monotonic = true
var maxTimestamp = Message.NoTimestamp
var offsetOfMaxTimestamp = -1L
for(messageAndOffset <- messages.shallowIterator) {
//更新第一条消息的offset,此时的offset还是生产者分配的offset
if(firstOffset < 0)
firstOffset = messageAndOffset.offset
// 判断内部offset是否单调递增
if(lastOffset >= messageAndOffset.offset)
monotonic = false
// 更新最后一个消息的offset
lastOffset = messageAndOffset.offset
val m = messageAndOffset.message
// 检测消息的长度
val messageSize = MessageSet.entrySize(m)
if(messageSize > config.maxMessageSize) {
BrokerTopicStats.getBrokerTopicStats(topicAndPartition.topic).bytesRejectedRate.mark(messages.sizeInBytes)
BrokerTopicStats.getBrokerAllTopicsStats.bytesRejectedRate.mark(messages.sizeInBytes)
throw new RecordTooLargeException("Message size is %d bytes which exceeds the maximum configured message size of %d."
.format(messageSize, config.maxMessageSize))
}
// 检测消息的CRC32校验码
m.ensureValid()
if (m.timestamp > maxTimestamp) {
maxTimestamp = m.timestamp
offsetOfMaxTimestamp = lastOffset
}
shallowMessageCount += 1 // 通过则增加外层消息数
validBytesCount += messageSize // 增加通过检测的字节数
val messageCodec = m.compressionCodec
if(messageCodec != NoCompressionCodec)
sourceCodec = messageCodec // 记录生产者采用的压缩方式
}
// 记录服务器端采用的压缩方式
val targetCodec = BrokerCompressionCodec.getTargetCompressionCodec(config.compressionType, sourceCodec)
LogAppendInfo(firstOffset, lastOffset, maxTimestamp, offsetOfMaxTimestamp, Message.NoTimestamp, sourceCodec, targetCodec, shallowMessageCount, validBytesCount, monotonic)
}
2.3 validateMessagesAndAssignOffsets更新message set的offsets,做消息的进一步验证
更新message set的offsets,做消息的进一步验证:
消息必须有key
消息的magic value ==0,那么messageFormatVersion就等于1
消息的magic value ==1,那么messageFormatVersion就等于0
对于消息而言,如果没有格式化转换或者重写是必须,则该方法将在适当的位置执行操作和避免重新压缩。
返回ValidationAndOffsetAssignResult对象,该对象包含验证过的message set,最大的时间戳以及shallow message的offset
private[kafka] def validateMessagesAndAssignOffsets(offsetCounter: LongRef,
now: Long,
sourceCodec: CompressionCodec,
targetCodec: CompressionCodec,
compactedTopic: Boolean = false,
messageFormatVersion: Byte = Message.CurrentMagicValue,
messageTimestampType: TimestampType,
messageTimestampDiffMaxMs: Long): ValidationAndOffsetAssignResult = {
// 源压缩类型和目标压缩类型都没有
if (sourceCodec == NoCompressionCodec && targetCodec == NoCompressionCodec) {
// 检查所有的message的magic value是否与指定的一样
if (!isMagicValueInAllWrapperMessages(messageFormatVersion))
// 因为存在message的magic value不一致,则需要进行统一,可能导致消息总长度变化
// 需要创建新的ByteBufferMessageSet,同时还会进行offset的分配,验证并更新CRC323,时间戳等信息
convertNonCompressedMessages(offsetCounter, compactedTopic, now, messageTimestampType, messageTimestampDiffMaxMs,
messageFormatVersion)
else
// 处理非压缩消息且magic值统一的情况,长度不会改变,主要是进行offset的分配,验真并更新CRC32 时间戳等信息
validateNonCompressedMessagesAndAssignOffsetInPlace(offsetCounter, now, compactedTopic, messageTimestampType,
messageTimestampDiffMaxMs)
} else { // 处理消息压缩情况
// 不能复用当前的ByteBufferMessage的情况
// 1. 消息当前的压缩类型与指定的压缩类型不一致,需要重新压缩
// 2. magic为0时需要重写消息的offset为绝对offset
// 3. 当magic大于0,但是内部压缩消息某些字段需要修改,例如时间戳
// 4. 需要转换消息格式
// 是否可以直接复用当前的ByteBufferMessage
var inPlaceAssignment = sourceCodec == targetCodec && messageFormatVersion > Message.MagicValue_V0
var maxTimestamp = Message.NoTimestamp
var offsetOfMaxTimestamp = -1L
val expectedInnerOffset = new LongRef(0)
val validatedMessages = new mutable.ArrayBuffer[Message]
this.internalIterator(isShallow = false, ensureMatchingMagic = true).foreach { messageAndOffset =>
val message = messageAndOffset.message
validateMessageKey(message, compactedTopic)// 校验消息的key
if (message.magic > Message.MagicValue_V0 && messageFormatVersion > Message.MagicValue_V0) {
// No in place assignment situation 3
// 校验时间戳
validateTimestamp(message, now, messageTimestampType, messageTimestampDiffMaxMs)
// 检查情况3 内部offset是否正常
if (messageAndOffset.offset != expectedInnerOffset.getAndIncrement())
inPlaceAssignment = false
if (message.timestamp > maxTimestamp) {
maxTimestamp = message.timestamp
offsetOfMaxTimestamp = offsetCounter.value + expectedInnerOffset.value - 1
}
}
if (sourceCodec != NoCompressionCodec && message.compressionCodec != NoCompressionCodec)
throw new InvalidMessageException("Compressed outer message should not have an inner message with a " +
s"compression attribute set: $message")
// 检查情况4
if (message.magic != messageFormatVersion)
inPlaceAssignment = false
// 保存通过上述检测和转换的Message集合
validatedMessages += message.toFormatVersion(messageFormatVersion)
}
// 不能复用当前的ByteBufferMessage的场景
if (!inPlaceAssignment) {
// Cannot do in place assignment.
val (largestTimestampOfMessageSet, offsetOfMaxTimestampInMessageSet) = {
if (messageFormatVersion == Message.MagicValue_V0)
(Some(Message.NoTimestamp), -1L)
else if (messageTimestampType == TimestampType.CREATE_TIME)
(Some(maxTimestamp), {if (targetCodec == NoCompressionCodec) offsetOfMaxTimestamp else offsetCounter.value + validatedMessages.length - 1})
else // Log append time
(Some(now), {if (targetCodec == NoCompressionCodec) offsetCounter.value else offsetCounter.value + validatedMessages.length - 1})
}
// 创建新的ByteBufferMessageSet对象。重新压缩
ValidationAndOffsetAssignResult(validatedMessages = new ByteBufferMessageSet(compressionCodec = targetCodec,
offsetCounter = offsetCounter,
wrapperMessageTimestamp = largestTimestampOfMessageSet,
timestampType = messageTimestampType,
messages = validatedMessages: _*),
maxTimestamp = largestTimestampOfMessageSet.get,
offsetOfMaxTimestamp = offsetOfMaxTimestampInMessageSet,
messageSizeMaybeChanged = true)
} else {// 复用当前的ByteBufferMessage对象,可以减少一次压缩操作
// 更新外层消息的offset,将其offset更新为内部最后一条压缩消息的offset
buffer.putLong(0, offsetCounter.addAndGet(validatedMessages.size) - 1)
// validate the messages
validatedMessages.foreach(_.ensureValid())
var crcUpdateNeeded = true
val timestampOffset = MessageSet.LogOverhead + Message.TimestampOffset
val attributeOffset = MessageSet.LogOverhead + Message.AttributesOffset
val timestamp = buffer.getLong(timestampOffset)
val attributes = buffer.get(attributeOffset)
// 更新外层的时间戳等
buffer.putLong(timestampOffset, maxTimestamp)
if (messageTimestampType == TimestampType.CREATE_TIME && timestamp == maxTimestamp)
// We don't need to recompute crc if the timestamp is not updated.
crcUpdateNeeded = false
else if (messageTimestampType == TimestampType.LOG_APPEND_TIME) {
// Set timestamp type and timestamp
buffer.putLong(timestampOffset, now)
buffer.put(attributeOffset, messageTimestampType.updateAttributes(attributes))
}
if (crcUpdateNeeded) {
// need to recompute the crc value
buffer.position(MessageSet.LogOverhead)
val wrapperMessage = new Message(buffer.slice())
Utils.writeUnsignedInt(buffer, MessageSet.LogOverhead + Message.CrcOffset, wrapperMessage.computeChecksum)
}
buffer.rewind()
// For compressed messages,
ValidationAndOffsetAssignResult(validatedMessages = this,
maxTimestamp = buffer.getLong(timestampOffset),
offsetOfMaxTimestamp = buffer.getLong(0),
messageSizeMaybeChanged = false)
}
}
}
2.4 maybeRoll 检测是否需要创建新的segment,如果需要则创建新的activeSegment
private def maybeRoll(messagesSize: Int, maxTimestampInMessages: Long): LogSegment = {
val segment = activeSegment
val now = time.milliseconds // 获取当前的时间
// 获取segment的最长存活时间
val reachedRollMs = segment.timeWaitedForRoll(now, maxTimestampInMessages) > config.segmentMs - segment.rollJitterMs
// 如果LogSegment的大小+这次消息的大小已经超过配置的LogSegment的最大长度或者LogSegment最长存活时间已经到期
if (segment.size > config.segmentSize - messagesSize ||
(segment.size > 0 && reachedRollMs) ||
segment.index.isFull || segment.timeIndex.isFull) {
debug(s"Rolling new log segment in $name (log_size = ${segment.size}/${config.segmentSize}}, " +
s"index_size = ${segment.index.entries}/${segment.index.maxEntries}, " +
s"time_index_size = ${segment.timeIndex.entries}/${segment.timeIndex.maxEntries}, " +
s"inactive_time_ms = ${segment.timeWaitedForRoll(now, maxTimestampInMessages)}/${config.segmentMs - segment.rollJitterMs}).")
roll()
} else {
segment
}
}
def roll(): LogSegment = {
val start = time.nanoseconds
lock synchronized {
val newOffset = logEndOffset
val logFile = logFilename(dir, newOffset)
val indexFile = indexFilename(dir, newOffset)
val timeIndexFile = timeIndexFilename(dir, newOffset)
for(file <- List(logFile, indexFile, timeIndexFile); if file.exists) {
warn("Newly rolled segment file " + file.getName + " already exists; deleting it first")
file.delete()
}
segments.lastEntry() match {
case null =>
case entry => {
val seg = entry.getValue
seg.onBecomeInactiveSegment()
seg.index.trimToValidSize()
seg.timeIndex.trimToValidSize()
seg.log.trim()
}
}
val segment = new LogSegment(dir,
startOffset = newOffset,
indexIntervalBytes = config.indexInterval,
maxIndexSize = config.maxIndexSize,
rollJitterMs = config.randomSegmentJitter,
time = time,
fileAlreadyExists = false,
initFileSize = initFileSize,
preallocate = config.preallocate)
val prev = addSegment(segment)
if(prev != null)
throw new KafkaException("Trying to roll a new log segment for topic partition %s with start offset %d while it already exists.".format(name, newOffset))
// We need to update the segment base offset and append position data of the metadata when log rolls.
// The next offset should not change.
updateLogEndOffset(nextOffsetMetadata.messageOffset)
// schedule an asynchronous flush of the old segment
scheduler.schedule("flush-log", () => flush(newOffset), delay = 0L)
info("Rolled new log segment for '" + name + "' in %.0f ms.".format((System.nanoTime - start) / (1000.0*1000.0)))
segment
}
}
2.6 flush 将数据刷新到磁盘
def flush(offset: Long) : Unit = {
// offset之前的数据已经全部刷到磁盘,所以不需要刷新
if (offset <= this.recoveryPoint)
return
debug("Flushing log '" + name + " up to offset " + offset + ", last flushed: " + lastFlushTime + " current time: " +
time.milliseconds + " unflushed = " + unflushedMessages)
// logSegments:查找到recoveryPoint和offset之间的LogSegment对象
for(segment <- logSegments(this.recoveryPoint, offset))
segment.flush()// 调用操作系统fsync命令刷新到磁盘
lock synchronized {
if(offset > this.recoveryPoint) {
this.recoveryPoint = offset // 后移或者更新recoveryPoint
lastflushedTime.set(time.milliseconds) // 修改lastflushedTime
}
}
}
2.7 read 从LogSegment读取消息
通过segments跳跃表,快速定位到需要读取的LogSegment
def read(startOffset: Long, maxLength: Int, maxOffset: Option[Long] = None, minOneMessage: Boolean = false): FetchDataInfo = {
trace("Reading %d bytes from offset %d in log %s of length %d bytes".format(maxLength, startOffset, name, size))
// 因为我们不使用锁来读取数据,所以同步有点棘手
// 所以它在查询之前,将nextOffsetMetadata(volatile修饰)保存成方法的局部变量,从而避免线程安全问题
val currentNextOffsetMetadata = nextOffsetMetadata
val next = currentNextOffsetMetadata.messageOffset
// 开始的offset 和下一个offset相等,直接返回,表示没有数据
if(startOffset == next)
return FetchDataInfo(currentNextOffsetMetadata, MessageSet.Empty)
// 存放小于等于startOffset的entry
var entry = segments.floorEntry(startOffset)
// attempt to read beyond the log end offset is an error
if(startOffset > next || entry == null)
throw new OffsetOutOfRangeException("Request for offset %d but we only have log segments in the range %d to %d.".format(startOffset, segments.firstKey, next))
// 基于一个比目标offset更小的offset去读取segment,但是如果永不他们大的offset读取segment不包含任何信息,
// 将继续读取成功的segments,直到我我们获取消息或者我们到达了日志的末尾
while(entry != null) {
// 如果在当前激活的segment上fetch,这儿可能有竞态条件,比如2个fetch请求都在消息添加后,但是在nextOffsetMetadata更新之前,
// 如果那样,第二个fetch可能会出现offset越界的异常,为了解决这个问题,我们cap这个读取的暴露的位置代替使用active segment的
// log end
val maxPosition = {
if (entry == segments.lastEntry) {
// 需要暴露的位置
val exposedPos = nextOffsetMetadata.relativePositionInSegment.toLong
// Check the segment again in case a new segment has just rolled out.
// 再次检查新的segmeng防止新的segment已经完成了日志滚动
if (entry != segments.lastEntry)
// New log segment has rolled out, we can read up to the file end.
entry.getValue.size
else
exposedPos
} else {
entry.getValue.size
}
}
// 读取
val fetchInfo = entry.getValue.read(startOffset, maxOffset, maxLength, maxPosition, minOneMessage)
// 如果没有
if(fetchInfo == null) {
// 则返回大于指定key的entry
entry = segments.higherEntry(entry.getKey)
} else {
return fetchInfo
}
}
FetchDataInfo(nextOffsetMetadata, MessageSet.Empty)
}