消息的存储是重点也是难点,MQ为了保证消息不丢失做了持久化。为了提升将消息写入文件的速度,MQ采用了顺序写,利用页缓存和内存映射,将所有的消息顺序写入 CommitLog 文件,写满了就创建另一个文件,文件大小固定为1G。
因消息的消费基于主题,因此MQ为每个主题创建了多个消费队列,每个消费队列对应一个 ConsumeQueue 文件,这个文件可以快速找到需要消费的消息在 CommitLog 中的位置。
消息还需要查询,为避免扫描整个 CommitLog 文件,MQ又维护了一个 IndexFile 索引文件,可以快速检索到消息。
存储概要设计
CommitLog:消息主体以及元数据的存储主体,存储Producer端写入的消息主体内容,消息内容不是定长的。
ConsumeQueue:消息消费队列,引入的目的主要是提高消息消费的性能。
IndexFile:索引文件,提供了一种可以通过key或时间区间来查询消息的方法。
保存到内存映射文件 ByteBuffer 中
Broker 接收到消息后进行存储
org.apache.rocketmq.store.DefaultMessageStore#putMessage
public PutMessageResult putMessage(MessageExtBrokerInner msg) {
if (this.shutdown) {
log.warn("message store has shutdown, so putMessage is forbidden");
return new PutMessageResult(PutMessageStatus.SERVICE_NOT_AVAILABLE, null);
}
if (BrokerRole.SLAVE == this.messageStoreConfig.getBrokerRole()) {
long value = this.printTimes.getAndIncrement();
if ((value % 50000) == 0) {
log.warn("message store is slave mode, so putMessage is forbidden ");
}
return new PutMessageResult(PutMessageStatus.SERVICE_NOT_AVAILABLE, null);
}
if (!this.runningFlags.isWriteable()) {
long value = this.printTimes.getAndIncrement();
if ((value % 50000) == 0) {
log.warn("message store is not writeable, so putMessage is forbidden " + this.runningFlags.getFlagBits());
}
return new PutMessageResult(PutMessageStatus.SERVICE_NOT_AVAILABLE, null);
} else {
this.printTimes.set(0);
}
if (msg.getTopic().length() > Byte.MAX_VALUE) {
log.warn("putMessage message topic length too long " + msg.getTopic().length());
return new PutMessageResult(PutMessageStatus.MESSAGE_ILLEGAL, null);
}
if (msg.getPropertiesString() != null && msg.getPropertiesString().length() > Short.MAX_VALUE) {
log.warn("putMessage message properties length too long " + msg.getPropertiesString().length());
return new PutMessageResult(PutMessageStatus.PROPERTIES_SIZE_EXCEEDED, null);
}
if (this.isOSPageCacheBusy()) {
return new PutMessageResult(PutMessageStatus.OS_PAGECACHE_BUSY, null);
}
long beginTime = this.getSystemClock().now();
PutMessageResult result = this.commitLog.putMessage(msg);
long eclipseTime = this.getSystemClock().now() - beginTime;
if (eclipseTime > 500) {
log.warn("putMessage not in lock eclipse time(ms)={}, bodyLength={}", eclipseTime, msg.getBody().length);
}
this.storeStatsService.setPutMessageEntireTimeMax(eclipseTime);
if (null == result || !result.isOk()) {
this.storeStatsService.getPutMessageFailedTimes().incrementAndGet();
}
return result;
}
校验存储服务是否关闭、此 Broker 是否是 Master、此 Broker 是否可写、主题长度是否超标、消息属性长度是否超标、页缓存是否繁忙(上次存储消息锁定 commitLog 到现在时间超过1s)。
继续调用 org.apache.rocketmq.store.CommitLog#putMessage
public PutMessageResult putMessage(final MessageExtBrokerInner msg) {
//...省略了延迟消息的处理,会将主题改为 SCHEDULE_TOPIC_XXXX 等操作
// 获取应该写入的 MappedFile 文件,存在就返回没有就返回空,..\commitlog 目录下最后一个文件
MappedFile mappedFile = this.mappedFileQueue.getLastMappedFile();
// 默认自旋锁
putMessageLock.lock(); //spin or ReentrantLock ,depending on store config
try {
long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now();
this.beginTimeInLock = beginLockTimestamp;
// Here settings are stored timestamp, in order to ensure an orderly
// global
// 修改锁定时间为此时
msg.setStoreTimestamp(beginLockTimestamp);
if (null == mappedFile || mappedFile.isFull()) {
// 不存在文件或者此文件已写满,再次获取最后一个文件,没有就创建一个文件
mappedFile = this.mappedFileQueue.getLastMappedFile(0); // Mark: NewFile may be cause noise
}
if (null == mappedFile) {
log.error("create mapped file1 error, topic: " + msg.getTopic() + " clientAddr: " + msg.getBornHostString());
beginTimeInLock = 0;
// 创建失败返回错误信息
return new PutMessageResult(PutMessageStatus.CREATE_MAPEDFILE_FAILED, null);
}
// 将消息追加到内存中
result = mappedFile.appendMessage(msg, this.appendMessageCallback);
switch (result.getStatus()) {
case PUT_OK:
break;
case END_OF_FILE:
// 文件太大,剩余空间不足,创建一个文件去写消息
...
result = mappedFile.appendMessage(msg, this.appendMessageCallback);
break;
case MESSAGE_SIZE_EXCEEDED:
case PROPERTIES_SIZE_EXCEEDED:
beginTimeInLock = 0;
return new PutMessageResult(PutMessageStatus.MESSAGE_ILLEGAL, result);
case UNKNOWN_ERROR:
beginTimeInLock = 0;
return new PutMessageResult(PutMessageStatus.UNKNOWN_ERROR, result);
default:
beginTimeInLock = 0;
return new PutMessageResult(PutMessageStatus.UNKNOWN_ERROR, result);
}
eclipseTimeInLock = this.defaultMessageStore.getSystemClock().now() - beginLockTimestamp;
// 修改锁定时间为0
beginTimeInLock = 0;
} finally {
putMessageLock.unlock();
}
...
PutMessageResult putMessageResult = new PutMessageResult(PutMessageStatus.PUT_OK, result);
// Statistics 统计消息的数量和大小
storeStatsService.getSinglePutMessageTopicTimesTotal(msg.getTopic()).incrementAndGet();
storeStatsService.getSinglePutMessageTopicSizeTotal(topic).addAndGet(result.getWroteBytes());
// doAppend 只是将消息写入到 byteBuffer,还需要持久化到磁盘
handleDiskFlush(result, putMessageResult, msg);
// 主从同步复制
handleHA(result, putMessageResult, msg);
return putMessageResult;
}
将消息写入到内存映射文件 MappedFile 中
org.apache.rocketmq.store.CommitLog.DefaultAppendMessageCallback#doAppend(文件起始物理偏移量,byteBuffer,最大可写间隔,消息内容)
public AppendMessageResult doAppend(final long fileFromOffset, final ByteBuffer byteBuffer, final int maxBlank,
final MessageExtBrokerInner msgInner) {
// STORETIMESTAMP + STOREHOSTADDRESS + OFFSET <br>
// PHY OFFSET
long wroteOffset = fileFromOffset + byteBuffer.position();
this.resetByteBuffer(hostHolder, 8);
// 创建消息的唯一ID
String msgId = MessageDecoder.createMessageId(this.msgIdMemory, msgInner.getStoreHostBytes(hostHolder), wroteOffset);
// Record ConsumeQueue information 保存消息队列的写入偏移量
keyBuilder.setLength(0);
keyBuilder.append(msgInner.getTopic());
keyBuilder.append('-');
keyBuilder.append(msgInner.getQueueId());
String key = keyBuilder.toString();
Long queueOffset = CommitLog.this.topicQueueTable.get(key);
if (null == queueOffset) {
queueOffset = 0L;
CommitLog.this.topicQueueTable.put(key, queueOffset);
}
// Transaction messages that require special handling
...准备待存储的内容
final int msgLen = calMsgLength(bodyLength, topicLength, propertiesLength);
// Determines whether there is sufficient free space
// END_FILE_MIN_BLANK_LENGTH 最少空闲8个字节用来存储文件剩余空间和魔数
if ((msgLen + END_FILE_MIN_BLANK_LENGTH) > maxBlank) {
// 消息太长,剩余空间不够存
return new AppendMessageResult(AppendMessageStatus.END_OF_FILE, wroteOffset, maxBlank, msgId, msgInner.getStoreTimestamp(),
queueOffset, CommitLog.this.defaultMessageStore.now() - beginTimeMills);
}
// 省略的值看 calMsgLength 中的消息内容
...按照消息格式写入到 msgStoreItemMemory
// Write messages to the queue buffer
byteBuffer.put(this.msgStoreItemMemory.array(), 0, msgLen);
AppendMessageResult result = new AppendMessageResult(AppendMessageStatus.PUT_OK, wroteOffset, msgLen, msgId,
msgInner.getStoreTimestamp(), queueOffset, CommitLog.this.defaultMessageStore.now() - beginTimeMills);
...事务消息特殊处理
return result;
}
offsetMsgId 消息的全局唯一ID,用作RocketMQ控制台消息查询
采用”IP地址+Port端口”与“CommitLog的物理偏移量地址”做了一个字符串拼接
看获取消息长度的方法,我们可以看出消息文件的存储格式,消息是不定长存储到 CommitLog 文件中的。
protected static int calMsgLength(int bodyLength, int topicLength, int propertiesLength) {
final int msgLen = 4 //TOTALSIZE 总长度
+ 4 //MAGICCODE 魔数
+ 4 //BODYCRC CRC校验码
+ 4 //QUEUEID 消费队列ID
+ 4 //FLAG 应用自设定的flag
+ 8 //QUEUEOFFSET 消息在消费队列中的偏移量
+ 8 //PHYSICALOFFSET 消息在 commitLog 中的物理偏移量
+ 4 //SYSFLAG 系统flag
+ 8 //BORNTIMESTAMP 发送者产生消息的时间
+ 8 //BORNHOST 发送者IP+port
+ 8 //STORETIMESTAMP 存储时间
+ 8 //STOREHOSTADDRESS 存储器所在的IP+port
+ 4 //RECONSUMETIMES 重试次数
+ 8 //Prepared Transaction Offset 事务消息偏移量
+ 4 + (bodyLength > 0 ? bodyLength : 0) //BODY body的长度
+ 1 + topicLength //TOPIC 主题的长度
+ 2 + (propertiesLength > 0 ? propertiesLength : 0) //propertiesLength 属性的长度
+ 0; // 具体属性内容
return msgLen;
}
doAppend 返回了一个 AppendMessageResult 对象
public class AppendMessageResult {
// Return code
private AppendMessageStatus status;
// Where to start writing
private long wroteOffset;
// Write Bytes 消息的长度
private int wroteBytes;
// Message ID
private String msgId;
// Message storage timestamp
private long storeTimestamp;
// Consume queue's offset(step by one)
private long logicsOffset;
private long pagecacheRT = 0;
private int msgNum = 1;
}
public enum AppendMessageStatus {
PUT_OK,//成功
END_OF_FILE,//文件不够存
MESSAGE_SIZE_EXCEEDED,//消息太长
PROPERTIES_SIZE_EXCEEDED,//属性太长
UNKNOWN_ERROR,//未知错误
}
根据存储策略是同步还是异步,进行消息刷盘
org.apache.rocketmq.store.CommitLog#handleDiskFlush
public void handleDiskFlush(AppendMessageResult result, PutMessageResult putMessageResult, MessageExt messageExt) {
// Synchronization flush
if (FlushDiskType.SYNC_FLUSH == this.defaultMessageStore.getMessageStoreConfig().getFlushDiskType()) {
final GroupCommitService service = (GroupCommitService) this.flushCommitLogService;
if (messageExt.isWaitStoreMsgOK()) {
GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());
service.putRequest(request);
boolean flushOK = request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());
if (!flushOK) {
log.error("do groupcommit, wait for flush failed, topic: " + messageExt.getTopic() + " tags: " + messageExt.getTags()
+ " client address: " + messageExt.getBornHostString());
// 超时返回
putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_DISK_TIMEOUT);
}
} else {
service.wakeup();
}
}
// Asynchronous flush
// 唤醒异步线程之后直接返回,不等待刷盘是否成功
else {
if (!this.defaultMessageStore.getMessageStoreConfig().isTransientStorePoolEnable()) {
flushCommitLogService.wakeup();
} else {
commitLogService.wakeup();
}
}
}
- GroupCommitService:同步刷盘,执行 MappedFile.mappedByteBuffer.force() 刷盘,等待刷盘成功再返回,或者超时返回;
- FlushCommitLogService.flushCommitLogService:异步刷盘,transientStorePoolEnable = false,执行 MappedFile.mappedByteBuffer.force() 刷盘
- FlushCommitLogService.commitLogService:异步刷盘,transientStorePoolEnable = true 开启了堆外内存,先周期性将堆外内存中的 MappedFile.writeBuffer 中的数据刷到 MappedFile.fileChannel ,再等待 flushCommitLogService 执行 MappedFile.fileChannel.force()
TransientStorePool 堆外内存池
TransientStorePoolEnable 为 True 开启堆外内存池,MQ创建一个 TransientStorePool,默认初始化创建5个 ByteBuffer 堆外内存,并利用 JNA 锁定这些内存。提供一个双向队列,org.apache.rocketmq.store.MappedFile#writeBuffer
public class TransientStorePool {
public void init() {
for (int i = 0; i < poolSize; i++) {
ByteBuffer byteBuffer = ByteBuffer.allocateDirect(fileSize);
final long address = ((DirectBuffer) byteBuffer).address();
Pointer pointer = new Pointer(address);
LibC.INSTANCE.mlock(pointer, new NativeLong(fileSize));
availableBuffers.offer(byteBuffer);
}
}
public ByteBuffer borrowBuffer() {
ByteBuffer buffer = availableBuffers.pollFirst();
if (availableBuffers.size() < poolSize * 0.4) {
log.warn("TransientStorePool only remain {} sheets.", availableBuffers.size());
}
return buffer;
}
}
创建 MappedFile 时,writeBuffer 从 transientStorePool 中取一个
public void init(final String fileName, final int fileSize,
final TransientStorePool transientStorePool) throws IOException {
init(fileName, fileSize);
this.writeBuffer = transientStorePool.borrowBuffer();
this.transientStorePool = transientStorePool;
}
消息先写入 MappedFile.writeBuffer ,若 writeBuffer 不为空将其内容再放到 MappedFile.fileChannel 中。
同步刷盘
将 GroupCommitRequest 刷盘请求放入到 GroupCommitService 后阻塞,等待成功刷盘。
class FlushCommitLogService extends ServiceThread
class ServiceThread implements Runnable
class GroupCommitService extends FlushCommitLogService {
protected final CountDownLatch2 waitPoint = new CountDownLatch2(1);
// 写队列
private volatile List<GroupCommitRequest> requestsWrite = new ArrayList<GroupCommitRequest>();
// 读队列
private volatile List<GroupCommitRequest> requestsRead = new ArrayList<GroupCommitRequest>();
public void run() {
while (!this.isStopped()) {
try {
this.waitForRunning(10);
// 刷盘
this.doCommit();
} catch (Exception e) {
CommitLog.log.warn(this.getServiceName() + " service has exception. ", e);
}
}
... 关闭刷盘服务
}
}
默认每次间隔10ms waitForRunning(10) 运行一次刷盘 doCommit()
protected void waitForRunning(long interval) {
// 已经被唤醒就直接返回
if (hasNotified.compareAndSet(true, false)) {
this.onWaitEnd();
return;
}
//entry to wait
// 重设 CountDownLatch2 ,复制了 CountDownLatch 并额外提供了一个重置方法 reset() 将计数还原
waitPoint.reset();
try {
waitPoint.await(interval, TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
log.error("Interrupted", e);
} finally {
hasNotified.set(false);
// 不管是被唤醒,还是等待时间到了,都会执行 onWaitEnd 去调用 swapRequests()
this.onWaitEnd();
}
}
在执行 doCommit() 之前 MQ 做了一个特殊操作 swapRequests()
private void swapRequests() {
List<GroupCommitRequest> tmp = this.requestsWrite;
this.requestsWrite = this.requestsRead;
this.requestsRead = tmp;
}
MQ 使用了两个队列来处理刷盘任务,requestsRead 负责处理刷盘请求,requestsWrite 负责加入刷盘请求,新增和读取分离这样提高处理效率
private void doCommit() {
synchronized (this.requestsRead) {
if (!this.requestsRead.isEmpty()) {
for (GroupCommitRequest req : this.requestsRead) {
// There may be a message in the next file, so a maximum of
// two times the flush
boolean flushOK = false;
for (int i = 0; i < 2 && !flushOK; i++) {
flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset();
if (!flushOK) {
// 刷盘
CommitLog.this.mappedFileQueue.flush(0);
}
}
// 唤醒新增消息的那个线程
req.wakeupCustomer(flushOK);
}
long storeTimestamp = CommitLog.this.mappedFileQueue.getStoreTimestamp();
if (storeTimestamp > 0) {
// 记录刷盘时间点,为恢复MQ存储状态做准备
CommitLog.this.defaultMessageStore.getStoreCheckpoint().setPhysicMsgTimestamp(storeTimestamp);
}
this.requestsRead.clear();
} else {
// Because of individual messages is set to not sync flush, it
// will come to this process
CommitLog.this.mappedFileQueue.flush(0);
}
}
}
刷盘的时候获取 requestsRead 对象锁,处理完了清空掉 requestsRead ,下次刷盘时交换 requestsRead 和 requestsWrite 。
内存映射文件刷盘
RocketMQ 主要通过 MappedByteBuffer 对文件进行读写操作(Mmap 方式)。
public boolean flush(final int flushLeastPages) {
boolean result = true;
// 获取到上次盘刷的那个 CommitLog 文件
MappedFile mappedFile = this.findMappedFileByOffset(this.flushedWhere, this.flushedWhere == 0);
if (mappedFile != null) {
long tmpTimeStamp = mappedFile.getStoreTimestamp();
int offset = mappedFile.flush(flushLeastPages);
long where = mappedFile.getFileFromOffset() + offset;
result = where == this.flushedWhere;
// 更新刷盘位置,之前位置的数据表示已经持久化
this.flushedWhere = where;
if (0 == flushLeastPages) {
this.storeTimestamp = tmpTimeStamp;
}
}
return result;
}
/**
* @return The current flushed position
*/
public int flush(final int flushLeastPages) {
// 文件写满了,脏页>=flushLeastPages,存在脏页
if (this.isAbleToFlush(flushLeastPages)) {
if (this.hold()) {
// 获取到上次可读位置或者是上次Commit位置
int value = getReadPosition();
try {
//We only append data to fileChannel or mappedByteBuffer, never both.
if (writeBuffer != null || this.fileChannel.position() != 0) {
this.fileChannel.force(false);
} else {
this.mappedByteBuffer.force();
}
} catch (Throwable e) {
log.error("Error occurred when force data to disk.", e);
}
// 设置刷盘到哪了
this.flushedPosition.set(value);
this.release();
} else {
log.warn("in flush, hold failed, flush offset = " + this.flushedPosition.get());
this.flushedPosition.set(getReadPosition());
}
}
return this.getFlushedPosition();
}
public int getReadPosition() {
return this.writeBuffer == null ? this.wrotePosition.get() : this.committedPosition.get();
}
异步刷盘
不开启堆外内存池,消息追加到 mappedByteBuffer 后,刷写到磁盘。开启的话先追加到堆外内存,再 commit 到 fileChannel,最后刷盘。
transientStorePoolEnable = true
class CommitRealTimeService extends FlushCommitLogService {
public void run() {
while (!this.isStopped()) {
// 默认每次间隔 200ms
int interval = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitIntervalCommitLog();
...
try {
// 执行commit,将堆外内存中的数据刷到物理文件的内存映射文件中
boolean result = CommitLog.this.mappedFileQueue.commit(commitDataLeastPages);
long end = System.currentTimeMillis();
if (!result) {
this.lastCommitTimestamp = end; // result = false means some data committed.
//now wake up flush thread.
// 有新数据提交,下面的 waitForRunning 直接返回,立刻再次执行 commit
flushCommitLogService.wakeup();
}
this.waitForRunning(interval);
} catch (Throwable e) {
CommitLog.log.error(this.getServiceName() + " service has exception. ", e);
}
...
}
}
}
commit 最终会调用 org.apache.rocketmq.store.MappedFile#commit0
protected void commit0(final int commitLeastPages) {
int writePos = this.wrotePosition.get();
int lastCommittedPosition = this.committedPosition.get();
if (writePos - this.committedPosition.get() > 0) {
try {
ByteBuffer byteBuffer = writeBuffer.slice();
// 将上次提交位置设为本次commit的开始位置
byteBuffer.position(lastCommittedPosition);
// 限制commit到最新写入的的位置
byteBuffer.limit(writePos);
this.fileChannel.position(lastCommittedPosition);
this.fileChannel.write(byteBuffer);
// 更新commit到哪了
this.committedPosition.set(writePos);
} catch (Throwable e) {
log.error("Error occurred when commit data to FileChannel.", e);
}
}
}
定时 commit 之后,定时 flush
class FlushRealTimeService extends FlushCommitLogService {
public void run() {
CommitLog.log.info(this.getServiceName() + " service started");
while (!this.isStopped()) {
// 默认间隔 500ms
int interval = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getFlushIntervalCommitLog();
try {
if (flushCommitLogTimed) {
Thread.sleep(interval);
} else {
this.waitForRunning(interval);
}
// 刷盘
CommitLog.this.mappedFileQueue.flush(flushPhysicQueueLeastPages);
long storeTimestamp = CommitLog.this.mappedFileQueue.getStoreTimestamp();
if (storeTimestamp > 0) {
// 记录已经刷盘位置
CommitLog.this.defaultMessageStore.getStoreCheckpoint().setPhysicMsgTimestamp(storeTimestamp);
}
...
到此消息已持久化到磁盘文件,主从同步、消费队列和索引文件见之后的文章。