RocketMQ源码分析——消息的存储

消息的存储是重点也是难点,MQ为了保证消息不丢失做了持久化。为了提升将消息写入文件的速度,MQ采用了顺序写,利用页缓存和内存映射,将所有的消息顺序写入 CommitLog 文件,写满了就创建另一个文件,文件大小固定为1G。
因消息的消费基于主题,因此MQ为每个主题创建了多个消费队列,每个消费队列对应一个 ConsumeQueue 文件,这个文件可以快速找到需要消费的消息在 CommitLog 中的位置。
消息还需要查询,为避免扫描整个 CommitLog 文件,MQ又维护了一个 IndexFile 索引文件,可以快速检索到消息。

存储概要设计

在这里插入图片描述
CommitLog:消息主体以及元数据的存储主体,存储Producer端写入的消息主体内容,消息内容不是定长的。
ConsumeQueue:消息消费队列,引入的目的主要是提高消息消费的性能。
IndexFile:索引文件,提供了一种可以通过key或时间区间来查询消息的方法。

在这里插入图片描述

保存到内存映射文件 ByteBuffer 中

Broker 接收到消息后进行存储
org.apache.rocketmq.store.DefaultMessageStore#putMessage

public PutMessageResult putMessage(MessageExtBrokerInner msg) {
    if (this.shutdown) {
        log.warn("message store has shutdown, so putMessage is forbidden");
        return new PutMessageResult(PutMessageStatus.SERVICE_NOT_AVAILABLE, null);
    }

    if (BrokerRole.SLAVE == this.messageStoreConfig.getBrokerRole()) {
        long value = this.printTimes.getAndIncrement();
        if ((value % 50000) == 0) {
            log.warn("message store is slave mode, so putMessage is forbidden ");
        }

        return new PutMessageResult(PutMessageStatus.SERVICE_NOT_AVAILABLE, null);
    }

    if (!this.runningFlags.isWriteable()) {
        long value = this.printTimes.getAndIncrement();
        if ((value % 50000) == 0) {
            log.warn("message store is not writeable, so putMessage is forbidden " + this.runningFlags.getFlagBits());
        }

        return new PutMessageResult(PutMessageStatus.SERVICE_NOT_AVAILABLE, null);
    } else {
        this.printTimes.set(0);
    }

    if (msg.getTopic().length() > Byte.MAX_VALUE) {
        log.warn("putMessage message topic length too long " + msg.getTopic().length());
        return new PutMessageResult(PutMessageStatus.MESSAGE_ILLEGAL, null);
    }

    if (msg.getPropertiesString() != null && msg.getPropertiesString().length() > Short.MAX_VALUE) {
        log.warn("putMessage message properties length too long " + msg.getPropertiesString().length());
        return new PutMessageResult(PutMessageStatus.PROPERTIES_SIZE_EXCEEDED, null);
    }

    if (this.isOSPageCacheBusy()) {
        return new PutMessageResult(PutMessageStatus.OS_PAGECACHE_BUSY, null);
    }

    long beginTime = this.getSystemClock().now();
    PutMessageResult result = this.commitLog.putMessage(msg);

    long eclipseTime = this.getSystemClock().now() - beginTime;
    if (eclipseTime > 500) {
        log.warn("putMessage not in lock eclipse time(ms)={}, bodyLength={}", eclipseTime, msg.getBody().length);
    }
    this.storeStatsService.setPutMessageEntireTimeMax(eclipseTime);

    if (null == result || !result.isOk()) {
        this.storeStatsService.getPutMessageFailedTimes().incrementAndGet();
    }

    return result;
}

校验存储服务是否关闭、此 Broker 是否是 Master、此 Broker 是否可写、主题长度是否超标、消息属性长度是否超标、页缓存是否繁忙(上次存储消息锁定 commitLog 到现在时间超过1s)。
继续调用 org.apache.rocketmq.store.CommitLog#putMessage

public PutMessageResult putMessage(final MessageExtBrokerInner msg) {
    //...省略了延迟消息的处理,会将主题改为 SCHEDULE_TOPIC_XXXX 等操作

    // 获取应该写入的 MappedFile 文件,存在就返回没有就返回空,..\commitlog 目录下最后一个文件
    MappedFile mappedFile = this.mappedFileQueue.getLastMappedFile();
	// 默认自旋锁
    putMessageLock.lock(); //spin or ReentrantLock ,depending on store config
    try {
        long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now();
        this.beginTimeInLock = beginLockTimestamp;

        // Here settings are stored timestamp, in order to ensure an orderly
        // global
        // 修改锁定时间为此时
        msg.setStoreTimestamp(beginLockTimestamp);

        if (null == mappedFile || mappedFile.isFull()) {
        	// 不存在文件或者此文件已写满,再次获取最后一个文件,没有就创建一个文件
            mappedFile = this.mappedFileQueue.getLastMappedFile(0); // Mark: NewFile may be cause noise
        }
        if (null == mappedFile) {
            log.error("create mapped file1 error, topic: " + msg.getTopic() + " clientAddr: " + msg.getBornHostString());
            beginTimeInLock = 0;
            // 创建失败返回错误信息
            return new PutMessageResult(PutMessageStatus.CREATE_MAPEDFILE_FAILED, null);
        }
		// 将消息追加到内存中
        result = mappedFile.appendMessage(msg, this.appendMessageCallback);
        switch (result.getStatus()) {
            case PUT_OK:
                break;
            case END_OF_FILE:
            	// 文件太大,剩余空间不足,创建一个文件去写消息
                ...
                result = mappedFile.appendMessage(msg, this.appendMessageCallback);
                break;
            case MESSAGE_SIZE_EXCEEDED:
            case PROPERTIES_SIZE_EXCEEDED:
                beginTimeInLock = 0;
                return new PutMessageResult(PutMessageStatus.MESSAGE_ILLEGAL, result);
            case UNKNOWN_ERROR:
                beginTimeInLock = 0;
                return new PutMessageResult(PutMessageStatus.UNKNOWN_ERROR, result);
            default:
                beginTimeInLock = 0;
                return new PutMessageResult(PutMessageStatus.UNKNOWN_ERROR, result);
        }

        eclipseTimeInLock = this.defaultMessageStore.getSystemClock().now() - beginLockTimestamp;
        // 修改锁定时间为0
        beginTimeInLock = 0;
    } finally {
        putMessageLock.unlock();
    }

    ...

    PutMessageResult putMessageResult = new PutMessageResult(PutMessageStatus.PUT_OK, result);

    // Statistics 统计消息的数量和大小
    storeStatsService.getSinglePutMessageTopicTimesTotal(msg.getTopic()).incrementAndGet();
    storeStatsService.getSinglePutMessageTopicSizeTotal(topic).addAndGet(result.getWroteBytes());
    // doAppend 只是将消息写入到 byteBuffer,还需要持久化到磁盘
    handleDiskFlush(result, putMessageResult, msg);
    // 主从同步复制
    handleHA(result, putMessageResult, msg);

    return putMessageResult;
}

将消息写入到内存映射文件 MappedFile 中
org.apache.rocketmq.store.CommitLog.DefaultAppendMessageCallback#doAppend(文件起始物理偏移量,byteBuffer,最大可写间隔,消息内容)

public AppendMessageResult doAppend(final long fileFromOffset, final ByteBuffer byteBuffer, final int maxBlank,
    final MessageExtBrokerInner msgInner) {
    // STORETIMESTAMP + STOREHOSTADDRESS + OFFSET <br>

    // PHY OFFSET
    long wroteOffset = fileFromOffset + byteBuffer.position();

    this.resetByteBuffer(hostHolder, 8);
    // 创建消息的唯一ID
    String msgId = MessageDecoder.createMessageId(this.msgIdMemory, msgInner.getStoreHostBytes(hostHolder), wroteOffset);

    // Record ConsumeQueue information 保存消息队列的写入偏移量
    keyBuilder.setLength(0);
    keyBuilder.append(msgInner.getTopic());
    keyBuilder.append('-');
    keyBuilder.append(msgInner.getQueueId());
    String key = keyBuilder.toString();
    Long queueOffset = CommitLog.this.topicQueueTable.get(key);
    if (null == queueOffset) {
        queueOffset = 0L;
        CommitLog.this.topicQueueTable.put(key, queueOffset);
    }

    // Transaction messages that require special handling

    ...准备待存储的内容
    final int msgLen = calMsgLength(bodyLength, topicLength, propertiesLength);
    
    // Determines whether there is sufficient free space
    // END_FILE_MIN_BLANK_LENGTH 最少空闲8个字节用来存储文件剩余空间和魔数
    if ((msgLen + END_FILE_MIN_BLANK_LENGTH) > maxBlank) {
    	// 消息太长,剩余空间不够存
        return new AppendMessageResult(AppendMessageStatus.END_OF_FILE, wroteOffset, maxBlank, msgId, msgInner.getStoreTimestamp(),
            queueOffset, CommitLog.this.defaultMessageStore.now() - beginTimeMills);
    }
    
    // 省略的值看 calMsgLength 中的消息内容
    ...按照消息格式写入到 msgStoreItemMemory
    
    // Write messages to the queue buffer
    byteBuffer.put(this.msgStoreItemMemory.array(), 0, msgLen);

    AppendMessageResult result = new AppendMessageResult(AppendMessageStatus.PUT_OK, wroteOffset, msgLen, msgId,
        msgInner.getStoreTimestamp(), queueOffset, CommitLog.this.defaultMessageStore.now() - beginTimeMills);

    ...事务消息特殊处理
    return result;
}

offsetMsgId 消息的全局唯一ID,用作RocketMQ控制台消息查询
采用”IP地址+Port端口”与“CommitLog的物理偏移量地址”做了一个字符串拼接
在这里插入图片描述

看获取消息长度的方法,我们可以看出消息文件的存储格式,消息是不定长存储到 CommitLog 文件中的。

protected static int calMsgLength(int bodyLength, int topicLength, int propertiesLength) {
    final int msgLen = 4 //TOTALSIZE 总长度
        + 4 //MAGICCODE 魔数
        + 4 //BODYCRC CRC校验码
        + 4 //QUEUEID 消费队列ID
        + 4 //FLAG 应用自设定的flag
        + 8 //QUEUEOFFSET 消息在消费队列中的偏移量
        + 8 //PHYSICALOFFSET 消息在 commitLog 中的物理偏移量
        + 4 //SYSFLAG 系统flag
        + 8 //BORNTIMESTAMP 发送者产生消息的时间
        + 8 //BORNHOST 发送者IP+port
        + 8 //STORETIMESTAMP 存储时间
        + 8 //STOREHOSTADDRESS 存储器所在的IP+port
        + 4 //RECONSUMETIMES 重试次数
        + 8 //Prepared Transaction Offset 事务消息偏移量
        + 4 + (bodyLength > 0 ? bodyLength : 0) //BODY body的长度
        + 1 + topicLength //TOPIC 主题的长度
        + 2 + (propertiesLength > 0 ? propertiesLength : 0) //propertiesLength 属性的长度
        + 0; // 具体属性内容
    return msgLen;
}

doAppend 返回了一个 AppendMessageResult 对象

public class AppendMessageResult {
    // Return code
    private AppendMessageStatus status;
    // Where to start writing
    private long wroteOffset;
    // Write Bytes 消息的长度
    private int wroteBytes;
    // Message ID
    private String msgId;
    // Message storage timestamp
    private long storeTimestamp;
    // Consume queue's offset(step by one)
    private long logicsOffset;
    private long pagecacheRT = 0;

    private int msgNum = 1;
}
public enum AppendMessageStatus {
    PUT_OK,//成功
    END_OF_FILE,//文件不够存
    MESSAGE_SIZE_EXCEEDED,//消息太长
    PROPERTIES_SIZE_EXCEEDED,//属性太长
    UNKNOWN_ERROR,//未知错误
}
根据存储策略是同步还是异步,进行消息刷盘

org.apache.rocketmq.store.CommitLog#handleDiskFlush

public void handleDiskFlush(AppendMessageResult result, PutMessageResult putMessageResult, MessageExt messageExt) {
    // Synchronization flush
    if (FlushDiskType.SYNC_FLUSH == this.defaultMessageStore.getMessageStoreConfig().getFlushDiskType()) {
        final GroupCommitService service = (GroupCommitService) this.flushCommitLogService;
        if (messageExt.isWaitStoreMsgOK()) {
            GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());
            service.putRequest(request);
            boolean flushOK = request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());
            if (!flushOK) {
                log.error("do groupcommit, wait for flush failed, topic: " + messageExt.getTopic() + " tags: " + messageExt.getTags()
                    + " client address: " + messageExt.getBornHostString());
                // 超时返回
                putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_DISK_TIMEOUT);
            }
        } else {
            service.wakeup();
        }
    }
    // Asynchronous flush
    // 唤醒异步线程之后直接返回,不等待刷盘是否成功
    else {
        if (!this.defaultMessageStore.getMessageStoreConfig().isTransientStorePoolEnable()) {
            flushCommitLogService.wakeup();
        } else {
            commitLogService.wakeup();
        }
    }
}
  • GroupCommitService:同步刷盘,执行 MappedFile.mappedByteBuffer.force() 刷盘,等待刷盘成功再返回,或者超时返回;
  • FlushCommitLogService.flushCommitLogService:异步刷盘,transientStorePoolEnable = false,执行 MappedFile.mappedByteBuffer.force() 刷盘
  • FlushCommitLogService.commitLogService:异步刷盘,transientStorePoolEnable = true 开启了堆外内存,先周期性将堆外内存中的 MappedFile.writeBuffer 中的数据刷到 MappedFile.fileChannel ,再等待 flushCommitLogService 执行 MappedFile.fileChannel.force()
TransientStorePool 堆外内存池

TransientStorePoolEnable 为 True 开启堆外内存池,MQ创建一个 TransientStorePool,默认初始化创建5个 ByteBuffer 堆外内存,并利用 JNA 锁定这些内存。提供一个双向队列,org.apache.rocketmq.store.MappedFile#writeBuffer

public class TransientStorePool {
    public void init() {
        for (int i = 0; i < poolSize; i++) {
            ByteBuffer byteBuffer = ByteBuffer.allocateDirect(fileSize);

            final long address = ((DirectBuffer) byteBuffer).address();
            Pointer pointer = new Pointer(address);
            LibC.INSTANCE.mlock(pointer, new NativeLong(fileSize));

            availableBuffers.offer(byteBuffer);
        }
    }
    
	public ByteBuffer borrowBuffer() {
	    ByteBuffer buffer = availableBuffers.pollFirst();
	    if (availableBuffers.size() < poolSize * 0.4) {
	        log.warn("TransientStorePool only remain {} sheets.", availableBuffers.size());
	    }
	    return buffer;
	}
}

创建 MappedFile 时,writeBuffer 从 transientStorePool 中取一个

public void init(final String fileName, final int fileSize,
    final TransientStorePool transientStorePool) throws IOException {
    init(fileName, fileSize);
    this.writeBuffer = transientStorePool.borrowBuffer();
    this.transientStorePool = transientStorePool;
}

消息先写入 MappedFile.writeBuffer ,若 writeBuffer 不为空将其内容再放到 MappedFile.fileChannel 中。

同步刷盘

将 GroupCommitRequest 刷盘请求放入到 GroupCommitService 后阻塞,等待成功刷盘。

class FlushCommitLogService extends ServiceThread
class ServiceThread implements Runnable
class GroupCommitService extends FlushCommitLogService {
    protected final CountDownLatch2 waitPoint = new CountDownLatch2(1);
    // 写队列
    private volatile List<GroupCommitRequest> requestsWrite = new ArrayList<GroupCommitRequest>();
    // 读队列
    private volatile List<GroupCommitRequest> requestsRead = new ArrayList<GroupCommitRequest>();
	public void run() {
       while (!this.isStopped()) {
           try {
               this.waitForRunning(10);
               // 刷盘
               this.doCommit();
           } catch (Exception e) {
               CommitLog.log.warn(this.getServiceName() + " service has exception. ", e);
           }
       }
       ... 关闭刷盘服务
    }
}

默认每次间隔10ms waitForRunning(10) 运行一次刷盘 doCommit()

protected void waitForRunning(long interval) {
	// 已经被唤醒就直接返回
    if (hasNotified.compareAndSet(true, false)) {
        this.onWaitEnd();
        return;
    }

    //entry to wait
    // 重设 CountDownLatch2 ,复制了 CountDownLatch 并额外提供了一个重置方法 reset() 将计数还原
    waitPoint.reset();

    try {
        waitPoint.await(interval, TimeUnit.MILLISECONDS);
    } catch (InterruptedException e) {
        log.error("Interrupted", e);
    } finally {
        hasNotified.set(false);
        // 不管是被唤醒,还是等待时间到了,都会执行 onWaitEnd 去调用 swapRequests()
        this.onWaitEnd();
    }
}

在执行 doCommit() 之前 MQ 做了一个特殊操作 swapRequests()

private void swapRequests() {
    List<GroupCommitRequest> tmp = this.requestsWrite;
    this.requestsWrite = this.requestsRead;
    this.requestsRead = tmp;
}

MQ 使用了两个队列来处理刷盘任务,requestsRead 负责处理刷盘请求,requestsWrite 负责加入刷盘请求,新增和读取分离这样提高处理效率

private void doCommit() {
    synchronized (this.requestsRead) {
        if (!this.requestsRead.isEmpty()) {
            for (GroupCommitRequest req : this.requestsRead) {
                // There may be a message in the next file, so a maximum of
                // two times the flush
                boolean flushOK = false;
                for (int i = 0; i < 2 && !flushOK; i++) {
                    flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset();

                    if (!flushOK) {
                    	// 刷盘
                        CommitLog.this.mappedFileQueue.flush(0);
                    }
                }
				// 唤醒新增消息的那个线程
                req.wakeupCustomer(flushOK);
            }

            long storeTimestamp = CommitLog.this.mappedFileQueue.getStoreTimestamp();
            if (storeTimestamp > 0) {
            	// 记录刷盘时间点,为恢复MQ存储状态做准备
                CommitLog.this.defaultMessageStore.getStoreCheckpoint().setPhysicMsgTimestamp(storeTimestamp);
            }

            this.requestsRead.clear();
        } else {
            // Because of individual messages is set to not sync flush, it
            // will come to this process
            CommitLog.this.mappedFileQueue.flush(0);
        }
    }
}

刷盘的时候获取 requestsRead 对象锁,处理完了清空掉 requestsRead ,下次刷盘时交换 requestsRead 和 requestsWrite 。

内存映射文件刷盘

RocketMQ 主要通过 MappedByteBuffer 对文件进行读写操作(Mmap 方式)。

public boolean flush(final int flushLeastPages) {
    boolean result = true;
    // 获取到上次盘刷的那个 CommitLog 文件
    MappedFile mappedFile = this.findMappedFileByOffset(this.flushedWhere, this.flushedWhere == 0);
    if (mappedFile != null) {
        long tmpTimeStamp = mappedFile.getStoreTimestamp();
        int offset = mappedFile.flush(flushLeastPages);
        long where = mappedFile.getFileFromOffset() + offset;
        result = where == this.flushedWhere;
        // 更新刷盘位置,之前位置的数据表示已经持久化
        this.flushedWhere = where;
        if (0 == flushLeastPages) {
            this.storeTimestamp = tmpTimeStamp;
        }
    }

    return result;
}

/**
 * @return The current flushed position
 */
public int flush(final int flushLeastPages) {
	// 文件写满了,脏页>=flushLeastPages,存在脏页
    if (this.isAbleToFlush(flushLeastPages)) {
        if (this.hold()) {
        	// 获取到上次可读位置或者是上次Commit位置
            int value = getReadPosition();

            try {
                //We only append data to fileChannel or mappedByteBuffer, never both.
                if (writeBuffer != null || this.fileChannel.position() != 0) {
                    this.fileChannel.force(false);
                } else {
                    this.mappedByteBuffer.force();
                }
            } catch (Throwable e) {
                log.error("Error occurred when force data to disk.", e);
            }
			// 设置刷盘到哪了
            this.flushedPosition.set(value);
            this.release();
        } else {
            log.warn("in flush, hold failed, flush offset = " + this.flushedPosition.get());
            this.flushedPosition.set(getReadPosition());
        }
    }
    return this.getFlushedPosition();
}

public int getReadPosition() {
    return this.writeBuffer == null ? this.wrotePosition.get() : this.committedPosition.get();
}
异步刷盘

不开启堆外内存池,消息追加到 mappedByteBuffer 后,刷写到磁盘。开启的话先追加到堆外内存,再 commit 到 fileChannel,最后刷盘。
transientStorePoolEnable = true

class CommitRealTimeService extends FlushCommitLogService {
    public void run() {
       while (!this.isStopped()) {
       	   // 默认每次间隔 200ms
           int interval = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitIntervalCommitLog();
           ...
           try {
           	   // 执行commit,将堆外内存中的数据刷到物理文件的内存映射文件中
               boolean result = CommitLog.this.mappedFileQueue.commit(commitDataLeastPages);
               long end = System.currentTimeMillis();
               if (!result) {
                   this.lastCommitTimestamp = end; // result = false means some data committed.
                   //now wake up flush thread.
                   // 有新数据提交,下面的 waitForRunning 直接返回,立刻再次执行 commit
                   flushCommitLogService.wakeup();
               }

               this.waitForRunning(interval);
           } catch (Throwable e) {
               CommitLog.log.error(this.getServiceName() + " service has exception. ", e);
           }
           ...
		}
	}
}

commit 最终会调用 org.apache.rocketmq.store.MappedFile#commit0

protected void commit0(final int commitLeastPages) {
    int writePos = this.wrotePosition.get();
    int lastCommittedPosition = this.committedPosition.get();

    if (writePos - this.committedPosition.get() > 0) {
        try {
            ByteBuffer byteBuffer = writeBuffer.slice();
            // 将上次提交位置设为本次commit的开始位置
            byteBuffer.position(lastCommittedPosition);
            // 限制commit到最新写入的的位置
            byteBuffer.limit(writePos);
            this.fileChannel.position(lastCommittedPosition);
            this.fileChannel.write(byteBuffer);
            // 更新commit到哪了
            this.committedPosition.set(writePos);
        } catch (Throwable e) {
            log.error("Error occurred when commit data to FileChannel.", e);
        }
    }
}

定时 commit 之后,定时 flush

class FlushRealTimeService extends FlushCommitLogService {

    public void run() {
        CommitLog.log.info(this.getServiceName() + " service started");

        while (!this.isStopped()) {
            // 默认间隔 500ms
            int interval = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getFlushIntervalCommitLog();
            try {
                if (flushCommitLogTimed) {
                    Thread.sleep(interval);
                } else {
                    this.waitForRunning(interval);
                }
                // 刷盘
                CommitLog.this.mappedFileQueue.flush(flushPhysicQueueLeastPages);
                long storeTimestamp = CommitLog.this.mappedFileQueue.getStoreTimestamp();
                if (storeTimestamp > 0) {
                	// 记录已经刷盘位置
                    CommitLog.this.defaultMessageStore.getStoreCheckpoint().setPhysicMsgTimestamp(storeTimestamp);
                }
                ...

到此消息已持久化到磁盘文件,主从同步、消费队列和索引文件见之后的文章。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值