RocketMQ消息存储高性能,底层依赖于nio提供的基于内核的I/O管理(ZeroCopy)。在RocketMQ中,文件的读取主要通过MappedByteBuffer进行操作,文件的中转主要通过FileChannel模型。文件基于内核操作,以及大部分的使用基于内存都是直接提高RocketMQ的关键点。
对RocketMQ消息存储的理解,我是分两部分进行的,第一部分是梳理消息存储相关的关键对象(也就是数据结构,不过RocketMQ的这部分的数据结构和数据管理显然是融合在一起的,也就只能勉强叫做关键对象了);第二部分梳理RocketMQ对关键对象的处理策略(为什么说RocketMQ的大部分文件都是基于内存操作就是这点,换言之也就是顺序写,随机读的实现方式。不过这部分我会在下一篇文章里面写,这一篇会分享一些我觉得关键的源码,篇幅可能比较长,不喜欢的看完关键对象介绍就结束吧)。
关键对象:
DefaultMessageStore:默认文件存储管理类,包含了很多对存储文件的操作API,存储模块里面最重要的一个类。
CommitLog: commitLog是物理消息实体的存储主体,文件存储路径为$HOME/store/commitLog。默认文件大小为1G,文件名按照文件偏移量递增。
ConsumeQueue:ConsumeQueue是消息消费队列, 它是一个逻辑队列,每个Topic下的每个queueId对应一个Consumequeue,其中存储了消息对应在CommitLog文件中的物理偏移量offset(maxPhysicOffset),消息大小size(mappedFileSize),存储路径(storePath),主题(topic),队列ID(queueId)。
其提供了根据消息的key值,时间区间来快速检索消息的方法
indexFile:具体消息索引文件,提供快速检索消息的方法。底层使用HashMap管理key,使用MappedByteBuffer管理索引内容(MappedByteBuffer使用虚拟内存,因此分配(map)的内存大小不受JVM的-Xmx参数限制,但是如内存占用、文件关闭不确定,被其打开的文件只有在垃圾回收的才会被关闭,内存操作高效但是不安全就是这个原因了)。
MapedFileQueue:提供维护管理多个mapedFile,提供MappedFile的添加,查询,删除,持久化。MapedFileQueue管理的一个目录下的MappedFile。
MappedFile:通过MapedFile类对于commitlog、consumequeue、index三类大文件进行磁盘读写操作
FileChannel:一个用于输入/输出操作的连接。通道表示对实体的开放连接,如硬件设备、文件、网络套接字或能够执行一个或多个不同的输入/输出操作的程序组件,例如读取或写入。
MappedByteBuffer:java nio提供的一种大文件的操作方式,MappedByteBuffer继承自ByteBuffer,内部维护了一个逻辑地址address。使用虚拟内存,因此分配(map)的内存大小不受JVM的-Xmx参数限制,其打开的文件通过垃圾管理器直接管理。
1.DefaultMessageStore源码解读
public class DefaultMessageStore implements MessageStore {
private static final Logger log = LoggerFactory.getLogger(LoggerName.STORE_LOGGER_NAME);
//消息存储配置,文件根目录,commit文件目录,commitLog大小(默认1G),commitQueue大小(默认30w),刷新频率等
private final MessageStoreConfig messageStoreConfig;
// CommitLog
private final CommitLog commitLog;
//主题,队列表,RocketMQ中文件目录结构是以topic和queueID为基础
private final ConcurrentMap<String/* topic */, ConcurrentMap<Integer/* queueId */, ConsumeQueue>> consumeQueueTable;
//commitQueue刷新服务,本质是一个循环执行的线程
private final FlushConsumeQueueService flushConsumeQueueService;
private final CleanCommitLogService cleanCommitLogService;
private final CleanConsumeQueueService cleanConsumeQueueService;
//索引服务,负责读写IndexFile的服务
private final IndexService indexService;
//MappedFile预分配服务线程
private final AllocateMappedFileService allocateMappedFileService;
//consumerQueue和Index的构建服务,消息存储到commitLog后由ReputMessageService负责将消息分
发到ConsumeQueue和IndexService
private final ReputMessageService reputMessageService;
//负责将master-slave之间的数据同步服务
private final HAService haService;
private final ScheduleMessageService scheduleMessageService;
//消息状态记录,通过静态的内部类CallSnapshot 以快照形式记录消息的一切行为
private final StoreStatsService storeStatsService;
private final TransientStorePool transientStorePool;
private final RunningFlags runningFlags = new RunningFlags();
private final SystemClock systemClock = new SystemClock();
private final ScheduledExecutorService scheduledExecutorService =
Executors.newSingleThreadScheduledExecutor(new ThreadFactoryImpl("StoreScheduledThread"));
private final BrokerStatsManager brokerStatsManager;
private final MessageArrivingListener messageArrivingListener;
private final BrokerConfig brokerConfig;
private volatile boolean shutdown = true;
private StoreCheckpoint storeCheckpoint;
private AtomicLong printTimes = new AtomicLong(0);
//转发comitlog日志
private final LinkedList<CommitLogDispatcher> dispatcherList;
private RandomAccessFile lockFile;
private FileLock lock;
boolean shutDownNormal = false;
}
2.CommitLog日志对象,体现高性能方面,重点关注一下日志文件的创建和查询,比如NIO方式对文件的管理,日志文件异步刷新(Commit通过线程来进行的刷新而不是直接持久化,从这点上来说RocketMQ日志文件的刷新都是异步。但是从宏观层面上来看刷盘策略也可再分为同步刷盘和异步刷盘,主要是刷新间隔时间的处理。)
public class CommitLog {
// Message's MAGIC CODE daa320a7
public final static int MESSAGE_MAGIC_CODE = 0xAABBCCDD ^ 1880681586 + 8;
private static final Logger log = LoggerFactory.getLogger(LoggerName.STORE_LOGGER_NAME);
// End of file empty MAGIC CODE cbd43194
private final static int BLANK_MAGIC_CODE = 0xBBCCDDEE ^ 1880681586 + 8;
private final MappedFileQueue mappedFileQueue;
private final DefaultMessageStore defaultMessageStore;
private final FlushCommitLogService flushCommitLogService;
//If TransientStorePool enabled, we must flush message to FileChannel at fixed periods
private final FlushCommitLogService commitLogService;
private final AppendMessageCallback appendMessageCallback;
private final ThreadLocal<MessageExtBatchEncoder> batchEncoderThreadLocal;
private HashMap<String/* topic-queueid */, Long/* offset */> topicQueueTable = new HashMap<String, Long>(1024);
private volatile long confirmOffset = -1L;
private volatile long beginTimeInLock = 0;
private final PutMessageLock putMessageLock;
/**
* 读取日志
*/
public SelectMappedBufferResult getData(final long offset) {
return this.getData(offset, offset == 0);
}
public SelectMappedBufferResult getData(final long offset, final boolean returnFirstOnNotFound) {
int mappedFileSize = this.defaultMessageStore.getMessageStoreConfig().getMapedFileSizeCommitLog();
MappedFile mappedFile = this.mappedFileQueue.findMappedFileByOffset(offset, returnFirstOnNotFound);
if (mappedFile != null) {
int pos = (int) (offset % mappedFileSize);
SelectMappedBufferResult result = mappedFile.selectMappedBuffer(pos);
return result;
}
return null;
}
/**
*存储日志:存储日志文件mappedFile,更新mappedFileQueue,文件的持久化通过刷盘服务完成,具体的时
*间点不确定
**/
public PutMessageResult putMessage(final MessageExtBrokerInner msg) {
// Set the storage time
msg.setStoreTimestamp(System.currentTimeMillis());
// Set the message body BODY CRC (consider the most appropriate setting
// on the client)
msg.setBodyCRC(UtilAll.crc32(msg.getBody()));
// Back to Results
AppendMessageResult result = null;
StoreStatsService storeStatsService = this.defaultMessageStore.getStoreStatsService();
String topic = msg.getTopic();
int queueId = msg.getQueueId();
final int tranType = MessageSysFlag.getTransactionValue(msg.getSysFlag());
if (tranType == MessageSysFlag.TRANSACTION_NOT_TYPE
|| tranType == MessageSysFlag.TRANSACTION_COMMIT_TYPE) {
// Delay Delivery
if (msg.getDelayTimeLevel() > 0) {
if (msg.getDelayTimeLevel() > this.defaultMessageStore.getScheduleMessageService().getMaxDelayLevel()) {
msg.setDelayTimeLevel(this.defaultMessageStore.getScheduleMessageService().getMaxDelayLevel());
}
topic = ScheduleMessageService.SCHEDULE_TOPIC;
queueId = ScheduleMessageService.delayLevel2QueueId(msg.getDelayTimeLevel());
// Backup real topic, queueId
MessageAccessor.putProperty(msg, MessageConst.PROPERTY_REAL_TOPIC, msg.getTopic());
MessageAccessor.putProperty(msg, MessageConst.PROPERTY_REAL_QUEUE_ID, String.valueOf(msg.getQueueId()));
msg.setPropertiesString(MessageDecoder.messageProperties2String(msg.getProperties()));
msg.setTopic(topic);
msg.setQueueId(queueId);
}
}
long eclipseTimeInLock = 0;
MappedFile unlockMappedFile = null;
MappedFile mappedFile = this.mappedFileQueue.getLastMappedFile();
putMessageLock.lock(); //spin or ReentrantLock ,depending on store config
try {
long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now();
this.beginTimeInLock = beginLockTimestamp;
// Here settings are stored timestamp, in order to ensure an orderly
// global
msg.setStoreTimestamp(beginLockTimestamp);
if (null == mappedFile || mappedFile.isFull()) {
mappedFile = this.mappedFileQueue.getLastMappedFile(0); // Mark: NewFile may be cause noise
}
if (null == mappedFile) {
log.error("create mapped file1 error, topic: " + msg.getTopic() + " clientAddr: " + msg.getBornHostString());
beginTimeInLock = 0;
return new PutMessageResult(PutMessageStatus.CREATE_MAPEDFILE_FAILED, null);
}
result = mappedFile.appendMessage(msg, this.appendMessageCallback);
switch (result.getStatus()) {
case PUT_OK:
break;
case END_OF_FILE:
unlockMappedFile = mappedFile;
// Create a new file, re-write the message
mappedFile = this.mappedFileQueue.getLastMappedFile(0);
if (null == mappedFile) {
// XXX: warn and notify me
log.error("create mapped file2 error, topic: " + msg.getTopic() + " clientAddr: " + msg.getBornHostString());
beginTimeInLock = 0;
return new PutMessageResult(PutMessageStatus.CREATE_MAPEDFILE_FAILED, result);
}
result = mappedFile.appendMessage(msg, this.appendMessageCallback);
break;
case MESSAGE_SIZE_EXCEEDED:
case PROPERTIES_SIZE_EXCEEDED:
beginTimeInLock = 0;
return new PutMessageResult(PutMessageStatus.MESSAGE_ILLEGAL, result);
case UNKNOWN_ERROR:
beginTimeInLock = 0;
return new PutMessageResult(PutMessageStatus.UNKNOWN_ERROR, result);
default:
beginTimeInLock = 0;
return new PutMessageResult(PutMessageStatus.UNKNOWN_ERROR, result);
}
eclipseTimeInLock = this.defaultMessageStore.getSystemClock().now() - beginLockTimestamp;
beginTimeInLock = 0;
} finally {
putMessageLock.unlock();
}
if (eclipseTimeInLock > 500) {
log.warn("[NOTIFYME]putMessage in lock cost time(ms)={}, bodyLength={} AppendMessageResult={}", eclipseTimeInLock, msg.getBody().length, result);
}
if (null != unlockMappedFile && this.defaultMessageStore.getMessageStoreConfig().isWarmMapedFileEnable()) {
this.defaultMessageStore.unlockMappedFile(unlockMappedFile);
}
PutMessageResult putMessageResult = new PutMessageResult(PutMessageStatus.PUT_OK, result);
// Statistics
storeStatsService.getSinglePutMessageTopicTimesTotal(msg.getTopic()).incrementAndGet();
storeStatsService.getSinglePutMessageTopicSizeTotal(topic).addAndGet(result.getWroteBytes());
handleDiskFlush(result, putMessageResult, msg);
handleHA(result, putMessageResult, msg);
return putMessageResult;
}
}
3.ConsumeQueue,通过ReputMessageService定时构建ConsumeQueue。
public class ConsumeQueue {
private static final Logger log = LoggerFactory.getLogger(LoggerName.STORE_LOGGER_NAME);
public static final int CQ_STORE_UNIT_SIZE = 20;
private static final Logger LOG_ERROR = LoggerFactory.getLogger(LoggerName.STORE_ERROR_LOGGER_NAME);
private final DefaultMessageStore defaultMessageStore;
private final MappedFileQueue mappedFileQueue;
private final String topic;
private final int queueId;
private final ByteBuffer byteBufferIndex;
private final String storePath;
private final int mappedFileSize;
private long maxPhysicOffset = -1;
private volatile long minLogicOffset = 0;
private ConsumeQueueExt consumeQueueExt = null;
//实际对mappedfile的加载时通过MappedFileQueue 实现
public boolean load() {
boolean result = this.mappedFileQueue.load();
log.info("load consume queue " + this.topic + "-" + this.queueId + " " + (result
? "OK" : "Failed"));
if (isExtReadEnable()) {
result &= this.consumeQueueExt.load();
}
return result;
}
}
4.indexFile重点关注putKey(索引构建)和selectPhyOffset(索引查询)这两个方法。
public class IndexFile {
private static final Logger log = LoggerFactory.getLogger(LoggerName.STORE_LOGGER_NAME);
private static int hashSlotSize = 4;
private static int indexSize = 20;
private static int invalidIndex = 0;
private final int hashSlotNum;
private final int indexNum;
private final MappedFile mappedFile;
private final FileChannel fileChannel;
private final MappedByteBuffer mappedByteBuffer;
private final IndexHeader indexHeader;
public boolean putKey(final String key, final long phyOffset, final long storeTimestamp) {
if (this.indexHeader.getIndexCount() < this.indexNum) {
int keyHash = indexKeyHashMethod(key);
int slotPos = keyHash % this.hashSlotNum;
int absSlotPos = IndexHeader.INDEX_HEADER_SIZE + slotPos * hashSlotSize;
FileLock fileLock = null;
try {
// fileLock = this.fileChannel.lock(absSlotPos, hashSlotSize,
// false);
int slotValue = this.mappedByteBuffer.getInt(absSlotPos);
if (slotValue <= invalidIndex || slotValue > this.indexHeader.getIndexCount()) {
slotValue = invalidIndex;
}
long timeDiff = storeTimestamp - this.indexHeader.getBeginTimestamp();
timeDiff = timeDiff / 1000;
if (this.indexHeader.getBeginTimestamp() <= 0) {
timeDiff = 0;
} else if (timeDiff > Integer.MAX_VALUE) {
timeDiff = Integer.MAX_VALUE;
} else if (timeDiff < 0) {
timeDiff = 0;
}
int absIndexPos =
IndexHeader.INDEX_HEADER_SIZE + this.hashSlotNum * hashSlotSize
+ this.indexHeader.getIndexCount() * indexSize;
this.mappedByteBuffer.putInt(absIndexPos, keyHash);
this.mappedByteBuffer.putLong(absIndexPos + 4, phyOffset);
this.mappedByteBuffer.putInt(absIndexPos + 4 + 8, (int) timeDiff);
this.mappedByteBuffer.putInt(absIndexPos + 4 + 8 + 4, slotValue);
this.mappedByteBuffer.putInt(absSlotPos, this.indexHeader.getIndexCount());
if (this.indexHeader.getIndexCount() <= 1) {
this.indexHeader.setBeginPhyOffset(phyOffset);
this.indexHeader.setBeginTimestamp(storeTimestamp);
}
this.indexHeader.incHashSlotCount();
this.indexHeader.incIndexCount();
this.indexHeader.setEndPhyOffset(phyOffset);
this.indexHeader.setEndTimestamp(storeTimestamp);
return true;
} catch (Exception e) {
log.error("putKey exception, Key: " + key + " KeyHashCode: " + key.hashCode(), e);
} finally {
if (fileLock != null) {
try {
fileLock.release();
} catch (IOException e) {
log.error("Failed to release the lock", e);
}
}
}
} else {
log.warn("Over index file capacity: index count = " + this.indexHeader.getIndexCount()
+ "; index max num = " + this.indexNum);
}
return false;
}
public void selectPhyOffset(final List<Long> phyOffsets, final String key, final int maxNum,
final long begin, final long end, boolean lock) {
if (this.mappedFile.hold()) {
int keyHash = indexKeyHashMethod(key);
int slotPos = keyHash % this.hashSlotNum;
int absSlotPos = IndexHeader.INDEX_HEADER_SIZE + slotPos * hashSlotSize;
FileLock fileLock = null;
try {
if (lock) {
// fileLock = this.fileChannel.lock(absSlotPos,
// hashSlotSize, true);
}
int slotValue = this.mappedByteBuffer.getInt(absSlotPos);
// if (fileLock != null) {
// fileLock.release();
// fileLock = null;
// }
if (slotValue <= invalidIndex || slotValue > this.indexHeader.getIndexCount()
|| this.indexHeader.getIndexCount() <= 1) {
} else {
for (int nextIndexToRead = slotValue; ; ) {
if (phyOffsets.size() >= maxNum) {
break;
}
int absIndexPos =
IndexHeader.INDEX_HEADER_SIZE + this.hashSlotNum * hashSlotSize
+ nextIndexToRead * indexSize;
int keyHashRead = this.mappedByteBuffer.getInt(absIndexPos);
long phyOffsetRead = this.mappedByteBuffer.getLong(absIndexPos + 4);
long timeDiff = (long) this.mappedByteBuffer.getInt(absIndexPos + 4 + 8);
int prevIndexRead = this.mappedByteBuffer.getInt(absIndexPos + 4 + 8 + 4);
if (timeDiff < 0) {
break;
}
timeDiff *= 1000L;
long timeRead = this.indexHeader.getBeginTimestamp() + timeDiff;
boolean timeMatched = (timeRead >= begin) && (timeRead <= end);
if (keyHash == keyHashRead && timeMatched) {
phyOffsets.add(phyOffsetRead);
}
if (prevIndexRead <= invalidIndex
|| prevIndexRead > this.indexHeader.getIndexCount()
|| prevIndexRead == nextIndexToRead || timeRead < begin) {
break;
}
nextIndexToRead = prevIndexRead;
}
}
} catch (Exception e) {
log.error("selectPhyOffset exception ", e);
} finally {
if (fileLock != null) {
try {
fileLock.release();
} catch (IOException e) {
log.error("Failed to release the lock", e);
}
}
this.mappedFile.release();
}
}
}
}
以上是存储相关的几个关键对象,源码部分没有做太多的解释,逻辑性的东西,有兴趣的看看源码都能理解。