RocketMQ的延迟队列实现

太阳伞下的阿呆

已于 2023-07-16 15:03:58 修改

阅读量3.9k

点赞数 1

分类专栏：架构设计文章标签： rocketmq 延迟队列

于 2023-04-16 21:54:22 首次发布

本文链接：https://blog.csdn.net/u010597819/article/details/130189019

版权

架构设计专栏收录该内容

15 篇文章

订阅专栏

4.2.0版本(4.6.1之前）

rocketmq-delayQ.4.2.0.png

broke收到延迟消息
将延迟消息暂存至topic：SCHEDULE_TOPIC_XXXX，队列为delayLevel-1，例如：delayLevel=3，则对应的reviveQueueId=2
定时任务轮询SCHEDULE_TOPIC_XXXX消息，如果消息已到达约定投递时间deliverTimeMs则将消息requeue至原topic中的原队列

5.1.0版本

老版本延迟队列仅支持几个延迟时间，而不是任意延迟时间，新版本进行了升级，可以支持任意版本

模型

TimerWheel（org.apache.rocketmq.store.timer.TimerWheel）：时间轮结构，是一个环，由一系列Slot组成，逻辑上因为延迟消息的生产时间是随机的，因此不连续，对应一个物理二进制文件，例如：/var/folders/_m/sx5bwyvj6z577f3vzk8pw4lc0000gn/T/unitteststore-4b570306-1d9a-4c49-9f07-486249fd2187
1. slotsTotal：轮盘的槽数量
2. precisionMs：时间轮盘的时间精度，例如：500ms，延迟消息延迟时间在20230320 21:37:50.000至20230320 21:37:50.500期间的消息，都将落在同一个slot槽上
3. wheelLength：轮盘的物理空间长度
Slot（org.apache.rocketmq.store.timer.Slot）：每一个slot代表一个时间范围内的delayedTime槽
1. timeMs：delayedTime，例如：1679318932500（2023-03-20 21:28:52）
2. firstPos：当前槽的头部TimerLog物理存储起始点，类似于：HashMap冲突时的链表头
3. lastPos：当前槽的TimerLog日志物理存储起始点
4. num：当前槽中延迟消息的数量
5. magic:暂时未使用
6. 案例1：写入三条消息，对应slot数据写入数据如下

timerLog.append delayTime=2023-04-16 20:00:00 
firstPos=0 lastPos=0 num=1 size=52
timerLog.append delayTime=2023-04-16 20:05:00 
firstPos=52 lastPos=52 num=1 size=52
timerLog.append delayTime=2023-04-16 20:00:00 
firstPos=0 lastPos=104 num=2 size=52
timerLog.append delayTime=2023-04-16 20:00:00 
firstPos=0 lastPos=156 num=3 size=52

TimerLog（org.apache.rocketmq.store.timer.TimerLog）：
1. size: 日志大小
2. prePos: 前一个日志起始offset
3. magic：日志是否被删除等标志符
4. currentWriteTime：当前写入时间
5. delayedTime：延迟消息延迟时间=delayedTime-tmpWriteTimeMs(currentWriteTime)
6. offsetPy：消息队列所在offset
7. sizePy：消息大小
8. hashCode：realTopic的hash值，统计指标使用
9. reservedValue：预留字段
10. 书接案例1：

timerLog.append delayTime=2023-04-16 20:00:00 
prevPos=-1 size=52
timerLog.append delayTime=2023-04-16 20:05:00 
prevPos=-1 size=52
timerLog.append delayTime=2023-04-16 20:00:00 
prevPos=0 size=52
timerLog.append delayTime=2023-04-16 20:00:00 
prevPos=104 size=52

TimerRequest（org.apache.rocketmq.store.timer.TimerRequest）

TimerWheel与TimerLog

TimerWheel.drawio.png

业务逻辑

org.apache.rocketmq.broker.BrokerController#registerMessageStoreHook：注册钩子方法，回调调度消息处理钩子：org.apache.rocketmq.broker.util.HookUtils#handleScheduleMessage，老版本的延迟消息在CommitLog中重写延迟消息Topic逻辑，已迁移至该钩子方法
TimerMessageStore：定时消息存储，负责持久化，维护更新TimerWheel，以及重启恢复等功能，BrokerController初始化时加载，启动/停止也由BrokerController管理
org.apache.rocketmq.store.timer.TimerMessageStore#enqueue
1. 读取org.apache.rocketmq.store.timer.TimerMessageStore#TIMER_TOPIC队列数据，queueId写死为0
2. 按照队列固定存储大小org.apache.rocketmq.store.ConsumeQueue#CQ_STORE_UNIT_SIZE读取ConsumeQueue队列数据
3. 将消息体org.apache.rocketmq.common.message.MessageExt与其他元数据封装为TimerRequest，入队列org.apache.rocketmq.store.timer.TimerMessageStore#enqueuePutQueue
org.apache.rocketmq.store.timer.TimerMessageStore.TimerEnqueuePutService#run
1. 拉取enqueuePutQueue队列数据
2. 将消息相关元数据offset+size，以及slot的pre pos封装为TimerLog buffer（org.apache.rocketmq.store.timer.TimerMessageStore#timerLogBuffer），将TimerLog buffer追加写入org.apache.rocketmq.store.timer.TimerMessageStore#timerLog
3. 将TimerLog物理信息与延迟时间封装为Slot槽，写入时间轮
org.apache.rocketmq.store.timer.TimerMessageStore.TimerDequeueGetService#run
1. 基于当前读时间（org.apache.rocketmq.store.timer.TimerMessageStore#currReadTimeMs）继续滚动读取时间轮盘Slot槽数据，即已经到达延迟时间的消息，需要触发延迟消息推送
2. 根据Slot槽数据（slot.lastPos）读取物理数据TimerLog
3. 将TimerLog记录的元数据信息封装为TimerRequest，入队列org.apache.rocketmq.store.timer.TimerMessageStore#dequeueGetQueue
org.apache.rocketmq.store.timer.TimerMessageStore.TimerDequeueGetMessageService#run
1. 拉取dequeueGetQueue队列数据
2. 根据TimerRequest元数据信息，从消息队列（org.apache.rocketmq.store.timer.TimerMessageStore#messageStore）中读取MessageExt
3. 将读取到的MessageExt信息写入TimerRequest，入队列：org.apache.rocketmq.store.timer.TimerMessageStore#dequeuePutQueue
org.apache.rocketmq.store.timer.TimerMessageStore.TimerDequeuePutMessageService#run
1. 拉取dequeuePutQueue队列数据
2. 将MessageExt转为MessageExtBrokerInner消息写入真实的消息队列，org.apache.rocketmq.store.timer.TimerMessageStore#convertMessage将延迟队列消息topic更新为实际topic：org.apache.rocketmq.common.message.MessageConst#PROPERTY_REAL_TOPIC

流程图

rmq_delay.drawio.png

问题

为什么不同粒度的延迟消息要用不同的队列？

性能问题。假如使用同一个队列，如果队列头部都是小时级别的消息，队列中间是分钟级别的消息，分钟级别的消息大概率都会比小时级别的消息先到达触发requeue时间，那么这个时候想要获取分钟级别的消息，需要遍历至分钟级别消息的下标处，就相当于磁盘出现了大量碎片。性能必然很差，可能导致延迟时间严重失真

为什么要重写MessageExt的queueOffset？

// 源代码：org.apache.rocketmq.store.timer.TimerMessageStore#enqueue
// use CQ offset, not offset in Message
msgExt.setQueueOffset(offset + (i / ConsumeQueue.CQ_STORE_UNIT_SIZE));

发消息时可能存在单个消息与批量消息，那么消息的offset是不均匀的，例如：0，30，50，90
延迟消息读取的时候是按照队列存储单元大小（org.apache.rocketmq.store.ConsumeQueue#CQ_STORE_UNIT_SIZE=20）固定长度读取的，因此offset是均匀的，应该是：0，20，40，60
因此需要重写queueOffset

为什么Slot槽时totalSlot的2倍？
木鸡啊-_-!!!
TimerLog读取的时候会读取到不需要的数据？
例如Slot案例中写入的三条数据对应的两条延迟消息（均为延迟时间：2023-04-16 20:00:00）之间，夹杂着一条更晚触发的延迟消息（2023-04-16 20:05:00）
答案当然是不会，书接案例1：read timer log，读取顺序如下：

prevPos=156
prevPos=104
prevPos=0

为什么抽象TimerLog与Slot，而不是直接使用Slot记录TimerLog数据？

通过TimerLog记录延迟消息更快，Slot是Wheel的槽，本身是个有序存储，如果要使用Slot存储，每次写入的延迟消息的时间是分散无序的，必须每次先读（延迟消息属于哪个槽）再写。而TimerLog只需要append操作，相对于Slot性能一定是更快的
如果使用Slot记录TimerLog数据，那么原有的TimerWheel结构就变成了一个环形的hashMap，而物理存储是一个平面文件，如果使用一个文件维护hashMap，offset问题会非常复杂，并且需要考虑并发读写问题。如果使用多个文件存储，其实就又回到了当前的抽象，一个文件为TimerWheel，多个文件为槽链表。当前的TimerLog使用同一个文件，格式就是一个数组实现的有指针的链表，通过prevPos跳跃访问同一个槽的数据。

为什么有两个指针：slot当前读取位置，slot当前写入位置？

注：讨论延迟精度为1s
正常的时间轮，一个读取位置就够了，例如：60s的时间轮，1s嘀嗒一次（移动一次slot），每次写入前一个slot，一圈以后就说明60s时间到了可以触发了。
因为rmq的延迟队列嘀嗒的时间是精度的1/10，例如：精度500ms，嘀嗒的时间是50ms
为什么不是固定的1s呢？
因为故障恢复问题，在服务重启期间，延迟消息有很多已经到期该触发了，重启后怎么办？从上次读取位置1s一次的嘀嗒到当前写入时间？那么所有的延迟消息都会被影响，导致延迟时间不准
所以需要加快读取的进度，rmq固定的读取速度是精度的1/10，因此读取一定会超过写入，当读到写入时间时？就会原地等待，因此需要记录当前写入时间
能不能用当前时间作为当前写入时间？
可以，rmq也是这样做的
多记录一个当前写入位置（时间），其实就是为了作为读取的暂停符号