Prometheus监控RocketMQ

📚概述

简介RocketMQ官网给出了RocketMQ监控的示例,本文针对该示例进行细化和实战。
官方文档:https://rocketmq.apache.org/zh/docs/4.x/deployment/04Exporter

📗安装rocketmq-exporter

本文以4.9.4版本为例,其他版本需要修改对应的版本号,替换到脚本安装包即可。

🧩rocketmq-exporter配置

github地址:https://github.com/apache/rocketmq-exporter

具体操作步骤:

🧾下载源码并修改bug

对应GitHub issues ===> BrokerRuntimeStats#loadTps NPE #131
原生rocketmq-exporterbug,需要修改org.apache.rocketmq.exporter.model.BrokerRuntimeStats#BrokerRuntimeStatsgetTransferredTps修改为getTransferedTps

📑修改配置

✨pom.xml配置

修改pom.xml改为对应的rocketmq的版本号。
image.png

✨application.yml配置

修改application.yml配置的namesrvAddr地址,以及其他对应的配置信息,具体的task执行周期可以不用修改,也可以根据实际情况进行修改。
image.png

  • rocketmq.config.enableACL 如果 RocketMQ 集群开启了 ACL 验证,需要配置为 true, 并在 accessKeysecretKey 中配置相应的 ak, sk.
  • rocketmq.config.outOfTimeSeconds 用于配置存储指标和相应的值的过期时间,若超过该时间,cache 中的 key 对应的节点没有发生写更改,则会进行删除。一般配置为 60s 即可(根据 promethus 获取指标的时间间隔进行合理配置,只要保证过期时间大于等于 promethus 收集指标的时间间隔即可)

📑打包启动

打包

使用maven打包即可。使用rocketmq-exporter-0.0.2-SNAPSHOT-exec.jar文件。
image.png

启动脚本

# rocketmq.config.namesrvAddr 配置nameserver地址,多个用分号隔开
nohup java -jar -Xms512m   -Xmx512m rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 >/dev/null 2>&1 &

完整脚本

image.png

🔊注意:
由于service文件中不能使用环境变量,所以在安装的时候就直接判断jdk是否安装并提供软连接到/usr/bin/java文件,后续脚本直接使用该文件

#!/bin/bash

# 安装目录
installDir="/opt/gdmp/exporter"

# exporter名称启动文件名称
exporterName="rocketmq-exporter"

# exporter安装包名称
exporterPackageName="${exporterName}"
exporterPackageNameTar="${exporterPackageName}.jar"
# exporter端口
exporterPort="5557"

# 描述信息
description="默认暴露端口为:${exporterPort},需要修改配置需编辑/etc/systemd/system/${exporterName}.service注册服务,并执行systemctl daemon-reload&systemctl restart ${exporterName}重启${exporterName}服务"

if ! egrep "7.[0-9]" /etc/redhat-release &>/dev/null; then
  printf -- '\033[31m ERROR: 支持Centos 7版本 \033[0m\n'
  exit 1
fi

# 目录不存在,创建目录
function mkdirIfNotExist() {
  if [ ! -d "$1" ]; then
    echo "mkdir -p $1"
    mkdir -p $1
  fi
}

# 软连接
if [ ! -z "$JAVA_HOME" ]; then
  echo "ln -s $JAVA_HOME/jre/bin/java /usr/bin/java"
  ln -s $JAVA_HOME/jre/bin/java /usr/bin/java
else
  echo "未安装JDK或者为配置环境变量"
  exit 1
fi

# 目录创建
mkdirIfNotExist ${installDir}/${exporterName}

# 拷贝安装包
echo "/usr/bin/cp -rf ${exporterPackageNameTar} ${installDir}/${exporterPackageName}/"
/usr/bin/cp -rf ${exporterPackageNameTar} ${installDir}/${exporterPackageName}/

# 启动脚本
echo "/usr/bin/cp -rf start.sh ${installDir}/${exporterPackageName}/"
/usr/bin/cp -rf start.sh ${installDir}/${exporterPackageName}/


# 拷贝启动service文件
echo "/usr/bin/cp -f ${exporterName}.service /etc/systemd/system/"
/usr/bin/cp -f ${exporterName}.service /etc/systemd/system/

systemctl daemon-reload
systemctl enable ${exporterName}
systemctl start ${exporterName}

echo "启动 ${exporterName} 客户端完成"

echo "注册 ${exporterName} 服务守护进程完成"

printf -- "\033[32m ${exporterName} 状态: \033[0m\n"
systemctl --type=service --state=active | grep ${exporterName}
printf -- "\033[32m exporter访问地址:http://127.0.0.1:${exporterPort}/metrics \033[0m\n"

echo ${description}
[Unit]
Description=https://github.com/apache/rocketmq-exporter
After=network-online.target

[Service]
ExecStart=/opt/gdmp/exporter/rocketmq-exporter/start.sh
#ExecStart=/usr/bin/java -jar -Xms1G   -Xmx1G /opt/gdmp/exporter/rocketmq-exporter/rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 >/data/rocketmq/rocketmq-exporter/exporter.log 2>&1
Restart=always
RestartSec=5
StartLimitInterval=0
StartLimitBurst=10
StandardOutput=append:/data/rocketmq/rocketmq-exporter/startup.log
StandardError=append:/data/rocketmq/rocketmq-exporter/error.log

[Install]
WantedBy=multi-user.target
                                
#!/bin/bash

if [ ! -z "$JAVA_HOME" ]; then
  JAVA="$JAVA_HOME/bin/java"
else
  JAVA='/usr/bin/java'
fi

echo "$JAVA"
# rocketmq.config.namesrvAddr 配置nameserver地址,多个用分号隔开
$JAVA -jar -Xms1G   -Xmx1G /opt/gdmp/exporter/rocketmq-exporter/rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 2>&1 
#!/bin/bash

# 安装目录
installDir="/opt/gdmp/exporter"

# exporter名称
exporterName="rocketmq-exporter"

echo "systemctl stop ${exporterName}"
systemctl stop ${exporterName}
systemctl daemon-reload
# 删除安装文件
echo "rm -rf ${installDir}/${exporterName}"
rm -rf ${installDir}/${exporterName}

# 安装服务文件
echo "rm -rf /etc/systemd/system/${exporterName}.service"
rm -rf /etc/systemd/system/${exporterName}.service

printf -- "\033[32m 卸载完成 \033[0m\n"

安装包:

链接:https://pan.baidu.com/s/1f9nMH1oSxyr8azUepu-Q1g

提取码:gcjk

🧫安装过程

直接执行install.sh脚本。
image.png
访问地址:
image.png

🧾日志路径

# 查看日志
tail -f ~/logs/exporterlogs/rocketmq-exporter.log

image.png

🔖问题记录

🔊注意:

  1. 原生rocketmq-exporterbug,需要修改org.apache.rocketmq.exporter.model.BrokerRuntimeStats#BrokerRuntimeStatsgetTransferredTps修改为getTransferedTps
  2. 如果使用版本不一致,需要在rocketmq-exporter中修改对应的版本,涉及到pom.xml文件和application.yml文件。
java.lang.NullPointerException: null
	at org.apache.rocketmq.exporter.model.BrokerRuntimeStats.loadTps(BrokerRuntimeStats.java:149)
	at org.apache.rocketmq.exporter.model.BrokerRuntimeStats.<init>(BrokerRuntimeStats.java:94)
	at org.apache.rocketmq.exporter.task.MetricsCollectTask.collectBrokerRuntimeStats(MetricsCollectTask.java:685)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
	at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

image.png

📘原理说明

Rocketmq-exporter 是用于监控 RocketMQ broker 端和客户端所有相关指标的系统,通过 mqAdminbroker 端获取指标值后封装成 87cache

🔊警告
过去版本曾是 87concurrentHashMap,由于 Map 不会删除过期指标,所以一旦有 label 变动就会生成一个新的指标,旧的无用指标无法自动删除,久而久之造成内存溢出。而使用 Cache 结构可可以实现过期删除,且过期时间可配置。

上述是RocketMQ官网的问题,也是我们在编写exporter需要注意的问题。Rocketmq-exporter也是我们自己开发exporter重要参考资料。

Rocketmq-expoter 获取监控指标的流程如下图所示,Expoter 通过 MQAdminExtMQ 集群请求数据,请求到的数据通过 MetricService 规范化成 Prometheus 需要的格式,然后通过 /metics 接口暴露给 Promethus
image.png

🗞️Metric结构

image.png

详细资料参考官网文档,在这里不在赘述。官网文档地址:https://rocketmq.apache.org/zh/docs/4.x/deployment/04Exporter#metric-%E7%BB%93%E6%9E%84

image.png

🧩prometheus相关配置

🌵按照prometheus官网配置启动

配置 promethusstatic_config: -targetsexporter 的启动 IP 和端口,如: localhost:5557

- job_name: 'rocketmq'
    scrape_interval: 30s
    static_configs:
      - targets: ['10.0.107.158:5557']
        labels:
          instance: '监控(0.0.107.158:5557)'

☘️grafana面板

以下面板在官网提供的面板上做了修改。
Rocketmq_dashboard.json

image.png

📙指标

💻服务端指标

指标名称含义对应Broker指标名
rocketmq_broker_tpsBroker级别的生产TPS
rocketmq_broker_qpsBroker级别的消费QPS
rocketmq_broker_commitlog_diffBroker组从节点同步落后消息size
rocketmq_brokeruntime_pmdt_0ms服务端开始处理写请求到完成写入的耗时(0ms)putMessageDistributeTime
rocketmq_brokeruntime_pmdt_0to10ms服务端开始处理写请求到完成写入的耗时(0~10ms)
rocketmq_brokeruntime_pmdt_10to50ms服务端开始处理写请求到完成写入的耗时(10~50ms)
rocketmq_brokeruntime_pmdt_50to100ms服务端开始处理写请求到完成写入的耗时(50~100ms)
rocketmq_brokeruntime_pmdt_100to200ms服务端开始处理写请求到完成写入的耗时(100~200ms)
rocketmq_brokeruntime_pmdt_200to500ms服务端开始处理写请求到完成写入的耗时(200~500ms)
rocketmq_brokeruntime_pmdt_500to1s服务端开始处理写请求到完成写入的耗时(500~1000ms)
rocketmq_brokeruntime_pmdt_1to2s服务端开始处理写请求到完成写入的耗时(1~2s)
rocketmq_brokeruntime_pmdt_2to3s服务端开始处理写请求到完成写入的耗时(2~3s)
rocketmq_brokeruntime_pmdt_3to4s服务端开始处理写请求到完成写入的耗时(3~4s)
rocketmq_brokeruntime_pmdt_4to5s服务端开始处理写请求到完成写入的耗时(4~5s)
rocketmq_brokeruntime_pmdt_5to10s服务端开始处理写请求到完成写入的耗时(5~10s)
rocketmq_brokeruntime_pmdt_10stomore服务端开始处理写请求到完成写入的耗时(> 10s)
rocketmq_brokeruntime_dispatch_behind_bytes到现在为止,未被分发(构建索引之类的操作)的消息bytesdispatchBehindBytes
rocketmq_brokeruntime_put_message_size_totalbroker写入消息size的总和putMessageSizeTotal
rocketmq_brokeruntime_put_message_average_sizebroker写入消息的平均大小putMessageAverageSize
rocketmq_brokeruntime_remain_transientstore_buffer_numbsTransientStorePool 中队列的容量remainTransientStoreBufferNumbs
rocketmq_brokeruntime_earliest_message_timestampbroker存储的消息最早的时间戳earliestMessageTimeStamp
rocketmq_brokeruntime_putmessage_entire_time_maxbroker自运行以来,写入消息耗时的最大值putMessageEntireTimeMax
rocketmq_brokeruntime_start_accept_sendrequest_time开始接受发送请求的时间startAcceptSendRequestTimeStamp
rocketmq_brokeruntime_putmessage_times_totalbroker写入消息的总次数putMessageTimesTotal
rocketmq_brokeruntime_getmessage_entire_time_maxbroker自启动以来,处理消息拉取的最大耗时getMessageEntireTimeMax
rocketmq_brokeruntime_pagecache_lock_time_millspageCacheLockTimeMills
rocketmq_brokeruntime_commitlog_disk_ratiocommitLog所在磁盘的使用比例commitLogDiskRatio
rocketmq_brokeruntime_dispatch_maxbufferbroker没有计算,一直为0dispatchMaxBuffer
rocketmq_brokeruntime_pull_threadpoolqueue_capacity处理拉取请求线程池队列的容量pullThreadPoolQueueCapacity
rocketmq_brokeruntime_send_threadpoolqueue_capacity处理发送请求线程池队列的容量sendThreadPoolQueueCapacity
rocketmq_brokeruntime_query_threadpool_queue_capacity处理查询请求线程池队列的容量queryThreadPoolQueueCapacity
rocketmq_brokeruntime_pull_threadpoolqueue_size处理拉取请求线程池队列的实际sizepullThreadPoolQueueSize
rocketmq_brokeruntime_query_threadpoolqueue_size处理查询请求线程池队列的实际sizequeryThreadPoolQueueSize
rocketmq_brokeruntime_send_threadpool_queue_size处理send请求线程池队列的实际sizesendThreadPoolQueueSize
rocketmq_brokeruntime_pull_threadpoolqueue_headwait_timemills处理拉取请求线程池队列的队头任务等待时间pullThreadPoolQueueHeadWaitTimeMills
rocketmq_brokeruntime_query_threadpoolqueue_headwait_timemills处理查询请求线程池队列的队头任务等待时间queryThreadPoolQueueHeadWaitTimeMills
rocketmq_brokeruntime_send_threadpoolqueue_headwait_timemills处理发送请求线程池队列的队头任务等待时间sendThreadPoolQueueHeadWaitTimeMills
rocketmq_brokeruntime_msg_gettotal_yesterdaymorning到昨晚12点为止,读取消息的总次数msgGetTotalYesterdayMorning
rocketmq_brokeruntime_msg_puttotal_yesterdaymorning到昨晚12点为止,写入消息的总次数msgPutTotalYesterdayMorning
rocketmq_brokeruntime_msg_gettotal_todaymorning到今晚12点为止,读取消息的总次数msgGetTotalTodayMorning
rocketmq_brokeruntime_msg_puttotal_todaymorning到昨晚12点为止,写入消息的总次数putMessageTimesTotal
rocketmq_brokeruntime_msg_put_total_today_now每个broker到现在为止,写入的消息次数msgPutTotalTodayNow
rocketmq_brokeruntime_msg_gettotal_today_now每个broker到现在为止,读取的消息次数msgGetTotalTodayNow
rocketmq_brokeruntime_commitlogdir_capacity_freecommitLog所在目录的可用空间commitLogDirCapacity
rocketmq_brokeruntime_commitlogdir_capacity_totalcommitLog所在目录的总空间
rocketmq_brokeruntime_commitlog_maxoffsetcommitLog的最大offsetcommitLogMaxOffset
rocketmq_brokeruntime_commitlog_minoffsetcommitLog的最小offsetcommitLogMinOffset
rocketmq_brokeruntime_remain_howmanydata_toflushremainHowManyDataToFlush
rocketmq_brokeruntime_getfound_tps600600s内getMessage时get到消息的平均TPSgetFoundTps
rocketmq_brokeruntime_getfound_tps6060s内getMessage时get到消息的平均TPS
rocketmq_brokeruntime_getfound_tps1010s内getMessage时get到消息的平均TPS
rocketmq_brokeruntime_gettotal_tps600600s内getMessage次数的平均TPSgetTotalTps
rocketmq_brokeruntime_gettotal_tps6060s内getMessage次数的平均TPS
rocketmq_brokeruntime_gettotal_tps1010s内getMessage次数的平均TPS
rocketmq_brokeruntime_gettransfered_tps600getTransferedTps
rocketmq_brokeruntime_gettransfered_tps60
rocketmq_brokeruntime_gettransfered_tps10
rocketmq_brokeruntime_getmiss_tps600600s内getMessage时没有get到消息的平均TPSgetMissTps
rocketmq_brokeruntime_getmiss_tps6060s内getMessage时没有get到消息的平均TPS
rocketmq_brokeruntime_getmiss_tps1010s内getMessage时没有get到消息的平均TPS
rocketmq_brokeruntime_put_tps600600s内写入消息次数的平均TPSputTps
rocketmq_brokeruntime_put_tps6060s内写入消息次数的平均TPS
rocketmq_brokeruntime_put_tps1010s内写入消息次数的平均TPS

💻生产端指标

指标名称含义
rocketmq_producer_offsettopic当前时间的最大offset
rocketmq_topic_retry_offset重试Topic当前时间的最大offset
rocketmq_topic_dlq_offset死信Topic当前时间的最大offset
rocketmq_producer_tpsTopic在一个Broker组上的生产TPS
rocketmq_producer_message_sizeTopic在一个Broker组上的生产消息大小的TPS
rocketmq_queue_producer_tps队列级别生产TPS
rocketmq_queue_producer_message_size队列级别生产消息大小的TPS

💻消费端指标

指标名称含义
rocketmq_group_diff消费组消息堆积消息数
rocketmq_group_retrydiff消费组重试队列堆积消息数
rocketmq_group_dlqdiff消费组死信队列堆积消息数
rocketmq_group_count消费组内消费者个数
rocketmq_client_consume_fail_msg_count过去1h消费者消费失败的次数
rocketmq_client_consume_fail_msg_tps消费者消费失败的TPS
rocketmq_client_consume_ok_msg_tps消费者消费成功的TPS
rocketmq_client_consume_rt消息从拉取到被消费的时间
rocketmq_client_consumer_pull_rt客户端拉取消息的时间
rocketmq_client_consumer_pull_tps客户端拉取消息的TPS
rocketmq_consumer_tps每个Broker组上订阅组的消费TPS
rocketmq_group_consume_tps订阅组当前消费TPS(对rocketmq_consumer_tps按broker聚合)
rocketmq_consumer_offset订阅组在一个broker组上当前的消费Offset
rocketmq_group_consume_total_offset订阅组当前消费的Offset(对rocketmq_consumer_offset按broker聚合)
rocketmq_consumer_message_size订阅组在一个broker组上消费消息大小的TPS
rocketmq_send_back_nums订阅组在一个broker组上消费失败,写入重试消息的次数
rocketmq_group_get_latency_by_storetime消费组消费延时,exporter get到消息后与当前时间相减

🧱监控指标选取

指标PromQL
生产消息TPSsum by (broker,topic) (rocketmq_producer_tps{instance=“ i n s t a n c e " , b r o k e r =   " instance",broker=~" instance",broker= "broker”})
消费消息TPSsum by (broker) (rocketmq_consumer_tps{instance=“ i n s t a n c e " , b r o k e r =   " instance",broker=~" instance",broker= "broker”})
消息积压数量sum(rocketmq_producer_offset{instance=“KaTeX parse error: Expected 'EOF', got '}' at position 10: instance"}̲) by (topic) - …instance”}) by (group,topic)
磁盘最高使用率max(rocketmq_brokeruntime_commitlog_disk_ratio{instance=“$instance”})  * 100
消费组消费延时sum by (group) (rocketmq_group_get_latency_by_storetime{instance=“$instance”})

🧶告警规则示例

具体规则根据需求执行定义即可。

groups:
  - name: 'RocketMQ出现异常'
rules:
  - alert: '生产消息TPS'
    expr: sum by (instance) (rocketmq_producer_tps{instance="10.0.107.158:5557"}/60) >= 50
    for: 1m
    labels:
      severity: '4'
    annotations:
      description: '{{ $labels.gdmpName }}的生产消息TPS当前是{{ $value | printf "%.2f" }}条/秒,请及时处理!!'
      currentValue: '{{ $value | printf "%.2f" }}条/秒'
      thresholdValue: '生产消息TPS ≥ 50条/秒'

  - alert: '消费消息TPS'
  expr: sum by (instance) (rocketmq_consumer_tps{instance="10.0.107.158:5557"}/60) >= 50
  for: 5m
  labels:
    severity: '4'
  annotations:
    description: '{{ $labels.gdmpName }}的消费消息TPS当前是{{ $value | printf "%.2f" }}条/秒,请及时处理!!'
    currentValue: '{{ $value | printf "%.2f" }}条/秒'
    thresholdValue: '消费消息TPS ≥ 50条/秒'

  - alert: '消息积压数量'
    expr: sum by (instance) (sum(rocketmq_producer_offset{instance="10.0.107.158:5557"}) by (topic,gdmpId) - on(topic,gdmpId)  group_right  sum(rocketmq_consumer_offset{instance="10.0.107.158:5557"}) by (group,topic,gdmpId)) >= 100
    for: 5m
    labels:
     severity: '4'
    annotations:
      description: '{{ $labels.gdmpName }}的消息积压数量当前是{{ $value }}条,请及时处理!!'
      currentValue: '{{ $value }}条'
      thresholdValue: '消息积压数量 ≥ 100条'

  - alert: '磁盘最高使用率'
    expr: max by (instance)(rocketmq_brokeruntime_commitlog_disk_ratio{instance="10.0.107.158:5557"})  * 100 >= 80
    for: 5m
    labels:
      severity: '4'
    annotations:
      description: '{{ $labels.gdmpName }}的磁盘最高使用率当前是{{ $value | printf "%.2f" }}%,请及时处理!!'
      currentValue: '{{ $value | printf "%.2f" }}%'
      thresholdValue: '磁盘最高使用率 ≥ 80%'

  - alert: '最高消费延时'
    expr: max by (instance)(rocketmq_group_get_latency_by_storetime{instance="10.0.107.158:5557"}) / 1000 >= 50
    for: 5m
    labels:
      severity: '4'
    annotations:
      description: '{{ $labels.gdmpName }}的最高消费延时当前是{{ $value | printf "%.2f" }}秒,请及时处理!!'
      currentValue: '{{ $value | printf "%.2f" }}秒'
      thresholdValue: '最高消费延时 ≥ 50秒'

📖参考资料

  1. RocketMQ Promethus Exporter | RocketMQ
  • 3
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值