场景
假如我现在对一个MQ集群监控,设置告警,有如下两条规则:
- alert: "RocketMQ,xxx_consumer出现消息积压"
expr: sum by(group, topic) (rocketmq_group_diff{group="xxx_consumer",topic="xxx"}) > 1000
for: 1m
labels:
severity: busi
annotations:
description: '消费组xxx_consumer消费xxx的消息时出现消息积压,积压量已超过1000'
summary: 'RocketMQ, xxx_consumer出现消息积压'
- alert: "broker节点挂了"
expr: count(rocketmq_broker_disk_ratio{cluster="XXXCluster"}) < 4
for: 0m
labels:
severity: warning
annotations:
description: 'broker节点个数少于4个了'
summary: 'broker节点挂了'
上面2条规则如下:
- 规则一:业务组的某个消费组(核心业务),不能出现消息积压,超过1000条就告警通知他们
- 规则二:我们的MQ集群某个业务挂了,我们自己要及时收到告警
现在实际情况是这样,如果规则一告警,必须要及时通知到相对应的业务组的告警的钉钉群里,同时也要通知到我们自己的钉钉群。规则二告警只通知我们自己的群,业务侧不关心。即:有些告警需要同时分发到多个群,有些只发送给某个群。
注意上面的严重程度(serverity)配置,注意用这个来区分,规则一是:busi,规则2是:warning。
配置示例如下
alertmanager的配置
global:
resolve_timeout: 5m
smtp_from: from@email.com
smtp_smarthost: smtp.net:port
smtp_auth_username: from@email.com
smtp_auth_password: PASS
smtp_require_tls: false
route:
receiver: 'email'
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 10m
routes:
- receiver: 'our'
group_wait: 10s
match_re:
severity: warning
- receiver: 'other'
group_wait: 10s
match_re:
severity: busi
templates:
- '*.html'
receivers:
- name: 'email'
email_configs:
- to: 'xuxd@email.com'
send_resolved: false
html: '{{ template "default-monitor.html" . }}'
headers: { Subject: "[WARN] 报警邮件" } #邮件主题
- name: 'our'
webhook_configs:
- url: http://127.0.0.1:8060/dingtalk/our/send
- name: 'other'
webhook_configs:
- url: http://127.0.0.1:8060/dingtalk/our/send
- url: http://127.0.0.1:8060/dingtalk/other/send
global:设置默认的邮箱配置,如果没有匹配的接收者就采用邮件通知
route:除了email这个全局配置的接收者外,下面的routes指定了两个特定的接收者,一个接收者叫“our”,匹配warning级别的;另一个叫“other”,匹配busi级别的,这两个级别在最前面的规则里定义,不是什么特定关键字,就是自己随便定义的一个标记
receivers:这里指定了上面定义的接收者的配置,email指定邮件发给谁;“our”指定dingtalk的发送url,注意这个uri的末尾,send前用的"our";“other”下面指定了两个url,区别就是url末尾的send前面,一个是“our”,另一个是"other"
下面顺便贴一下我用的邮件模板(文件名:default-monitor.html),模板格式是一个table:
{{ define "default-monitor.html" }}
<table>
<tr><td>报警名</td><td>描述</td><td>开始时间</td></tr>
{{ range $i, $alert := .Alerts }}
<tr><td>{{ index $alert.Labels "alertname" }}</td><td>{{ index $alert.Annotations "description" }}</td><td>{{ $alert.StartsAt }}</td></tr>
{{ end }}
</table>
{{ end }}
prometheus-webhook-dingtalk配置
## Customizable templates path
templates:
- /home/user/monitor/alert/prometheus-webhook-dingtalk-1.4.0.linux-amd64/template/template.tmpl
## Targets, previously was known as "profiles"
targets:
our:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxx
secret: xxx_secret
other:
url: https://oapi.dingtalk.com/robot/send?access_token=xxx_other
secret: xxx_other_secret
targets下有两个,分别是"our"和"other",这里对应上面alertmanager配置的url里的"our"和"other。access_token和secret是钉钉群添加机器小助手生成的。
这样配置,如果规则一告警,就是alertmanager的name为other的receiver来发送告警通知,发送到我们的钉钉群和业务侧钉钉群。如果是规则二告警,通过our发送,便只发送到我们的钉钉群。