一文彻底搞懂 Alertmanager 的告警抑制与静默

云计算-Security

已于 2023-10-05 11:17:01 修改

阅读量5.7k

点赞数 6

文章标签： prometheus 监控

于 2023-04-03 16:17:04 首次发布

本文链接：https://blog.csdn.net/IT_ZRS/article/details/129932353

版权

前言

前面提到了，Prometheus 数据指标采集 + Grafana 数据 Web 展示 + 钉钉告警消息通知，但是仅仅是做了一个实验，并没有深究其触发告警问题，那 Prometheus 到底是如何触发告警的？我们接着往下看。

一、Prometheus 架构

在搞清楚 Prometheus 是如何触发告警前，首先我们要清楚 Prometheus 的一个整体架构，其他部分我们先不看，我们就来看 Alertmanager 部分，从架构图中可看到它作为一个整体独立存在，Prometheus 以 Push 消息的方式与其通信，Alertmanager 最后再将 Prometheus 推送过来的告警信息经过其模板渲染后路由给指定用户&&接收端（邮件/钉钉/企业微信等），从而实现告警通知。

沿用前两篇博客案例来逐步解释。

Prometheus - SSL 证书过期监控 | Rabcnops

Prometheus - SSL 证书过期监控 - 钉钉告警 | Rabcnops

二、Prometheus 告警触发

2.1 Prometheus 规则文件与字段解释

首先，来看 Prometheus 的告警规则：

cat /home/data/prometheus/rules/ssl_cert_alerts.yml

groups:                        # 父分组
- name: "SSL证书过期提醒"        # 子分组。该字段下可配置多个子分组（子分组名用name字段定义），也就是说Prometheus以组的形式进行告警配置。本案例中定义了名为“SSL证书过期提醒”的一个子分组。
  rules:                       # 定义子分组的告警规则。
  - alert: "证书过期时间<30天"   # 定义告警名。
    expr: probe_ssl_earliest_cert_expiry{job="SSL证书时间"} - time() < 86400 * 30  # 告警条件（只有满足这个表达式条件时才会触发告警）
    for: 0s                    # 持续的时间（满足告警条件持续时间多久后，才会发送告警到Alertmanager）
    labels:                    # 标签（K/V的形式，如定义告警的级别 - - > 提示、告警、严重。）
      severity: "提示"
    annotations:               # 注释（summary -- 摘要、description -- 详情）
      summary: "SSL 证书即将过期！"
      description: "{{ $labels.instance }} SSL 证书将在30天后过期，请注意及时续费！"
  - alert: "证书过期时间<7天"
    expr: probe_ssl_earliest_cert_expiry{job="SSL证书时间"} - time() < 86400 * 7
    for: 0s
    labels:
      severity: "告警"
    annotations:
      summary: "SSL 证书即将过期！"
      description: "{{ $labels.instance }} SSL 证书将在7天后过期，请注意及时续费！"
  - alert: "证书过期时间<1天"
    expr: probe_ssl_earliest_cert_expiry{job="SSL证书时间"} - time() < 86400 * 1
    for: 0s
    labels:
      severity: "灾难"
    annotations:
      summary: "SSL 证书即将过期！"
      description: "{{ $labels.instance }} SSL 证书将在1天后过期，请注意及时续费！"

2.2 规则文件检测

检测规则语法是否有误，SUCCESS 代表语法是没问题的。

docker exec prometheus promtool check rules conf/rules/ssl_cert_alerts.yml

Checking conf/rules/ssl_cert_alerts.yml
  SUCCESS: 3 rules found

2.3 重启/热加载 Prometheus

重启 Prometheus 使配置文件生效。

docker restart prometheus

三、Alertmanager 告警通知

3.1 基础环境

在 Prometheus 服务端有三个告警状态：

inactive：没有异常。
pending：已触发阈值，但未满足告警的持续时间（即 rule 中的 for 字段）。
firing：已触发阈值且满足告警的持续时间，然后将告警发送至 Alertmanager，Alertmanager 根据相关模板发送至Email/钉钉等。

在我的案例中有三个 SSL 证书的时间 < 7 天了，如下图：

3.2 接收 Prometheus 告警

钉钉告警配置在前面已经讲过了，这里来验证整个告警过程。

1、Prometheus 规则定义并重启服务生效

2、此时我们会看到 Prometheus 的三个告警状态的变化

为了不产生告警，我已经关闭了小于 30 天、7 天的告警规则了。

现在开启小于 7 天的告警，看看其状态变化情况。

没有经历 Pending，那是因为我的告警规则的 for 值是 0s，所以你会看到直接跳到 Firing 了。

3.3 Alertmanager 发送告警消息

当 Firing 的时候，Prometheus 就会将告警推送到 Alertmanager

Alertmanager 接收到来自 Prometheus 的告警时，会根据 Alertmanager 相关配置（如组等待时间、组发送间隔）将消息发送给 Email、钉钉等。

3.4 Alertmanager 配置文件与字段解释

先看看一份简单的配置文件与字段解释。

global:
  resolve_timeout: 1m                               # 恢复等待（如果1m后没收到来自Prometheus的告警，则发送恢复告警通知）
  smtp_smarthost: 'smtp.163.com:465'                # 邮箱服务器
  smtp_from: 'zhurongsen_admin@163.com'             # 邮箱地址（发送用户）
  smtp_auth_username: 'zhurongsen_admin@163.com'    # 邮箱登录地址
  smtp_auth_password: 'DYKIFIZYKUOXRPFV'            # 邮箱授权码（注意是授权码，不是登录密码）
  smtp_require_tls: false

templates:
- '/etc/alertmanager/template/*.tmpl'     # alertmanager模板文件（用于定义告警通知时的模板，如HTML模板、邮件模板等，当然像钉钉有自己的模板则就不会使用该模板，而是使用钉钉自己的模板，也就是说这是缺省模板）

route:                                    # 根路由
  group_by: ['alertname']                 # 分组（通过alertname标签的值进行分组）
  group_wait: 10s                         # 第一次产生告警，等待10s，组内有告警就一起发出，没有则单独发出
  group_interval: 1m                      # 第二次产生告警，先等待1m，如果1m后还没恢复就进入repeat_interval。定义相同的Group之间发送告警通知的时间间隔
  repeat_interval: 5m                     # 在最终发送消息前再等待5m，5m后还没恢复就发送第二次告警
  receiver: 'ops'                         # 接收者（全局接收者）不管什么样的告警类型都会接收到告警信息
  # continue: false                       # 如果为false，则不进行后续匹配，为true则继续匹配子路由
  routes:                                 # 子路由
  - match:                                # 常规匹配
      severity: critical                  # 匹配critical值（这里主要是看你在没在你的Prometheus规则中定义）
    receiver: 'dev'                       # 只有匹配到critical时，才会发送告警消息给dev
    # continue: true                      # 同样是否继续匹配（看你的实际情况）
  - match_re:                             # 正则匹配
      severity: ^(warning|critical)$      # 匹配warning或critical
    receiver: 'webhook'                   # 只有匹配到warning或critical师，才会发送告警给webhook

receivers:                            # 接收者的具体信息
- name: 'ops'
  email_configs:
  - to: '2564395767@qq.com'
    send_resolved: true
- name: 'dev'
  email_configs:
  - to: '2318099451@qq.com'
    send_resolved: true
- name: 'test'
  email_configs:
  - to: 'zhurongsen_admin@126.com'
    send_resolved: true
- name: 'webhook'
  webhook_configs:
  - url: http://192.168.56.142:8060/dingtalk/webhook1/send
    send_resolved: true

inhibit_rules:                                  # 抑制规则（当匹配到critical时抑制掉warning的告警，防止重复告警）
  - source_match:
      severity: 'critical'                      # 此时，这个告警会被通知
    target_match:
      severity: 'warning'                       # 此时，这个告警会被抑制
    equal: ['alertname', 'dev', 'instance']     # 匹配哪些对象的告警

3.5 重启 Alertmanager 服务

任何修改配置文件的动作都要重启/热加载才会生效。

docker restart alertmanager

四、案例

4.1 Prometheus 规则

上面提到有3个证书要过期了，但我已经续费了，为了体现效果，我将修改过期时间。

groups:
- name: "SSL证书过期提醒"
  rules:
  - alert: "证书过期时间<200天"
    expr: probe_ssl_earliest_cert_expiry{job="SSL证书时间"} - time() < 86400 * 200
    for: 0s
    labels:
      severity: "提示"
      type: ssl
    annotations:
      summary: "SSL 证书即将过期 - 提示"
      description: "{{ $labels.instance }} SSL 证书将在200天后过期，请注意及时续费！"
  - alert: "证书过期时间<100天"
    expr: probe_ssl_earliest_cert_expiry{job="SSL证书时间"} - time() < 86400 * 100
    for: 0s
    labels:
      severity: "告警"
      type: ssl
    annotations:
      summary: "SSL 证书即将过期 - 告警"
      description: "{{ $labels.instance }} SSL 证书将在100天后过期，请注意及时续费！"
  - alert: "证书过期时间<1天"
    expr: probe_ssl_earliest_cert_expiry{job="SSL证书时间"} - time() < 86400 * 1
    for: 0s
    labels:
      severity: "灾难"
      type: ssl
    annotations:
      summary: "SSL 证书即将过期 - 灾难"
      description: "{{ $labels.instance }} SSL 证书将在1天后过期，请注意及时续费！"

4.2 Alertmanager 告警与抑制

global:
  resolve_timeout: 30s
route:
  group_wait: 10s
  group_interval: 5s
  repeat_interval: 1m
  group_by: ['alertname']
  receiver: 'ops'
  routes:
  - match:
      severity: '提示'
    receiver: 'web.hook.prometheusalert'
  - match:
      severity: '告警'
    receiver: 'web.hook.prometheusalert'
  - match:
      severity: '灾难'
    receiver: 'web.hook.prometheusalert'

receivers:
- name: 'ops'
  webhook_configs:
  - url: 'http://192.168.56.150:8060/dingtalk/webhook1/send'
    send_resolved: true
- name: 'web.hook.prometheusalert'
  webhook_configs:
  - url: 'http://192.168.56.142:8060/dingtalk/webhook1/send'
    send_resolved: true

inhibit_rules:
  - source_match:
      severity: '告警'
    target_match:
      severity: '提示'
    equal: ['ssl']

告警策略分析：当子路由匹配到不同的 severity 时就会将消息发往不同的 receiver，当子路由无法匹配到时，消息会默认发往根路由的 receiver，因此，无论是否匹配到子路由规则，消息都会发往根路由的 receiver。

抑制策略分析：当匹配到告警标签值时会抑制提示标签值的告警（无论根路由或子路由），但是要注意的是，抑制策略必须来自同个标签类型（即 equal 标签的值必须相同才能起到抑制效果），否则抑制不生效。举个例子：之前我将 equal 设置为alertname是不生效的，原因是在 Prometheus 那端的 alertname 就不一样（除非你设置为一样的），因此解决方法就是将 alertname 设置为相同的值，或自定义额外标签（案例中我就添加的额外标签 - - > type 且值都统一为 ssl），这样的话当匹配到 告警 时就会抑制提示的告警通知并检查他们是否来自于同个 ssl（即ssl标签的值相同抑制才会生效）。