Prometheus告警通知配置、Alertmanager高可用-Day03

最新推荐文章于 2025-05-06 13:54:54 发布

圣圣不爱学习

最新推荐文章于 2025-05-06 13:54:54 发布

阅读量1.8k

点赞数 28

分类专栏： Prometheus监控文章标签： prometheus

本文链接：https://blog.csdn.net/qq_42515722/article/details/136528443

版权

Prometheus监控专栏收录该内容

6 篇文章

订阅专栏

0. 告警规则示例和exporter（直接抄）

官网：https://samber.github.io/awesome-prometheus-alerts/

1. 告警流程

首先是Prometheus通过各种exporter抓取到指标数据后，然后存储到本地存储或远端存储中。
一般Prometheus数据量比较大的话，都会做一个远端的分布式高可用存储，并进行读写分离，再让grafana去对接只读实例，这样不会影响Prometheus server本身的性能。
告警的话，首先要在Prometheus中配置rule（告警规则），这个rule是以文本文件形式存在的，可以配置多个，工作中可以给node配置rule，给pod配置rule，反正就是分类配置rule，Prometheus启动时，会去加载这些rule文件，然后Prometheus就知道，什么指标在什么阈值下需要告警。
但是告警也是有一个触发条件的，并不是一达到阈值，立马就告警了，一旦某个告警持续了一个周期，触发了告警条件，Prometheus就会推送这条告警给Alertmanager。
Alertmanager是专门配合Prometheus来做告警通知的，一切的邮件告警、企业微信告警、钉钉告警等，都是在Alertmanager中配置的。
Alertmanager主要是调用这些软件的webhook接口，来发送告警内容的，所以只要第三方软件支持webhook（能配置账号密码），就能对接Alertmanager。
并且Alertmanager还支持告警合并、分组、静默等。
prometheus—>触发阈值—>超出持续时间—>alertmanager—>分组|抑制|静默—>媒体类型—>邮件|钉钉|微信等。
分组(group): 将类似性质的警报发送给指定的收件人，比如网络通知发给网络工程师、数据库通知发送给数据库工程师。
静默(silences): 是一种简单的特定时间静音的机制，例如：服务器要升级维护可以先设置这个时间段告警静默。
抑制(inhibition): 当警报发出后，停止重复发送由此警报引发的其他警报即合并一个故障引起的多个报警事件，可以消除冗余告警

2. 安装Alertmanager

这里的所有演示环境，都是基于二进制安装的。

2.1 下载软件包


[root@prometheus-server apps]# wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz

[root@prometheus-server apps]# ll -h alertmanager-0.27.0.linux-amd64.tar.gz
-rw-r--r-- 1 root root 30M 3月   7 13:29 alertmanager-0.27.0.linux-amd64.tar.gz

[root@prometheus-server apps]# tar xf alertmanager-0.27.0.linux-amd64.tar.gz
[root@prometheus-server apps]# ln -s alertmanager-0.27.0.linux-amd64 alertmanager
[root@prometheus-server apps]# ll
总用量 30144
lrwxrwxrwx 1 root root       31 3月   7 13:31 alertmanager -> alertmanager-0.27.0.linux-amd64
drwxr-xr-x 2 1001 1002       93 2月  28 19:56 alertmanager-0.27.0.linux-amd64
-rw-r--r-- 1 root root 30866868 3月   7 13:29 alertmanager-0.27.0.linux-amd64.tar.gz
drwxr-xr-x 3 root root       75 3月   6 21:41 blackbox_exporter

2.2 配置并启动alertmanager

2.2.1 编辑启动配置文件

[root@prometheus-server alertmanager]# cat /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus alertmanager
After=network.target

[Service]
ExecStart=/apps/alertmanager/alertmanager --config.file=/apps/alertmanager/alertmanager.yml

[Install]
WantedBy=multi-user.target

[root@prometheus-server alertmanager]# systemctl restart alertmanager.service
[root@prometheus-server alertmanager]# systemctl enable alertmanager.service

[root@prometheus-server alertmanager]# ss -lntup |grep 9093
tcp    LISTEN     0      128    [::]:9093               [::]:*                   users:(("alertmanager",pid=14417,fd=8))

2.2.2 访问web页面

在这里插入图片描述

3. 告警通知配置

官方文档：https://prometheus.io/docs/prometheus/latest/configuration/configuration/

3.1 邮件告警通知

3.1.1 配置讲解

global:
  smtp_from: # 发件人邮箱地址
  smtp_smarthost: # 邮箱 smtp 地址。
  smtp_auth_username: # 发件人的登陆用户名，默认和发件人地址一致。
  smtp_auth_password: # 发件人的登陆密码，有时候是授权码。
  smtp_require_tls: # 是否需要 tls 协议。默认是 true。
  #wechart_api_url: # 企业微信 API 地址。
  #wechart_api_secret： # 企业微信 API secret
  #wechat_api_corp_id: # 企业微信 corp id 信息。
  resolve_timeout: 60s # 当一个告警在 Alertmanager 持续多长时间未接收到新告警后就标记告警状态为resolved(已解决/已恢复)。
route:
  group_by: [alertname] # 通过 alertname 的值对告警进行分类,- alert: 物理节点 cpu 使用率
  group_wait: 10s # 一组告警第一次发送之前等待的延迟时间，即产生告警后延迟 10 秒钟将组内新产生的消息一起合并发送(一般设置为 0 秒 ~ 几分钟)。
  group_interval: 2m # 一组已发送过初始通知的告警接收到新告警后，下次发送通知前等待的延迟时间(一般设置为 5 分钟或更多)。
  repeat_interval: 5m # 一条成功发送的告警，在最终发送通知之前等待的时间(通常设置为 3 小时或更长时间)。
  #间隔示例：
  #group_wait: 10s # 第一次产生告警，等待 10s，组内有告警就一起发出，没有其它告警就单独发出。
  #group_interval: 2m # 第二次产生告警，先等待 2 分钟，2 分钟后还没有恢复就进入repeat_interval。
  #repeat_interval: 5m # 在最终发送消息前再等待 5 分钟，5 分钟后还没有恢复就发送第二次告警。
  receiver: default-receiver # 其它的告警发送给 default-receiver
  routes: # 将 critical(严重的)的报警发送给 myalertname
  - receiver: myalertname
  group_wait: 10s
  match_re:
    severity: critica
receivers: #定义多接收者
- name: 'default-receiver'
    email_configs:
	- to: 'rooroot@aliyun.com'
	  send_resolved: true # 通知已经恢复的告警
- name: myalertname
  webhook_configs:
  - url: 'http://172.30.7.101:8060/dingtalk/alertname/send'
    send_resolved: true # 通知已经恢复的告警

3.1.2 配置邮件告警

[root@prometheus-server alertmanager]# cat alertmanager.yml
global:
  resolve_timeout: 60s
  smtp_smarthost: 'smtp.qq.com:25'
  smtp_from: 'tangshengx@qq.com'
  smtp_auth_username: '1184964356'
  smtp_auth_password: 'cqrkfcsnpjgqifbj' # qq邮箱授权码
  smtp_hello: "@qq.com"
  smtp_require_tls: false
route:
  group_by: ['alertname']
  group_wait: 2s
  group_interval: 2s
  repeat_interval: 1m
  receiver: 'web.hook' # 这里发给web.hook
receivers:
  - name: 'web.hook' # 这里必须配置相同名称web.hook
    #webhook_configs:
    #  - url: 'http://127.0.0.1:5001/'
	email_configs:
	  - to: 'tangshengx@qq.com' # global上面配置的发件人，这里配置的收件人，可以同一个账号发给自己
inhibit_rules: # 告警抑制规则
  - source_match: # 源匹配级别，匹配成功就发出通知，但其他'alertname'、'dev'、'instance'产生的warning级别的告警通知，将被抑制，就是不发送。
      severity: 'critical' # 报警事件级别，当前为“严重”
    target_match: # 目标匹配
      severity: 'warning' # 如果同一时间有上面的'critical'级别告警，也有这里的'warning'级别告警，就会抑制'warning'级别的告警，只发送'critical'级别告警。
    equal: ['alertname', 'dev', 'instance'] # 匹配的对象。这些包含在告警信息中
~

3.1.3 配置Prometheus告警规则

3.1.3.1 配置物理机告警规则

[root@prometheus-server alertmanager]# cd /usr/local/src/prometheus
[root@prometheus-server prometheus]# mkdir rule
[root@prometheus-server prometheus]# cd rule
[root@prometheus-server rule]# cat node-rule.yaml
groups:
  - name: 物理节点状态-监控告警
    rules:
    - alert: 物理节点cpu使用率
      expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 90
      for: 2s
      labels:
        severity: ccritical
      annotations:
        summary: "{{ $labels.instance }}cpu使用率过高"
        description: "{{ $labels.instance }}的cpu使用率超过90%,当前使用率[{{ $value }}],需要排查处理"
    - alert: 物理节点内存使用率
      expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{ $labels.instance }}内存使用率过高"
        description: "{{ $labels.instance }}的内存使用率超过90%,当前使用率[{{ $value }}],需要排查处理"
    - alert: InstanceDown
      expr: up == 0
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{ $labels.instance }}: 服务器宕机"
        description: "{{ $labels.instance }}: 服务器延时超过2分钟"
    - alert: 物理节点磁盘的IO性能
      expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高！"
        description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"
    - alert: 入网流量带宽
      expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{$labels.mountpoint}} 流入网络带宽过高！"
        description: "{{$labels.mountpoint }}流入网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
    - alert: 出网流量带宽
      expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{$labels.mountpoint}} 流出网络带宽过高！"
        description: "{{$labels.mountpoint }}流出网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
    - alert: TCP会话
      expr: node_netstat_Tcp_CurrEstab > 1000
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高！"
        description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
    - alert: 磁盘容量
      expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 10 # 测试告警
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于10%(目前使用:{{$value}}%)"
 # $value是取值与expr的结果
          
[root@prometheus-server rule]#

3.1.3.2 配置Prometheus引用规则文件

[root@prometheus-server prometheus]# cat prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 10.31.200.103:9093 # 配置alertmanager地址

rule_files:
  - "/usr/local/src/prometheus/rule/node-rule.yaml" # 引用告警规则文件
……省略部分内容

3.1.4 查看告警是否生成

[root@prometheus-server prometheus]# systemctl restart prometheus.service
[root@prometheus-server alertmanager]# ./amtool alert --alertmanager.url=http://10.31.200.103:9093
Alertname  Starts At                Summary           State
磁盘容量       2024-03-07 12:53:23 UTC  /boot 磁盘分区使用率过高！  active
磁盘容量       2024-03-07 12:53:23 UTC  / 磁盘分区使用率过高！      active
磁盘容量       2024-03-07 12:53:23 UTC  /boot 磁盘分区使用率过高！  active
磁盘容量       2024-03-07 12:53:23 UTC  / 磁盘分区使用率过高！      active
磁盘容量       2024-03-07 12:53:23 UTC  / 磁盘分区使用率过高！      active
磁盘容量       2024-03-07 12:53:23 UTC  /boot 磁盘分区使用率过高！  active

在这里插入图片描述

3.1.5 恢复告警

[root@prometheus-server rule]# cat node-rule.yaml
……省略部分内容
    - alert: 磁盘容量
      expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90 # 把这个值调整正常的
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于90%(目前使用:{{$value}}%)"

[root@prometheus-server rule]# systemctl restart prometheus.service

3.2 钉钉告警通知

3.2.1 创建钉钉机器人

在这里插入图片描述

3.2.2 测试webhook可用性

3.2.2.1 shell脚本测试

root@prometheus-server ~]# mkdir /data/scripts -p
[root@prometheus-server ~]# cd /data/scripts/
[root@prometheus-server scripts]# cat dingding-test-webhook.sh
#!/bin/bash
source /etc/profile

MESSAGE=$1

/usr/bin/curl -X "POST" 'https://oapi.dingtalk.com/robot/send?access_token=557e3c5d31db639ae29201caa50c24188072cf84b7f42db56f002281c6158dfc' \
  -H 'Content-Type: application/json' \
  -d '{"msgtype": "text", "text": { "content": "'${MESSAGE}'"}
}'

[root@prometheus-server scripts]# sh dingding-test-webhook.sh "告警测试"
{"errcode":310000,"errmsg":"错误描述:关键词不匹配;解决方案:请联系群管理员查看此机器人的关键词，并在发送的信息中包含此关键词;"} # 关键词不匹配时，是无法发出通知的

[root@prometheus-server scripts]# sh dingding-test-webhook.sh "alertname=告警测试"
{"errcode":0,"errmsg":"ok"}
[root@prometheus-server scripts]#

3.2.2.2 测试结果

在这里插入图片描述

3.2.3 部署webhook-dingtalk

alertmanager是没有办法直接发送消息给钉钉的，必须要通过webhook-dingtalk来发送告警消息。

3.2.3.1 下载webhook-dingtalk

下载地址：https://github.com/timonwong/prometheus-webhook-dingtalk

[root@prometheus-server data]# cd /usr/local/src/
[root@prometheus-server src]# wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

[root@prometheus-server src]# ll -h prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
-rw-r--r-- 1 root root 8.1M 3月   8 10:03 prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

3.2.3.2 编辑webhook-dingtalk配置文件

不同版本的webhook-dingtalk，配置要求不同，详情看官方文档

[root@prometheus-server src]# tar xf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
[root@prometheus-server src]# ln -s prometheus-webhook-dingtalk-1.4.0.linux-amd64 prometheus-webhook-dingtalk

[root@prometheus-server src]# cd prometheus-webhook-dingtalk
[root@prometheus-server prometheus-webhook-dingtalk]# ll
总用量 15708
-rw-r--r-- 1 3434 3434     1194 12月 11 2019 config.example.yml
drwxr-xr-x 3 3434 3434       23 12月 11 2019 contrib
-rw-r--r-- 1 3434 3434    11358 12月 11 2019 LICENSE
-rwxr-xr-x 1 3434 3434 16065409 12月 11 2019 prometheus-webhook-dingtalk

3.2.3.3 启动

这里注意：–ding.profile="alertname，这里的关键字alertname，一定要和钉钉机器人里面配置的一样，不然消息无法发送

[root@prometheus-server prometheus-webhook-dingtalk]# nohup ./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --ding.profile="alertname=https://oapi.dingtalk.com/robot/send?access_token=557e3c5d31db639ae29201caa50c24188072cf84b7f42db56f002281c6158dfc" &

[root@prometheus-server prometheus-webhook-dingtalk]# ss -lntup |grep 8060
tcp    LISTEN     0      128    [::]:8060               [::]:*                   users:(("prometheus-webh",pid=17617,fd=3))

3.2.4 调整alertmanager配置，添加发收件人

[root@prometheus-server prometheus-webhook-dingtalk]# cat /apps/alertmanager/alertmanager.yml
……省略部分内容
route:
  group_by: ['alertname']
  group_wait: 2s
  group_interval: 2s
  repeat_interval: 1m
  #receiver: 'web.hook' # 注释掉老的，就不会发送邮件了
  receiver: 'dingding' # 引用新的收件人
receivers:
  - name: 'web.hook'
    #webhook_configs:
    #  - url: 'http://127.0.0.1:5001/'
    email_configs:
      - to: 'tangshengx@qq.com'
  - name: 'dingding' # 添加新的收件人
    webhook_configs:
    - url: 'http://10.31.200.103:8060/dingtalk/alertname/send' # 注意alertname，和启动命令中的--ding.profile="alertname保持一致
      send_resolved: true
……省略部分内容

[root@prometheus-server prometheus-webhook-dingtalk]# systemctl restart alertmanager.service
[root@prometheus-server prometheus-webhook-dingtalk]# systemctl is-active alertmanager.service
active

3.2.5 触发告警

3.2.5.1 调整Prometheus告警规则

[root@prometheus-server prometheus-webhook-dingtalk]# tail -8 /usr/local/src/prometheus/rule/node-rule.yaml
      expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 10 # 还是调整这个
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于10%(目前使用:{{$value}}%)"

[root@prometheus-server prometheus-webhook-dingtalk]# systemctl is-active prometheus.service
active

3.2.5.2 查看告警发送情况

[root@prometheus-server alertmanager]# ./amtool alert --alertmanager.url=http://10.31.200.103:9093                                                                   Alertname  Starts At                Summary           State
磁盘容量       2024-03-08 02:23:38 UTC  /boot 磁盘分区使用率过高！  active
磁盘容量       2024-03-08 02:23:38 UTC  / 磁盘分区使用率过高！      active
磁盘容量       2024-03-08 02:23:38 UTC  /boot 磁盘分区使用率过高！  active
磁盘容量       2024-03-08 02:23:38 UTC  / 磁盘分区使用率过高！      active
磁盘容量       2024-03-08 02:23:38 UTC  / 磁盘分区使用率过高！      active
磁盘容量       2024-03-08 02:23:38 UTC  /boot 磁盘分区使用率过高！  active

在这里插入图片描述

3.2.5.3 恢复告警

略

3.2.6 自定义钉钉消息模版

3.2.6.1 创建消息模版

[root@prometheus-server alertmanager]# cd /usr/local/src/prometheus-webhook-dingtalk
[root@prometheus-server prometheus-webhook-dingtalk]# cat template.yaml
{{ define "dingding.to.message" }}

{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}

=========  **监控告警** =========  

**告警程序:**     Alertmanager   
**告警类型:**    {{ $alert.Labels.alertname }}   
**告警级别:**    {{ $alert.Labels.severity }} 级   
**告警状态:**    {{ .Status }}   
**故障主机:**    {{ $alert.Labels.instance }} {{ $alert.Labels.device }}   
**告警主题:**    {{ .Annotations.summary }}   
**告警详情:**    {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}   
**主机标签:**    {{ range .Labels.SortedPairs  }}  </br> [{{ .Name }}: {{ .Value | markdown | html }} ] 
{{- end }} </br>

**故障时间:**    {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}  
========= = end =  =========  
{{- end }}
{{- end }}

{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}

========= 告警恢复 =========  
**告警程序:**     Alertmanager   
**告警主题:**    {{ $alert.Annotations.summary }}  
**告警主机:**    {{ .Labels.instance }}   
**告警类型:**    {{ .Labels.alertname }}  
**告警级别:**    {{ $alert.Labels.severity }} 级   
**告警状态:**    {{   .Status }}  
**告警详情:**    {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}  
**故障时间:**    {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}  
**恢复时间:**    {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}  

========= = **end** =  =========
{{- end }}
{{- end }}
{{- end }}

3.2.6.2 配置 dingtalk 加载钉钉消息模板

[root@prometheus-server prometheus-webhook-dingtalk]# cp config.example.yml config.yml
[root@prometheus-server prometheus-webhook-dingtalk]# cat config.yml
## Request timeout
# timeout: 5s

## Customizable templates path
templates: # 引用消息模版
  - /usr/local/src/prometheus-webhook-dingtalk/template.yaml

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
# default_message:
#   title: '{{ template "legacy.title" . }}'
#   text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  alertname: # 这个alertname也是要和钉钉的消息发送关键字相同
    url: https://oapi.dingtalk.com/robot/send?access_token=557e3c5d31db639ae29201caa50c24188072cf84b7f42db56f002281c6158dfc
    message: # 消息模版简介
      text: '{{ template "dingding.to.message" . }}'

3.2.7 重启webhook-dingtalk加载消息模版，并启用web页面

3.2.7.1 重启webhook-dingtalk

[root@prometheus-server prometheus-webhook-dingtalk]# ps -ef |grep ding
root     17617 17531  0 10:06 pts/0    00:00:00 ./prometheus-webhook-dingtalk --web.listen-address=0.0.0.0:8060 --ding.profile=alertname=https://oapi.dingtalk.com/robot/send?access_token=557e3c5d31db639ae29201caa50c24188072cf84b7f42db56f002281c6158dfc
root     18144 17531  0 11:12 pts/0    00:00:00 grep --color=auto ding
[root@prometheus-server prometheus-webhook-dingtalk]# kill -9 17617

[root@prometheus-server prometheus-webhook-dingtalk]# nohup ./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --web.enable-ui --config.file="config.yml" &

[root@prometheus-server prometheus-webhook-dingtalk]# ss -lntup |grep 8060
tcp    LISTEN     0      128    [::]:8060               [::]:*                   users:(("prometheus-webh",pid=18153,fd=3))

3.2.7.2 访问web页面验证模版

在这里插入图片描述

3.2.8 触发告警

3.2.8.1 调整Prometheus告警规则

和之前一样。

3.2.8.2 查看告警触发情况

[root@prometheus-server alertmanager]# ./amtool alert --alertmanager.url=http://10.31.200.103:9093
Alertname  Starts At                Summary           State
磁盘容量       2024-03-08 05:46:53 UTC  /boot 磁盘分区使用率过高！  active
磁盘容量       2024-03-08 05:46:53 UTC  / 磁盘分区使用率过高！      active
磁盘容量       2024-03-08 05:46:53 UTC  /boot 磁盘分区使用率过高！  active
磁盘容量       2024-03-08 05:46:53 UTC  / 磁盘分区使用率过高！      active
磁盘容量       2024-03-08 05:46:53 UTC  / 磁盘分区使用率过高！      active
磁盘容量       2024-03-08 05:46:53 UTC  /boot 磁盘分区使用率过高！  active

在这里插入图片描述

3.2.8.3 恢复告警

略

3.3 企业微信告警通知

注意：自2022年5月20号之后创建的机器人，必须要进行可信IP认证后，才能使用。
我这里的机器人都是之前创建的，不受影响。

3.3.1 查看机器人ID和secret

在这里插入图片描述

3.3.2 查看企业ID

在这里插入图片描述

3.3.3 调整alertmanager配置，发送告警到企业微信

[root@prometheus-server ~]# cat /apps/alertmanager/alertmanager.yml
global:
  resolve_timeout: 60s
  smtp_smarthost: 'smtp.qq.com:25'
  smtp_from: 'tangshengx@qq.com'
  smtp_auth_username: '1184964356'
  smtp_auth_password: 'cqrkfcsnpjgqifbj'
  smtp_hello: "@qq.com"
  smtp_require_tls: false
route:
  group_by: ['alertname']
  group_wait: 2s
  group_interval: 2s
  repeat_interval: 1m
  #receiver: 'web.hook'
  #receiver: 'dingding'
  receiver: 'wechat' # 路由消息到wechat

receivers:
……省略部分内容
  - name: 'wechat' # 添加企业微信通知
    wechat_configs:
    - corp_id: wwe3783933e46b2b99 # 企业ID
      to_user: '@all' # 发送到哪些用户
      #to_party: 2 # 发送到的部门ID，因为上面是发送到所有人，所以这里就不需要了
      agent_id: 1000002 # 机器人ID
      api_secret: Wjz7awS17De1w32Na9mYeiQD7uBz6l-CrBP3JS586c4 # 机器人secret
      send_resolved: true
……省略部分内容

[root@prometheus-server ~]# systemctl restart alertmanager.service

3.3.4 调整Prometheus配置，触发告警

[root@prometheus-server ~]# cat /usr/local/src/prometheus/rule/node-rule.yaml
……省略部分内容
    - alert: 磁盘容量
      expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 10 # 调整
      for: 2s
      labels:
        severity: critical
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
        description: "{{$labels.mountpoint }} 磁盘分区使用大于10%(目前使用:{{$value}}%)"

[root@prometheus-server ~]# systemctl restart prometheus.service

3.3.5 查看告警

在这里插入图片描述

3.3.6 消息发送到指定部门

在这里插入图片描述

3.3.7 自定义企业微信告警消息模板

3.3.7.1 配置消息通知模板

[root@prometheus-server ~]# cd /apps/alertmanager
[root@prometheus-server alertmanager]# cat message-template.templ
{{ define "wechat.default.message" }}
{{ range $i, $alert :=.Alerts }}
===alertmanager监控报警===
告警状态：{{   .Status }}
告警级别：{{ $alert.Labels.type }}
告警类型：{{ $alert.Labels.alertname }}
告警应用：{{ $alert.Annotations.summary }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
触发阀值：{{ $alert.Annotations.value }}
告警详情: {{ $alert.Annotations.description }}
触发时间: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
===========end============
{{ end }}
{{ end }}

3.3.7.2 配置alertmanager引用通知模板

[root@prometheus-server alertmanager]# cat alertmanager.yml
global:
  resolve_timeout: 60s
  smtp_smarthost: 'smtp.qq.com:25'
  smtp_from: 'tangshengx@qq.com'
  smtp_auth_username: '1184964356'
  smtp_auth_password: 'cqrkfcsnpjgqifbj'
  smtp_hello: "@qq.com"
  smtp_require_tls: false

templates: # 引用企业微信告警模板，要放到route前面
  - '/apps/alertmanager/message-template.templ'

route:
  group_by: ['alertname']
  group_wait: 2s
  group_interval: 2s
  repeat_interval: 1m
  #receiver: 'web.hook'
  #receiver: 'dingding'
  receiver: 'wechat'

receivers:
  - name: 'web.hook'
    #webhook_configs:
    #  - url: 'http://127.0.0.1:5001/'
    email_configs:
      - to: 'tangshengx@qq.com'
  - name: 'dingding'
    webhook_configs:
    - url: 'http://10.31.200.103:8060/dingtalk/alertname/send'
      send_resolved: true

  - name: 'wechat'
    wechat_configs:
    - corp_id: wwe3783933e46b2b99 # 企业ID
      to_user: '@all' # 发送到哪些用户
      #to_party: 2 # 发送到的部门ID，因为上面是发送到所有人，所以这里就不需要了
      agent_id: 1000002 # 机器人ID
      api_secret: Wjz7awS17De1w32Na9mYeiQD7uBz6l-CrBP3JS586c4 # 机器人secret
      send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

[root@prometheus-server alertmanager]# systemctl restart alertmanager.service

3.3.8 查看告警模版是否生效

在这里插入图片描述

3.4 告警消息分类发送

根据消息中的属性信息设置规则，将消息分类发送，如以下为将 severity 级别为critical 的通知消息发送到 dingding，其它的则发送到微信。
实际工作中：网络告警发送给网工、数据库的告警发送给DBA等。

3.4.1 调整Prometheus告警规则文件

3.4.1.1 在告警规则中添加标签

添加标签的目的是为了更好的分类告警

[root@prometheus-server alertmanager]# cat /usr/local/src/prometheus/rule/node-rule.yaml
groups:
  - name: 物理节点状态-监控告警
    rules:
    - alert: 物理节点cpu使用率
      expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 10
      for: 2s
      labels: # 自定义标签。完全自定义，按需配置
        severity: critical # 告警级别
        type: node # 告警类型
        project: 在线商城 # 告警的项目或者业务线
      annotations:
        summary: "{{ $labels.instance }}cpu使用率过高"
        description: "{{ $labels.instance }}的cpu使用率超过90%,当前使用率[{{ $value }}],需要排查处理"
……省略部分内容
    - alert: 磁盘容量
      expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
      for: 2s
      labels:
        severity: warn # 告警级别
        type: k8s # 告警类型
        project: 金牌客服 # 告警的项目或者业务线
      annotations:
        summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"

[root@prometheus-server alertmanager]# systemctl restart prometheus.service

3.4.2 调整alertmanager配置

[root@prometheus-server alertmanager]# cat alertmanager.yml
……省略部分内容
route:
  group_by: ['alertname']
  group_wait: 2s
  group_interval: 2s
  repeat_interval: 1m
  #receiver: 'web.hook'
  receiver: 'default' # 这里的话，可以把其他不匹配的告警，发送到一个默认的通知组，我这里没加
  
  routes: # 子路由配置
  - receiver: 'dingding' # critical 级别的告警发dingding
    group_wait: 1s
    match_re: # 注意：如果是多条件的话，必须同时满足，才能发出去消息
      severity: critical #匹配严重等级告警
      type: node
      project: 在线商城
  - receiver: 'wechat' # warn 级别的告警发微信
    group_wait: 1s
    match_re:
      severity: warn
      type: k8s
      project: 金牌客服



receivers:
……省略部分配置

  - name: 'dingding'
    webhook_configs:
    - url: 'http://10.31.200.103:8060/dingtalk/alertname/send'
      send_resolved: true

  - name: 'wechat'
    wechat_configs:
    - corp_id: wwe3783933e46b2b99 # 企业ID
      to_user: '@all' # 发送到哪些用户
      #to_party: 2 # 发送到的部门ID，因为上面是发送到所有人，所以这里就不需要了
      agent_id: 1000002 # 机器人ID
      api_secret: Wjz7awS17De1w32Na9mYeiQD7uBz6l-CrBP3JS586c4 # 机器人secret
      send_resolved: true
……省略部分配置

[root@prometheus-server alertmanager]# systemctl restart alertmanager.service

3.4.3 查看告警分类结果

在这里插入图片描述

3.5 告警抑制与静默

3.5.1 告警抑制

基于告警规则，超过 80%就不在发 60%的告警，只发送80%的这个告警，但是必须有2条规则才行。
就类似下面这样的配置就行。

- alert: 磁盘容量
  expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 60
  for: 2s
  labels:
    severity: P1
    type: k8s
    project: 金牌客服
  annotations:
    summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
    description: "{{$labels.mountpoint }} 磁盘分区使用大于60%(目前使用:{{$value}}%)"
	
- alert: 磁盘容量
   expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
   for: 2s
   labels:
     severity: P2
     type: k8s
     project: 金牌客服
   annotations:
     summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
     description: "{{$labels.mountpoint }} 磁盘分区使用大于90%(目前使用:{{$value}}%)"

3.5.2 手动告警静默

这里主要就是在alertmanager的web页面操作了
（1）先找到要静默的告警事件，然后手动静默指定的事件

（2）点击 silence（静默）

（3）填写描述信息，让人知道为什么静默（过期时间两小时）

（4）查看被当前静默的事件

（5）手动取消静默

在这里插入图片描述

4. Alertmanager高可用

4.1 单机架构

在这里插入图片描述

4.2 负载均衡架构

多部署几个alertmanager，然后通过lvs或者nginx做负载。

4.3 基于 Gossip 机制（官方）

官方文档：https://yunlzheng.gitbook.io/prometheus-book/part-ii-prometheus-jin-jie/readmd/alertmanager-high-availability

Alertmanager 引入了 Gossip 机制。Gossip 机制为多个 Alertmanager 之间提供了信息传递的机制。确保即使在多个 Alertmanager 分别接收到相同告警信息的情况下，并且只有一个告警通知被发送给 Receiver。
集群环境搭建：
为了能够让 Alertmanager 节点之间进行通讯，需要在 Alertmanager 启动时设置相应的参数。其中主要的
参数包括：
–cluster.listen-address string: 当前实例集群服务监听地址
–cluster.peer value: 初始化时关联的其它实例的集群服务地址

5. 国产告警组件：PrometheusAlert

官网：https://github.com/feiyu563/PrometheusAlert

PrometheusAlert 是开源的运维告警中心消息转发系统，支持主流的监控系统 Prometheus、Zabbix，日志系统 Graylog2，Graylog3、数据可视化系统 Grafana、SonarQube，阿里云-云监控，以及所有支持 WebHook接口的系统发出的预警消息，支持将收到的这些消息发送到钉钉，微信，email，飞书，腾讯短信，腾讯电话，阿里云短信，阿里云电话，华为短信，百度云短信，容联云电话，七陌短信，七陌语音，TeleGram，百度 Hi(如流)等。