原创作者:运维工程师 谢晋
Promethues微信告警
使用Prometheus实现企微信报警
Prometheus可以通过多种方式把告警信息发送到指定人,常用的有邮件,短信报警方式,但是越来越多的企业开始使用Prometheus结合微信作为主要的告警方式,这样可以及时有效的把告警信息推送到接收人,方便告警的及时处理。
- 部署Alertmanager
Alertmanager是Prometheus告警组件,需要通过Alertmanager结合邮件、微信、钉钉等进行告警
安装Alertmanager组件
官网下载Alertmanager包
# tar xf alertmanager-0.23.0.linux-amd64.tar.gz
# ln -s /usr/local/alertmanager-0.23.0/ /usr/local/alertmanager
# cat /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager System
Documentation=alertmanager System
[Service]
ExecStart=/usr/local/alertmanager/alertmanager \
--config.file=/usr/local/alertmanager/alertmanager.yml
[Install]
WantedBy=multi-user.target
# systemctl start alertmanager.service
访问Alertmanager网页9093端口
- 检查alertmanager.yml语法
# pwd
/usr/local/alertmanager
# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml' SUCCESS
Found:
- global config
- route
- 1 inhibit rules
- 1 receivers
- 0 templates
- 企业微信ID获取
登陆企业微信https://work.weixin.qq.com/,需企业管理员账号登陆
点击应用管理
这里添加应用名称和告警需要通知到的人或组
这里要记住3个值,Agentld、Secret、企业ID,后续创建报警需要
- 创建告警配置文件
# cat /usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
templates: #告警模板
- '/usr/local/alertmanager/wechat.tmpl'
route: # 设置报警分发策略
group_by: ['linux'] # 分组标签
group_wait: 10s # 告警等待时间。告警产生后等待10s,如果有同组告警一起发出
group_interval: 10s # 两组告警的间隔时间
repeat_interval: 1m # 重复告警的间隔时间,减少相同右键的发送频率 此处为测试设置为1分钟
receiver: 'wechat' # 默认接收者
receivers:
- name: 'wechat'
wechat_configs:
- send_resolved: true
agent_id: '1000003' # 自建应用的agentId
to_party: '2' # 接收告警消息的部门
api_secret: 'zdevZA7hR-gsc4N0aeKYSjSFMz1aL38YLQnkeGuaVIY' # 自建应用的secret
corp_id: 'wwc5982c624dfffbb9' # 企业ID
message: '{{ template "wechat.tmpl" . }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
将企业微信刚刚获取到的三个值填入即可
- 配置Prometheus.yml
将Alertmanager信息添加到prometheus.yml文件内
# vi /usr/local/prometheus/prometheus.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['192.168.20.106:9093']
# - alertmanager:9093
- 报警规则配置
# cat /usr/local/prometheus/rules/node_status.yml
groups:
- name: 实例存活告警规则
rules:
- alert: 实例存活告警
expr: up{job="linux"} == 0 or up{job="linux"} == 0
for: 1m
labels:
user: prometheus
severity: Disaster
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "Instance {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
value: "{{ $value }}"
- name: 内存告警规则
rules:
- alert: "内存使用率告警"
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 75
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: "服务器: {{$labels.alertname}} 内存报警"
description: "{{ $labels.alertname }} 内存资源利用率大于75%!(当前值: {{ $value }}%)"
value: "{{ $value }}"
- name: CPU报警规则
rules:
- alert: CPU使用率告警
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 70
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: "服务器: {{$labels.alertname}} CPU报警"
description: "服务器: CPU使用超过70%!(当前值: {{ $value }}%)"
value: "{{ $value }}"
- name: 磁盘报警规则
rules:
- alert: 磁盘使用率告警
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: "服务器: {{$labels.alertname}} 磁盘报警"
description: "服务器:{{$labels.alertname}},磁盘设备: 使用超过80%!(挂载点: {{ $labels.mountpoint }} 当前值: {{ $value }}%)"
value: "{{ $value }}"
在添加告警监控脚本位置
# cat /usr/local/prometheus/prometheus.yml
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "rules/node_status.yml"
- 启用alertmanager
启动服务
# systemctl start alertmanager.service
设置开机自启动
# systemctl enable alertmanager.service
- 测试告警推送结果
将监控的虚拟机关闭一台做主机宕机测试,企业微信收到告警
若未配置告警模板,将会收到这样的默认告警信息。 - 配置告警模板
# vi /usr/local/alertmanager/wechat.tmpl
{{ define "wechat.tmpl" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 监控报警 =========
告警状态:{{ .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
触发阀值:{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 异常恢复 =========
告警类型:{{ .Labels.alertname }}
告警状态:{{ .Status }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}
在alertmanager.yml文件内告警模板指向wechat.tmpl文件
# vi /usr/local/alertmanager/alertmanager.yml
receivers:
- name: 'wechat'
wechat_configs:
- send_resolved: true
agent_id: '1000002' # 自建应用的agentId
to_party: '2' # 接收告警消息的部门
api_secret: 'NZvnlNNFvnpk4k-0_YNmE-ULRynAEU8PYkyT1k_MTm8' # 自建应用的secret
corp_id: 'wwc5982c624dfffbb9' # 企业ID
message: '{{ template "wechat.tmpl" . }}'
测试报警