文章目录
Prometheus+grafana的安装可以参考文章:https://blog.csdn.net/anqixiang/article/details/104283549
一、原理图
二、部署Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
tar xvf alertmanager-0.20.0.linux-amd64.tar.gz
mv alertmanager-0.20.0.linux-amd64 /usr/local/bin/alertmanager
三、修改alertmanager的主配置文件【采用邮件告警】
cd /usr/local/bin/alertmanager
cat > alertmanager.yml << EOF
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 发件人邮箱
smtp_auth_username: 发件人邮箱
smtp_auth_password: 密码
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'mail'
receivers:
- name: 'mail'
email_configs:
- to: 收件人邮箱
EOF
检查配置文件是否正确
./amtool check-config alertmanager.yml
启动
./alertmanager --config.file=alertmanager.yml &
四、配置Prometheus与Alertmanager通信
vim prometheus.yml
mkdir rules
./promtool check config prometheus.yml
五、编写告警规则
官方示例:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
cat > rules/test.yml << EOF
groups:
- name: general.rules
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} of job {{ $labels.job }} 已停止1分钟>以上"
EOF
./promtool check config prometheus.yml
systemctl restart prometheus
能看到自己编写的规则
六、验证告警
在172.16.38.238上停止node这个job
可以看到node已经down掉
等待两分钟左右可以收到告警邮件
状态变为FIRING
七、告警状态解释
八、Inhibit Rule抑制