一、告警规则参考
https://awesome-prometheus-alerts.grep.to/rules#host-and-hardware
下面是部署
二、部署Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
tar xvf alertmanager-0.20.0.linux-amd64.tar.gz
mv alertmanager-0.20.0.linux-amd64 /usr/local/bin/alertmanager
三、修改alertmanager的主配置文件【采用邮件告警】
cd /usr/local/bin/alertmanager
cat > alertmanager.yml << EOF
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:465'
smtp_from: 发件人邮箱
smtp_auth_username: 发件人邮箱
smtp_auth_password: 密码
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'mail'
receivers:
- name: 'mail'
email_configs:
- to: 收件人邮箱
EOF
检查配置文件是否正确
./amtool check-config alertmanager.yml
启动报警
nohup /app/alertmanager/alertmanager --config.file=/app/alertmanager/alertmanager.yml &
四、配置Prometheus与Alertmanager通信, prometheus.yml文件
vim prometheus.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- ip:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# 这个first_rules.yml放在prometheus.yml同个级别目录下
- "first_rules.yml"
# - "second_rules.yml"
./promtool check config prometheus.yml
五、编写告警规则
官方示例:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
cat > first_rules.yml << EOF
groups:
- name: general.rules
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown #报警名字
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} of job {{ $labels.job }} 已停止1分钟>以上"
EOF
./promtool check config prometheus.yml
systemctl restart prometheus
能看到自己编写的规则浏览器
http://ip:9090/rules
六、验证告警
在172.16.38.238上停止node这个job
可以看到node已经down掉
等待两分钟左右可以收到告警邮件
状态变为FIRING