AlertManager
1、安装 AlertManager
cd /usr/software
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
tar -zxvf alertmanager-0.21.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/prometheus-2.26.0.linux-amd64
./
2、配置 prometheus.yml
cd /usr/local/prometheus-2.26.0.linux-amd64
vim prometheus.yml
######## prometheus.yml 配置文件 ###########
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:19911'
rule_files:
- "rules.yml"
3、配置 alertmanager.yml
cd /usr/local/alertmanager-0.21.0.linux-amd64
vim alertmanager.yml
######## alertmanager.yml 配置文件 ###########
global:
resolve_timeout: 5m
# smtp配置
smtp_from: "ezrealer@qq.com"
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: "ezrealer@qq.com"
smtp_auth_password: "123456"
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'ezrealer_email'
receivers:
- name: 'ezrealer_email'
email_configs:
- to: '2695138379@qq.com'
send_resolved: true
headers:
from: "Prometheus 警报中心"
subject: "报警邮件"
to: "ezrealer2"
4、配置 rules.yml
cd /usr/local/prometheus-2.26.0.linux-amd6
vim rules.yml
######## rules.yml 配置文件 ###########
groups:
- name: node_status
rules:
- alert: node_status
expr: probe_success == 0
for: 1m
labels:
status: 严重
annotations:
summary: "group:{{$labels.group}},instance:{{$labels.instance}} has been down "
description: "group:{{$labels.group}},instance:{{$labels.instance}} has been down "
value: "{{$value}}"
- name: CPU
rules:
- alert: CPU使用率
expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[6m]))) by (instance) * 100 > 80
for: 1m
labels:
status: 一般
annotations:
summary: "group:{{$labels.group}},instance:{{$labels.instance}}:CPU使用率大于80%"
value: "{{$value}}"
服务器的监控与告警
参考:https://mp.weixin.qq.com/s/DILXvkvpS25VJbb3FalBqQ
CPU
内存
磁盘
可用性
服务状态
网络
CPU
100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60
node_load5 > on (instance) 2 * count by(instance)(node_cpu_seconds_total{mode="idle"})
内存
node_memory_MemTotal_bytes:主机上的总内存
node_memory_MemFree_bytes:主机上的可用内存
node_memory_Buffers_bytes:缓冲缓存中的内存
node_memory_Cached_bytes:页面缓存中的内存
100 - sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"})by (instance) / sum(node_memory_MemTotal_bytes{job="node-exporter"})by(instance)*100 > 80
磁盘
predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!=""}[1h], 4*3600)
(100 - (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100)>80) and (predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!="",device!="rootfs"}[1h],4 * 3600) < 0)
100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100)
可用性
up{job="node-exporter"}==0
服务状态
1、docker
node_systemd_unit_state{name="docker.service",state="active"} == 1