Promethus从入门到报警
1.流程描述
---->安装Prometheus server
---->安装Node-Exporter后向节点拉取数据
---->安装AlterManager并编写配置文件,注册报警邮件
---->检验报警
2.安装Prometheus server
注:若只是安装并测试报警案例请直接查看安装AlterManager报警
2.1 解压
[root@node01 ~]# tar -zxvf prometheus-2.30.3.linux-amd64.tar.gz
[root@node01 ~]# mv prometheus-2.30.3.linux-amd64 prometheus-2.30.3
[root@node01 ~]# mv prometheus-2.30.3 /opt/yjx/
[root@node01 ~]# rm -rf prometheus-2.30.3.linux-amd64.tar.gz
2.2 修改配置文件,添加环境变量
这里只是单纯安装Prometheus server,所以Job是自己
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
[root@node01 ~]# vim /etc/systemd/system/prometheus.service
[Unit]
Description=prometheus
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=root
ExecStart=/opt/yjx/prometheus-2.30.3/prometheus \
--config.file=/opt/yjx/prometheus-2.30.3/prometheus.yml \
--storage.tsdb.path=/opt/yjx/prometheus-2.30.3/data/ \
--web.enable-lifecycle
[Install]
WantedBy=multi-user.target
[root@node01 ~]# systemctl daemon-reload
2.3启动校验
[root@node01 ~]# systemctl start prometheus
[root@node01 ~]# systemctl status prometheus
注:以自己集群的ip地址为准
http://192.168.10.101:9090/
http://node01:9090/
3.安装Node-Exporter后拉取数据
3.1 安装
[root@node01 ~]# tar -zxvf node_exporter-1.2.2.linux-amd64.tar.gz
[root@node01 ~]# mv node_exporter-1.2.2.linux-amd64 node_exporter-1.2.2
[root@node01 ~]# mv node_exporter-1.2.2 /opt/yjx/
[root@node01 ~]# rm -rf node_exporter-1.2.2.linux-amd64.tar.gz
[root@node01 ~]# vim /etc/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/yjx/node_exporter-1.2.2/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
[root@node01 ~]# yjxrsync /opt/yjx/node_exporter-1.2.2
#三台主机均需执行
[root@123 ~]# systemctl daemon-reload
[root@123 ~]# systemctl start node_exporter
[root@123 ~]# systemctl status node_exporter
3.2 拉取并校验
[root@node01 ~]# vim /opt/yjx/prometheus-2.30.3/prometheus.yml
注1:yml文件必须注意格式规范
注2:节点ip地址以自身集群的为准
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: 'yjx-nodes'
metrics_path: /metrics
static_configs:
- targets: ['192.168.10.101:9100','192.168.10.102:9100','192.168.10.103:9100']
#热加载,此命令较为常用,在更改配置文件后均需使用热加载命令.
[root@node01 ~]# curl -XPOST http://192.168.10.101:9090/-/reload
这里可以登陆9100端口查看node-Export
4.安装AlterManager,并实现报警
4.1先修改配置文件,添加报警所用的Rules
[root@node01 ~]# vim /opt/yjx/prometheus-2.30.3/prometheus.yml
注: 注意yaml格式规范,若出错则热加载会报错
Prometheus开启告警
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.10.101:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/opt/yjx/prometheus-2.30.3/rules/*.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: 'yjx-nodes'
metrics_path: /metrics
static_configs:
- targets: ['192.168.10.101:9100','192.168.10.102:9100','192.168.10.103:9100']
[root@node01 ~]# mkdir -p /opt/yjx/prometheus-2.30.3/rules
[root@node01 ~]# vim /opt/yjx/prometheus-2.30.3/rules/node_alived.yml
这里以我给出的案例为准
自定义Rules
[root@node01 ~]# vim /opt/yjx/prometheus-2.30.3/rules/node_alived.yml
这里主要以节点宕机为测试案例,其余案例请自行根据模板修改
groups:
# 实例存活告警规则
- name: 实例存活告警规则
rules:
# 实例存活告警
- alert: 实例存活告警 # 名称
expr: up == 0 # 算法
for: 1m # 告警持续时间
labels:
user: prometheus
severity: warning
annotations:
summary: "主机宕机 !!!"
description: "牙刷哟,该实例主机已经宕机超过一分钟了。"
[root@node01 ~]# vim /opt/yjx/prometheus-2.30.3/rules/memory_over.yml
[root@node01 ~]# vim /opt/yjx/prometheus-2.30.3/rules/cpu_over.yml
[root@node01 ~]# vim /opt/yjx/prometheus-2.30.3/rules/disk_over.yml
重点:修改好yml文件后不要直接使用 curl -XPOST http://192.168.10.101:9090/-/reload
先校验检查配置文件格式是否正确!!!
进入Promethus文件目录后使用命令:
[root@node01 prometheus-2.30.3]# ./promtool check config /opt/yjx/prometheus-2.30.3/prometheus.yml
检验失败会提示:
格式和配置正确会提示:
现在就可以重启Prometheus:
curl -XPOST http://192.168.10.101:9090/-/reload
若不报错则为重启成功,然后去Web页面查看我们自定义的Rules
如为空则没有正确配置好yml文件
4.2安装Alertmanager
[root@node01 ~]# tar -zxvf alertmanager-0.23.0.linux-amd64.tar.gz
[root@node01 ~]# mv alertmanager-0.23.0.linux-amd64 alertmanager-0.23.0
[root@node01 ~]# mv alertmanager-0.23.0 /opt/yjx/
[root@node01 ~]# rm -rf alertmanager-0.23.0.linux-amd64.tar.gz
[root@node01 ~]# vim /opt/yjx/alertmanager-0.23.0/alertmanager.yml
这里的yml以我给出的为准
注:smtp_from 和smtp_auth_username使用自己的邮箱
然后smtp_auth_password验证码不是邮箱账号密码,是邮箱POP3/IMAP/SMTP/Exchange/CardDAV/CalDAV服务开启后的授权码
开启服务后会生成对应的授权码
# 全局配置项
global:
resolve_timeout: 5m #超时,默认5min
smtp_smarthost: 'smtp.qq.com:465' #邮箱smtp服务
smtp_from: '1075810796@qq.com'
smtp_auth_username: '1075810796@qq.com'
smtp_auth_password: 'hmgtugxzibdggjee'
smtp_require_tls: false
# 定义模板信息
templates:
- '/opt/yjx/alertmanager-0.23.0/template/*.tmpl' # 路径
# 路由
route:
group_by: ['alertname'] # 报警分组依据
group_wait: 10s #组等待时间
group_interval: 10s # 发送前等待时间
repeat_interval: 1h #重复周期
receiver: 'mail' # 默认警报接收者
# 警报接收者
receivers:
- name: 'mail' #警报名称
email_configs:
- to: '{{ template "email.to" . }}' #接收警报的email
html: '{{ template "email.to.html" . }}' # 模板
send_resolved: true
# 告警抑制
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
然后创建发送模板,也就是我们发送报警信息的模板
[root@node01 ~]# mkdir -p /opt/yjx/alertmanager-0.23.0/template
[root@node01 ~]# vim /opt/yjx/alertmanager-0.23.0/template/email.tmpl
{{ define "email.from" }}863159469@qq.com{{ end }}
{{ define "email.to" }}540667499@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: yjxxt_prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}
{{ end }}
添加环境变量
[root@node01 ~]# vim /etc/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=root
ExecStart=/opt/yjx/alertmanager-0.23.0/alertmanager \
--config.file /opt/yjx/alertmanager-0.23.0/alertmanager.yml \
--storage.path /opt/yjx/alertmanager-0.23.0/data/
[Install]
WantedBy=multi-user.target
[root@node01 ~]# systemctl daemon-reload
[root@node01 ~]# systemctl start alertmanager
[root@node01 ~]# systemctl status alertmanager
4.3校验报警
关闭node02主机,模拟节点宕机
等待超时时间:上面配置我们设置的为5分钟
访问Web:
http://192.168.10.101:9090/classic/targets
http://192.168.10.101:9093
五分钟后将会收到报警邮件