目录
一、prometheus告警功能
Prometheus对指标的收集、存储同告警能力分属于Prometheus Server和AlertManager(通用的组件)两个独立的组件,前者仅负责基于"告警规则"生成告警通知,具体的告警操作则由后者完成;
Alertmanager负责处理由客户端发来的告警通知客户端通常是Prometheus server,但它也支持接收来自其它工具的告警;
Alertmanager对告警通知进行分组、去重后,根据路由规则将其路由到不同的receiver,如Email、短信或PagerDuty等;
二、静默、抑制、分组
- 分组 (Grouping):将相似告警合并为单个告警通知的机制,在系统因大面积故障而触发告警潮时,分组机制能避免用户被大量的告警噪声淹没,进而导致关键信息的隐没;
- 抑制(Inhibition):系统中某个组件或服务故障而触发告警通知后,那些依赖于该组件或服务的其它组件或服务可能也会因此而触发告警,抑制便是避免类似的级联告警的一种特性,从而让用户能将精力集中于真正的故障所在;
- 静默(silent):是指在一个特定的时间窗口内,即便接收到告警通知,Alertmanager也不会真正向用户发送告警信息的行为;通常,在系统例行维护期间,需要激活告警系统的静默特性;
- 路由(route):用于配置Alertmanager如何处理传入的特定类型的告警通知,其基本逻辑是根据路由匹配规则的匹配结果来确定处理当前告警通知的路径和行为
三、部署告警对接QQ邮箱
[root@qq ~]# systemctl stop firewalld.service
[root@qq ~]# setenforce 0
##将alertmanager-0.21.0.linux-amd64.tar.gz压缩包传入到/opt目录下
[root@qq /opt]# ls
alertmanager-0.21.0.linux-amd64.tar.gz
[root@qq /opt]# tar zxf alertmanager-0.21.0.linux-amd64.tar.gz -C /usr/local/
[root@qq /usr/local]# ln -s /usr/local/alertmanager-0.21.0.linux-amd64/ /usr/local/alertmanager
[root@qq /usr/local/alertmanager]# cat /usr/local/alertmanager/alertmanager.yml
global: #全局参数
resolve_timeout: 5m
route: #路由信息
group_by: ['alertname'] #分组
group_wait: 30s #分组缓冲/等待时间
group_interval: 5m #重新分组时间
repeat_interval: 1h #重新告警间隔
receiver: 'web.hook' #接收方/媒介
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/' #标注5001端口
inhibit_rules: #抑制规则的策略
- source_match: #匹配项
severity: 'critical' #严重的级别
target_match:
severity: 'warning' #target匹配warning级别
equal: ['alertname', 'dev', 'instance'] #符合alertname、dev、instance
修改配置文件
[root@qq /usr/local/alertmanager]# vim /usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: 1441596016@qq.com
smtp_auth_username: 1441596016@qq.com
smtp_auth_password: 授权码
smtp_require_tls: false
smtp_smarthost: 'smtp.qq.com:456'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email-test'
receivers:
- name: 'email-test'
email_configs:
- to: 1441596016@qq.com
send_resolved: true
配置绑定的QQ邮箱
启动alertmanager
[root@qq /usr/local/alertmanager]# ./alertmanager
相关的配置文件
[root@qq /usr/local]# cd prometheus-2.27.1.linux-amd64/
[root@qq /usr/local/prometheus-2.27.1.linux-amd64]# mkdir alert_config
[root@qq /usr/local/prometheus-2.27.1.linux-amd64]# cd alert_config/
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config]# mkdir alert_rules targets
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config]# cd alert_rules/
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config/alert_rules]# vim instance_down.yaml
groups:
- name: AllInstances
rules:
- alert: InstanceDown
# Condition for alerting
expr: up == 0
for: 1m
# Annotation - additional informational labels to store more information
annotations:
title: 'Instance down'
description: Instance has been down for more than 1 minute.'
# Labels - additional labels to be attached to the alert
labels:
severity: 'critical'
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config/alert_rules]# vim instance_down.yaml
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config]# cd targets/
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config/targets]# vim alertmanagers.yaml
- targets:
- 192.168.68.40:9093
labels:
app: alertmanager
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config/targets]# vim nodes-linux.yaml
- targets:
- 192.168.68.30:9100
- 192.168.68.105:9100
labels:
app: node-exporter
job: node
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config/targets]# vim prometheus-servers.yaml
- targets:
- 192.168.68.40:9090
labels:
app: prometheus
job: prometheus
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config/targets]# vim alertmanagers.yaml
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config/targets]# vim nodes-linux.yaml
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config/targets]# vim prometheus-servers.yaml
prometheus 启动文件
[root@qq /usr/local/prometheus-2.27.1.linux-amd64/alert_config]# vim prometheus.yml
# my global config
# Author: MageEdu <mage@magedu.com>
# Repo: http://gitlab.magedu.com/MageEdu/prometheus-configs/
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- file_sd_configs:
- files:
- "targets/alertmanagers*.yaml"
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yaml"
- "alert_rules/*.yaml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
file_sd_configs:
- files:
- targets/prometheus-*.yaml
refresh_interval: 2m
# All nodes
- job_name: 'nodes'
file_sd_configs:
- files:
- targets/nodes-*.yaml
refresh_interval: 2m
- job_name: 'alertmanagers'
file_sd_configs:
- files:
- targets/alertmanagers*.yaml
refresh_interval: 2m
启动 prometheus
[root@qq /usr/local/prometheus-2.27.1.linux-amd64]# ./prometheus --config.file=./alert_config/prometheus.yml