prometheus容器化部署,配合grafana画图
部署 Prometheus
环境部署
主机名 | IP地址 | 服务 |
---|---|---|
prometheus | 192.168.129.205 | prometheus、grafana |
node-exporter | 192.168.129.33 | node_exporter、cAdvisor |
准备工作
//修改主机名
[root@localhost ~]# hostnamectl set-hostname prometheus
[root@localhost ~]# bash
//关闭防火墙
[root@prometheus ~]# systemctl disable --now firewalld.service
[root@prometheus ~]# sed -i s/SELINUX=enforing/SELINUX=disabled/g /etc/selinux/config
[root@prometheus ~]# setenforce 0
[root@prometheus ~]# reboot
//配置yum源
[root@prometheus ~]# curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-8.repo
[root@prometheus ~]# sed -i -e '/mirrors.cloud.aliyuncs.com/d' -e '/mirrors.aliyuncs.com/d' /etc/yum.repos.d/CentOS-Base.repo
//配置docker的yum源
[root@prometheus ~]# cd /etc/yum.repos.d/
[root@prometheus ~]# curl -o docker-ce.repo https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/centos/docker-ce.repo
[root@prometheus ~]# sed -i 's@https://download.docker.com@https://mirrors.tuna.tsinghua.edu.cn/docker-ce@g' docker-ce.repo
//安装docker-ce以及docker组件
[root@prometheus ~]# yum -y install docker-ce
[root@prometheus ~]# systemctl enable --now docker
[root@prometheus ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2021-12-29 23:32:00 CST; 7min ago
Docs: https://docs.docker.com
Main PID: 1002 (dockerd)
Tasks: 13
Memory: 128.2M
CGroup: /system.slice/docker.service
└─1002 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
//配置阿里云镜像加速
[root@prometheus ~]# tee /etc/docker/daemon.json << 'EOF'
{
"registry-mirrors": ["https://b9pmyelo.mirror.aliyuncs.com"]
}
EOF
[root@prometheus ~]# systemctl restart docker
//
[root@prometheus ~]# docker version
Client: Docker Engine - Community
Version: 20.10.12
API version: 1.41
Go version: go1.16.12
Git commit: e91ed57
Built: Mon Dec 13 11:45:22 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.12
API version: 1.41 (minimum version 1.12)
Go version: go1.16.12
Git commit: 459d0df
Built: Mon Dec 13 11:43:44 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.12
GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: de40ad0
运行promethues容器
拉取官方prometheus镜像
[root@prometheus ~]# docker pull prom/prometheus
Using default tag: latest
latest: Pulling from prom/prometheus
3cb635b06aa2: Pull complete
34f699df6fe0: Pull complete
33d6c9635e0f: Pull complete
f2af7323bed8: Pull complete
c16675a6a294: Pull complete
827843f6afe6: Pull complete
3d272942eeaf: Pull complete
7e785cfa34da: Pull complete
05e324559e3b: Pull complete
170620261a59: Pull complete
ec35f5996032: Pull complete
5509173eb708: Pull complete
Digest: sha256:cb9817249c346d6cfadebe383ed3b3cd4c540f623db40c4ca00da2ada45259bb
Status: Downloaded newer image for prom/prometheus:latest
docker.io/prom/prometheus:latest
创建存放prometheus配置文件的目录,并提供默认配置文件
[root@prometheus ~]# mkdir -p /prometheus/config/
[root@prometheus ~]# vi /prometheus/config/prometheus.yml
[root@prometheus ~]# cat /prometheus/config/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
使用官方promethrus镜像创建容器
[root@prometheus ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
prom/prometheus latest a3d385fc29f9 11 days ago 201MB
[root@prometheus ~]# docker run --name prometheus -d --restart always -p 9090:9090 -v /prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml:ro prom/prometheus:latest
268b771e2b8ad9680cd33ab48d7789de1e0da8f28b37f59acfe1bb32ea0235f2
[root@prometheus ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
268b771e2b8a prom/prometheus:latest "/bin/prometheus --c…" 7 seconds ago Up 5 seconds 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp prometheus
//查看端口号
[root@prometheus ~]# ss -antl
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:9090 0.0.0.0:*
LISTEN 0 128 0.0.0.0:22 0.0.0.0:*
LISTEN 0 128 [::]:9090 [::]:*
LISTEN 0 128 [::]:22 [::]:*
使用本机IP地址192.168.129.205 + 端口号9090/targets在浏览器中访问
部署 node_exporter
准备工作
node_exporter-1.3.0.linux-amd64.tar.gz
//修改主机名
[root@localhost ~]# hostnamectl set-hostname node-exporter
[root@localhost ~]# bash
//关闭防火墙
[root@node-exporter ~]# systemctl disable --now firewalld.service
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
[root@node-exporter ~]# sed -i s/SELINUX=enforing/SELINUX=disabled/g /etc/selinux/config
[root@node-exporter ~]# setenforce 0
[root@node-exporter ~]# reboot
//下载安装包
[root@node-exporter ~]# cd /usr/src/
[root@node-exporter src]# wget https://github.com/prometheus/node_exporter/releases/download/v1.3.0/node_exporter-1.3.0.linux-amd64.tar.gz
[root@node-exporter src]# ls
debug kernels node_exporter-1.3.0.linux-amd64.tar.gz
//解压并重命名
[root@node-exporter src]# tar -xf node_exporter-1.3.0.linux-amd64.tar.gz -C /usr/local/
[root@node-exporter src]# mv /usr/local/node_exporter-1.3.0.linux-amd64/ /usr/local/node_exporter
[root@node-exporter src]# ls /usr/local/
bin etc games include lib lib64 libexec node_exporter sbin share src
//编写.service文件
[root@node-exporter ~]# cat > /usr/lib/systemd/system/node_exporter.service <<EOF
[unit]
Description=The node_exporter Server
After=network.target
[Service]
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure
RestartSec=15s
SyslogIdentifier=node_exporter
[Install]
WantedBy=multi-user.target
EOF
[root@node-exporter ~]# systemctl daemon-reload
//启动
[root@node-exporter ~]# systemctl enable --now node_exporter
Created symlink from /etc/systemd/system/multi-user.target.wants/node_exporter.service to /usr/lib/systemd/system/node_exporter.service.
[root@node-exporter ~]# systemctl status node_exporter
● node_exporter.service
Loaded: loaded (/usr/lib/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
Active: active (running) since 四 2021-12-30 00:24:44 CST; 1s ago
Main PID: 62288 (node_exporter)
Tasks: 4
Memory: 6.5M
CGroup: /system.slice/node_exporter.service
└─62288 /usr/local/node_exporter/node_exporter
//查看端口号
[root@node-exporter ~]# ss -antl
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:22 0.0.0.0:*
LISTEN 0 128 *:9100 *:*
LISTEN 0 128 [::]:22 [::]:*
添加节点到prometheus中
修改/prometheus/config目录下的prometheus配置文件prometheus.yml
[root@prometheus ~]# cd /prometheus/config/
[root@prometheus config]# vi prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node1" #添加节点名称
static_configs:
- targets: ["192.168.129.33:9100"] #添加node_exporter节点IP地址和端口号
//重启prometheus容器
[root@prometheus ~]# docker restart prometheus
prometheus
[root@prometheus ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
268b771e2b8a prom/prometheus:latest "/bin/prometheus --c…" 42 minutes ago Up 5 seconds 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp prometheus
使用prometheus主机IP地址192.168.129.205+ 端口号9090/targets在浏览器中访问
查看监控数据
浏览器打开IP+端口 192.168.129.205:9090/metrics
部署grafana画图工具
拉取grafan/grafan官方镜像
[root@prometheus ~]# docker pull grafana/grafana
Using default tag: latest
latest: Pulling from grafana/grafana
97518928ae5f: Pull complete
5b58818b7f48: Pull complete
d9a64d9fd162: Pull complete
4e368e1b924c: Pull complete
867f7fdd92d9: Pull complete
387c55415012: Pull complete
07f94c8f51cd: Pull complete
ce8cf00ff6aa: Pull complete
e44858b5f948: Pull complete
4000fdbdd2a3: Pull complete
Digest: sha256:18d94ae734accd66bccf22daed7bdb20c6b99aa0f2c687eea3ce4275fe275062
Status: Downloaded newer image for grafana/grafana:latest
docker.io/grafana/grafana:latest
[root@prometheus ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
prom/prometheus latest a3d385fc29f9 11 days ago 201MB
grafana/grafana latest 9b957e098315 2 weeks ago 275MB
使用官方grafana镜像运行容器
[root@prometheus ~]# docker run -d --name grafana -p 3000:3000 --restart always grafana/grafana
32239d6bf896ca59c12df52d070865a007a2ea44de43149d1cacdd7582a5d2b0
[root@prometheus ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
32239d6bf896 grafana/grafana "/run.sh" 30 seconds ago Up 28 seconds 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp grafana
268b771e2b8a prom/prometheus:latest "/bin/prometheus --c…" 54 minutes ago Up 6 minutes 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp prometheus
[root@prometheus ~]# ss -antl
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:3000 0.0.0.0:*
LISTEN 0 128 0.0.0.0:9090 0.0.0.0:*
LISTEN 0 128 0.0.0.0:22 0.0.0.0:*
LISTEN 0 128 [::]:3000 [::]:*
LISTEN 0 128 [::]:9090 [::]:*
LISTEN 0 128 [::]:22 [::]:*
使用prometheus主机IP地址192.168.129.205 + 端口号3000在浏览器中访问
- 默认账号:admin 密码:admin
再次输入(因为你是首次登陆)
进入首页
配置数据源
导入仪表盘模板
仪表板模板下载地址
- 下载JSON文件到本地,uoload上传导入
- 直接输入ID,load就会自动加载到这个模板
效果图
部署cAdvisor
[root@node-exporter ~]# docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
--privileged \
--device=/dev/kmsg \
google/cadvisor
ac1ef6a06434feaf84272a454a95c64178eeac37b6d57d6aa544f4d0cd744418
访问
node-exporter 主机的ip+8080端口
添加容器ip
[root@prometheus ~]# vi /prometheus/config/prometheus.yml
[root@prometheus ~]# cat /prometheus/config/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node1" #添加节点名称
static_configs:
- targets: ["192.168.129.33:9100"] #添加node_exporter节点IP地址和端口号
- job_name: "docker"
static_configs:
- targets: ["192.168.129.33:8080"] #添加node_exporter节点IP地址和端口号
//重启prometheus
[root@prometheus ~]# docker restart prometheus
prometheus
访问测试
现有已经把容器监控起来
不过这只是监控的宿主机资源信息,如果我们想看docker容器的信息
在官网查找与docker有关的模板导入并使用
适合docker的模板11600
配置Alertmanager报警
alertmanager与prometheus工作流程如下
- prometheus收集监测的信息
- prometheus.yml文件定义rules文件,rules里包括了告警信息
- prometheus把报警信息push给alertmanager ,alertmanager里面有定义收件人和发件人
- alertmanager发送文件给邮箱或微信
告警等级
启动并配置 AlertManager
[root@master ~]# docker pull prom/alertmanager:latest
latest: Pulling from prom/alertmanager
aa2a8d90b84c: Pull complete
b45d31ee2d7f: Pull complete
e64c3c57ffe7: Pull complete
7665a4a59238: Pull complete
9a345be9cdfe: Pull complete
aa42aae1183b: Pull complete
Digest: sha256:9ab73a421b65b80be072f96a88df756fc5b52a1bc8d983537b8ec5be8b624c5a
Status: Downloaded newer image for prom/alertmanager:latest
docker.io/prom/alertmanager:latest
[root@prometheus ~]# mkdir /alertmanager
[root@prometheus ~]# docker cp alertmanager:/etc/alertmanager/alertmanager.yml /alertmanager
[root@prometheus ~]# ls /alertmanager/
alertmanager.yml
[root@prometheus ~]# docker run --name alertmanager -d -v /alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml -p 9093:9093 --restart always prom/alertmanager
4afabc63c74ea8372035878ef2ca1f84b289450e585efd3efdb7a87bfcc6c4bb
这里 AlertManager 默认启动的端口为 9093,启动完成后,浏览器访问 http://<IP>:9093
可以看到默认提供的 UI 页面,不过现在是没有任何告警信息的,因为我们还没有配置报警规则来触发报警。
配置alertmanager邮箱报警
AlertManager 默认配置文件为alertmanager.yml
,在容器内路径为 /etc/alertmanager/alertmanager.yml
,默认配置如下:
[root@prometheus ~]# cat /alertmanager/alertmanager.yml
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
简单介绍一下主要配置的作用:
- global: 全局配置,包括报警解决后的超时时间、SMTP 相关配置、各种渠道通知的 API 地址等等。
- route: 用来设置报警的分发策略,它是一个树状结构,按照深度优先从左向右的顺序进行匹配。
- receivers: 配置告警消息接受者信息,例如常用的 email、wechat、slack、webhook 等消息通知方式。
- inhibit_rules: 抑制规则配置,当存在与另一组匹配的警报(源)时,抑制规则将禁用与一组匹配的警报(目标)。
那么,我们就来配置一下使用 Email 方式通知报警信息,这里以 QQ 邮箱为例,配置如下:
global:
resolve_timeout: 5m
#邮箱服务器
smtp_from: 'xxxxxxxx@qq.com'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: 'xxxxxxxx@qq.com'
smtp_auth_password: 'xxxxxxxxxxxxxxx'
smtp_require_tls: false
smtp_hello: 'qq.com'
#配置路由树
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
#接收人
receivers:
- name: 'email'
email_configs:
- to: 'xxxxxxxx@qq.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
smtp_smarthost
: 这里为 QQ 邮箱 SMTP 服务地址,官方地址为smtp.qq.com
端口为465
或587
,同时要设置开启POP3/SMTP
服务。smtp_auth_password
: 这里为第三方登录 QQ 邮箱的授权码,非 QQ 账户登录密码,否则会报错,获取方式在 QQ 邮箱服务端设置开启POP3/SMTP
服务时会提示。smtp_require_tls
: 是否使用 tls,根据环境不同,来选择开启和关闭。如果提示报错email.loginAuth failed: 530 Must issue a STARTTLS command first
,那么就需要设置为 true。着重说明一下,如果开启了 tls,提示报错starttls failed: x509: certificate signed by unknown authority
,需要在email_configs
下配置insecure_skip_verify: true
来跳过 tls 验证。
配置邮箱报警,首先我们邮箱需要开启SMTP服务,并获取唯一标识码
编辑alertmanager.yml文件
[root@prometheus ~]# cat /alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: '2464146612@qq.com' #定义发送的邮箱
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '2464146612@qq.com'
smtp_auth_password: 'rgxloknrxwpyebcc'
smtp_require_tls: false
smtp_hello: 'qq.com'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '2464146612@qq.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
prometheus添加alertmanager报警规则
接下来,我们需要在 Prometheus 配置 AlertManager 服务地址以及告警规则,新建报警规则文件node-up.rules
如下
[root@prometheus ~]# mkdir /prometheus/rules && cd /prometheus/rules
[root@prometheus rules]# vi node_up.rules
[root@prometheus rules]# cat node_up.rules
groups:
- name: node-up
rules:
- alert: node-up
expr: up{job="node1"} == 0 #这个是在prometheus.yml里面设置的job_name
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{{ $labels.instance }} 已停止运行超过 15s!"
修改prometheus.yml文件,添加rules规则
[root@prometheus ~]# vi /prometheus/config/prometheus.yml
...
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.129.205:9093 #alertmanager地址以及端口
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/usr/local/prometheus/rules/*.rules"
# - "first_rules.yml"
# - "second_rules.yml"
...
//重启服务
[root@prometheus ~]# docker restart prometheus
prometheus
在prometheus上查看相应的规则
这里说明一下 Prometheus Alert 告警状态有三种状态:Inactive
、Pending
、Firing
。
- Inactive:非活动状态,表示正在监控,但是还未有任何警报触发。
- Pending:表示这个警报必须被触发。由于警报可以被分组、压抑/抑制或静默/静音,所以等待验证,一旦所有的验证都通过,则将转到
Firing
状态。 - Firing:将警报发送到 AlertManager,它将按照配置将警报的发送给所有接收者。一旦警报解除,则将状态转到
Inactive
,如此循环。
触发报警发送 Email
上边我们定义的 rule 规则为监测 job="node-exporter"
Node 是否活着,那么就可以停掉 node-exporter
服务来间接起到 Node Down 的作用,从而达到报警条件,触发报警规则。
[root@node-exporter ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ac1ef6a06434 google/cadvisor "/usr/bin/cadvisor -…" 2 hours ago Up 2 hours 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp cadvisor
[root@node-exporter ~]#
docker stop node-exporter
停止服务后,等待 15s 之后可以看到 Prometheus target 里面 node-exproter
状态为 unhealthy
状态,等待 15s 后,alert 页面由绿色 node-up (0 active)
Inactive
状态变成了黄色 node-up (1 active)
Pending
状态,继续等待 15s 后状态变成红色 Firing
状态,向 AlertManager 发送报警信息,此时 AlertManager 则按照配置规则向接受者发送邮件告警
查看邮件
lertManager 配置自定义邮件模板
上面虽然已经可以做出报警,但是我们想让报警信息更加直观一些
alertmanager支持自定义邮件模板的
首先新建一个模板文件 email.tmpl
[root@prometheus template]# mkdir -p /prometheus/alertmanager/template && cd /prometheus/alertmanager/template
[root@prometheus template]# vim email.tmpl
{{ define "email.from" }}2464146612@qq.com{{ end }}
{{ define "email.to" }}2464146612@qq.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
触发时间: {{ .StartsAt.Format "2021-12-30 01:29:52" }} <br>
=========end==========<br>
{{ end }}
{{ end }}
简单说明一下,上边模板文件配置了 email.from
、email.to
、email.to.html
三种模板变量,可以在 alertmanager.yml
文件中直接配置引用。这里 email.to.html
就是要发送的邮件内容,支持 Html 和 Text 格式,这里为了显示好看,采用 Html 格式简单显示信息。下边 {{ range .Alerts }}
是个循环语法,用于循环获取匹配的 Alerts 的信息,下边的告警信息跟上边默认邮件显示信息一样,只是提取了部分核心值来展示。然后,需要增加 alertmanager.yml
文件 templates 配置如下:
[root@prometheus ~]# vi /alertmanager/alertmanager.yml
[root@prometheus ~]# cat /alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: '{{ template "email.from" . }}' #定义发送的邮箱
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '{{ template "email.from" . }}'
smtp_auth_password: 'rgxloknrxwpyebcc'
smtp_require_tls: false
smtp_hello: 'qq.com'
templates:
- '/etc/alertmanager-tmpl/email.tmpl'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '{{ template "email.to" . }}'
html: '{{ template "email.to.html" . }}'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
然后,修改 AlertManager 启动命令,将本地 email.tmpl
文件挂载到容器内指定位置并重启。
[root@prometheus ~]# docker rm -f alertmanager
[root@prometheus ~]# docker run --name alertmanager -d -v /alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml -v /prometheus/alertmanager/template/email.tmpl:/etc/alertmanager-tmpl/ -p 9093:9093 --restart always prom/alertmanager
重启完毕后,同样模拟触发报警条件(停止 node-exporter
服务),也是可以正常发送模板邮件