Day07-Prometheus监控项目
9. 基于Prometheus的全网监控
9.1 前期沟通
- 沟通确定监控指标.
- 可能需要开发人员配合书写页面.
- 其他沟通…
9.2 环境准备
- 监控的目标
监控的项目 | 使用什么exporter | 涉及的主机 |
---|---|---|
系统基本信息 | node_exporter | 所有 |
负载均衡,web | nginx_exporter | 负载均衡,web服务器 |
web中间件:php,java | jmx_exporter (jmx) | web服务器 |
数据库 | mysqld_exporter | 数据库服务器 |
redis | redis_exporter | 缓存 |
存储 | xxx_exporter | nfs(自定义),对象存储(OSS),ceph,minio |
9.3 搭建流程
1)nginx_exporter
- 准备
访问指定的uri和端口就显示nginx_status页面.
[root@web01 ~]# cat /etc/nginx/conf.d/status.conf
server {
listen 8000;
location / {
stub_status;
}
}
- 部署nginx_exporter
可以直接下载与部署nginx_exporter类似于node_exporter
https://github.com/nginxinc/nginx-prometheus-exporter
- 通过容器运行测试环境(模拟web服务器和lb环境)
[root@docker01 ~]# mkdir -p /app/project/ngx
[root@docker01 ~]# cd /app/project/ngx
[root@docker01 ngx]# ll
总用量 0
[root@docker01 ngx]# vim status.conf
server {
listen 8000;
location / {
stub_status;
}
}
[root@docker01 ngx]# docker run -d --name "ngx" \
-v `pwd`/status.conf:/etc/nginx/conf.d/status.conf \
--restart=always \
-p 80:80 \
-p 8000:8000 \
nginx:1.22-alpine
# --rm 一般用于测试与 --restart冲突的
# 检查测试环境
[root@docker01 ngx]# curl 172.16.1.81:8000/
Active connections: 1
server accepts handled requests
1 1 1
Reading: 0 Writing: 1 Waiting: 0
- 运行nginx_exporter
[root@docker01 ngx]# docker pull nginx/nginx-prometheus-exporter:0.10.0
# 运行容器并指定url+端口+uri
# 默认是:"http://127.0.0.1:8080/stub_status" #127.0.0.1是容器内部的ip并非宿主机ip.
# 前台
[root@docker01 ngx]# docker run -it --rm -p 9113:9113 nginx/nginx-prometheus-exporter:0.10.0 -nginx.scrape-uri "http://172.16.1.81:8000/"
2024/05/11 03:25:58 Starting NGINX Prometheus Exporter version=0.10.0 commit=7a03d0314425793cf4001f0d9b0b2cfd19563433 date=2021-12-21T19:24:34Z
2024/05/11 03:25:58 Listening on :9113
2024/05/11 03:25:58 NGINX Prometheus Exporter has successfully started
# 浏览器输入
http://10.0.0.81:9113/metrics
# 后台
docker run -d --name "ngx_exporter_8000" -p 9113:9113 \
--restart=always \
nginx/nginx-prometheus-exporter:0.10.0 \
-nginx.scrape-uri "http://172.16.1.81:8000/"
- 修改prometheus服务端配置
服务端配置文件增加job部分即可
- job_name: "web_lb_ngx_exporter"
static_configs:
- targets:
- "10.0.0.81:9113"
# 重启
[root@m04-prometheus prometheus]# systemctl restart prometheus.service
- 找grafana模板
- 模板id: 9512
- 自定义模板
- 添加job变量
label_values(prometheus键值)
label_values(job) #获取所有job
- 创建instance变量
label_values(instance) #查询出所有实例
#如果想job与实例是关联的则使用下面即可
label_values(up{job="$job"},instance)
- 添加面板
- 小结:
- 有相对应的环境
- 部署export(docker,直接部署)
- 测试exporter是否有数据 10.0.0.81:9113/metrics
- 配置prometheus服务端管理exporter
- 配置grafana(仪表盘(模板),自定义仪表盘(自定义job变量和instance变量))
2)db_exporter
https://github.com/prometheus/mysqld_exporter
- 在数据库中添加用户
[root@db01 ~]# mysql -uroot -poldboy123
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'exporter123';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
CREATE USER 'exporter'@'172.%' IDENTIFIED BY 'exporter123';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'172.%';
CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter123';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
CREATE USER 'exp'@'%' IDENTIFIED BY 'exporter123';
GRANT ALL ON *.* TO 'exp'@'%';
- 配置mysqld_exporter
[root@docker01 db]# cat my.cnf
[client]
user = exporter
password = exporter123
# 配置连接数据库的变量
export DATA_SOURCE_NAME='login:password@(hostname:port)/'
export DATA_SOURCE_NAME='exporter:exporter123@(172.16.1.51:3306)/'
# 在容器中跑则加上
-e 或 --env
- 启动mysqld_exporter容器
docker run -d --restart=always --name "mysqld_exporter" \
-v `pwd`/my.cnf:/etc/exporter-my.cnf \
-e "DATA_SOURCE_NAME='exporter:exporter123@(172.16.1.51:3306)/'" \
-p 9104:9104 \
prom/mysqld-exporter \
--config.my-cnf=/etc/exporter-my.cnf
[root@docker01 db]# docker logs mysqld_exporter
level=info ts=2024-05-11T06:45:13.829Z caller=mysqld_exporter.go:277 msg="Starting msqyld_exporter" version="(version=0.13.0, branch=HEAD, revision=ad2847c7fa67b9debafccd5a08bacb12fc9031f1)"
level=info ts=2024-05-11T06:45:13.829Z caller=mysqld_exporter.go:278 msg="Build context" (gogo1.16.4,userroot@e2043849cb1f,date20210531-07:30:16)=(MISSING)
level=info ts=2024-05-11T06:45:13.829Z caller=mysqld_exporter.go:293 msg="Scraper enabled" scraper=global_status
level=info ts=2024-05-11T06:45:13.829Z caller=mysqld_exporter.go:293 msg="Scraper enabled" scraper=global_variables
level=info ts=2024-05-11T06:45:13.829Z caller=mysqld_exporter.go:293 msg="Scraper enabled" scraper=slave_status
level=info ts=2024-05-11T06:45:13.829Z caller=mysqld_exporter.go:293 msg="Scraper enabled" scraper=info_schema.innodb_cmp
level=info ts=2024-05-11T06:45:13.829Z caller=mysqld_exporter.go:293 msg="Scraper enabled" scraper=info_schema.innodb_cmpmem
level=info ts=2024-05-11T06:45:13.829Z caller=mysqld_exporter.go:293 msg="Scraper enabled" scraper=info_schema.query_response_time
level=info ts=2024-05-11T06:45:13.829Z caller=mysqld_exporter.go:303 msg="Listening on address" address=:9104
level=info ts=2024-05-11T06:45:13.829Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
# 浏览器访问
http://10.0.0.81:9104
服务端配置文件增加job部分即可
- job_name: "db_mysqld_exporter"
static_configs:
- targets:
- "10.0.0.81:9104"
# 重启
[root@m04-prometheus prometheus]# systemctl restart prometheus.service
- 官方的镜像有问题这里通过Dockerfile自定义
[root@docker01 db]# cat Dockerfile
FROM alpine:latest
LABEL author=lidao996
ADD mysqld_exporter-0.15.1.linux-amd64.tar.gz /app/tools/
COPY my.cnf /app/tools/mysqld_exporter-0.15.1.linux-amd64/
ENV DATA_SOURCE_NAME='exporter:exporter123@(172.16.1.51:3306)/'
RUN ln -s /app/tools/mysqld_exporter-0.15.1.linux-amd64 /app/tools/mysqld_exporter
WORKDIR /app/tools/mysqld_exporter-0.15.1.linux-amd64/
EXPOSE 9104
CMD ["./mysqld_exporter","--config.my-cnf=./my.cnf"]
[root@docker01 db]# docker build -t db:db_exporter_v1 .
[root@docker01 db]# docker run -d -p 9104:9104 db:db_exporter_v1
- 软件包在群里有或官方下载: https://prometheus.io/
10. Altermanager 告警
-
用于实现告警功能.
-
使用流程:
- 部署alertermanager(prometheus服务端)
- 修改alertmanger配置
- 配置告警规则rules与修改服务端配置
10.1 Alertmanager配置
-
Alertmanager
- 部署alertmanager服务
- 进行配置:邮件,第三方平台
-
Prometheus服务端
- 配置rules(类似于zbx触发器),调用alertmangaer发送告警
1)部署Alertmanager
#1. 解压
[root@m04-prometheus tools]# tar xf alertmanager-0.24.0.linux-amd64.tar.gz -C /app/tools/
#2. 软连接
[root@m04-prometheus tools]# ln -s /app/tools/alertmanager-0.24.0.linux-amd64/ /app/tools/alertmanager
[root@m04-prometheus tools]# ln -s /app/tools/alertmanager/alertmanager /bin/
#3. 检查
[root@m04-prometheus tools]# alertmanager --version
alertmanager, version 0.24.0 (branch: HEAD, revision: f484b17fa3c583ed1b2c8bbcec20ba1db2aa5f11)
build user: root@265f14f5c6fc
build date: 20220325-09:31:33
go version: go1.17.8
platform: linux/amd64
#4. 启动
[root@m04-prometheus tools]# alertmanager --config.file=/app/tools/alertmanager/alertmanager.yml
2)配置alertmanager第三方平台方式
- 去ioops获取apikey
c9ce580c3473404f9773a230365994d3
[root@m04-prometheus alertmanager]# cat alertmanager.yml
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://api.aiops.com/alert/api/event/prometheus/c9ce580c3473404f9773a230365994d3'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
- 下面是配置邮件告警的内容
alertmanager配置详解
global: 全局定义部分。配置发件人信息.
resolve_timeout: 5m dns解析的超时时间.
smtp_from: 发件人
smtp_smarthost: smtp服务器
smtp_hello: qq.com 163.com 邮箱厂商.
smtp_auth_username: 邮箱名字
smtp_auth_password: 授权码
smtp_require_tls: false
route: 配置收件人间隔时间,收件方式.
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h 重复告警时间. eg。11:00 发送了1次告警,12:00 再发送1次.
receiver: 'email' 采取什么方式接受告警.
- 完整的配置文件
[root@m05-prometheus /app/prometheus/alertmanager]# cat alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: 'lidao996@163.com'
smtp_smarthost: 'smtp.163.com:465'
smtp_hello: '163.com'
smtp_auth_username: 'lidao996@163.com'
smtp_auth_password: 'MMNKQYHMJON'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email'
receivers:
- name: "email"
email_configs:
- to: 'youjiu_linux@qq.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
- 邮件告警结束
温馨提示: 更多配置可以参考https://prometheus.io/docs/alerting/latest/configuration/
10.2 Prometheus配置
-
开启告警.
-
配置rules规则
-
prometheus服务端配置文件
[root@m04-prometheus prometheus]# cat prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- "prom.oldboylinux.cn:9093" # 配置
rule_files:
- "/app/tools/prometheus/alerts_check_node.yml" # 规则文件
scrape_configs:
- job_name: "prometheus_server"
static_configs:
- targets: ["localhost:9090"]
# - job_name: "basic_info_node_exporter"
# static_configs:
# - targets:
# - "prom.oldboylinux.cn:9100"
# - "gra.oldboylinux.cn:9100"
- job_name: "basic_info_node_exporter_discovery"
file_sd_configs:
- files:
- /app/tools/prometheus/discovery_node_exporter.json
refresh_interval: 5s
- job_name: "pushgateway"
static_configs:
- targets:
- "gra.oldboylinux.cn:9091"
- job_name: "web_lb_ngx_exporter"
static_configs:
- targets:
- "10.0.0.81:9113"
- job_name: "db_mysqld_exporter"
static_configs:
- targets:
- "10.0.0.81:9104"
- prometheus告警规则文件
[root@m04-prometheus prometheus]# cat alerts_check_node.yml
groups:
- name: check_node_status
rules:
- alert: check_node_is_up
# 出现故障的表达式
expr: up == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: " {{ $labels.instance }} 节点停止运行超过15秒!!!"
[root@m04-prometheus prometheus]# systemctl restart prometheus.service
- 检查规则
- 制造故障
- 浏览器中alertmanager页面显示告警http://10.0.0.64:9093/#/alerts
- 第三方平台告警
- 配置分派
- 添加告警规则
- 平台告警显示
- 检查结果
- 如果发件人邮箱没有信息,要去检查alertmanger配置,prometheus配置,rules.配置
10.3 小结
- 书写各种expr(PromQL过滤语句)过滤数据达到我们想要的目标.
- 这些语句prometheus webui页面测试。
11. 全网监控-补充
- Prometheus服务端收集各种指标。
- 通过各种export获取数据。
- 可以通过pushgateway,自定义。
- 告警Alertmanger.报警规则本质expr规则(promQL语句)
- 展示grafana找仪表盘,进行修改即可
export选择 | |
---|---|
基础信息(系统监控) | node_exporter |
自定义监控 | pushgateway |
数据库 | mysqld_exporter |
nginx | nginx_exporter(github第三方插件) |
容器 | cadvisor |
… |
- 监控容器
docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
--privileged \
--device=/dev/kmsg \
google/cadvisor:latest
--privileged docker进入特权模式.
- Prometheus服务端配置文件
[root@m04-prometheus prometheus]# cat prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- "prom.oldboylinux.cn:9093"
rule_files:
- "/app/tools/prometheus/alerts_check_node.yml"
scrape_configs:
- job_name: "prometheus_server"
static_configs:
- targets: ["localhost:9090"]
# - job_name: "basic_info_node_exporter"
# static_configs:
# - targets:
# - "prom.oldboylinux.cn:9100"
# - "gra.oldboylinux.cn:9100"
- job_name: "basic_info_node_exporter_discovery"
file_sd_configs:
- files:
- /app/tools/prometheus/discovery_node_exporter.json
refresh_interval: 5s
- job_name: "pushgateway"
static_configs:
- targets:
- "gra.oldboylinux.cn:9091"
- job_name: "web_lb_ngx_exporter"
static_configs:
- targets:
- "10.0.0.81:9113"
- job_name: "db_mysqld_exporter"
static_configs:
- targets:
- "10.0.0.81:9104"
- job_name: "docker_cadvisor" # docker_cadvisor配置
static_configs:
- targets:
- "10.0.0.81:8080"
# 重启服务端
[root@m04-prometheus prometheus]# systemctl restart prometheus.service
- 浏览器访问:http://10.0.0.81:8080/containers/
- 在Prometheus中显示
- grafana: 10619
12. 总结
- Prometheus监控架构,知晓prometheus及相关服务关系,知晓每个服务作用.
- 基于prometheus+grafana全网监控.
- 部署使用各种exporter
- node_exporter
- nginx_exporter
- mysqld_exporter
- cadvisor监控容器
- 配置中静态与动态加载
- pushgateway
- alertmanger
- grafana