【监控】prometheus传统环境监控告警常用配置

hh真是个慢性子

已于 2024-03-27 16:37:22 修改

阅读量1.3k

点赞数 12

文章标签： prometheus 运维监控 mysql 告警 alertmanager grafana

于 2024-03-26 17:28:59 首次发布

本文链接：https://blog.csdn.net/weixin_45385457/article/details/137050935

版权

这个监控很简单，不了解流程会感觉很复杂，先知道配置的先后顺序，了解整个框架后，将配置切分成多个部分，每个部分百度配置即可。主要怕不了解每层如何配置，无从下手。粗略看几本相关书籍，理解流程，按配置顺序提出问题，挨个解决的同时也搭建成功了。路跑通后开始精细化配置。百炼成钢不搭建 20 遍，不要说你学习了。

学习一个新的知识时应尽量避免完美主义，先把整个路简化的跑通，对自信心影响很大，跑通后精深研究每个技术点，最后结合生产中遇到的问题，思考每个每个功能点对你的环境的适配性，从而得到适合自己公司的配置方案。

简化图

在这里插入图片描述

服务器信息 :

节点名    IP 地址      服务名
node01  10.10.8.62   grafana   prometheus   alertmanager  node_exporter mysqld_exporter
node02  10.10.8.63   node_exporter mysqld_exporter

创建专用用户和组

groupadd monitor
useradd -MN -s /sbin/nologin monitor  -g monitor

grafana

安装

node01

#wget https://dl.grafana.com/oss/release/grafana-10.4.0.linux-amd64.tar.gz

cd /home/zcsadmin/
tar xf grafana-10.4.0.linux-amd64.tar.gz  -C /usr/local/
mv /usr/local/grafana-v10.4.0/  /usr/local/grafana

配置

node01

mkdir -p  /usr/local/grafana/data/{log,plugins,socket}

cp /usr/local/grafana/conf/defaults.ini /usr/local/grafana/conf/granfana.ini

chown -R monitor:monitor /usr/local/grafana/

sed -i 's#socket = /tmp/grafana.sock#socket = data/socket/grafana.sock#g' /usr/local/grafana/conf/granfana.ini
sed -i 's#en-US#zh-CN#g' /usr/local/grafana/conf/granfana.ini

启动

node01

cat >/usr/lib/systemd/system/grafana.service<<'EOF'
[Unit]
Description=Grafana
After=network.target 


[Service]
User=monitor
Group=monitor
Environment="GRAFANA_HOME=/usr/local/grafana"
ExecStart=/usr/local/grafana/bin/grafana-server   --config=/usr/local/grafana/conf/granfana.ini --homepath=/usr/local/grafana
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl restart grafana
systemctl status  grafana
systemctl enable  grafana

默认账号密码：admin/admin

prometheus

告警规则合集，不要手写监控规则啦，改改就用呗

https://github.com/samber/awesome-prometheus-alerts#-rules

https://samber.github.io/awesome-prometheus-alerts/

安装

node01

#wget  https://github.com/prometheus/prometheus/releases/download/v2.50.1/prometheus-2.50.1.linux-amd64.tar.gz

cd /home/zcsadmin/
tar xf prometheus-2.50.1.linux-amd64.tar.gz  -C /usr/local/
mv /usr/local/prometheus-2.50.1.linux-amd64/ /usr/local/prometheus
cd /usr/local/prometheus

配置

node01

cat >/usr/local/prometheus/prometheus.yml<<'EOF'
global:

  scrape_interval: 15s # 抓取target的时间间隔，设置为15秒，默认值为1分钟。经验值为10～60s
  evaluation_interval: 15s #Prometheus计算一条规则配置的时间间隔，设置为15秒，

alerting:
  alertmanagers:
    - static_configs:    # 静态配置Alertmanager的地址，也可以依赖服务发现动态识别
      - targets:         # 可以配置多个IP地址
        - 10.10.8.62:9093

# 添加告警规则文件
rule_files:
  - "rules/*.yml"

scrape_configs:
  # prometheus 监控
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # alertmanager 监控
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['localhost:9093']
  
  # linux 系统监控
  - job_name: 'node-exporter'
    static_configs:
      - targets: 
        - 'localhost:9100'


  # mysql 监控
  - job_name: 'mysqld-exporter'
    static_configs:
      - targets: 
        - localhost:3306
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        # 这里配置 mysqld_exporter 主机端口
        replacement: localhost:9104

EOF

# 创建告警规则文件
mkdir /usr/local/prometheus/rules

chown -R monitor:monitor /usr/local/prometheus/
chown -R monitor:monitor /data

检查配置

node01

/usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml

启动

node01

cat >/usr/lib/systemd/system/prometheus.service<<'EOF'
[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=monitor
Group=monitor
ExecStart=/usr/local/prometheus/prometheus \
  --config.file "/usr/local/prometheus/prometheus.yml" \
  --web.listen-address "0.0.0.0:9090" \
  --storage.tsdb.retention=1095d \
  --web.enable-lifecycle
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl restart prometheus
systemctl status prometheus
systemctl enable prometheus

node01

# 配置检查
/usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
# 重载配置
curl -X POST http://127.0.0.1:9090/-/reload

在 grafana 中配置数据源

alertmanager

安装

node01

#https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz

cd /home/zcsadmin/
tar xf alertmanager-0.27.0.linux-amd64.tar.gz -C /usr/local
mv /usr/local/alertmanager-0.27.0.linux-amd64/ /usr/local/alertmanager

配置

node01

cat  >/usr/local/alertmanager/alertmanager.yml<<'EOF'
global:
  resolve_timeout: 5m
  
  #邮箱
  smtp_smarthost: 'mail.test.com:25'
  smtp_from: 'test@test.com'
  smtp_auth_username: 'test@test.com'
  smtp_auth_password: 'test@!QAZ' 
  smtp_require_tls: false
  
  # 企业微信
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_corp_id: 'ww2edb882dtest93222'      # 企业微信中企业ID

# 配置路由树
route:
  # group_by: ['alertname'] # 根据告警规则组名进行分组
  group_wait: 1s # 分组内第一个告警等待时间，
  group_interval: 1s # 发送新告警间隔时间
  repeat_interval: 1h # 重复告警间隔发送时间
  receiver: 'email_wechat'

# 接收人
receivers:
- name: 'email_wechat'
  # 邮箱配置
  email_configs:
  - to: 'duyuhang@inmyshow.com'
    html: '{{ template "email.html" . }}'
    send_resolved: true
    
  # 企业微信配置
  wechat_configs:
  - send_resolved: true
    api_secret: 'x7NQ305cPcR1dsdsHDSnW9oU_ioOaGqdsdsdsdsds6Oy4M'
    agent_id: '10000034'   #企微后台查询的agentid
    message: '{{ template "wechat.message" . }}'
    to_party: '57'
    to_user : "@all"

# 告警模板位置
templates:
- '/usr/local/alertmanager/templates/*.tmpl'

# 抑制规则
#inhibit_rules:
#- source_match:
#    severity: 'critical'
#  target_match:
#    severity: 'warning'
#  equal: ['alertname', 'dev', 'instance']
EOF

企业微信创建机器人：自行百度

必须配置可信 IP： https://blog.csdn.net/weixin_45385457/article/details/132278442

邮件模板

node01

# 通知模板
mkdir /usr/local/alertmanager/templates
cat >/usr/local/alertmanager/templates/email.tmpl<<'EOF'
{{ define "email.html" }}
{{ range .Alerts }}
告警主题: {{ .Annotations.summary }} <br>
故障主机: {{ .Labels.instance }} <br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
{{ end }}
{{ end }}
EOF

微信模板

微信通知模板

node01

cat >/usr/local/alertmanager/templates/wechat.tmpl<<'EOF'
{{ define "wechat.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
告警:{{ .Labels.instance }} {{ .Annotations.summary }}
告警状态：{{   .Status }}
告警级别：{{ .Labels.severity }}
告警类型：{{ .Labels.alertname }}
故障主机：{{ .Labels.instance }}
告警主题：{{ .Annotations.summary }}
告警详情：{{ .Annotations.description }};
故障时间：{{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{- end }}
{{- end }}

{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
恢复:{{ .Labels.instance }} {{ .Annotations.summary }}
告警类型：{{ .Labels.alertname }}
告警状态：{{ .Status }}
告警主题：{{ .Annotations.summary }}
告警详情：{{ .Annotations.description }};
故障时间：{{ .StartsAt.Format "2006-01-02 15:04:05" }}
恢复时间：{{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息：{{ $alert.Labels.instance }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
EOF

chown -R monitor:monitor /usr/local/alertmanager/

启动

node01

cat >/usr/lib/systemd/system/alertmanager.service<<'EOF'
[Unit]
Description=alertmanager
After=network.target 

[Service]
User=monitor
Group=monitor
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

chown -R monitor:monitor /usr/local/alertmanager/
systemctl daemon-reload
systemctl restart alertmanager
systemctl status  alertmanager
systemctl enable  alertmanager

granfana 配置数据源

node_exporter

需要安装在每个需要监控的服务器上。

使用node_exporter进行 linux 系统监控，在 prometheus配置文件中添加node_exporter，grafana 导入模板即可，

安装

node01 node02

#wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

cd /home/zcsadmin/
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

tar xf node_exporter-1.7.0.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/node_exporter-1.7.0.linux-amd64  /usr/local/node_exporter

启动

node01 node02

cat >/usr/lib/systemd/system/node_exporter.service<<'EOF'
[Unit]
Description=node_exporter
After=network.target 

[Service]
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl restart node_exporter
systemctl status  node_exporter
systemctl enable  node_exporter

配置

granfana 导入模板地址：

https://grafana.com/grafana/dashboards/1860-node-exporter-full/

告警规则

node01 node02

cd  /usr/local/prometheus/rules && \
wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/host-and-hardware/node-exporter.yml

# 重载配置
curl -X POST http://127.0.0.1:9090/-/reload

验证

node01 node02

curl 'http://localhost:9100/metrics' |grep cpu

mysqld_exporter

不需要安装在每个需要监控的服务器上，流程如下：

在 prometheus 服务器上安装mysqld_exporter
配置统一的mysql用户密码连接文件
在需要监控的mysql 实例中创建对应的账号密码，注意：账号必须可以在prometheus服务器上连接
开通防火墙规则

安装

node01

#wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.1/mysqld_exporter-0.15.1.linux-amd64.tar.gz

cd /home/zcsadmin/
tar xf mysqld_exporter-0.15.1.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/mysqld_exporter-0.15.1.linux-amd64  /usr/local/mysqld_exporter

启动

node01

cat >/usr/lib/systemd/system/mysqld_exporter.service<<'EOF'
[Unit]
Description=mysqld_exporter
After=network.target 

[Service]
ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/config.my.cnf
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl restart mysqld_exporter
systemctl status  mysqld_exporter
systemctl enable  mysqld_exporter

配置

安装测试 mysql

node01 node02

yum install -y mariadb
systemctl start mariadb

客户端需要在对应的 MySQL 实例中创建账号

node01

# 数据库创建账号
create user exporter@'10.10.8.62' identified by 'exportertest';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'10.10.8.62';

node01

# 创建mysqld_exporter 连接 mysql 配置文件
cat >/usr/local/mysqld_exporter/config.my.cnf<<'EOF'
[client]
user = exporter
password = exportertest
EOF

node01

cat >/usr/local/prometheus/prometheus.yml<<'EOF'
global:

  scrape_interval: 15s # 抓取target的时间间隔，设置为15秒，默认值为1分钟。经验值为10～60s
  evaluation_interval: 15s #Prometheus计算一条规则配置的时间间隔，设置为15秒，

alerting:
  alertmanagers:
    - static_configs:    # 静态配置Alertmanager的地址，也可以依赖服务发现动态识别
      - targets:         # 可以配置多个IP地址
        - 10.10.8.62:9093

# 添加告警规则文件
rule_files:
  - "rules/*.yml"


scrape_configs:
  # prometheus 监控
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # alertmanager 监控
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['localhost:9093']
  
  # linux 系统监控
  - job_name: 'node-exporter'
    static_configs:
      - targets: 
        - 'localhost:9100'
        - '10.10.8.63:9100'

  # mysql 监控
  - job_name: 'mysqld-exporter'
      params:
      # 不需要。将值匹配到配置文件中的子项。默认值为 “client”。
      auth_module: [client.servers]
    static_configs:
      - targets: 
        - localhost:3306
        - 10.10.8.63:3306 # 添加一行 有新的实例 往下加就行了

    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        # 这里配置 mysqld_exporter 主机端口
        replacement: localhost:9104
EOF

告警规则

node01

cd  /usr/local/prometheus/rules && \
wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/mysql/mysqld-exporter.yml

# 修改权限
chown -R monitor:monitor /usr/local/prometheus/

# 检查配置
/usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml

# 重载配置
curl -X POST http://127.0.0.1:9090/-/reload

grafana 导入仪表板 ID： 7362

验证

node01

curl 'http://localhost:9104/metrics' |grep mysql
curl 'http://10.10.8.63:9104/metrics' |grep mysql

自动发现

监控传统环境不需要自动发现，也不好用，直接配置文件也能满足，如果要用的话可以配置一下基于文件的方式，如果使用 k8s 可以去学习一下Consul

安全相关

grafana 配置 https

mkdir  /usr/local/grafana/certificate
cd /usr/local/grafana/certificate

openssl req -newkey rsa:2048 -nodes -keyout key.pem -x509 -days 3650 -out certificate.pem  # 一路回车

vim /usr/local/grafana/conf/granfana.ini
protocol = https
cert_file = /usr/local/grafana/certificate/certificate.pem
cert_key = /usr/local/grafana/certificate/key.pem

systemctl restart grafana.service 
systemctl status  grafana.service

Prometheus 配置用户密码

配置后需要重新配置 grafana 的数据源里的链接信息

使用 htpasswd 工具生成密码

# 安装 htpasswd 工具
yum install httpd-tools -y

# 执行命令 我这里密码为 admintest
htpasswd -nBC 12 '' | tr -d ':\n'
New password:  
Re-type new password: 

# 加密的密码
$2y$12$NHyeXrePI1gUx/kAHLNfn.H6sizsTgIer/ishuh/cdczmntUJ3Ywm

配置 web 用户密码

cat >/usr/local/prometheus/web-config.yml<<'EOF'
basic_auth_users:
    admin: $2y$12$NHyeXrePI1gUx/kAHLNfn.H6sizsTgIer/ishuh/cdczmntUJ3Ywm
EOF

修改prometheus配置添加 basic_auth

vim /usr/local/prometheus/prometheus.yml
scrape_configs:
  # prometheus 监控
  - job_name: 'prometheus'
    basic_auth:
      username: admin         # 账号为 admin
      password: admintest     # 密码为 admintest
    static_configs:
      - targets: ['localhost:9090']

修改启动配置

cat >/usr/lib/systemd/system/prometheus.service<<'EOF'
[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=monitor
Group=monitor
ExecStart=/usr/local/prometheus/prometheus \
  --config.file "/usr/local/prometheus/prometheus.yml" \
  --web.listen-address "0.0.0.0:9090" \
  --web.config.file=/usr/local/prometheus/web-config.yml \
  --storage.tsdb.retention=1095d \
  --web.enable-lifecycle 
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl restart prometheus
systemctl status prometheus
systemctl enable prometheus

标签的应用和分类

在配置 targets 时，可以定义标签

vim /usr/local/prometheus/prometheus.yml
- job_name: 'example'
  static_configs:
    - targets: ['server:9100']
      labels:  # 定义标签
        environment: 'production'

实际应用：

在告警规则文件中，根据标签来区别告警的严重等级

vim /usr/local/prometheus/rules/test.yml
groups:
- name: example-alerts
  rules:
  - alert: HighHttpRequests
    expr: http_requests_total{job="example", instance="example-instance"} > 100
    for: 5m
    labels:
      severity: critical  # 根据 severity 标签的不同值，来配置告警
    annotations:
      summary: "High HTTP Requests"
      description: "The number of HTTP requests is high on example-instance"

在告警时使用 route 里的 group_by 来区分不同的告警发送至哪个 receivers 内

vim  /usr/local/alertmanager/alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'sms-critical'

receivers:
- name: 'sms-critical'
  webhook_configs:
  - url: 'https://your-sms-provider/api/send'
    send_resolved: true
    http_config:
      bearer_token: 'your-bearer-token'

route:
  routes:
  - match:
      severity: 'critical'
    receiver: 'sms-critical'

总结

在生产环境使用 prometheus 监控时，要充分利用标签的功能，对不同的环境不同作用的机器制定不同的告警规则，避免出现告警过多导致的漏处理。要严格把控安全问题，防止信息的泄露。

hh真是个慢性子

关注

12
点赞
踩
33

收藏

觉得还不错? 一键收藏
0
评论
【监控】prometheus传统环境监控告警常用配置

在生产环境使用 prometheus 监控时，要充分利用标签的功能，对不同的环境不同作用的机器制定不同的告警规则，避免出现告警过多导致的漏处理。要严格把控安全问题，防止信息的泄露。
复制链接

扫一扫