这个监控很简单,不了解流程会感觉很复杂,先知道配置的先后顺序,了解整个框架后,将配置切分成多个部分,每个部分百度配置即可。主要怕不了解每层如何配置,无从下手。粗略看几本相关书籍,理解流程,按配置顺序提出问题,挨个解决的同时也搭建成功了。路跑通后开始精细化配置。百炼成钢不搭建 20 遍,不要说你学习了。
学习一个新的知识时应尽量避免完美主义,先把整个路简化的跑通,对自信心影响很大,跑通后精深研究每个技术点,最后结合生产中遇到的问题,思考每个每个功能点对你的环境的适配性,从而得到适合自己公司的配置方案。
简化图
服务器信息 :
节点名 IP 地址 服务名
node01 10.10.8.62 grafana prometheus alertmanager node_exporter mysqld_exporter
node02 10.10.8.63 node_exporter mysqld_exporter
创建专用用户和组
groupadd monitor
useradd -MN -s /sbin/nologin monitor -g monitor
grafana
安装
node01
#wget https://dl.grafana.com/oss/release/grafana-10.4.0.linux-amd64.tar.gz
cd /home/zcsadmin/
tar xf grafana-10.4.0.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/grafana-v10.4.0/ /usr/local/grafana
配置
node01
mkdir -p /usr/local/grafana/data/{log,plugins,socket}
cp /usr/local/grafana/conf/defaults.ini /usr/local/grafana/conf/granfana.ini
chown -R monitor:monitor /usr/local/grafana/
sed -i 's#socket = /tmp/grafana.sock#socket = data/socket/grafana.sock#g' /usr/local/grafana/conf/granfana.ini
sed -i 's#en-US#zh-CN#g' /usr/local/grafana/conf/granfana.ini
启动
node01
cat >/usr/lib/systemd/system/grafana.service<<'EOF'
[Unit]
Description=Grafana
After=network.target
[Service]
User=monitor
Group=monitor
Environment="GRAFANA_HOME=/usr/local/grafana"
ExecStart=/usr/local/grafana/bin/grafana-server --config=/usr/local/grafana/conf/granfana.ini --homepath=/usr/local/grafana
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl restart grafana
systemctl status grafana
systemctl enable grafana
默认账号密码:admin/admin
prometheus
告警规则合集,不要手写监控规则啦,改改就用呗
https://github.com/samber/awesome-prometheus-alerts#-rules
https://samber.github.io/awesome-prometheus-alerts/
安装
node01
#wget https://github.com/prometheus/prometheus/releases/download/v2.50.1/prometheus-2.50.1.linux-amd64.tar.gz
cd /home/zcsadmin/
tar xf prometheus-2.50.1.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/prometheus-2.50.1.linux-amd64/ /usr/local/prometheus
cd /usr/local/prometheus
配置
node01
cat >/usr/local/prometheus/prometheus.yml<<'EOF'
global:
scrape_interval: 15s # 抓取target的时间间隔,设置为15秒,默认值为1分钟。经验值为10~60s
evaluation_interval: 15s #Prometheus计算一条规则配置的时间间隔,设置为15秒,
alerting:
alertmanagers:
- static_configs: # 静态配置Alertmanager的地址,也可以依赖服务发现动态识别
- targets: # 可以配置多个IP地址
- 10.10.8.62:9093
# 添加告警规则文件
rule_files:
- "rules/*.yml"
scrape_configs:
# prometheus 监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# alertmanager 监控
- job_name: 'alertmanager'
static_configs:
- targets: ['localhost:9093']
# linux 系统监控
- job_name: 'node-exporter'
static_configs:
- targets:
- 'localhost:9100'
# mysql 监控
- job_name: 'mysqld-exporter'
static_configs:
- targets:
- localhost:3306
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
# 这里配置 mysqld_exporter 主机端口
replacement: localhost:9104
EOF
# 创建告警规则文件
mkdir /usr/local/prometheus/rules
chown -R monitor:monitor /usr/local/prometheus/
chown -R monitor:monitor /data
检查配置
node01
/usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
启动
node01
cat >/usr/lib/systemd/system/prometheus.service<<'EOF'
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
User=monitor
Group=monitor
ExecStart=/usr/local/prometheus/prometheus \
--config.file "/usr/local/prometheus/prometheus.yml" \
--web.listen-address "0.0.0.0:9090" \
--storage.tsdb.retention=1095d \
--web.enable-lifecycle
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl restart prometheus
systemctl status prometheus
systemctl enable prometheus
node01
# 配置检查
/usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
# 重载配置
curl -X POST http://127.0.0.1:9090/-/reload
在 grafana 中配置数据源
alertmanager
安装
node01
#https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
cd /home/zcsadmin/
tar xf alertmanager-0.27.0.linux-amd64.tar.gz -C /usr/local
mv /usr/local/alertmanager-0.27.0.linux-amd64/ /usr/local/alertmanager
配置
node01
cat >/usr/local/alertmanager/alertmanager.yml<<'EOF'
global:
resolve_timeout: 5m
#邮箱
smtp_smarthost: 'mail.test.com:25'
smtp_from: 'test@test.com'
smtp_auth_username: 'test@test.com'
smtp_auth_password: 'test@!QAZ'
smtp_require_tls: false
# 企业微信
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_corp_id: 'ww2edb882dtest93222' # 企业微信中企业ID
# 配置路由树
route:
# group_by: ['alertname'] # 根据告警规则组名进行分组
group_wait: 1s # 分组内第一个告警等待时间,
group_interval: 1s # 发送新告警间隔时间
repeat_interval: 1h # 重复告警间隔发送时间
receiver: 'email_wechat'
# 接收人
receivers:
- name: 'email_wechat'
# 邮箱配置
email_configs:
- to: 'duyuhang@inmyshow.com'
html: '{{ template "email.html" . }}'
send_resolved: true
# 企业微信配置
wechat_configs:
- send_resolved: true
api_secret: 'x7NQ305cPcR1dsdsHDSnW9oU_ioOaGqdsdsdsdsds6Oy4M'
agent_id: '10000034' #企微后台查询的agentid
message: '{{ template "wechat.message" . }}'
to_party: '57'
to_user : "@all"
# 告警模板位置
templates:
- '/usr/local/alertmanager/templates/*.tmpl'
# 抑制规则
#inhibit_rules:
#- source_match:
# severity: 'critical'
# target_match:
# severity: 'warning'
# equal: ['alertname', 'dev', 'instance']
EOF
企业微信创建机器人:自行百度
必须配置可信 IP: https://blog.csdn.net/weixin_45385457/article/details/132278442
邮件模板
node01
# 通知模板
mkdir /usr/local/alertmanager/templates
cat >/usr/local/alertmanager/templates/email.tmpl<<'EOF'
{{ define "email.html" }}
{{ range .Alerts }}
告警主题: {{ .Annotations.summary }} <br>
故障主机: {{ .Labels.instance }} <br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
{{ end }}
{{ end }}
EOF
微信模板
微信通知模板
node01
cat >/usr/local/alertmanager/templates/wechat.tmpl<<'EOF'
{{ define "wechat.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
告警:{{ .Labels.instance }} {{ .Annotations.summary }}
告警状态:{{ .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ .Labels.alertname }}
故障主机:{{ .Labels.instance }}
告警主题:{{ .Annotations.summary }}
告警详情:{{ .Annotations.description }};
故障时间:{{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
恢复:{{ .Labels.instance }} {{ .Annotations.summary }}
告警类型:{{ .Labels.alertname }}
告警状态:{{ .Status }}
告警主题:{{ .Annotations.summary }}
告警详情:{{ .Annotations.description }};
故障时间:{{ .StartsAt.Format "2006-01-02 15:04:05" }}
恢复时间:{{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息:{{ $alert.Labels.instance }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
EOF
chown -R monitor:monitor /usr/local/alertmanager/
启动
node01
cat >/usr/lib/systemd/system/alertmanager.service<<'EOF'
[Unit]
Description=alertmanager
After=network.target
[Service]
User=monitor
Group=monitor
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
chown -R monitor:monitor /usr/local/alertmanager/
systemctl daemon-reload
systemctl restart alertmanager
systemctl status alertmanager
systemctl enable alertmanager
granfana 配置数据源
node_exporter
需要安装在每个需要监控的服务器上。
使用node_exporter进行 linux 系统监控,在 prometheus配置文件中添加node_exporter,grafana 导入模板即可,
安装
node01 node02
#wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
cd /home/zcsadmin/
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xf node_exporter-1.7.0.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/node_exporter-1.7.0.linux-amd64 /usr/local/node_exporter
启动
node01 node02
cat >/usr/lib/systemd/system/node_exporter.service<<'EOF'
[Unit]
Description=node_exporter
After=network.target
[Service]
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl restart node_exporter
systemctl status node_exporter
systemctl enable node_exporter
配置
granfana 导入模板地址:
https://grafana.com/grafana/dashboards/1860-node-exporter-full/
告警规则
node01 node02
cd /usr/local/prometheus/rules && \
wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/host-and-hardware/node-exporter.yml
# 重载配置
curl -X POST http://127.0.0.1:9090/-/reload
验证
node01 node02
curl 'http://localhost:9100/metrics' |grep cpu
mysqld_exporter
不需要安装在每个需要监控的服务器上,流程如下:
- 在 prometheus 服务器上安装mysqld_exporter
- 配置统一的mysql用户密码连接文件
- 在需要监控的mysql 实例中创建对应的账号密码,注意:账号必须可以在prometheus服务器上连接
- 开通防火墙规则
安装
node01
#wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.1/mysqld_exporter-0.15.1.linux-amd64.tar.gz
cd /home/zcsadmin/
tar xf mysqld_exporter-0.15.1.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/mysqld_exporter-0.15.1.linux-amd64 /usr/local/mysqld_exporter
启动
node01
cat >/usr/lib/systemd/system/mysqld_exporter.service<<'EOF'
[Unit]
Description=mysqld_exporter
After=network.target
[Service]
ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/config.my.cnf
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl restart mysqld_exporter
systemctl status mysqld_exporter
systemctl enable mysqld_exporter
配置
安装测试 mysql
node01 node02
yum install -y mariadb
systemctl start mariadb
客户端 需要在对应的 MySQL 实例中创建账号
node01
# 数据库创建账号
create user exporter@'10.10.8.62' identified by 'exportertest';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'10.10.8.62';
node01
# 创建mysqld_exporter 连接 mysql 配置文件
cat >/usr/local/mysqld_exporter/config.my.cnf<<'EOF'
[client]
user = exporter
password = exportertest
EOF
node01
cat >/usr/local/prometheus/prometheus.yml<<'EOF'
global:
scrape_interval: 15s # 抓取target的时间间隔,设置为15秒,默认值为1分钟。经验值为10~60s
evaluation_interval: 15s #Prometheus计算一条规则配置的时间间隔,设置为15秒,
alerting:
alertmanagers:
- static_configs: # 静态配置Alertmanager的地址,也可以依赖服务发现动态识别
- targets: # 可以配置多个IP地址
- 10.10.8.62:9093
# 添加告警规则文件
rule_files:
- "rules/*.yml"
scrape_configs:
# prometheus 监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# alertmanager 监控
- job_name: 'alertmanager'
static_configs:
- targets: ['localhost:9093']
# linux 系统监控
- job_name: 'node-exporter'
static_configs:
- targets:
- 'localhost:9100'
- '10.10.8.63:9100'
# mysql 监控
- job_name: 'mysqld-exporter'
params:
# 不需要。将值匹配到配置文件中的子项。默认值为 “client”。
auth_module: [client.servers]
static_configs:
- targets:
- localhost:3306
- 10.10.8.63:3306 # 添加一行 有新的实例 往下加就行了
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
# 这里配置 mysqld_exporter 主机端口
replacement: localhost:9104
EOF
告警规则
node01
cd /usr/local/prometheus/rules && \
wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/mysql/mysqld-exporter.yml
# 修改权限
chown -R monitor:monitor /usr/local/prometheus/
# 检查配置
/usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
# 重载配置
curl -X POST http://127.0.0.1:9090/-/reload
grafana 导入仪表板 ID: 7362
验证
node01
curl 'http://localhost:9104/metrics' |grep mysql
curl 'http://10.10.8.63:9104/metrics' |grep mysql
自动发现
监控传统环境不需要自动发现,也不好用,直接配置文件也能满足,如果要用的话可以配置一下基于文件的方式,如果使用 k8s 可以去学习一下Consul
安全相关
grafana 配置 https
mkdir /usr/local/grafana/certificate
cd /usr/local/grafana/certificate
openssl req -newkey rsa:2048 -nodes -keyout key.pem -x509 -days 3650 -out certificate.pem # 一路回车
vim /usr/local/grafana/conf/granfana.ini
protocol = https
cert_file = /usr/local/grafana/certificate/certificate.pem
cert_key = /usr/local/grafana/certificate/key.pem
systemctl restart grafana.service
systemctl status grafana.service
Prometheus 配置用户密码
配置后需要重新配置 grafana 的数据源里的链接信息
使用 htpasswd 工具生成密码
# 安装 htpasswd 工具
yum install httpd-tools -y
# 执行命令 我这里密码为 admintest
htpasswd -nBC 12 '' | tr -d ':\n'
New password:
Re-type new password:
# 加密的密码
$2y$12$NHyeXrePI1gUx/kAHLNfn.H6sizsTgIer/ishuh/cdczmntUJ3Ywm
配置 web 用户密码
cat >/usr/local/prometheus/web-config.yml<<'EOF'
basic_auth_users:
admin: $2y$12$NHyeXrePI1gUx/kAHLNfn.H6sizsTgIer/ishuh/cdczmntUJ3Ywm
EOF
修改prometheus配置添加 basic_auth
vim /usr/local/prometheus/prometheus.yml
scrape_configs:
# prometheus 监控
- job_name: 'prometheus'
basic_auth:
username: admin # 账号为 admin
password: admintest # 密码为 admintest
static_configs:
- targets: ['localhost:9090']
修改启动配置
cat >/usr/lib/systemd/system/prometheus.service<<'EOF'
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
User=monitor
Group=monitor
ExecStart=/usr/local/prometheus/prometheus \
--config.file "/usr/local/prometheus/prometheus.yml" \
--web.listen-address "0.0.0.0:9090" \
--web.config.file=/usr/local/prometheus/web-config.yml \
--storage.tsdb.retention=1095d \
--web.enable-lifecycle
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl restart prometheus
systemctl status prometheus
systemctl enable prometheus
标签的应用和分类
在配置 targets 时,可以定义标签
vim /usr/local/prometheus/prometheus.yml
- job_name: 'example'
static_configs:
- targets: ['server:9100']
labels: # 定义标签
environment: 'production'
实际应用:
在告警规则文件中,根据标签来区别告警的严重等级
vim /usr/local/prometheus/rules/test.yml
groups:
- name: example-alerts
rules:
- alert: HighHttpRequests
expr: http_requests_total{job="example", instance="example-instance"} > 100
for: 5m
labels:
severity: critical # 根据 severity 标签的不同值,来配置告警
annotations:
summary: "High HTTP Requests"
description: "The number of HTTP requests is high on example-instance"
在告警时使用 route 里的 group_by 来区分不同的告警发送至哪个 receivers 内
vim /usr/local/alertmanager/alertmanager.yml
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'sms-critical'
receivers:
- name: 'sms-critical'
webhook_configs:
- url: 'https://your-sms-provider/api/send'
send_resolved: true
http_config:
bearer_token: 'your-bearer-token'
route:
routes:
- match:
severity: 'critical'
receiver: 'sms-critical'
总结
在生产环境使用 prometheus 监控时,要充分利用标签的功能,对不同的环境不同作用的机器制定不同的告警规则,避免出现告警过多导致的漏处理。要严格把控安全问题,防止信息的泄露。