一、部署 Prometheus
1、下载二进制文件
https://github.com/prometheus/prometheus/releases/download/v2.28.0/prometheus-2.28.0.linux-amd64.tar.gz
2、下载完后解压即可使用
tar xf prometheus-2.28.0.linux-amd64.tar.gz
3、添加systemd管理
[root@prometheus ~]# cat /usr/lib/systemd/system/prometheus.service
[Unit]
Description=prometheus
[Service]
ExecStart=/opt/monitor/prometheus/prometheus --config.file=/opt/monitor/prometheus/prometheus.yml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
4、加载配置并启动
systemctl daemon-reload
systemctl start prometheus.service
5、prometheus配置文件修改如下
[root@prometheus prometheus]# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093 # 开启alertmanager告警,去掉 # 号即可
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml" # prometheus读取监控的数据文件
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'node server'
static_configs:
- targets: ['192.168.33.145:9100','192.168.33.142:9100'] # 监控 node_exporter 数据,主要监控node节点数据(内存,cpu,负载等)
- job_name: 'docker'
static_configs:
- targets: ['192.168.33.145:8080'] # cadvisor 服务,主要监控docker数据
6、热加载prometheus配置文件
[root@prometheus prometheus]# ps -ef|grep prometheus
root 1081 1 0 13:25 ? 00:00:10 /opt/monitor/prometheus/prometheus --config.file=/opt/monitor/prometheus/prometheus.yml
root 3123 2619 0 14:10 pts/0 00:00:00 grep --color=auto prometheus
[root@prometheus prometheus]# kill -HUP 1081
7、prometheus自带web页面如下:
输入prometheus所在主机地址+9100即可打开web页面(192.168.33.139:9100)
二、node_exporter部署
1、下载二进制文件
https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz
2、解压
tar xf node_exporter-1.2.2.linux-amd64.tar.gz -C /opt/monitor
3、添加systemd管理
[root@prometheus ~]# cat /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
[Service]
ExecStart=/opt/monitor/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-include=(docker|sshd|nginx).service
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
4、加载配置并启动
systemctl daemon-reload
systemctl start node_exporter.service
三、grafana部署
1、下载二进制文件
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.0.3.linux-amd64.tar.gz
2、解压二进制文件
tar -zxvf grafana-enterprise-8.0.3.linux-amd64.tar.gz -C /opt/monitor
3、添加systemd管理
[root@prometheus ~]# cat /usr/lib/systemd/system/grafana.service
[Unit]
Description=grafana
[Service]
ExecStart=/opt/monitor/grafana/bin/grafana-server -homepath=/opt/monitor/grafana
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
4、加载配置并启动
systemctl daemon-reload
systemctl start grafana.service
5、grafana模板下载地址
https://grafana.com/grafana/dashboards
常用模板
193 docke监控r模板
9276 node节点监控模板
7362 mysql监控模板
6、grafana展示界面 (192.168.33.145:3000)
6.1、监控node主机
6.2、监控kubernetes集群
四、alertmanager部署
1、下载alertmanager二进制包
wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
2、解压二进制包
tar xf alertmanager-0.23.0.linux-amd64.tar.gz -C /opt/monitor/
3、添加systemd管理
[root@prometheus alertmanager]# cat /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
[Service]
ExecStart=/opt/monitor/alertmanager/alertmanager --config.file=/opt/monitor/alertmanager/alertmanager.yml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
4、加载配置并启动
systemctl daemon-reload
systemctl start alertmanager.service
5、修改alertmanager配置(钉钉告警版)
[root@prometheus alertmanager]# cat alertmanager.yml
global:
resolve_timeout: 5m
templates:
- '/opt/monitor/alertmanager/template/*.tmpl'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 1m
repeat_interval: 2m
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/webhook1/send'
send_resolved: true
inhibit_rules:
- source_match:
alertname: 'ApplicationDown'
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname',"target","job","instance"]
6、修改alertmanager配置(邮件告警版)
[root@prometheus alertmanager]# cat alertmanager.yml.bak20210830
global:
resolve_timeout: 5m
# 邮箱服务器
smtp_smarthost: 'smtp.126.com:25'
smtp_from: 'liujixiao6@126.com'
smtp_auth_username: 'liujixiao6@126.com'
smtp_auth_password: 'BBELDJWBPLMLIMUR'
smtp_require_tls: false
# 配置路由树
route:
group_by: ['alertname'] # 根据告警规则组名进行分组
group_wait: 10s # 分组内第一个告警等待时间,10s内如有第二个告警会合并一个告警
group_interval: 10s # 发送新告警间隔时间
repeat_interval: 1h # 重复告警间隔发送时间
receiver: 'mail'
# 接收人
receivers:
- name: 'mail'
email_configs:
- to: '1665111913@qq.com'
7、重启alertmanager
systemctl restart alertmanager
说明:钉钉告警详细配置请查看我的其它博客
https://blog.csdn.net/ljx1528/article/details/120070330
钉钉告警截图
邮件告警截图