Prometheus监控搭建过程

最新推荐文章于 2024-07-02 20:12:51 发布

wyf_wyf_001

最新推荐文章于 2024-07-02 20:12:51 发布

阅读量674

点赞数 10

文章标签： prometheus

本文链接：https://blog.csdn.net/wyf_wyf_001/article/details/139220199

版权

报警过程

一、实验准备

1、准备两台虚拟机

2、虚拟机连通互联网

二、实验步骤

1、Prometheus-server端

1.1下载并安装Prometheus Server服务

$ rz promethues.zip

$ unzip promethues.zip

# 单独解压缩Prometheus软件，完成安装

$ tar -xf prometheus-2.14.0.linux-amd64.tar.gz

$ cd prometheus-2.14.0.linux-amd64/

$ tree ./

# Prometheus安装非常简单，解压缩复制到自定义目录下即可，约定成俗的习惯：/usr/local/prometheus

$ cp -r prometheus-2.14.0.linux-amd64/ /usr/local/prometheus

1.2 编写Prometheus service启动脚本

$ cat>/usr/local/prometheus/prometheus.service<<EOF

[Unit]

Description=Prometheus

After=network.target

[Service]

Type=simple

User=root

WorkingDirectory=/usr/local/prometheus

ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml

Restart=on-failure

LimitNOFILE=65536

[Install]

WantedBy=multi-user.target

EOF

1.3 添加启动脚本到systemd启动管理中

$ ln -s /usr/local/prometheus/prometheus.service /lib/systemd/system/

$ systemctl daemon-reload

$ systemctl start prometheus

$ systemctl enable prometheus

$ netstat -antp | grep LISTEN | grep :9090

tcp6 0 0 :::9090 :::* LISTEN 37984/prometheus

1.4 使用Windows浏览器访问192.168.152.13:9090，进行测试。

1.5 配置文件讲解

# 配置文件（原版未改）

$ vim /usr/local/prometheus/prometheus.yml

# my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

# scrape_timeout is set to the global default (10s).

# Alertmanager configuration

alerting:

alertmanagers:

- static_configs:

- targets:

# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

# - "first_rules.yml"

# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

- job_name: 'prometheus'

# metrics_path defaults to '/metrics'

# scheme defaults to 'http'.

static_configs:

- targets: ['localhost:9090']

# -----------------------------------------------------------------------------------------------

# 配置文件关键词介绍（由于alertmanager、exporter等都未安装，相关配置后面详细讲）

global: # 全局配置（如果有内部单独设定，会覆盖这个参数）

scrape_interval: 15s

# 全局默认的数据拉取间隔

evaluation_interval: 15s

# 全局默认的规则(主要是报警规则)拉取间隔

scrape_timeout: 10s

# 全局默认的单次数据拉取超时间，默认不开启，当报context deadline exceeded错误时需要在特定的job下配置该字段，注意：scrape_timeout时间不能大于scrape_interval，否则Prometheus将会报错。

alerting: # 告警插件定义，这里会设定alertmanager这个报警插件。

rule_files: # 告警规则，按照设定参数进行扫描加载，用于自定义报警规则（类似触发器trigger），其报警媒介由alertmanager插件实现。

scrape_configs: # 采集配置，配置数据源，包含分组job_name以及具体target，又分为静态配置和服务发现。

1.6 浏览器查看会有两个主机名群组，点击status ---target

2、Prometheus - Node Exporter部署

2.1 解压缩并安装Node Exporter

$ tar -xf node_exporter-0.18.1.linux-amd64.tar.gz

$ cp -r node_exporter-0.18.1.linux-amd64 /usr/local/node_exporter

2.2 编写Node Exporter启动脚本

$ cat>/usr/local/node_exporter/node_exporter.service<<EOF

[Unit]

Description=Node Exporter

After=network.target

Wants=network-online.target

[Service]

Type=simple

User=root

ExecStart=/usr/local/node_exporter/node_exporter

Restart=on-failure

LimitNOFILE=65536

[Install]

WantedBy=multi-user.target

EOF

2.3 添加启动脚本到systemd启动管理中

$·

$ systemctl daemon-reload

$ systemctl start node_exporter

$ systemctl enable node_exporter

$ netstat -antp | grep LISTEN | grep :9100

tcp6 0 0 :::9100 :::* LISTEN 38150/node_exporter

#windows 浏览器访问192.168.152.14:9100，查看效果

2.5 服务器端也安装一个Prometheus - Node Exporter进行数据监控

$ scp -r node_exporter/ 192.168.152.13:/usr/loca/

2.6 修改服务器监控配置文件

$ ln -s /usr/local/node_exporter/node_exporter.service /lib/systemd/system/

$ systemctl daemon-reload

$ systemctl start node_exporter

$ systemctl enable node_exporter

$ vim /usr/local/prometheus/prometheus.yml

#添加监控项targets：["192.168.152.13:9100","192.168.152.14"]

$ systemctl restart prometheus

#在访问浏览器，刷新浏览器，点击Status---Targets

3、Prometheus - Grafana部署

3.1. 下载并安装Prometheus Server服务

$ tar -xf grafana-10.2.2.linux-amd64.tar.gz

$ cp -r grafana-v10.2.2 /usr/local/grafana

3.2. 编写Prometheus service启动脚本

$ cat>/usr/local/grafana/grafana-server.service<<EOF

[Unit]

Description=Grafana Server

After=network.target

[Service]

Type=simple

User=root

WorkingDirectory=/usr/local/grafana

ExecStart=/usr/local/grafana/bin/grafana-server

Restart=on-failure

LimitNOFILE=65536

[Install]

WantedBy=multi-user.target

EOF

3.3. 添加启动脚本到systemd启动管理中

$ ln -s /usr/local/grafana/grafana-server.service /lib/systemd/system/

$ systemctl daemon-reload

$ systemctl start grafana-server

$ systemctl enable grafana-server

3.4. web访问：http://192.168.152.13:3000/login

登录grafana，账号：admin，密码：admin，随后填写新密码，和确认新密码，符合密码三原则。

3.5. 安装监控Linux系统资源模板

#点击DATA SOURCES----Prometheus----Name(填写名称Prometheus)-----Prometheus server ULR(http://192.168.152.13:9090)---save & test

模板号：8919，由于图形模板作者会更新版本，软件版本和图形模板由于版本变更导致不兼容，要让软件跟随图形模板进行更新。

#右上角点击+号，导入仪表板—8919—load—Prometheus选择添加的数据源—import导入。

3.6. 效果展示——设置刷新时间为15秒

4、Prometheus - AlertManager部署（报警配置）

1、alertmanager组件安装

4.1 下载并安装alertmanager组件

$ tar -xf alertmanager-0.20.0.linux-amd64.tar.gz

$ cp -r alertmanager-0.20.0.linux-amd64 /usr/local/alertmanager

4.2 编写alertmanager启动脚本

$cat>/usr/local/alertmanager/alertmanager.service<<EOF

[Unit]

Description=Alertmanager

After=network.target

[Service]

Type=simple

User=root

WorkingDirectory=/usr/local/alertmanager

ExecStart=/usr/local/alertmanager/alertmanager

Restart=on-failure

[Install]

WantedBy=multi-user.target

EOF

4.3 添加启动脚本到systemd启动管理中

$ 100ln -s /usr/local/alertmanager/alertmanager.service /lib/systemd/system/

$ systemctl daemon-reload

$ systemctl start alertmanager

$ netstat -anpt |grep 9093

#默认没有被Prometheus调用，需要修改peometheus配置文件调用alertmanager组件

2、 alertmanager组件安装（邮件报警）

4.4 修改配置文件 - 实现基于邮件的报警（备份原始的，覆盖修改）

$ vim /usr/local/alertmanager/alertmanager.yml

global:

# 在没有报警的情况下声明为已解决的时间

resolve_timeout: 5m

# 配置邮件发送信息

smtp_smarthost: 'smtp.163.com:25'

smtp_from: 'wyf_wyf_001@163.com'

smtp_auth_username: 'wyf_wyf_001@163.com'

smtp_auth_password: 'OELAGVCQVDKGHZDM'

# 需要去网页端申请第三方登录专属密码

smtp_hello: '163.com'

smtp_require_tls: false

route:

group_by: ['alertname', 'cluster']

group_wait: 30s

group_interval: 5m

repeat_interval: 5m

receiver: default

receivers:

- name: 'default'

email_configs:

- to: 'xbz_002@126.com'

send_resolved: true

4.5 添加报警规则，进行效果测试

# 修改 /usr/local/prometheus/prometheus.yml 文件添加规则文件

$ vim /usr/local/prometheus/prometheus.yml

alerting:

alertmanagers:

- static_configs:

- targets:

- 192.168.152.130:9093 #服务器IP和端口

rule_files:

- "rules/*rules.yml"

# 创建并修改 /usr/local/prometheus/rules/node1_rules.yml 文件添加监控规则

$ mkdir /usr/local/prometheus/rules/

$ vim /usr/local/prometheus/rules/test_rules.yml

groups:

- name: test-rules

rules:

- alert: InstanceDown

expr: avg(irate(node_cpu_seconds_total{cpu="0",mode="user"}[30m])) by (instance) >= 0.1

#30分钟内用户消耗CPU的平均每分钟增量百分比大于等于0.1

for: 1m

labels:

status: warning

annotations:

summary: "{{$labels.instance}}: CPU Load is too high"

description: "{{$labels.instance}}: {{$labels.job}} CPU Load is too high"

$ systemctl restart prometheus alertmanager

#浏览器查看是否有报警添加成功

4.6 在被监控的机器上增加负载

$dd if=/dev/zero of=/dev/null & #可以多执行几次。

4.7 浏览器查看web页面有报警。

4.8 在Prometheus-server上继续创建报警模版

$ cd /usr/local/prometheus/rules

$vim test2_rules.yml

groups:

- name: test

rules:

- alert: CPU负载1分钟告警

expr: node_load1 / count (count (node_cpu_seconds_total) without (mode)) by (instance, job) > 2.5

for: 1m

labels:

level: warning

annotations:

summary: "{{ $labels.instance }} CPU负载告警 "

description: "{{$labels.instance}} 1分钟CPU负载(当前值: {{ $value }})"

- alert: CPU使用率告警

expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle",job!~"(IDC-GPU|hw-nodes-prod-ES|nodes-test-GPU|nodes-dev-GPU)"}[30m])) by (instance) > 0.85

for: 1m

labels:

level: warning

annotations:

summary: "{{ $labels.instance }} CPU使用率告警 "

description: "{{$labels.instance}} CPU使用率超过85%(当前值: {{ $value }} )"

- alert: CPU使用率告警

expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle",job=~"(IDC-GPU|hw-nodes-prod-ES)"}[30m])) by (instance) > 0.9

for: 1m

labels:

level: warning

annotations:

summary: "{{ $labels.instance }} CPU负载告警 "

description: "{{$labels.instance}} CPU使用率超过90%(当前值: {{ $value }})"

- alert: 内存使用率告警

expr: (1-node_memory_MemAvailable_bytes{job!="IDC-GPU"} / node_memory_MemTotal_bytes{job!="IDC-GPU"}) * 100 > 60

labels:

level: critical

annotations:

summary: "{{ $labels.instance }} 可用内存不足告警"

description: "{{$labels.instance}} 内存使用率已达90% (当前值: {{ $value }})"

- alert: 磁盘使用率告警

expr: 100 - (node_filesystem_avail_bytes{fstype=~"ext4|xfs", mountpoint !~ "/var/lib/[kubelet|rancher].*" } / node_filesystem_size_bytes{fstype=~"ext4|xfs", mountpoint !~ "/var/lib/[kubelet|rancher].*"}) * 100 > 85

labels:

level: warning

annotations:

summary: "{{ $labels.instance }} 磁盘使用率告警"

description: "{{$labels.instance}} 磁盘使用率已超过85% (当前值: {{ $value }})"

$systemctl restart prometheus

$浏览再次查看会多几个报警

3、钉钉报警

1、在钉钉上添加机器人设置机器人名称——勾选加签——复制加签密钥（保存一份）——复制API接口

加签密钥：SEC445e54a9c3a877038ce302bfe07c1c511fe76c5034b4d05bddae2a5fb5fadef3

api接口：https://oapi.dingtalk.com/robot/send?access_token=fa20ed62eb9ed8dc06d3078399e55ca15266ad96dc3280482b1e903eafd707d7

2、在Prometheus-server上安装钉钉插件

$ tar -xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

$ cp -r prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/dingtalk/

# 创建 service 启动文件

$ cd /usr/local/dingtalk/

$ vim dingtalk.service

[Unit]

Description=dingtalk

After=network.target

[Service]

Type=simple

User=root

WorkingDirectory=/usr/local/dingtalk/

ExecStart=/usr/local/dingtalk/prometheus-webhook-dingtalk

Restart=on-failure

[Install]

WantedBy=multi-user.target

3、修改钉钉插件的配置文件

$ mv config.example.yml config.yml

$ vim config.yml

#取消templates和下一行的注释，开启调用告警模版的位置

#url:后面填写钉钉的APL接口位置

secret：后面填写钉钉的密钥

mobiles：后面填写钉钉报警手机号

$ cd /usr/local/dingtalk/contrib/template/legacy

$vim template.tmpl

[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]

---

{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}

**告警名称**: {{ index .Annotations "title" }}

**告警级别**: {{ .Labels.severity }}

**告警主机**: {{ .Labels.instance }}

**告警信息**: {{ index .Annotations "description" }}

**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

---

{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}

**告警名称**: {{ index .Annotations "title" }}

**告警级别**: {{ .Labels.severity }}

**告警主机**: {{ .Labels.instance }}

**告警信息**: {{ index .Annotations "description" }}

**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**恢复时间**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}

**====侦测到{{ .Alerts.Firing | len }}个故障====**

---

**====恢复{{ .Alerts.Resolved | len }}个故障====**

$ ln -s /usr/local/dingtalk/dingtalk.service /lib/systemd/system

$ systemctl daemon-reload

$ systemctl start dingtalk

$netstat -anpt |grep 8060 #查看8060端口

4、修改Prometheus报警组件的配置文件，让获取到的消息能发送给钉钉插件

$vim /usr/local/alertmanager/alertmanager.yml

#在最下面添加

webhook_configs:

- url: 'http://192.168.152.13:8060/dingtalk/webhook1/send'

send_resolved: true

$ systemctl restart alertmanager

$ netstat -anpt |grep 9093

5、11和12主机制造报警条件

$ dd if=/dev/zero of=/dev/null & #多执行几次。查看报警。

$ dd if=/dev/zero of=/a.txt bs=1G count=16

wyf_wyf_001

关注

10
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
Prometheus监控搭建过程

点击DATA SOURCES----Prometheus----Name(填写名称Prometheus)-----Prometheus server ULR(http://192.168.152.13:9090)---save & test。**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}#添加监控项targets：["192.168.152.13:9100","192.168.152.14"]
复制链接

扫一扫