Prometheus监控搭建过程

报警过程

一、实验准备

1、准备两台虚拟机

2、虚拟机连通互联网

二、实验步骤

1、Prometheus-server端

 1.1下载并安装Prometheus Server服务

$ rz promethues.zip

$ unzip promethues.zip

# 单独解压缩Prometheus软件,完成安装

$ tar -xf prometheus-2.14.0.linux-amd64.tar.gz

$ cd prometheus-2.14.0.linux-amd64/

$ tree ./

# Prometheus安装非常简单,解压缩复制到自定义目录下即可,约定成俗的习惯:/usr/local/prometheus

$ cp -r prometheus-2.14.0.linux-amd64/ /usr/local/prometheus

 1.2 编写Prometheus service启动脚本

$ cat>/usr/local/prometheus/prometheus.service<<EOF

[Unit]

Description=Prometheus

After=network.target

[Service]

Type=simple

User=root

WorkingDirectory=/usr/local/prometheus

ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml

Restart=on-failure

LimitNOFILE=65536

[Install]

WantedBy=multi-user.target

EOF

 1.3 添加启动脚本到systemd启动管理中

$ ln -s /usr/local/prometheus/prometheus.service /lib/systemd/system/

$ systemctl daemon-reload

$ systemctl start prometheus

$ systemctl enable prometheus

$ netstat -antp | grep LISTEN | grep :9090

tcp6       0      0 :::9090                 :::*                    LISTEN      37984/prometheus

 1.4 使用Windows浏览器访问192.168.152.13:9090,进行测试。

1.5 配置文件讲解

# 配置文件(原版未改)

$ vim /usr/local/prometheus/prometheus.yml

# my global config

global:

  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

  # scrape_timeout is set to the global default (10s).

 

# Alertmanager configuration

alerting:

  alertmanagers:

  - static_configs:

    - targets:

      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

  # - "first_rules.yml"

  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'

    # scheme defaults to 'http'.

    static_configs:

    - targets: ['localhost:9090']

# -----------------------------------------------------------------------------------------------

# 配置文件关键词介绍(由于alertmanager、exporter等都未安装,相关配置后面详细讲)

global:                        # 全局配置 (如果有内部单独设定,会覆盖这个参数)

  scrape_interval:     15s

  # 全局默认的数据拉取间隔

  evaluation_interval: 15s

  # 全局默认的规则(主要是报警规则)拉取间隔

  scrape_timeout:            10s

  # 全局默认的单次数据拉取超时间,默认不开启,当报context deadline exceeded错误时需要在特定的job下配置该字段,注意:scrape_timeout时间不能大于scrape_interval,否则Prometheus将会报错。

alerting:                # 告警插件定义,这里会设定alertmanager这个报警插件。

rule_files:                # 告警规则,按照设定参数进行扫描加载,用于自定义报警规则(类似触发器trigger),其报警媒介由alertmanager插件实现。

scrape_configs:        # 采集配置,配置数据源,包含分组job_name以及具体target,又分为静态配置和服务发现。

1.6 浏览器查看会有两个主机名群组,点击status ---target

2、Prometheus - Node Exporter部署

2.1 解压缩并安装Node Exporter

$ tar -xf node_exporter-0.18.1.linux-amd64.tar.gz

$ cp -r node_exporter-0.18.1.linux-amd64 /usr/local/node_exporter

2.2 编写Node Exporter启动脚本

$ cat>/usr/local/node_exporter/node_exporter.service<<EOF

[Unit]

Description=Node Exporter

After=network.target

Wants=network-online.target

[Service]

Type=simple

User=root

ExecStart=/usr/local/node_exporter/node_exporter

Restart=on-failure

LimitNOFILE=65536

[Install]

WantedBy=multi-user.target

EOF

2.3 添加启动脚本到systemd启动管理中

$ systemctl daemon-reload

$ systemctl start node_exporter

$ systemctl enable node_exporter

$ netstat -antp | grep LISTEN | grep :9100

tcp6       0      0 :::9100                 :::*                    LISTEN      38150/node_exporter

#windows 浏览器访问192.168.152.14:9100,查看效果

2.5 服务器端也安装一个Prometheus - Node Exporter进行数据监控

$ scp -r node_exporter/   192.168.152.13:/usr/loca/

2.6 修改服务器监控配置文件

$ ln -s /usr/local/node_exporter/node_exporter.service /lib/systemd/system/

$ systemctl daemon-reload

$ systemctl start node_exporter

$ systemctl enable node_exporter

$ vim /usr/local/prometheus/prometheus.yml

#添加监控项targets:["192.168.152.13:9100","192.168.152.14"]

$ systemctl restart prometheus

#在访问浏览器,刷新浏览器,点击Status---Targets

3、Prometheus - Grafana部署

3.1. 下载并安装Prometheus Server服务

   $ tar -xf grafana-10.2.2.linux-amd64.tar.gz

   $ cp -r grafana-v10.2.2  /usr/local/grafana

3.2. 编写Prometheus service启动脚本

   $ cat>/usr/local/grafana/grafana-server.service<<EOF

   [Unit]

   Description=Grafana Server

   After=network.target

   

   [Service]

   Type=simple

   User=root

   WorkingDirectory=/usr/local/grafana

   ExecStart=/usr/local/grafana/bin/grafana-server

   

   Restart=on-failure

   LimitNOFILE=65536

   

   [Install]

   WantedBy=multi-user.target

   EOF

3.3. 添加启动脚本到systemd启动管理中

   $ ln -s /usr/local/grafana/grafana-server.service /lib/systemd/system/

   $ systemctl daemon-reload

   $ systemctl start grafana-server

    $ systemctl enable grafana-server

3.4. web访问:http://192.168.152.13:3000/login

登录grafana,账号:admin,密码:admin,随后填写新密码,和确认新密码,符合密码三原则。

3.5. 安装监控Linux系统资源模板

#点击DATA SOURCES----Prometheus----Name(填写名称Prometheus)-----Prometheus server ULR(http://192.168.152.13:9090)---save & test

   模板号:8919,由于图形模板作者会更新版本,软件版本和图形模板由于版本变更导致不兼容,要让软件跟随图形模板进行更新。

#右上角点击+号,导入仪表板—8919—load—Prometheus选择添加的数据源—import导入。

3.6. 效果展示——设置刷新时间为15秒

4、Prometheus - AlertManager部署(报警配置)

       1、alertmanager组件安装 

4.1 下载并安装alertmanager组件

$ tar -xf alertmanager-0.20.0.linux-amd64.tar.gz

$ cp -r alertmanager-0.20.0.linux-amd64 /usr/local/alertmanager

4.2 编写alertmanager启动脚本

$cat>/usr/local/alertmanager/alertmanager.service<<EOF

[Unit]

Description=Alertmanager

After=network.target

[Service]

Type=simple

User=root

WorkingDirectory=/usr/local/alertmanager

ExecStart=/usr/local/alertmanager/alertmanager

Restart=on-failure

[Install]

WantedBy=multi-user.target

EOF

4.3 添加启动脚本到systemd启动管理中

$ 100ln -s /usr/local/alertmanager/alertmanager.service /lib/systemd/system/

$ systemctl daemon-reload

$ systemctl start alertmanager

$ netstat -anpt |grep 9093

#默认没有被Prometheus调用,需要修改peometheus配置文件调用alertmanager组件

  

      2、 alertmanager组件安装(邮件报警)

4.4 修改配置文件 - 实现基于邮件的报警(备份原始的,覆盖修改)

$  vim /usr/local/alertmanager/alertmanager.yml

 global:

  # 在没有报警的情况下声明为已解决的时间

  resolve_timeout: 5m

  # 配置邮件发送信息

  smtp_smarthost: 'smtp.163.com:25'

  smtp_from: 'wyf_wyf_001@163.com'

  smtp_auth_username: 'wyf_wyf_001@163.com'

  smtp_auth_password: 'OELAGVCQVDKGHZDM'

  # 需要去网页端申请第三方登录专属密码

  smtp_hello: '163.com'

  smtp_require_tls: false

route:

  group_by: ['alertname', 'cluster']

  group_wait: 30s

  group_interval: 5m

  repeat_interval: 5m

  receiver: default

 

receivers:

- name: 'default'

  email_configs:

  - to: 'xbz_002@126.com'

    send_resolved: true

4.5 添加报警规则,进行效果测试

# 修改 /usr/local/prometheus/prometheus.yml 文件添加规则文件

$ vim /usr/local/prometheus/prometheus.yml

alerting:

  alertmanagers:

  - static_configs:

    - targets:

       - 192.168.152.130:9093                        #服务器IP和端口

rule_files:

  - "rules/*rules.yml"

# 创建并修改 /usr/local/prometheus/rules/node1_rules.yml 文件添加监控规则

 

$ mkdir /usr/local/prometheus/rules/

$ vim /usr/local/prometheus/rules/test_rules.yml

groups:

    - name: test-rules

      rules:

      - alert: InstanceDown

        expr: avg(irate(node_cpu_seconds_total{cpu="0",mode="user"}[30m])) by (instance) >= 0.1

#30分钟内用户消耗CPU的平均每分钟增量百分比大于等于0.1

        for: 1m

        labels:

          status: warning

        annotations:

          summary: "{{$labels.instance}}: CPU Load is too high"

          description: "{{$labels.instance}}:  {{$labels.job}} CPU Load is too high"

 

 

$ systemctl restart  prometheus alertmanager

#浏览器查看是否有报警添加成功

 

 

4.6 在被监控的机器上增加负载

$dd if=/dev/zero  of=/dev/null  &  #可以多执行几次。

4.7 浏览器查看web页面有报警。

4.8 在Prometheus-server上继续创建报警模版

$ cd /usr/local/prometheus/rules

$vim test2_rules.yml

groups:

- name: test

  rules:

  - alert: CPU负载1分钟告警

    expr:  node_load1 / count (count (node_cpu_seconds_total) without (mode)) by (instance, job) > 2.5

    for: 1m 

    labels:

      level: warning

    annotations:

      summary: "{{ $labels.instance }} CPU负载告警 "

      description: "{{$labels.instance}} 1分钟CPU负载(当前值: {{ $value }})"

  - alert: CPU使用率告警

    expr:  1 - avg(irate(node_cpu_seconds_total{mode="idle",job!~"(IDC-GPU|hw-nodes-prod-ES|nodes-test-GPU|nodes-dev-GPU)"}[30m])) by (instance) > 0.85

    for: 1m 

    labels:

      level: warning

    annotations:

      summary: "{{ $labels.instance }} CPU使用率告警 "

      description: "{{$labels.instance}} CPU使用率超过85%(当前值: {{ $value }} )"

  - alert: CPU使用率告警

    expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle",job=~"(IDC-GPU|hw-nodes-prod-ES)"}[30m])) by (instance) > 0.9

    for: 1m

    labels:

      level: warning

    annotations:

      summary: "{{ $labels.instance }} CPU负载告警 "

      description: "{{$labels.instance}} CPU使用率超过90%(当前值: {{ $value }})"

  - alert:  内存使用率告警

    expr:  (1-node_memory_MemAvailable_bytes{job!="IDC-GPU"} /  node_memory_MemTotal_bytes{job!="IDC-GPU"}) * 100 > 60

    labels:

      level: critical

    annotations:

      summary: "{{ $labels.instance }} 可用内存不足告警"

      description: "{{$labels.instance}} 内存使用率已达90% (当前值: {{ $value }})"

  - alert:  磁盘使用率告警

    expr: 100 - (node_filesystem_avail_bytes{fstype=~"ext4|xfs", mountpoint !~ "/var/lib/[kubelet|rancher].*" } / node_filesystem_size_bytes{fstype=~"ext4|xfs", mountpoint !~ "/var/lib/[kubelet|rancher].*"}) * 100 > 85

    labels:

      level: warning

    annotations:

      summary: "{{ $labels.instance }} 磁盘使用率告警"

      description: "{{$labels.instance}} 磁盘使用率已超过85% (当前值: {{ $value }})"

 

$systemctl restart prometheus

$浏览再次查看会多几个报警

 

  3、钉钉报警

1、在钉钉上添加机器人设置机器人名称——勾选加签——复制加签密钥(保存一份)——复制API接口

 

加签密钥:SEC445e54a9c3a877038ce302bfe07c1c511fe76c5034b4d05bddae2a5fb5fadef3

api接口:https://oapi.dingtalk.com/robot/send?access_token=fa20ed62eb9ed8dc06d3078399e55ca15266ad96dc3280482b1e903eafd707d7

2、在Prometheus-server上安装钉钉插件

$ tar -xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

$ cp -r prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/dingtalk/

# 创建 service 启动文件

$ cd /usr/local/dingtalk/

$ vim dingtalk.service

[Unit]

Description=dingtalk

After=network.target

[Service]

Type=simple

User=root

WorkingDirectory=/usr/local/dingtalk/

ExecStart=/usr/local/dingtalk/prometheus-webhook-dingtalk

Restart=on-failure

[Install]

WantedBy=multi-user.target

3、修改钉钉插件的配置文件

$ mv config.example.yml  config.yml

$ vim config.yml

#取消templates和下一行的注释,开启调用告警模版的位置

#url:后面填写钉钉的APL接口位置

  secret:后面填写钉钉的密钥

  mobiles:后面填写钉钉报警手机号

$ cd /usr/local/dingtalk/contrib/template/legacy

$vim template.tmpl

{{ define "__subject" }}

[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]

{{ end }}

{{ define "__alert_list" }}{{ range . }}

---

{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}

**告警名称**: {{ index .Annotations "title" }}

**告警级别**: {{ .Labels.severity }}

**告警主机**: {{ .Labels.instance }}

**告警信息**: {{ index .Annotations "description" }}

**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

{{ end }}{{ end }}

{{ define "__resolved_list" }}{{ range . }}

---

{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}

**告警名称**: {{ index .Annotations "title" }}

**告警级别**: {{ .Labels.severity }}

**告警主机**: {{ .Labels.instance }}

**告警信息**: {{ index .Annotations "description" }}

**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**恢复时间**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}

{{ end }}{{ end }}

{{ define "default.title" }}

{{ template "__subject" . }}

{{ end }}

{{ define "default.content" }}

{{ if gt (len .Alerts.Firing) 0 }}

**====侦测到{{ .Alerts.Firing | len  }}个故障====**

{{ template "__alert_list" .Alerts.Firing }}

---

{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}

**====恢复{{ .Alerts.Resolved | len  }}个故障====**

{{ template "__resolved_list" .Alerts.Resolved }}

{{ end }}

{{ end }}

{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}

{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}

{{ template "default.title" . }}

{{ template "default.content" . }}

$ ln -s /usr/local/dingtalk/dingtalk.service /lib/systemd/system

$ systemctl  daemon-reload

$ systemctl  start dingtalk

$netstat -anpt |grep 8060    #查看8060端口

4、修改Prometheus报警组件的配置文件,让获取到的消息能发送给钉钉插件

$vim /usr/local/alertmanager/alertmanager.yml

#在最下面添加

 webhook_configs:

  - url: 'http://192.168.152.13:8060/dingtalk/webhook1/send'

    send_resolved: true

$ systemctl restart alertmanager

$ netstat -anpt |grep 9093

5、11和12主机制造报警条件

$ dd if=/dev/zero  of=/dev/null  &        #多执行几次。查看报警。

$ dd if=/dev/zero of=/a.txt   bs=1G count=16

  • 10
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Prometheus是一个开源的监控和警报系统。搭建Prometheus可以按照以下步骤进行操作: 1. 首先,下载Prometheus的安装包,并解压到指定目录。可以使用以下命令下载和解压: ``` tar -xvf prometheus-2.16.0.linux-amd64.tar.gz -C /usr/local/ mv /usr/local/prometheus-2.16.0.linux-amd64/ /usr/local/prometheus ``` 2. 进入Prometheus的安装目录: ``` cd /usr/local/prometheus ``` 3. 启动Prometheus服务: ``` ./prometheus & ``` 4. 确认Prometheus服务已经启动成功。可以查看日志文件`/var/log/messages`,如果一切正常,会看到类似以下的打印信息: ``` Mar 10 10:08:55 centos7 systemd: Started prometheus. Mar 10 10:08:55 centos7 systemd: Starting prometheus... ... Mar 10 10:08:55 centos7 prometheus: level=info ts=2020-03-10T14:08:55.895Z caller=main.go:630 msg="Server is ready to receive web requests." ``` 5. 现在,可以通过访问Prometheus的Web界面来进行监控。在浏览器中输入Prometheus的地址,例如`http://localhost:9090`,即可访问。 需要注意的是,Prometheus的配置文件是`prometheus.yml`,通过配置该文件可以确定其服务发现机制,以便更新exporter而无需重启Prometheus服务。可以根据需要进行相应的配置。\[3\] 希望以上信息对您有所帮助。如果还有其他问题,请随时提问。 #### 引用[.reference_title] - *1* [Prometheus搭建](https://blog.csdn.net/zjz5740/article/details/115652357)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insertT0,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [Prometheus搭建](https://blog.csdn.net/u010039418/article/details/104782792)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insertT0,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [Prometheus初步搭建和配置(Windows)](https://blog.csdn.net/zyl042/article/details/123082790)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insertT0,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值