Prometheus和它的xdm

事后清晨

已于 2023-02-22 09:45:28 修改

阅读量642

点赞数 2

分类专栏： prometheus node_export 文章标签： bash Powered by 金山文档

于 2023-02-21 18:14:13 首次发布

本文链接：https://blog.csdn.net/AiTmm/article/details/129146073

版权

prometheus 同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

node_export

1 篇文章 0 订阅

订阅专栏

手把手带你入门👀

组件有

prometheus监控部署

node_exporter部署机器状态监控

snmp_exporter部署交换机防火墙监控

grafana部署监控数据面板展示

alertmanager部署配置告警规则邮件告警

PrometheusAlert部署告警消息中转发送到飞书机器人完成平台搭建。

需要各个组件的安装包。

1：prometheus-2.36.0.linux-amd64.tar.gz

2：node_exporter-1.4.0.linux-amd64.tar.gz

3：grafana-9.2.3.linux-amd64.tar.gz

4：alertmanager-0.23.0.linux-amd64.tar.gz

5：PrometheusAlert.zip

6：snmp_exporter-0.21.0.linux-amd64.tar.gz

#prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.36.0/prometheus-2.36.0.linux-amd64.tar.gz

#node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.4.0.linux-amd64.tar.gz

#Grafana
wget https://dl.grafana.com/oss/release/grafana-9.2.3.linux-amd64.tar.gz

#alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz

#PrometheusAlert
wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.7/linux.zip

#snmp_exporter
wget https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.linux-amd64.tar.gz

安装前工作准备：

#关闭防火墙
[root@localhost]# systemctl stop firewalld
[root@localhost]# systemctl disable firewalld
#关闭Selinux
[root@localhost]# setenforce 0 #临时关闭
[root@localhost]# sed -i 's/SELINUX=enforcing/SLINUX=disabled/g' /etc/selinux/config  #永久关闭
#时区同步
[root@localhost]# systemctl restart ntpd
[root@localhost]# systemctl enable ntpd

1、prometheus

去官网下载相应版本，上传到服务器上。官网提供的是二进制版，解压就能用，不需要编译。

[root@localhost]# wget https://github.com/prometheus/prometheus/releases/download/v2.36.0/prometheus-2.36.0.linux-amd64.tar.gz
[root@localhost]# tar -xf prometheus-2.36.0.linux-amd64.tar.gz -C /data/
[root@localhost]# mv /data/prometheus-2.36.0.linux-amd64.tar.gz/ /data/prometheus

可以按需加入启动参数。

ExecStart=/data/prometheus/prometheus #启动服务
--config.file=/data/prometheus/prometheus.yml #启动文件
--web.listen-address="0.0.0.0:9090"#指定网页打开Prometheus的ip和端口，默认为"0.0.0.0:9090"
--web.enable-lifecycle #通过HTTP请求启用关闭和重新加载
--storage.tsdb.path=data/prometheus/data/ #存储的基本路径 
--storage.tsdb.retention=15d #存放数据时间（默认15天）
--log.level=info 存放数据格式

未设置参数 --web.enable-lifecycle时，执行curl -X POST http://localhost:9090/-/reload 会报错

启动设置参数--web.enable-lifecycle就可以用命令 curl-X POST http://localhost:9090/-/reload

重新加载配置文件了，无需重启服务。

加入systemd启动脚本

[root@localhost]# cat <<EOF /usr/lib/systemd/system/prometheus.service 

[Unit]
Description=prometheus server daemon
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
ExecStart=/data/prometheus/prometheus --config.file=/data/prometheus/prometheus.yml --web.enable-lifecycle --storage.tsdb.path=/bnq/prometheus/data/  --storage.tsdb.retention=15d --log.level=info
ExecReload=/bin/kill -HUP$MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
KillMode=process
Restart=on-failure
RestartSec=42s
 
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable prometheus.service

/data/prometheus/prometheus.yml原配置文件

[root@localhost]# cat /data/prometheus/prometheus.yml
# my global config
global: #全局配置 （如果有内部单独设定，会覆盖这个参数）
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting: #告警插件定义。这里会设定alertmanager这个报警插件。
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files: #告警规则。 按照设定参数进行扫描加载，用于自定义报警规则，其报警媒介和route路由由alertmanager插件实现。
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs: #采集配置。配置数据源，包含分组job_name以及具体target。又分为静态配置和服务发现

remote_write: #用于远程存储写配置

remote_read: #用于远程读配置
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"#监控任务

    # metrics_path defaults  to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]#监控IP+端口

/data/prometheus/prometheus.yml更改后

[root@localhost]# cat /data/prometheus/prometheus.yml
#my global config
global:
  scrape_interval: 15s 
  evaluation_interval: 15s every 1 minute.

alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - xx.xx.xx.xx:9093
rule_files:
   - /data/prometheus/rules/*.yml #规则配置文件

  - job_name: "prometheus"#监控任务名称
    static_configs:
      - targets: ["xx.xx.xx.xx:9100"]#监控IP及端口

  - job_name: "containers"
    static_configs:
      - targets: ["xx.xx.xx.xx:9100","xx.xx.xx.xx:9100"]#监控多个IP及端口。注意如果新增监控结点，prometheus服务则需要重启。
        labels:
          Docker: docker-containers

  - job_name: "Docker-Dev"
    file_sd_configs:
    - files:
      - '/data/prometheus/monitor/docker_dev.yml'#定义动态发现的配置文件
      refresh_interval: 15s #每隔5秒检查一次
#file_sd_configs: 服务发现：动态发现需要监控的Target实例。

#配置文件语法检测。修改配置文件后可以使用该命令检测配置是否正确

[root@localhost]#  ./promtool check config prometheus.yml 
Checking prometheus.yml
  SUCCESS: 1 rule files found

在监控机器集群时我们并不能每次添加或摘除结点时随意的重启服务用来更新我们对配置文件的操作。所以则需要用到file_sd_configs。用file_sd_configs块替换prometheus.yml文件中的static_configs块。并指定文件列表，在父目录targets下为每个任务指定了对应的配置文件，并为每个任务创建了一个子目录。

动态发现文件配置

[root@localhost]# cat /data/prometheus/monitor/docker_dev.yml
- targets: ["docker-dev-01.os:9100","docker-dev-02.os:9100","docker-dev-03.os:9100"]

启动服务

[root@localhost]# systemctl start prometheus.service
[root@localhost]# netstat -tlnp | grep9090 or lsof -i:9090 #查看端口是否启动成功
[root@localhost]# systemctl status prometheus.service  #查看服务启动状态

通过浏览器访问IP:9090就可以访问到prometheus主页面。

在这可以查看监控的结点状态

至此，prometheus的服务搭建完成。

2、node_exporter

https://github.com/prometheus/node_exporter/releases

[root@localhost]# wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.4.0.linux-amd64.tar.gz
[root@localhost]# tar zxvf node_exporter-1.4.0.linux-amd64.tar.gz -C /data/node_exporter

[root@localhost]# nohup ./node_exporter &  #后台启动

netstat -tlnp | grep9100 or lsof -i:9100 #查看端口是否启动成功

node_exporter如果不指定，默认会暴露9100端口

访问IP+ 9100 可以看到node_exporter监控得到的各种信息

至此node_exporter已经搭建完毕

3、Grafana

下载安装包，加入system启动脚本

下载地址：

https://grafana.com/grafana/download?pg=get&plcmt=selfmanaged-box1-cta1&edition=oss

下载&解压：

[root@localhost]# wget https://dl.grafana.com/oss/release/grafana-9.2.3.linux-amd64.tar.gz
[root@localhost]# tar -zxvf grafana-9.2.3.linux-amd64.tar.gz

# 注册成系统服务
[root@localhost]# cat /usr/systemd/system/grafana.service

[Service]
ExecStart=/data/grafana/grafana-9.2.3/bin/grafana-server --config=/data/grafana/grafana-9.2.3/conf/defaults.ini  --homepath=/data/grafana/grafana-9.2.3
 
[Install]
WantedBy=multi-user.target
 
[Unit]
Description=grafana
After=network.target

# 重载/开机自启/查看状态/启动

[root@localhost]# systemctl daemon-reload
[root@localhost]# systemctl enable node_exporter
[root@localhost]# systemctl status node_exporter 
[root@localhost]# systemctl start node_exporter

# 查看服务是否启动

[root@localhost]# lsof -i:3000
[root@localhost]# ps-ef | grep grafana

浏览器访问地址：

http://127.0.0.1:3000

默认用户名密码：admin/admin

首次登陆需要改密码，点击SKip可以跳过。

数据源配置

导入监控Dashboard 模板，这里提供一个地址可以参考。

https://grafana.com/grafana/dashboards/

Linux服务器监控可以用这两个ID： 8919 ro 10180 。更多模板可以去上方链接去找~

欣赏一下，简单直观。

至此，Grafana配置完成。

4、Alertmanager

Alertmanager也是基于Go语言编写，下载解压就可以使用

[root@localhost]# wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
[root@localhost]# tar zvxf alertmanager-0.23.0.linux-amd64.tar.gz -C /data/
[root@localhost]# mv /data/alertmanager-0.23.0.linux-amd64.tar.gz /data/alertmanager

[root@localhost]# cat /usr/lib/systemd/system/alertmanager.service 

[Unit]
Description=alertmanager server daemon
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
 
[Service]
ExecStart=/data/alertmanager/alertmanager --config.file=/data/alertmanager/alertmanager.yml --storage.path=/data/alertmanager/data
ExecReload=/bin/kill -HUP$MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
KillMode=process
Restart=on-failure
RestartSec=42s
 
[Install]
WantedBy=multi-user.target

systemctl daemon-reload
systemctl enable prometheus.service

Alertmanager选项说明,说几个重要的参数，全部参数查看： ./alertmanager -h

选项名	解释
--config.file	指定alertmanager.yml配置文件路径
--web.external-url	指定地址和端口，默认9093 格式：http://0.0.0.0:9093
--data.retention	历史数据最大保留时间，默认120小时

Alertmanager配置文件格式通常包括global(全局配置)、templates(告警模板)、route(告警路由)、receivers(接收器)、inhibit_rules(抑制规则)等主要配置项模块。

这是alertmanager.yml模块格式，更多配置查看：https://prometheus.io/docs/alerting/latest/configuration/#filepath

Alertmanager解压后会包含一个默认的alertmanager.yml配置文件，内容如下所示：

[root@localhost]# cat /data/alertmanager.yml/alertmanager.yml
global:
  #在没有报警的情况下声明为已解决的时间
  resolve_timeout: 5m
  #配置邮件发送信息
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: 'hxx@163.com'
  smtp_auth_username: 'hxx@163.com'
  smtp_auth_password: '<邮箱授权码>'
  smtp_hello: '163.com'
  smtp_require_tls: false
#所有报警信息进入后的根路由，用来设置报警的分发策略
route:
  #这里的标签列表是接收到报警信息后的重新分组标签，例如，接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
  group_by: ['alertname','cluster']
  #当一个新的报警分组被创建后，需要等待至少group_wait时间来初始化通知，这种方式可以确保您能有足够的时间为同一分组来获取多个警报，然后一起触发这个报警信息。
  group_wait: 30s
  # 当第一个报警发送后，等待'group_interval'时间来发送新的一组报警信息。
  group_interval: 5m

  # 如果一个报警信息已经发送成功了，等待'repeat_interval'时间来重新发送他们
  repeat_interval: 5m

  # 默认的receiver：如果一个报警没有被一个route匹配，则发送给默认的接收器
  receiver: default

  # 上面所有的属性都由所有子路由继承，并且可以在每个子路由上进行覆盖。
  routes:
  - receiver: email
    group_wait: 10s
    match:
      team: node
receivers:
- name: 'default'
  email_configs:
  - to: 'xxx@qq.com,xxx@163.com'#接收多个邮件地址
    send_resolved: true
- name: 'email'
  email_configs:
  - to: 'xxx@qq.com'
    send_resolved: true

修改后的/data/alertmanager/alertmanager.yml配置

[root@localhost]# cat /data/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m #超时时间 默认5m
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: 'hxxx@163.com'# 发送告警的邮箱
  smtp_auth_username: 'hxxx@163.com'  #发送告警的邮箱
  smtp_auth_password: 'LPTAEIHAUUDTIDAIV'#邮箱授权密码
  smtp_require_tls: false


templates:
  - './alertmanager/template/*.html'


inhibit_rules:
- source_match: #匹配当前告警发生后其他告警抑制掉
    serverity: '严重'
  target_match: #被抑制告警
    serverity: '告警'
# equal: 只有包含指定标签才可成立规则,这里表示两个告警级别的主机都“相同”时，成功抑制 ，这里也可以写多个标签
  equal: ['alertname']

route:
  receiver: webhook1  # 默认接收者  routes: # 指定那些组可以接收消息
  group_wait: 30s # 告警等待时间。告警产生后等待10s，如果有同组告警一起发出
  group_interval: 5m # 两组告警的间隔时间
  repeat_interval: 2h  # 重复告警的间隔时间，减少相同右键的发送频率 
  group_by: ['alertname']#对应prometheus规则文件中的team
  routes:
  - receiver: webhook1
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [team]
    matchers:
    - tema = cpu

  - receiver: webhook1
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [team]
    matchers:
    - tema = mem

receivers:  #飞书机器人告警配置会关联到PrometheusAlert,这里不做过多赘述。
- name: webhook1
  webhook_configs:
  - url: 'http://xx.xx.xx.xxx:8800/prometheusalert?type=fs&tpl=prometheus-fsv2&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/e90f7203-7f3c-4248-87f3-f94cdd8d9s5a'

Alertmanager也有配置文件检测功能。在配置目录中有个"amtool" 文件

[root@localhost]# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml'  SUCCESS
Found:
 - global config
 - route
 - 1 inhibit rules
 - 1 receivers
 - 1 templates
  SUCCES

启动Alertmanager服务之前需要配置告警规则，这里引用/data/prometheus/prometheus.yml的配置

[root@localhost]# cat /data/prometheus/prometheus.yml

# 配置Alertmanager参数
alerting:
 alertmanagers:
- static_configs:
   - targets: 
     - xxx.xxx.xxx.xxx:9093 # alertmanager的ip端口
# 配置规则文件路径
rule_files: 
- /data/prometheus/rules/*.yml  # 规则路径

这里是规则文件/data/prometheus/rules/ecs.yml ，规则路径下可以创建多个文件名不重复的yml文件用来配置告警规则。告警会在下方的PrometheusAlert中体现，请注意路径颜色。

[root@localhost]# cat /data/prometheus/rules/ecs.yml 
groups:
  - name: ecs.rules # 规则组名
    rules:
    - alert: MEM-UP #规则名 
      expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 #判断条件
      for: 1m #条件保持1min才会发出告警
      labels: #设置规则标签
        severity: warning
        team: MEM
      annotations: #规则的其他标签，但不用于识别规则。
        description: "{{$labels.instance}}\n 内存当前剩余: %{{ $value }}"#描述信息
        summary: "机器{{ $labels.instance }}内存不足"

    - alert: CPU-UP
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
      for: 1m
      labels:
        severity: warning
        team: CPU
      annotations:
        summary: "主机{{ $labels.instance }} CPU负载80%"
        description: "CPU 使用率 > 80%\n 当前负载 = {{ $value }}\n"  #LABELS = {{ $labels }}"

登录xxx.xxx.xxx.xxx:9090prometheus

或者 xxx.xxx.xxx.xxx:9093 Alertmanager查看是否启动成功并发现规则

邮件接收到告警信息

如果需要告警到企业微信、钉钉、飞书等，请下移尊位~

至此，Alertmanager配置完成。

5、PrometheusAlert

[root@localhost]# wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.7/linux.zip
[root@localhost]# unzip linux.zip 
[root@localhost]# mv ./linux/* /data/PrometheusAlert && chmod+x /data/prometheusAlert/PrometheusAlert

#运行PrometheusAlert
[root@localhost]#  ./PrometheusAlert #后台运行请执行 nohup ./PrometheusAlert &

#启动后可使用浏览器打开以下地址查看：http://127.0.0.1:8080

#默认登录帐号和密码在app.conf中有配置

在conf/app.conf 是默认配置文件，可以更改启动端口及用户名密码等。

[root@localhost]# cat /data/PrometheusAlert/conf/app.confg
#---------------------↓全局配置-----------------------
appname = PrometheusAlert
#登录用户名
login_user=prometheusalert
#登录密码
login_password=prometheusalert
#监听地址
httpaddr ="0.0.0.0"
#监听端口
httpport =8080
runmode = dev
#设置代理 proxy = http://123.123.123.123:8080
proxy =
#开启JSON请求
copyrequestbody =true
#告警消息标题
title=PrometheusAlert
#链接到告警平台地址
GraylogAlerturl=http://graylog.org
#钉钉告警 告警logo图标地址
logourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#钉钉告警 恢复logo图标地址
rlogourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#短信告警级别(等于3就进行短信告警) 告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
messagelevel=3
#电话告警级别(等于4就进行语音告警) 告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
phonecalllevel=4
#默认拨打号码(页面测试短信和电话功能需要配置此项)
defaultphone=xxxxxxxx
#故障恢复是否启用电话通知0为关闭,1为开启
phonecallresolved=0
#是否前台输出file or console
logtype=file
#日志文件路径
logpath=logs/prometheusalertcenter.log
#转换Prometheus,graylog告警消息的时区为CST时区(如默认已经是CST时区，请勿开启)
prometheus_cst_time=0
#数据库驱动，支持sqlite3，mysql,postgres如使用mysql或postgres，请开启db_host,db_port,db_user,db_password,db_name的注释
db_driver=sqlite3
#db_host=127.0.0.1
#db_port=3306
#db_user=root
#db_password=root
#db_name=prometheusalert
#是否开启告警记录 0为关闭,1为开启
AlertRecord=0
#是否开启告警记录定时删除 0为关闭,1为开启
RecordLive=0
#告警记录定时删除周期，单位天
RecordLiveDay=7
# 是否将告警记录写入es7，0为关闭，1为开启
alert_to_es=0
# es地址，是[]string
# beego.Appconfig.Strings读取配置为[]string，使用";"而不是","
to_es_url=http://localhost:9200
# to_es_url=http://es1:9200;http://es2:9200;http://es3:9200
# es用户和密码
# to_es_user=username
# to_es_pwd=password

#---------------------↓webhook-----------------------
#是否开启钉钉告警通道,可同时开始多个通道0为关闭,1为开启
open-dingding=1
#默认钉钉机器人地址
ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxx
#是否开启 @所有人(0为关闭,1为开启)
dd_isatall=1

#是否开启微信告警通道,可同时开始多个通道0为关闭,1为开启
open-weixin=1
#默认企业微信机器人地址
wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx

#是否开启飞书告警通道,可同时开始多个通道0为关闭,1为开启
open-feishu=0
#默认飞书机器人地址
fsurl=https://open.feishu.cn/open-apis/bot/hook/xxxxxxxxx

#---------------------↓腾讯云接口-----------------------
#是否开启腾讯云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-txdx=0
#腾讯云短信接口key
TXY_DX_appkey=xxxxx
#腾讯云短信模版ID 腾讯云短信模版配置可参考 prometheus告警:{1}
TXY_DX_tpl_id=xxxxx
#腾讯云短信sdk app id
TXY_DX_sdkappid=xxxxx
#腾讯云短信签名 根据自己审核通过的签名来填写
TXY_DX_sign=腾讯云

#是否开启腾讯云电话告警通道,可同时开始多个通道0为关闭,1为开启
open-txdh=0
#腾讯云电话接口key
TXY_DH_phonecallappkey=xxxxx
#腾讯云电话模版ID
TXY_DH_phonecalltpl_id=xxxxx
#腾讯云电话sdk app id
TXY_DH_phonecallsdkappid=xxxxx

#---------------------↓华为云接口-----------------------
#是否开启华为云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-hwdx=0
#华为云短信接口key
HWY_DX_APP_Key=xxxxxxxxxxxxxxxxxxxxxx
#华为云短信接口Secret
HWY_DX_APP_Secret=xxxxxxxxxxxxxxxxxxxxxx
#华为云APP接入地址(端口接口地址)
HWY_DX_APP_Url=https://rtcsms.cn-north-1.myhuaweicloud.com:10743
#华为云短信模板ID
HWY_DX_Templateid=xxxxxxxxxxxxxxxxxxxxxx
#华为云签名名称，必须是已审核通过的，与模板类型一致的签名名称,按照自己的实际签名填写
HWY_DX_Signature=华为云
#华为云签名通道号
HWY_DX_Sender=xxxxxxxxxx

#---------------------↓阿里云接口-----------------------
#是否开启阿里云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-alydx=0
#阿里云短信主账号AccessKey的ID
ALY_DX_AccessKeyId=xxxxxxxxxxxxxxxxxxxxxx
#阿里云短信接口密钥
ALY_DX_AccessSecret=xxxxxxxxxxxxxxxxxxxxxx
#阿里云短信签名名称
ALY_DX_SignName=阿里云
#阿里云短信模板ID
ALY_DX_Template=xxxxxxxxxxxxxxxxxxxxxx

#是否开启阿里云电话告警通道,可同时开始多个通道0为关闭,1为开启
open-alydh=0
#阿里云电话主账号AccessKey的ID
ALY_DH_AccessKeyId=xxxxxxxxxxxxxxxxxxxxxx
#阿里云电话接口密钥
ALY_DH_AccessSecret=xxxxxxxxxxxxxxxxxxxxxx
#阿里云电话被叫显号，必须是已购买的号码
ALY_DX_CalledShowNumber=xxxxxxxxx
#阿里云电话文本转语音（TTS）模板ID
ALY_DH_TtsCode=xxxxxxxx

#---------------------↓容联云接口-----------------------
#是否开启容联云电话告警通道,可同时开始多个通道0为关闭,1为开启
open-rlydh=0
#容联云基础接口地址
RLY_URL=https://app.cloopen.com:8883/2013-12-26/Accounts/
#容联云后台SID
RLY_ACCOUNT_SID=xxxxxxxxxxx
#容联云api-token
RLY_ACCOUNT_TOKEN=xxxxxxxxxx
#容联云app_id
RLY_APP_ID=xxxxxxxxxxxxx

#---------------------↓邮件配置-----------------------
#是否开启邮件
open-email=0
#邮件发件服务器地址
Email_host=smtp.qq.com
#邮件发件服务器端口
Email_port=465
#邮件帐号
Email_user=xxxxxxx@qq.com
#邮件密码
Email_password=xxxxxx
#邮件标题
Email_title=运维告警
#默认发送邮箱
Default_emails=xxxxx@qq.com,xxxxx@qq.com

#---------------------↓七陌云接口-----------------------
#是否开启七陌短信告警通道,可同时开始多个通道0为关闭,1为开启
open-7moordx=0
#七陌账户ID
7MOOR_ACCOUNT_ID=Nxxx
#七陌账户APISecret
7MOOR_ACCOUNT_APISECRET=xxx
#七陌账户短信模板编号
7MOOR_DX_TEMPLATENUM=n
#注意：七陌短信变量这里只用一个var1，在代码里写死了。
#-----------
#是否开启七陌webcall语音通知告警通道,可同时开始多个通道0为关闭,1为开启
open-7moordh=0
#请在七陌平台添加虚拟服务号、文本节点
#七陌账户webcall的虚拟服务号
7MOOR_WEBCALL_SERVICENO=xxx
# 文本节点里被替换的变量，我配置的是text。如果被替换的变量不是text，请修改此配置
7MOOR_WEBCALL_VOICE_VAR=text

#---------------------↓telegram接口-----------------------
#是否开启telegram告警通道,可同时开始多个通道0为关闭,1为开启
open-tg=0
#tg机器人token
TG_TOKEN=xxxxx
#tg消息模式 个人消息或者频道消息 0为关闭(推送给个人)，1为开启(推送给频道)
TG_MODE_CHAN=0
#tg用户ID
TG_USERID=xxxxx
#tg频道name或者id, 频道name需要以@开始
TG_CHANNAME=xxxxx
#tg api地址, 可以配置为代理地址
#TG_API_PROXY="https://api.telegram.org/bot%s/%s"

#---------------------↓workwechat接口-----------------------
#是否开启workwechat告警通道,可同时开始多个通道0为关闭,1为开启
open-workwechat=0
# 企业ID
WorkWechat_CropID=xxxxx
# 应用ID
WorkWechat_AgentID=xxxx
# 应用secret
WorkWechat_AgentSecret=xxxx
# 接受用户
WorkWechat_ToUser="zhangsan|lisi"
# 接受部门
WorkWechat_ToParty="ops|dev"
# 接受标签
WorkWechat_ToTag=""
# 消息类型, 暂时只支持markdown
# WorkWechat_Msgtype = "markdown"

#---------------------↓百度云接口-----------------------
#是否开启百度云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-baidudx=0
#百度云短信接口AK(ACCESS_KEY_ID)
BDY_DX_AK=xxxxx
#百度云短信接口SK(SECRET_ACCESS_KEY)
BDY_DX_SK=xxxxx
#百度云短信ENDPOINT（ENDPOINT参数需要用指定区域的域名来进行定义，如服务所在区域为北京，则为）
BDY_DX_ENDPOINT=http://smsv3.bj.baidubce.com
#百度云短信模版ID,根据自己审核通过的模版来填写(模版支持一个参数code：如prometheus告警:{code})
BDY_DX_TEMPLATE_ID=xxxxx
#百度云短信签名ID，根据自己审核通过的签名来填写
TXY_DX_SIGNATURE_ID=xxxxx

#---------------------↓百度Hi(如流)-----------------------
#是否开启百度Hi(如流)告警通道,可同时开始多个通道0为关闭,1为开启
open-ruliu=0
#默认百度Hi(如流)机器人地址
BDRL_URL=https://api.im.baidu.com/api/msg/groupmsgsend?access_token=xxxxxxxxxxxxxx
#百度Hi(如流)群ID
BDRL_ID=123456
#---------------------↓bark接口-----------------------
#是否开启telegram告警通道,可同时开始多个通道0为关闭,1为开启
open-bark=0
#bark默认地址, 建议自行部署bark-server
BARK_URL=https://api.day.app
#bark key, 多个key使用分割
BARK_KEYS=xxxxx
# 复制, 推荐开启
BARK_COPY=1
# 历史记录保存,推荐开启
BARK_ARCHIVE=1
# 消息分组
BARK_GROUP=PrometheusAlert

#---------------------↓语音播报-----------------------
#语音播报需要配合语音播报插件才能使用
#是否开启语音播报通道,0为关闭,1为开启
open-voice=1
VOICE_IP=127.0.0.1
VOICE_PORT=9999

#---------------------↓飞书机器人应用-----------------------
#是否开启feishuapp告警通道,可同时开始多个通道0为关闭,1为开启
open-feishuapp=1
# APPID
FEISHU_APPID=cli_xxxxxxxxxxxxx
# APPSECRET
FEISHU_APPSECRET=xxxxxxxxxxxxxxxxxxxxxx
# 可填飞书 用户open_id、user_id、union_ids、部门open_department_id
AT_USER_ID="xxxxxxxx"

这里用飞书机器人做测试，所以需要把飞书配置开启。其他默认即可

#监听端口
httpport =8800#因8080被占用，所以随便改了个。大佬们随意

#是否开启飞书告警通道,可同时开始多个通道0为关闭,1为开启
open-feishu=1#0改为1
#默认飞书机器人地址
fsurl=https://open.feishu.cn/open-apis/bot/hook/xxxxxxxxx
启动服务
nohup ./PrometheusAlert &

至此，PrometheusAlert配置完成。

这里提及一个踩过的坑，飞书机器人的版本请多注意。如果版本为v2,那么在PrometheusAlert的模板中选择v2的模板，不然飞书机器人是不会发送消息的。

这里创建机器人后复制webhook地址。加到/data/alertmanager/alertmanager.yml的配置中。记得重启/data/alertmanger/alertmanager.yml服务。

[root@localhost]# cat /data/alertmanager/alertmanager.yml
receivers:  #飞书机器人告警配置会关联到PrometheusAlert,这里不做过多赘述。
- name: webhook1
  webhook_configs:
  - url: 'http://xx.xx.xx.xxx:8800/prometheusalert?type=fs&tpl=prometheus-fsv2&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/e90f7203-7f3c-4248-87f3-f94cdd8d9s5a'

飞书机器人告警信息，这里的告警规则是在/data/prometheus/rules/ecs.yml 中定义的规则。

至此，PrometheusAlert 完成！

6、Snmp_Exportrer

SNMP Exporter 是 Prometheus 开源的一个支持 SNMP 协议的采集器，如果使用二进制文件部署，下载地址如下：https://github.com/prometheus/snmp_exporter/releases

https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.linux-amd64.tar.gz https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.windows-amd64.tar.gz

在 Linux 系统部署snmp_exporter二进制文件

#解压文件

[root@localhost]# pwd/data/
[root@localhost]# tar -zxvf snmp_exporter-0.20.0.linux-amd64.tar.gz
[root@localhost]# mv snmp_exporter-0.20.0.linux-amd64 snmp_exporter

加入systemd启动

[root@localhost]# cat /usr/lib/systemd/system/snmp_exporter.service

[Unit]
Description=snmp_exporter
After=network.target
[Service]
ExecStart=/data/snmp_exporter/snmp_exporter --config.file=/data/snmp_exporter/snmp.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target

#启动服务

[root@localhost]# systemctl daemon-reload
[root@localhost]# systemctl start snmp_exporter
[root@localhost]# systemctl enable snmp_exporter

这里引用的是/data/prometheus/prometheus.yml的配置文件配置

[root@localhost]# cat /data/prometheus/prometheus.yml
  - job_name: "snmp-list"
    scrape_timeout: 30s
    scrape_interval: 30s
    file_sd_configs:
    - files:
      - '/data/prometheus/monitor/snmp.yml'
      refresh_interval: 15s
    metrics_path: /snmp
    params:
       module: [if_mib]    #generator.yml或SNMP_EXRORTER配置文件中对应的模块
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: xxx.xxx.xxx.xxx:9116  # snmp_exporter 服务IP地址

如果snmp抓取数据报错，提示GET "http://127.0.0.1:9116/snmp?module=if_mib&target=127.0.0.1"：context deadline exceeded ，

那么可能是Scrape Duration超时，修改延长 scrape_timeout: 30s 参数即可。

配置好 Prometheus 以后重新启动 Prometheus 服务，就可以查到 Cisco 交换机的监控信息了。

7、Cadvisor

监控docker容器。可以使用cadvisor这个插件。因为默认端口是8080.被占用了。所以另用一个。

[root@localhost]# docker run -v /:/rootfs:ro -v /var/run:/var/run/:rw -v /sys:/sys:ro -v /var/lib/docker:/var/lib/docker:ro -v /dev/disk/:/dev/disk:ro -d-p18080:8080  --restart=always --name=cadvisor  google/cadvisor

启动成功后docker ps -a 查看一下是否启动成功即可

在Prometheus.yml配置中引用即可

[root@localhost]# cat /data/prometheus/prometheus.yml
****
  - job_name: "Docker-Containers"
    file_sd_configs:
    - files:
      - '/data/prometheus/monitor/docker_containers.yml'
      refresh_interval: 15s


[root@localhost]# cat /data/prometheus/monitor/docker_containers.yml
- targets: ["docker-dev-01.os.bnqoa.com:18080","docker-dev-02.os.bnqoa.com:18080","docker-dev-03.os.bnqoa.com:18080"]

新增节点后在这个地址中查看即可。

http://xxx.xxx.xxx.xxx:9090/targets

模板可以在这个地址中搜索docker

https://grafana.com/grafana/dashboards/

希望各位大佬一起探讨~

事后清晨

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
Prometheus和它的xdm

Prometheus 作为监控后起之秀，虽然有一些不足，但是绝对不妨碍我们使用。根据长期的使用经验来看，它足以满足大多数场景需求，只不过对于开源软件，需要根据业务去二次开发才能发挥出最大能力。
复制链接

扫一扫