Prometheus 安装部署

huaishu

已于 2022-12-13 12:12:00 修改

阅读量475

点赞数

文章标签： prometheus mongodb 数据库

于 2022-12-13 11:59:36 首次发布

本文链接：https://blog.csdn.net/huaishu/article/details/128296964

版权

Promethus 安装部署

安装软件版本
Download | Prometheus
prometheus-2.33
Download Grafana | Grafana Labs
grafana-8.3.6
Releases · prometheus/node_exporter · GitHub
node_exporter-1.3.1

blackbox_exporter-0.20.0

alertmanager-0.24.0

cadvisor
mongodb_exporter :https://github.com/percona/mongodb_exporter

metrics类型：

Counter计数器：计的数据是递增的，不能使用计数器来统计可能减小的指标

Gauge量规：代表可以任意上下波动的单个数值

Summary摘要：用于表示一段时间内的数据采样的结果（客户端计算）

Histogram直方图：上边界、样本值总和、样本总数（服务端计算）

Promethus 服务

配置prometheus.yml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 192.168.21.120:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "first_rules.yml"
  - "rules/*.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
  - job_name: 'linux'
    static_configs:
      - targets: ['localhost:9222','192.168.21.11:9222']
  - job_name: 'docker'
    static_configs:
      - targets: ['localhost:8080','192.168.21.11:8080']
  - job_name: 'mongo_exp'
    static_configs:
      - targets: ['192.168.21.11:9223']
        labels:
          unitname: "Mongodb_exporter"
  - job_name: 'port_status'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets: ['192.168.21.120:3000']
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115 # blackbox-exporter 服务所在的机器和端口 

  - job_name: 'port_status_gyds'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets: 
        - 192.168.21.11:9110
        - 192.168.21.11:9210
        - 192.168.21.11:8090
        - 192.168.21.11:9100
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.21.11:9115 # blackbox-exporter 服务所在的机器和端口

告警规则

采集服务未开启

groups:
- name: example
  rules:
  - alert:  InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Instance {{ $labels.instance }} has been down for more than 5 minutes

node_exporter告警配置

groups:
- name: test
  rules:
  - alert: 内存使用率过高
    expr: 100-(node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes)/node_memory_MemTotal_bytes*100 > 30 
    for: 1m  # 告警持续时间，超过这个时间才会发送给alertmanager
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} 内存使用率过高"
      description: "{{ $labels.instance }} of job {{$labels.job}}内存使用率超过80%,当前使用率[{{ $value }}]."

  - alert: cpu使用率过高
    expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} cpu使用率过高"
      description: "{{ $labels.instance }} of job {{$labels.job}}cpu使用率超过80%,当前使用率[{{ $value }}]."# 尽可能把详细告警信息写入summary标签值，因为告警短信/邮件/钉钉发送的内容使用了summary标签中的值。

blackbox_exporter 告警配置

groups:
- name: 站点状态-监控告警
  rules:
  - alert: docker_port  #alertname报警名称
    expr: probe_success == 0
    for: 1h
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.instance}} 不能访问"
      description: "{{$labels.instance}} 不能访问"

Promethus部署

cd /home/suer/prometheus/prometheus-2.33
nohup ./prometheus     --config.file=./prometheus.yml  --log.level=debug  --log.format=logfmt  --web.enable-lifecycle &

配置自启动脚本

vi /usr/lib/systemd/system/prometheus.service

[Unit]
Description=Prometheus Monitoring System
Documentation=Prometheus Monitoring System

[Service]
ExecStart=/home/suer/prometheus/prometheus-2.33/prometheus --config.file=/home/suer/prometheus/prometheus-2.33/prometheus.yml --log.level=debug  --log.format=logfmt  --web.enable-lifecycle  --web.listen-address=:9090 

[Install]
WantedBy=multi-user.target

systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus

#热启动
curl -XPOST http://192.168.21.120:9090/-/reload

node_exporter采集

部署

cd /home/suer/prometheus/node_exporter
nohup ./node_exporter --web.listen-address=":9222"  --log.level="info"   --log.format="logfmt"   &

配置自启动脚本

vim /usr/lib/systemd/system/blackbox_exporter.service

[Unit]
Description=node_exporter
After=network.target
[Service]
ExecStart=/home/suer/prometheus/node_exporter/node_exporter --web.listen-address=":9222"  --log.level="info"   --log.format="logfmt" 
[Install]
WantedBy=multi-user.target

systemctl enable node_exporter.service
systemctl start node_exporter.service

blackbox_exporter 采集

部署

cd /home/suer/prometheus/blackbox_exporter-0.20.0
nohup ./blackbox_exporter  --config.file=./blackbox.yml --web.listen-address=":9115"  --log.level=debug > ./blackbox.out 2>&1 &

mongodb_exporter 采集

部署

nohup ./mongodb_exporter --mongodb.uri='mongodb://admin:654321@192.168.21.11:27017/?authSource=admin' --compatible-mode  --discovering-mode   --web.listen-address=":9223"  --log.level=debug  ./mongodb_exporter.out 2>&1  &

cAdvisor 采集

用于收集正在运行的容器资源使用和性能信息

使用docker部署

docker run -d \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor:latest

下载二进制：https://github.com/google/cadvisor/releases/latest
本地运行：./cadvisor  -port=8080 &>>/var/log/cadvisor.log

snmp_exporter 采集器

华为交换机：snmp_exporter监控华为网络设备 - 简书

DELL 服务器：Prometheus 实现监控Dell服务器相关硬件指标 - 屌丝的IT - 博客园

snmp.yml MIB 配置

huawei_mib:
    walk: 
      - sysUpTime
      - interfaces
      - ifXTable
      - sysDescr
      - sysName
      - 1.3.6.1.2.1.31.1.1.1.1
                  ***
    version: 2
    auth:
      community: public_read
    lookups:
      - source_indexes: [ifIndex]
        lookup: ifAlias
      - source_indexes: [ifIndex]
        # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
        lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
      - source_indexes: [ifIndex]
        # Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
        lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
    overrides:
      ifAlias:
        ignore: true # Lookup metric
      ifDescr:
        ignore: true # Lookup metric
      ifName:
        ignore: true # Lookup metric
      ifType:
        type: EnumAsInfo

promethus 配置

- job_name: 'snmp_dell'
    scrape_interval: 10s #刷新间隔默认10s
    scrape_timeout: 1m  #超时时间，snmp_exporter刷数据慢修改大一点
    static_configs:
     - targets:
       - 10.1.0.1  #交换机IP地址
    metrics_path: /snmp
    params:
     module: [huawei_mib]  #generator.yml自定义文件的模块名
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.21.120:9116 # snmp_exporter 服务IP地址

alertManager 告警服务

配置 alertmanager.yml

global: #全局配置，包括报警解决后的超时时间、SMTP 相关配置、各种渠道通知的 API 地址等等。
  resolve_timeout: 5m  
  smtp_from: 'xxxxxxxx@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: 'xxxxxxxx@qq.com'
  smtp_auth_password: 'xxxxxxxxxxxxxxx'
  smtp_require_tls: false
  smtp_hello: 'qq.com'
route: # 用来设置报警的分发策略，它是一个树状结构，按照深度优先从左向右的顺序进行匹配。
  group_by: ['alertname'] # 采用哪个标签来作为分组依据
  group_wait: 10s # 组告警等待时间。也就是告警产生后等待10s，如果有同组告警一起发出
  group_interval: 5s # 两组告警的间隔时间
  repeat_interval: 5m # 重复告警的间隔时间，减少相同邮件的发送频率
  receiver: 'email' # 设置默认接收人
  routes:   # 可以指定哪些组接手哪些消息
  - receiver: 'default-receiver'  
    continue: true
    group_wait: 10s
  - receiver: 'fping-receiver'  
    group_wait: 10s
    match_re:  #根据标签分组，匹配标签dest=szjf的为fping-receiver组
      dest: szjf
receivers: #配置告警消息接受者信息，例如常用的 email、wechat、slack、webhook 等消息通知方式。
- name: 'default-receiver'
  email_configs:
  - to: 'xxxxxxxx@qq.com'
- name: "fping-receiver"
  webhook_configs:
  - url: 'http://127.0.0.1:9095/dingtalk'
    send_resolved: true
- name: 'email'
  #webhook_configs 
  email_configs:
  - to: 'xxxxxxxx@qq.com'
    send_resolved: true
inhibit_rules: #抑制规则配置，当存在与另一组匹配的警报（源）时，抑制规则将禁用与一组匹配的警报（目标）
  - source_match: #匹配当前告警发生后其他告警抑制掉
      severity: 'critical' #指定告警级别
    target_match: 
      severity: 'warning' #指定抑制告警级别
    equal: ['alertname', 'dev', 'instance'] # 确保这个配置下的标签内容相同才会抑制，也就是说警报中必须有这三个标签值才会被抑制。

静默(silences): 是一种简单的特定时间静音的机制

#qq邮箱配置，需要申请第三方登录密码

安装

cd /home/suer/prometheus/alertmanager
nohup ./alertmanager --config.file=alertmanager.yml &

grafana

cd /home/suer/prometheus/grafana-8.3.6/bin
nohup ./grafana-server &

http://192.168.21.120:3000/
admin 123456

故障自愈

webhook（python：fastapi uvicorn）

#启动脚本
uvicorn /home/suer/alert:app --reload

ansible

ansible docker_node -m shell -a ‘docker restart cadvisor’

防火墙开放端口号

sudo firewall-cmd --zone=public --add-port=9093/tcp --permanent
sudo firewall-cmd --reload
sudo firewall-cmd --zone=public --list-ports

grafana 看板

node_exporter 模板： 8919
blackbox_exporter 模板：9965
cadvisor 模板：193
Kong for prometheus：7424
mongodb : 14997
mysql : 14057
测试环境：
grafana:
http://192.168.21.120:3000/
admin 123456
告警：
http://192.168.21.120:9093/#/alerts
promethus：
http://192.168.21.120:9090/

huaishu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Prometheus 安装部署

静默(silences): 是一种简单的特定时间静音的机制。blackbox_exporter 模板：9965。blackbox_exporter 告警配置。node_exporter 模板： 8919。用于收集正在运行的容器资源使用和性能信息。node_exporter告警配置。snmp.yml MIB 配置。cadvisor 模板：193。promethus 配置。使用docker部署。
复制链接

扫一扫