Prometheus使用

德墨忒尔

已于 2022-06-01 00:24:49 修改

阅读量805

点赞数 1

分类专栏：运维文章标签： linux 运维 centos

于 2022-06-01 00:16:35 首次发布

本文链接：https://blog.csdn.net/u012748043/article/details/125075869

版权

运维专栏收录该内容

1 篇文章 0 订阅

订阅专栏

官网地址 https://prometheus.io/

下载与安装

下载地址 https://prometheus.io/download/

下载 prometheus、alertmanager、node_exporter、mysqld_exporter

Prometheus Server

解压prometheus安装包，并启动

tar -zxvf prometheus-2.36.0.linux-amd64.tar.gz
cd prometheus-2.36.0.linux-amd64
./prometheus

访问 192.168.10.129:9090 ，192.168.10.129 虚拟机地址

访问http://192.168.10.129:9090/metrics 查看指标原始数据

Exporter

Exporter产生监控数据，Prometheus Server会从Exporter拉取数据

所有的Exporter https://prometheus.io/docs/instrumenting/exporters/

安装node_exporter，node_exporter主要是监控主机的、操作系统

tar -zxvf node_exporter-1.3.1.linux-amd64.tar.gz
cd node_exporter-1.3.1.linux-amd64
./node_exporter

访问 http://192.168.10.129:9100/

node_cpu_xxx 监控cpu信息

node_disk_xxx 监控磁盘信息

node_filesystem_xxx 监控文件系统信息

node_memory_xxx 监控内存信息

等等。。。

node_exporter github地址 https://github.com/prometheus/node_exporter

如果想关闭一些指标使用 --no-collector.<name>，开启指标 --collector.<name>

#关闭cpu指标
./node_exporter --no-collector.cpu

修改PrometheusServer的prometheus.yml文件

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node_exporter"
    static_configs:
      # prometheus server端服务发现，用来帮助server端找到exporter
      - targets: ["localhost:9100"]

重启prometheus server，发现已经有了node_exporter的指标

AlertManager告警

告警规则地址 https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

配置告警规则 alert.yaml

groups:
# 规格分组的名称
- name: test-group
  rules:
  #告警规则名称
  - alert: TestRule
    # PromQL编写表达式 如果表达式满足条件，就触发告警
    expr: node_disk_read_bytes_total{device="sda", instance="localhost:9100", job="node_exporter"} > 20
    # 表示评估的等待时间，当条件满足持续这么久之后才触发告警
    for: 10s
    # 自定义标签
    labels:
      node_disk_read_bytes_total: node_disk_read_bytes_total
    # 用来指定附加信息，eg.用来描述告警的详情
    annotations:
      # 用来描述告警的概要信息
      summary: "disk指标不正常，自定义标签：{{ $labels.node_disk_read_bytes_total }},实例是{{ $labels.instance }}"
      # 用来描述告警的详情
      description: "{{$labels.instance}}的disk指标不正常，值是{{ $value }}"

在Prometheus.yml添加规则文件地址

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "alert.yaml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node_exporter"
    static_configs:
      # prometheus server端服务发现，用来帮助server端找到exporter
      - targets: ["localhost:9100"]

重启Prometheus server

AlertManager

Prometheus Server 发现满足告警触发规则条件就向AlertManager推送

AlertManager配置 https://prometheus.io/docs/alerting/latest/configuration/

tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz 
cd alertmanager-0.24.0.linux-amd64

修改 alertmanager.yml

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email'
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
  - name: 'email'
    email_configs:
    - to: xxx@qq.com
      from: yyy@qq.com
      smarthost: smtp.qq.com:465
      auth_username: yyy@qq.com 
      # 授权码
      auth_password: uinynfsegdlibage
      # 使用465端口 必须设置成false 否则启动报错
      require_tls: false 
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

启动AlertManager

./alertmanager

修改prometheus.yml文件

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

重启 prometheus server

访问alertmanager

PromQL

Prometheus内置的数据查询语言，可针对Prometheus中的数据做各种计算、汇总等操作

区间向量

Grafana

可视化监控指标展示工具

官网地址 https://grafana.com/

下载地址 https://grafana.com/grafana/download

wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.5.3.linux-amd64.tar.gz
tar -zxvf grafana-enterprise-8.5.3.linux-amd64.tar.gz

cd grafana-8.5.3/bin
./grafana-server web

添加datasource

官方图表地址 https://grafana.com/dashboards

集群

192.168.10.129 : Prometheus Server , node_exporter

192.168.10.130 : node_exporter

192.168.10.131 : node_exporter

1、在以上基础上添加机器状态监控节点（NODE集群配置：每台要监控的服务器都需要安装一个NODE）

在192.168.10.130 192.168.10.131 两台虚拟机上安装node_exporter 并启动

2、配置Prometheus Server 的prometheus.yml 并重启

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "alert.yaml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node_exporter"
    static_configs:
      # prometheus server端服务发现，用来帮助server端找到exporter
      - targets: ["192.168.10.129:9100"]
      - targets: ["192.168.10.130:9100"]
      - targets: ["192.168.10.131:9100"]

grafana dashboard id 8919

grafana dashboard id 1860

HTTP API

HTTP API 地址

https://prometheus.io/docs/prometheus/latest/querying/api/

瞬时数据查询
通过使用 QUERY API 我们可以查询 PromQL 在特定时间点下的计算结果。

GET /api/v1/query
URL 请求参数：
query= : PromQL 表达式。
time=<rfc3339 | unix_timestamp> : 用于指定用于计算 PromQL 的时间戳。可选参数，默认情况下使用当前系统时间。
timeout= : 超时设置。可选参数，默认情况下使用全局设置的参数 -query.timeout。
如果 time 参数缺省，则使用当前服务器时间。

查看cpu信息

GET /api/v1/query_range
URL 请求参数：