Prometheus学习笔记

最新推荐文章于 2025-03-20 00:07:36 发布

WillianmsLee

最新推荐文章于 2025-03-20 00:07:36 发布

阅读量237

点赞数 1

文章标签： websocket

本文链接：https://blog.csdn.net/wsylina/article/details/116332034

版权

本文详细介绍了Prometheus，一个高大上的监控平台，强调了它在准确性和精确性方面的作用。文章涵盖Prometheus的基本概念，如time series数据模型、K/V数据模型，以及HTTP pull/push数据采集方式。此外，还探讨了Prometheus组件，如metrics、Gauges、Counters和Histograms。文章进一步讲解了PromQL中的关键函数，如rate、increase、sum和topk，并介绍了服务发现、企业级监控数据采集方法、Exporter、Pushgateway以及Grafana的安装和配置。最后，文章提到了Prometheus的告警配置和管理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Prometheus学习资料

是什么

高大上的监控平台

能给我带来什么

准确性和精确性的要求极大的贡献力量
在这里插入图片描述

特性

基于time series时间序列模型

时间序列模型，是一系列有序的数据，通常是等时间间隔的采样数据
基于K/V的数据模型
采用HTTP pull / push两种对应的数据采集传输方式
本身自带图形调试
最精细的数据采样理论上可以达到秒级采样

Prometheus组件

Prometheus metrics概念

Prometheus监控中，对于采集过来的数据统一称为metrics数据。其并不代表一种具体的数据格式，而是一种对于度量计算单位的抽象。

Gauges

最简单的度量标准，只有一个返回值，或者叫瞬时状态。例如我们想衡量一个等待队列中任务的个数、CPU使用率、内存使用率
Counters

Counter就是计数器，从数据量0开始累积计算，在理想状态下，只能永远增长或保持不变，不会下降（特殊情况另说）

比如累积用户访问量
Histograms

Histograms统计数据的分布状况，比如最小值，最大值，中位数，75百分位，90百分位，95百分位，99百分位

在这里插入图片描述

PromQL进阶

在这里插入图片描述

Promethus监控实例——CPU

node_cpu:监控cpu的key
node_cpu{mode="idle"} #cpu空闲使用时间
increase(node_cpu{mode="idle"}[1m]) #一分钟内CPU空闲使用时间
sum(increase(node_cpu{mode="idle"}[1m])) #聚合多核CPU一分钟内CPU空闲使用时间
by(instance):此函数可以把sum加合到一起的数值，按照指定的一个方式进行一层拆分，instance代表的是机器名
sum(increase(node_cpu{mode="idle"}[1m])) by(instance) #表示把sum函数中服务器加合再强行拆分出来
sum(increase(node_cpu[1m])) by(instance) #全部CPU时间一分钟增量
sum(increase(node_cpu{mode="idle"}[1m])) / sum(increase(node_cpu[1m])) by(instance) #代表空闲CPU使用百分比
1-(sum(increase(node_cpu{mode="idle"}[1m])) by(instance) / sum(increase(node_cpu[1m])) by(instance)) *100% #代表CPU非空闲时间百分比

举一反三
sum(increase(node_cpu{mode="user"}[1m])) by(instance) / sum(increase(node_cpu[1m])) by(instance)  # 用户态CPU使用率
sum(increase(node_cpu{mode="system"}[1m])) by(instance) / sum(increase(node_cpu[1m])) by(instance) #系统态CPU使用率
sum(increase(node_cpu{mode="iowait"}[1m])) by(instance) / sum(increase(node_cpu[1m])) by(instance) #IO态CPU使用率

Prometheus命令行及常用函数

命令行

node_cpu{export_instnce="web.*"} > 400
# {}的部分属于标签，用来过滤更加精细的信息
# 标签：也是来源于数据采集，可以自定义，也可以使用exporter默认项
# exporter中最重要的就是export_instnce，指明那台服务器被监控
# > 400 :表示进一步对输出数值进行过滤

常用函数
在这里插入图片描述

rate函数

rate()函数，专门搭配counter类型数据使用的函数。它的功能是按照设定的一个时间段，取counter这个时间段中的平均每秒的增量

在这里插入rate(node_network_recive_bytes[1m]) #一分钟内node_network_recive_bytes的每秒增量代码片

increase函数
increase()函数，专门搭配counter类型数据使用的函数。它的功能是按照设定的一个时间段，取counter这个时间段中的的增量

increase(node_cpu{mode="idle"}[1m])

sum函数

对于包裹的数值进行加合的函数，可以通过by() 函数进行拆分

topk函数

取数据最高前几位(自定义)的值，查看的时候，一般在console查看，图形显示一般意义不大

topk(3,sum(increase(node_cpu[1m])))

Prometheus服务发现

动态发现

基于文件的服务发现

不依赖任何平台或者第三方服务，Prometheus Server定期从文件中加载Target信息。文件可以是JSON或者YAML格式，含有定义的Target列表。这些文件可以从别的程序生成，例如Ansible

- target:
  - localhost:9090
  labels:
    app: prometheus
    job: prometheus
- target:
  - localhost:9100
  labels:
    app: nodes-expoters
    job: node


# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'prometheus' #任务名称
#    static_configs: #被监控主机的设置
    file_sd_configs:
    - files:
      - target/prometheus*.yaml
      refresh_interval: 1m
#    - targets: ['192.168.1.90:9100']
 
  - job_name: 'nodes' #任务名称
    file_sd_configs:
    - files:
      - target/nodes*.yaml
      refresh_interval: 1m
#    static_configs: #被监控主机的设置
#    - targets: ['192.168.1.80:9100','192.168.1.81:9100','192.168.1.82:9100']
    
  - job_name: 'pushgateway' #任务名称
#    static_configs: #被监控主机的设置
#    - targets: ['192.168.1.70:9091'] 
    file_sd_configs:
    - files:
      - target/pushgateway*.yaml
      refresh_interval: 1m

企业级监控数据采集方法

Prometheus安装配置

tar -zxvf ……
nohup ./prometheus --web.read-timeout=5m --web.max-connections=512 --storage.tsdb.retention=15d --query.timeout=2m --query.max-concurrency=20 &
--web.read-timeout=5m 请求链接的最大等待时间
--web.max-connections=512 #最大连接数
--storage.tsdb.retention=15d #数据保存时间
--storage.tsdb.path="data/" #数据存储路径
--query.timeout=2m # 慢查询最大时间
--query.max-concurrency=20 # 最大并行数

Promethus数据存储

当前近期的数据实际存放在内存中，并按照一定的时间间隔存放在 wal/ 目录中，防止突然断电，或者重启以用来恢复内存中的数据

Prometheus配置文件


# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'prometheus' #任务名称
    static_configs: #被监控主机的设置
    - targets: ['192.168.1.90:9100']
 
  - job_name: 'nodes' #任务名称
    static_configs: #被监控主机的设置
    - targets: ['192.168.1.80:9100','192.168.1.81:9100','192.168.1.82:9100']
    
  - job_name: 'pushgateway' #任务名称
    static_configs: #被监控主机的设置
    - targets: ['192.168.1.70:9091']

Exporter

自身是HTTP服务器，可以响应从外部发过来的GET请求
自身需要运行在后台，可以定期触发，抓取本地数据
返回给Prometheus的内容，需要符合metics类型（Key-Value）

Pushgateway

Pushgateway介绍

pushgateway 是另外一种采用被动推送的方式（而不是exporter主动获取）获取监控数据的Prometheus插件

它可以单独的运行在任何的节点上面的插件（并不一定要在被监控的客户端）

然后通过自定义开发的脚本，把需要监控的的数据发送给pushgateway，然后pushgateway再把数据发送给Prometheus

pushgateway本身没有任何抓取监控数据的功能，它只能等待推送过来的信息

pushgateway实例

#! /bin/bash
instance_name=`hostname` #本机名赋予变量，用于之后的标签
if [ $instance_name == "localhost" ];then #要求机器名不能是localhost 不然标签就没有办法区分
echo "must FQDN hostname"
exit 1
fi

label="count_netstat_wait_connections" #定义一个新的key
count_netstat_wait_connectionsnets=`netstat -an | grep -i wait | wc -l`
echo "$label:$count_netstat_wait_connectionsnets"
echo $label $count_netstat_wait_connectionsnets | curl --data-binary @- http://192.168.1.70:9091/metrics/job/pushgateway/instance/$instance_name

Grafana

安装启动

wget https://dl.grafana.com/oss/release/grafana-7.5.0-1.x86_64.rpm
yum -y install grafana-7.5.0-1.x86_64.rpm
systemctl start grafana-server.service 
systemctl status grafana-server.service

## 登录grafana的初始用户和密码：admin admin

设置Grafana数据源
在这里插入图片描述
创建Dashboard

Prometheus监控告警

alertmanager配置文件

global:
  resolve_timeout: 5m #持续5分钟没收到告警信息后认为问题已处理
  
route:
  group_by: ["instance"]            # 分组名
  group_wait: 30s                   # 当收到告警的时候，等待三十秒看是否还有告警，如果有就一起发出去
  group_interval: 5m                # 发送警告间隔时间
  repeat_interval: 3h               # 重复报警的间隔时间
  receiver: mail                    # 全局报警组，这个参数是必选的，和下面报警组名要相同

receivers:
- name: 'mail'                      # 报警组名
  email_configs:
  - to: '*************'          # 收件人邮箱
    from: '***********'		# 发件者邮箱
    smarthost: 'smtp.qq.com:465'        # smtp地址
    auth_username: '*********'   # 邮箱用户
    auth_identitiy: '**********'  # 认证服务名
    auth_password: '**********'      # 邮箱密码
    require_tls: false

Prometheus告警规则文件示例

groups:
- name: AllInstances
  rules:
  - alert: InstanceDown
    # Condition for alerting
    expr: up == 0
    for: 1m
    # Annotation - additional informational labels to store more information
    annotations:
      title: 'Instance down'
      description: Instance has been down for more than 1 minute.
      # Labels - additional labels to be attached to the alert
    labels:
      severity: 'critical'

Prometheus主配置文件配置

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - file_sd_configs:
    - files:
      - target/alertmanager*.yaml

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
   - "ruels/*.yaml"


# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'prometheus' #任务名称
    file_sd_configs:
    - files:
      - target/prometheus*.yaml
      refresh_interval: 1m
  - job_name: 'nodes' #任务名称
    file_sd_configs:
    - files:
      - target/nodes*.yaml
      refresh_interval: 1m
  - job_name: 'pushgateway' #任务名称
    file_sd_configs:
    - files:
      - target/pushgateway*.yaml
      refresh_interval: 1m
  - job_name: 'alertmanager' #任务名称
    file_sd_configs:
    - files:
      - target/alertmanager*.yaml

抑制告警
在这里插入图片描述

global:
  resolve_timeout: 5m #持续5分钟没收到告警信息后认为问题已处理
  
route:
  group_by: ["instance"]            # 分组名
  group_wait: 30s                   # 当收到告警的时候，等待三十秒看是否还有告警，如果有就一起发出去
  group_interval: 5m                # 发送警告间隔时间
  repeat_interval: 3h               # 重复报警的间隔时间
  receiver: mail                    # 全局报警组，这个参数是必选的，和下面报警组名要相同

receivers:
- name: 'mail'                      # 报警组名
  email_configs:
  - to: '710460064@qq.com'          # 收件人邮箱
    from: '710460064@qq.com'		# 发件者邮箱
    smarthost: 'smtp.qq.com:465'        # smtp地址
    auth_username: '710460064@qq.com'   # 邮箱用户
    auth_identitiy: '710460064@qq.com'  # 认证服务名
    auth_password: 'tstbvsqoqlfgbahh'      # 邮箱密码
    require_tls: false
inhibit_rules:
- source_match:
    alertname: InstanceDown
    severity: critical
  target_match:
    alertname: InstanceDown
    severity: critical
  equal:
    - instance