Prometheus 监控系统

xyc1211

已于 2023-03-27 18:26:29 修改

阅读量260

点赞数

文章标签： prometheus

于 2023-03-23 11:36:10 首次发布

原文链接：https://prometheus.io/

版权

https://prometheus.io/

prometheus

是一个监控+报警+时序数据库的组合系统
主要用作监控，基于时序，主动对指标抓取、存储、查询、绘图，并根据规则发警报

通过 HTTP 上的拉模型
将所有数据存储为时间序列【时间戳，指标数据】

数据支持float64，对字符串和毫秒分辨率时间戳的支持有限

架构

在这里插入图片描述

概念

目标 Target
要抓取的对象的定义
Exporter
将不支持Prometheus的格式转换为Prometheus支持的指标格式
Endpoint
可以抓取的指标来源，通常对应于单个进程。
样本 Sample
时间序列中某个时间点的单个值（包含一个 float64 值和一个毫秒精度的时间戳）

作业 Job
具有相同目的的目标集合，例如监视一组为可伸缩性或可靠性而复制的相似进程
一组实例
实例 Instance
实例是唯一标识作业中目标的标签。

警报管理器 Alertmanager
警报 alert
通知 Notification
一组警报
静音 Silence
静音的标签不报警
PromQL
PromQL是普罗米修斯查询语言
远程读取
从其他系统（例如长期存储）透明读取时间序列作为查询的一部分
远程写入
将摄取的样本即时发送到其他系统，例如长期存储

数据模型

指标 Metric

指定了被测系统的一般特征

例：
http_requests_total 指标: 代表收到的 HTTP 请求总数

标签 labels

Prometheus 的维度数据模型, 标识相同指标的特定维度实例
查询语言允许基于这些维度进行过滤和聚合。
更改任何标签值，包括添加或删除标签，都将创建一个新的时间序列。

例：
对 http_requests_total 指标添加标签method = "get|post" 来区分get与post请求

表示方式 Notation

通过一个指标+一组标签，表示时间序列
<metric name>{<label name>=<label value>, ...}
如：http_requests_total{method=“POST”, handler=“/messages”}

指标类型 METRIC TYPES

Counter 计数器

单调递增，只能增加或归零

Gauge 数据轨迹

可增可减，上下波动数值

Histogram 直方图

统计和分析样本的分布情况

Summary 汇总

统计和分析样本的分布情况

下载安装

下载
https://prometheus.io/download/
解压
配置yaml
prometheus.yml

启动

./prometheus --config.file=prometheus.yml

docker安装

docker run -p 9090:9090 prom/prometheus

# 挂载卷运行
docker run \
    -p 9090:9090 \
    -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus

配置文件

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

# Prometheus 服务器的全局配置
global:
  # 抓取目标的频率
  scrape_interval:     15s
  # 评估规则的频率
  evaluation_interval: 15s

# 加载规则的位置 （记录规则、警报规则）
rule_files:
  # - "first.rules"
  # - "second.rules"

# 要监控的资源、对象
scrape_configs:
    # 作业
  - job_name: prometheus
    # 该作业静态配置的目标
    static_configs:
      - targets: ['localhost:9090']

记录规则
预先经过计算表达式，并将其结果保存为一组新的时间序列

groups:
  - name: example
    rules:
    - record: code:prometheus_http_requests_total:sum
      expr: sum by (code) (prometheus_http_requests_total)

警报规则

groups:
  - name: example
    rules:
    - alert: HighRequestLatency
      expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
      for: 10m
      labels:
        severity: page
      annotations:
        summary: High request latency

PromQL查询语句

注释
以 # 开头的行注释
陈旧性 todo
查询时不用时间序列的数据时间，采用独立的时间戳。主要是为了支持聚合（sum、avg等）等情况
因为多个时间序列独立，在时间上不完全一致。会获取独立时间戳之前的最新样本
目标抓取或规则评估不再返回以前存在的时间序列的样本，则该时间序列将被标记为陈旧

数据类型

Prometheus 的表达式语言中，表达式或子表达式可以计算为四种类型

Instant vector `瞬时向量`

一组时间序列：每个时间序列只包含一个样本，所有样本的时间戳相同
一个时序，只有一个值

Range vector `范围向量`

一组时间序列：每个时间序列都包含一系列随时间变化的数据点
一个时序，有一组值

Scalar `标量`

没有时序，单纯一个浮点值

String

字符串

时间序列选择器

`瞬时向量选择器`

语法：

# 查询 指标名称 的所有时间序列
<metric name>

# 筛选符合条件的 指标 的时间序列
{__name__ 运算符 "<metric name>"}

# 指标名称+标签 标签匹配器列表进一步过滤指标的时间序列
<metric name>{<label name> 运算符 <label value>, ...}

例：

查询:
http_requests_total{}

返回: 
http_requests_total{code="200",handler="alerts",instance="localhost:9090",job="prometheus",method="get"}=
(20889@1518096812.326)
http_requests_total{code="200",handler="graph",instance="localhost:9090",job="prometheus",method="get"}=
(21287@1518096812.326)

运算符支持：

=：选择与提供的字符串相同的。
!=：选择与提供的字符串不相同的。
=~：选择与提供的字符串正则表达式匹配的。
!~：选择与提供的字符串正则表达式不匹配的。

`范围向量选择器`

将持续时间附加在选择器末尾的方括号中
语法： 瞬时向量选择器[持续时间]

# 选择在 过去5分钟内 所有样本数据
http_requests_total{}[5m] 

http_requests_total{job="prometheus"}[5m]

例：

查询:
http_requests_total{}[5m] 

返回: 
http_requests_total{code="200",handler="alerts",instance="localhost:9090",job="prometheus",method="get"}=
	[
	    1@1518096812.326
	    1@1518096817.326
	    1@1518096822.326
	    1@1518096827.326
	    1@1518096832.326
	    1@1518096837.325
	]
http_requests_total{code="200",handler="graph",instance="localhost:9090",job="prometheus",method="get"}=
	[
	    4 @1518096812.326
	    4@1518096817.326
	    4@1518096822.326
	    4@1518096827.326
	    4@1518096832.326
	    4@1518096837.325
	]

时间单位：

ms- 毫秒
s- 秒
m- 分钟
h- 小时
d- days - 假设一天总是 24 小时
w- 周 - 假设一周总是 7d
y- 年 - 假设一年总是 365 天
持续时间可以通过串联来组合：1h30m

时间基准

序列选择器默认都是以当前时间为基准采样

offset：设置时间基准偏移
@：指定时间基准

`offset 偏移量`

通过 offset 更改查询中各个瞬时和范围向量的时间偏移量
即设置时间基准

语法：选择器 offset 偏移时间

# 查询 过去5分钟 的值
http_requests_total offset 5m

# 偏移量要紧跟在选择器之后，是一体的
sum(http_requests_total{method="GET"} offset 5m)

# 范围向量 + 偏移量
rate(http_requests_total[5m] offset 1w)

# 负偏移量
rate(http_requests_total[5m] offset -1w)

`@ 时间戳`

通过 @ 更改查询中单个瞬时和范围向量的评估时间
语法：选择器 @ 时间戳

http_requests_total @ 1609746000

# @也要紧跟在选择器之后，是一体的
sum(http_requests_total{method="GET"} @ 1609746000)

# offset + @，前后顺序无差别：都会先到@的时间，然后偏移offset
http_requests_total @ 1609746000 offset 5m
http_requests_total offset 5m @ 1609746000

特殊值：start()，end()

瞬时查询
start() 和 end() 都为评估时间
范围查询
分别解析为范围查询的开始和结束，并在所有步骤中保持不变。

子查询

针对给定的范围和时间间隔运行瞬时查询
查询结果是一个范围向量

语法：
<instant_query> '[' <range> ':' [<resolution>] ']' [ @ <float_literal> ] [ offset <duration> ]

瞬时查询 [ 范围:时间间隔 ] [@ 时间] [offset 时间]

时间间隔
被称为颗粒度（granularity）或分辨率（resolution）

# 统计过去10m的数据，按照每份1m的频率划分子时间段，在子时间段内部进行查询
http_requests_total[10m:1m]
# 返回值： value是子时间段的计数值，@时间戳 表示对应的子时间段的开始时间
[value @时间戳]

# 返回过去 30 分钟的 http_requests_total 指标的 5 分钟内的平均速率，分辨率为 1 分钟.
rate(http_requests_total[5m]) [30m:1m]

max_over_time(deriv(rate(distance_covered_total[5s])[30s:5s])[10m:])

例：

http_requests_total[10m:1m]

返回：
http_requests_total{code="200", handler="prometheus", instance="172.18.0.10:9100", job="host_node_export", method="get"}=
	[
		97606 @1679642520
		97610 @1679642580
		97614 @1679642640
		97618 @1679642700
		97622 @1679642760
		97626 @1679642820
		97630 @1679642880
		97634 @1679642940
		97638 @1679643000
		97642 @1679643060
	]

运算

算数运算

两个标量
两个浮点值运算
瞬时向量与标量：
向量中每个样本值与标量运算
两个瞬时向量：
左侧向量中的每个元素匹配（标签完全相同）右侧向量中的元素 进行运算
没匹配到的元素直接丢弃
返回的新序列不包含指标

向量匹配 todo

聚合运算

<aggr-op> [without|by (<label list>)] ([parameter,] <vector expression>)
或者
<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]

vector expression
向量表达式
aggr-op:
sum（计算维度总和）
min（选择最小尺寸）
max（选择最大尺寸）
avg（计算维度的平均值）
group（结果向量中的所有值均为 1）
stddev（计算维度的总体标准偏差）
stdvar（计算维度的总体标准方差）
count（计算向量中元素的数量）
count_values（计算具有相同值的元素的数量）
bottomk（样本值最小的 k 个元素）
topk（样本值最大的 k 个元素）
quantile（计算维度上的 φ 分位数 (0 ≤ φ ≤ 1)）
without | by
without：从结果向量中删除列出的标签，而所有其他标签都保留在输出中。
by：与without相反，只输出出现在by子句中列出的标签
label list
未加引号的标签列表,可包含尾随逗号
(label1, label2)，(label1, label2,)
parameter
对于count_values、quantile、topk、bottomk是必输

sum by (job) (
  rate(http_requests_total[5m])
)

函数

https://prometheus.io/docs/prometheus/latest/querying/functions/

使用

浏览器访问

浏览器访问 http://ip:9090/graph
table视图
搜索框输入指标名称node_cpu，单击“execute”
查询出了 node_cpu指标四个标签（cpu、instance、job、mode）的时序数据
graph 图形视图

范围向量
查看监控目标

http API

http://<prometheus_server_url>/api/v1/query_range?query=<promql_expression>&start=<start_time>&end=<end_time>&step=<step>

其中：
<prometheus_server_url>：指定 Prometheus 服务器的 URL 地址。
<promql_expression>：PromQL 表达式，用于计算指标数据。可以使用 rate() 函数来计算数据流入速率，例如：rate(metric_name[time_range])。
<start_time>：查询数据的起始时间，格式为 RFC3339。
<end_time>：查询数据的结束时间，格式为 RFC3339。
<step>：数据查询的时间间隔，例如 1h、30m、10s 等。

xyc1211

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Prometheus 监控系统

时间序列中某个时间点的单个值（包含一个 float64 值和一个毫秒精度的时间戳）主要用作监控，基于时序，主动对指标抓取、存储、查询、绘图，并根据规则发警报。Prometheus 的维度数据模型, 标识相同指标的特定维度实例。更改任何标签值，包括添加或删除标签，都将创建一个新的时间序列。数据支持float64，对字符串和毫秒分辨率时间戳的支持有限。从其他系统（例如长期存储）透明读取时间序列作为查询的一部分。将所有数据存储为时间序列【时间戳，指标数据】通过一个指标+一组标签，表示时间序列。
复制链接

扫一扫