prometheus组件详解

最新推荐文章于 2024-07-22 17:23:05 发布

CN-FuWei

最新推荐文章于 2024-07-22 17:23:05 发布

阅读量2.7k

点赞数

分类专栏： # Prometheus 文章标签：中间件

本文链接：https://blog.csdn.net/zfw_666666/article/details/124444958

版权

Prometheus 专栏收录该内容

21 篇文章 7 订阅

订阅专栏

一、简介

1.1、监控系统概述

注意：监控和告警 是有区别的，注意区分

监控系统设计：
- 评估系统的业务流程、业务种类、架构体系。对于各个地方的细节需要一定程度的认知
- 分类出所需的监控项种类：
  - 业务监控：QPS,PV,UV,SUCC_RATE,投诉率 ...
  - 系统监控：CPU,MEM,Load,IO,Traffic ...
  - 网络监控：Tcp Retran,丢包,延迟 ...
  - 日志监控：各种需要采集的日志，一般是单独设计和实现
  - 程序监控：嵌入程序内部，直接获取流量或者开放特定的接口或者特殊的日志格式
- 监控技术方案/软件选取：结合内部架构特点，大小、种类人员多少等选取合适的方案
- 监控体系的人员安排：责任到人、分块进行。开发团队配合选取
监控系统搭建步骤：
- 单点服务端搭建
- 单点客户端部署
- 单点客户端测试
- 采集程序单点部署
- 采集程序批量部署
- 监控服务端HA
- 监控数据图形化搭建 (Grafana)
- 报警系统测试
- 报警规则测试
- 监控+报警联合测试
- 正式上线监控
数据采集：
- 可选用脚本作为数据采集途径：
  - 例如: shell/python/awk/lua/perl/go 等
- 数据采集的形式分类：
  - 周期性采集：例如conrtab 每隔一段时间采集一次
  - 后台式采集：采集程序以守护进程方式运行在后台，持续不断的采集数据
  - ...
监控数据分析/算法：
- 如何判断指标。比如cpu load大于10，持续多久才告警？
- qps的当前值和历史值(同时段)的值的差异是多少该告警？
- ...
监控稳定测试：
- 不管是一次性测试，还是后台采集。只要进行采集都会对系统有一定的影响
- 稳定性测试，就是通过一定时间的单点部署观察，来判断影响范围
监控自动化：
- 监控客户端的批量部署，服务端HA，监控项修改。。。都需要大量的人工参与
- 自动化工具推荐：puppet(配置文件部署),Jenkins(持续集成部署),CMDB(配置管理)
监控图形化：
- granfa ...

常见的监控工具：Nagios/Cacti/Zabbix/Ntop/Prometheus/...
常见的告警工具： PagerDuty/自建语音告警系统/自建短信通知/自建邮件系统/...
监控的方向：自愈式监控体系/全链路监控

1.2、Prometheus简介

Prometheus到底是什么？

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Since its inception in 2012, many companies and organizations have adopted Prometheus, and the project has a very active developer and user community. It is now a standalone open source project and maintained independently of any company. To emphasize this, and to clarify the project's governance structure, Prometheus joined the Cloud Native Computing Foundation in 2016 as the second hosted project, after Kubernetes.

Prometheus特性

基于时间序列的数据模型 //数据存储
基于K/V的数据模型 //数据存储
采集数据的查询完全基于数学运算(函数) 而不是其他的表达式，并提供有专门的查询web //数据展示
采用HTTP pull/push两种对应的数据采集传输方式 //数据传输
开源，且大量的社区插件 //插件扩展
push的方法，非常的灵活，几乎支持任何形式的数据 //数据传输
本身自带图形调试(较为简单) //数据展示
精细的数据采样，理论上可以到秒级 //数据存储
不支持集群化 //新版本支持联邦
被监控集群规模较大的情况下，有性能瓶颈 //不足
偶尔发生数据丢失 //2.0之后已改进
中文支持不好，中文资料较少 //不足

1.3、Prometheus架构

Prometheus Server: 收集和存储、检索时间序列数据
Pushgateway: 用于prometheus无法直接pull的监控部分支持
service discovery: 服务发现
Alertmanager: 处理告警
PromQL: 用于查询和展示

1.4、storage

Promtheus支持本地存储时间序列数据，也支持远程存储

Prometheus采用的是时间序列的方式以一种自定义的格式存储在本地硬盘上
Prometheus的本地时间序列数据库以每2h为间隔来分block(块)存储，每一个块中又分为多个chunk文件，chunk文件是用来存放时间序列数据、metadata和索引index
index文件是对metric(prometheus中一次K/V采集数据叫做一个metric)和label(标签)进行索引之后存储在chunk中 chunk是作为存储的基本单位，index and metadata作为子集
prometheus平时是将采集的数据先存放在内存中(prometheus对内存的消耗，还是比较小的)，以类似缓存的方式用于加速搜索和快速访问
当出现宕机时，prometheus有一种保护机制叫做WAL，可以将数据定期存入硬盘中chunk来表示。并在重新启动时用于恢复进入内存

1.4.1、Data model

prmetheus基本上将所有数据存储为时间序列time series 并且Every time series is uniquely identified by its *metric name* and optional key-value pairs called *labels*.

metric命名格式要求：[a-zA-Z_:][a-zA-Z0-9_:]* ，Label 格式要求 [a-zA-Z_][a-zA-Z0-9_]并以_开头的label是内部使用的。Label的value为空的Label被认为是和不存在该Label的

时间序列的表示格式：

<metric name>{<label name>=<label value>, ...} #metric_name对应多个label，不通label的结合对应不同的value
api_http_requests_total{method="POST", handler="/messages"} #举例

OpenTSDB的使用说明：http://opentsdb.net/docs/build/html/index.html

[root@master1 pushgateway-1.4.1.linux-amd64]# curl -s  http://127.0.0.1:9100/metrics  |grep -v "#"   
...
node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="0",mode="user"} 0
node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="1",mode="user"} 0
node_cpu_seconds_total{cpu="0",mode="idle"} 446008.08
node_cpu_seconds_total{cpu="0",mode="iowait"} 1063.97
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 0.18
node_cpu_seconds_total{cpu="0",mode="softirq"} 710.98
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 4323.16
node_cpu_seconds_total{cpu="0",mode="user"} 8415.35
...

1.4.2、Metric types

Prometheus 目前支持的四种Metric类型

Gauges

最简单的度量指标，只有一个简单的返回值，或者叫瞬时状态，例如，我们想衡量一个待处理的队列中任务的个数：
例如：监控磁盘容量或者内存的使用量，可以使用Gauges的metrics格式来度量。因为硬盘的容量或者内存的使用量是随着时间的推移不断变化的

Counters

计数器，从数据量0开始计算，只能增长或者被重置为0
举例：采集用户的访问量，被访问一次就+1；理想情况下是永远递增的最多保持不变

Histograms

Histogram 可以理解为柱状图的意思，常用于跟踪事件发生的规模，例如：请求耗时、响应大小。它特别之处是可以对记录的内容进行分组，提供 count 和 sum 全部值的功能。例如：{小于10=5次，小于20=1次，小于30=2次}，count=8次，sum=8次的求和值
举例：http_response_time(http响应时间)，抓取nginx_access.log 采集用户的平均访问时间。
方式1：统计全天的nginx_access.log中http_response_time的值，sum求和/用户量得到该值，问题，这样采集到的值有什么意义？
特殊场景1：上午9:00系统部长，平均响应时间1~3s，但是只持续了5min;总的一天的平均值并不能体现什么
特殊场景2：没有故障，平均响应时间0.05s，但是总有一些慢请求，如果看平均值就发现不了问题。
Histograms 和 summary 类型，可以分别统计出全部用户的响应时间中=0.5s的量有多少，00.5s、大于2s的分别有多少，各个样本的分布情况

summary

与 Histogram 类型类似，用于表示一段时间内的数据采样结果（通常是请求持续时间或响应大小等），它提供一个quantiles的功能，可以按%比划分跟踪的结果。例如：quantile取值0.95，表示取采样值里面的95%数据。比Histograms更为精确也更为消耗资源

1.4.3、Storage

可以存储在本地，也可以存储到远程系统上

[root@master1 data]# ll
总用量 20
drwxr-xr-x 3  700 root    68 5月  30 21:02 01F6YNX0G290KVDSW5B0CV9F9N
drwxr-xr-x 3  700 root    68 6月   4 09:34 01F7AAGYNM9WJVQWEE76C3C4S3
drwxr-xr-x 3  700 root    68 6月   4 13:01 01F7APBV7124ZZXEDH0WFXNVBF
drwxr-xr-x 3  700 root    68 6月   4 19:00 01F7BAWAD1RMETVR2QJ7JR2C15
drwxr-xr-x 3  700 root    68 6月   7 11:12 01F7J79JZSQ4DZ393TZXRZ5HBA
drwxr-xr-x 3  700 root    68 6月   7 13:14 01F7JE8ZGWA0R0BNG87C8AMSSR
drwxr-xr-x 3  700 root    68 6月   7 19:00 01F7K22E6PDT4A5JTRXYV1QYE9
drwxr-xr-x 3 root root    68 6月   8 13:00 01F7MZVZAW0H7VBQV296FNKRGE
drwxr-xr-x 3 root root    68 6月   9 13:00 01F7QJ8PAVAJE5W4M5MW4QW2M2
drwxr-xr-x 3 root root    68 6月  10 01:00 01F7RVF1Y9CPN51HYDE5K3AW5K
drwxr-xr-x 3 root root    68 6月  13 15:19 01F823SWE6G0WV1K2WK0DQ34FP
drwxr-xr-x 3 root root    68 6月  13 15:19 01F823SWHHRDHM4MMP6WKKB0D3
drwxr-xr-x 3 root root    68 6月  13 15:19 01F823SWMH6HH1SQ8YKBV8399R
drwxr-xr-x 2  700 root    20 6月  13 16:00 chunks_head
-rw-r--r-- 1  700 root     0 5月  29 14:33 lock
-rw-r--r-- 1  700 root 20001 6月  13 16:05 queries.active
drwxr-xr-x 3  700 root    81 6月  13 15:31 wal

[root@master1 data]# cd  01F823SWMH6HH1SQ8YKBV8399R
[root@master1 01F823SWMH6HH1SQ8YKBV8399R]# ll
总用量 396
drwxr-xr-x 2 root root     20 6月  13 15:19 chunks
-rw-r--r-- 1 root root 395834 6月  13 15:19 index
-rw-r--r-- 1 root root    700 6月  13 15:19 meta.json
-rw-r--r-- 1 root root      9 6月  13 15:19 tombstones
[root@master1 01F823SWMH6HH1SQ8YKBV8399R]# ls chunks/
000001
[root@master1 01F823SWMH6HH1SQ8YKBV8399R]# ls chunks/000001 
chunks/000001

默认每2h生成一个chunk子目录，目录内包含一个chunks子目录，该子目录包含该时间窗口的所有时间序列样本，一个元数据文件和一个索引文件(该索引文件将度量名称和标签索引到chunks目录中的时间序列)。chunks目录中的样本被组合到一个或多个段文件中，每个段文件的默认大小为512MB。当通过API删除序列时，删除记录存储在单独的tombstone文件中(而不是立即从块段中删除数据)。
近期数据的current block是保存在内存中的。通过WAL(write-ahead-log)来防止崩溃。可以在prometheus崩溃后重启replay使用。WAL日志存储在wal目录中，每128MB一个分片，这些文件中保存尚未压缩的原始数据。prometheus将保留至少3个wal文件


注意：
	1.由于本地local-storage么有集群和复制功能。建议底层使用raid。备份时建议使用快照
	2.外部存储可以通过 remote read/write api，但是性能和效率差异会很大

相关参数：

--storage.tsdb.path: Where Prometheus writes its database. Defaults to data/
-storage.tsdb.retention.time: When to remove old data. Defaults to 15d
--storage.tsdb.retention.size: [EXPERIMENTAL] The maximum number of bytes of storage blocks to retain. The oldest data will be removed first. Defaults to 0 or disabled. This flag is experimental and may change in future releases. Units supported: B, KB, MB, GB, TB, PB, EB. Ex: "512MB"
--storage.tsdb.wal-compression: Enables compression of the write-ahead log (WAL). Depending on your data, you can expect the WAL size to be halved with little extra cpu load. This flag was introduced in 2.11.0 and enabled by default in 2.20.0. Note that once enabled, downgrading Prometheus to a version below 2.11.0 will require deleting the WAL.

远程存储：

prometheus支持三种集成外部存储的方式：

Prometheus can write samples that it ingests to a remote URL in a standardized format.
Prometheus can receive samples from other Prometheus servers in a standardized format.
Prometheus can read (back) sample data from a remote URL in a standardized format.

对应参数：--storage.remote.*

1.5、Service discovery

Prometheus可以通过自定义或者结合consul,k8s,dns等服务发现软件，进行服务发现和监控

具体支持的服务发现机制以及配置方法，参考: https://prometheus.io/docs/prometheus/latest/configuration/configuration/ 包含sd配置的

1.6、Pushgateway

prometheus的客户端主要有两种方式采集：

1、pull 主动拉取方式

客户端(被检控方)先安装各类已有exporters在系统上之后，exporters以守护进程的模式运行并开始采集数据；
exporter本身也是一个http_server可以对http请求做出响应，返会数据。prometheus用pull这种方式去拉的方式(HTTP get)去访问每个节点上exporter并采回需要的数据

2、被动推送方式Push

指的是在客户端(或者服务端)安装官方提供的pushgateway插件，然后，使用我们运维自行开发的各种脚本把监控数据组织成k/v的形式metric形式发送给pushgateway之后，pushgateway会再推送给prometheus

1.6.1、exporter

exporter的介绍：

不同于pushgateway，exporter是一个独立运行的采集程序。功能主要包含：

自身是HTTP服务器，可以响应从外发出来的HTTP GET请求
自身需要运行在后台，并可以定期触发抓取本地的监控数据
返回给prometheus-server的内容是需要符合prometheus规定的metric类型(key-value)，其中value要求是(float int)

常用的exporter: https://prometheus.io/download/ 有很多的exporter，有go/Ruby/python等各种开发语言开发

其中node_exporter用的最多，几乎把system中相关的监控项全部包含了。部分功能如下：

Name	Description	OS
arp	Exposes ARP statistics from `/proc/net/arp`.	Linux
bcache	Exposes bcache statistics from `/sys/fs/bcache/`.	Linux
bonding	Exposes the number of configured and active slaves of Linux bonding interfaces.	Linux
btrfs	Exposes btrfs statistics	Linux
boottime	Exposes system boot time derived from the `kern.boottime` sysctl.	Darwin, Dragonfly, FreeBSD, NetBSD, OpenBSD, Solaris
conntrack	Shows conntrack statistics (does nothing if no `/proc/sys/net/netfilter/` present).	Linux
cpu	Exposes CPU statistics	Darwin, Dragonfly, FreeBSD, Linux, Solaris, OpenBSD
cpufreq	Exposes CPU frequency statistics	Linux, Solaris
diskstats	Exposes disk I/O statistics.	Darwin, Linux, OpenBSD
edac	Exposes error detection and correction statistics.	Linux
entropy	Exposes available entropy.	Linux
exec	Exposes execution statistics.	Dragonfly, FreeBSD
fibrechannel	Exposes fibre channel information and statistics from `/sys/class/fc_host/`.	Linux
filefd	Exposes file descriptor statistics from `/proc/sys/fs/file-nr`.	Linux
filesystem	Exposes filesystem statistics, such as disk space used.	Darwin, Dragonfly, FreeBSD, Linux, OpenBSD

...

1.6.2、pushgateway

pushgateway本身也是一个http server，运维可以通过写脚本程序。抓自己想要的监控数据然后推送到pushgateway(HTTP POST)再由pushgateweay推送到prometheus服务端

为什么需要铺设gateway？

exporter虽然采集类型已经很丰富了，但是我们仍然需要很多自定义的非规则化的监控数据
exporter采集信息量较大，很多我们用不到，pushgateway一般是定义一种类型的数据，数据采集就是节省资源
pushgateway脚本开发，远比开发一个全新的exporter简单快速的多。

1.7、AlertManager

prometheus server push alerts到Alertmanager，AlertManager推送告警到第三方告警通道。现在grafana上也可以实现告警推送

Prometheus告警包含两部分内容：1.promethues的alerting rules ; 2.prometheus发送告警到Alertmanager，alertmanager负责管理告警本身、静默期、告警聚合并发送通知给报警平台

注意：可以使用alertmanager推送告警，也可以通过grafana推送告警

安装告警和通知主要包含如下几个步骤：

安装和配置prometheus
配置prometheus 推送告警给alertmanager
给prometheus创建alertrule

alertmanager专注于接受告警内容：消除重复数据，分组，推送告警给告警网关
Grouping: 将类似性质的告警分类为单个通知
Inhibition：抑制告警。如果某些其他告警已经触发，则抑制某些告警的触发。举例：集群A不可用的告警已经触发，则集群A其他Firing的其他告警就不需要告警了，防止出现告警风暴，需要在alertmanager的配置文件中配置
Silences: 在静默期内将告警静音，在web 界面配置
Client behavisor: Alertmanager对其客户端的行为有特殊要求。这些仅适用于Prometheus不用于发送警报的高级用例。
High Availability: alertmanager支持通过配置参数cluster-*去实现高可用，需要在prometheus中配置alertmanger的列表

1.7.1、configuration

alertmanager有command-line flags 和配置文件两种配置。

默认配置：

[root@master1 alertmanager-0.22.2.linux-amd64]# cat alertmanager.yml 
route: 
  group_by: ['alertname']  #分组
  group_wait: 30s #当一个新的报警分组被创建后，需要等待至少group_wait时间来初始化通知，这种方式可以确保您能有足够的时间为同一分组来获取多个警报，然后一起触发这个报警信息
  group_interval: 5m #当第一个报警发送后，等待'group_interval'时间来发送新的一组报警信息,在发送有关添加到已发送初始通知的警报组中的新警报的通知之前要等待多长时间(通常5m)
  repeat_interval: 1h  #如果一个报警信息已经发送成功了，等待'repeat_interval'时间来重新发送他们,组A和组B告警已经发送过一次，组A的重复告警间隔时间为1h发送。
  receiver: 'web.hook' #接收器
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'ev', 'instance']

global: 全局配置
route: 接受告警信息后的分组标签，用来设置分发策略

更多用法，请参考：https://prometheus.io/docs/alerting/latest/configuration/

1.7.2、notification template

告警内容调整，主要包含两个方面：1、prometheus的alertuels;2、alertmanager的receiver配置

有需要再深入了解：https://prometheus.io/docs/alerting/latest/notifications/

1.7.3、http api

GET　/-/healthy #健康检查
GET /-/ready #就绪检查
POST /-/reload #重新加载配置
[root@master1 alertmanager-0.22.2.linux-amd64]# curl -X GET  http://192.168.56.101:9093/api/v1/status  | python -m json.tool |sed 's/\\n/\n\t/g'  #查看默认配置

1.8、PromQL

PromQL is the Prometheus Query Language. 它允许广泛的操作，包括聚合、切片和切分、预测和连接。

1.8.1、Basics

1、promtool查询

promtool支持的四种query，也可以在prometheus web控制台进行查询
  query instant [<flags>] <server> <expr>
    Run instant query.

  query range [<flags>] <server> <expr>  #使用表达式
    Run range query.

  query series --match=MATCH [<flags>] <server>
    Run series query.

  query labels [<flags>] <server> <name>
    Run labels query.


# 用法1 query instant 
[root@master1 prometheus-2.24.0.linux-amd64]#  ./promtool query instant -o promql  http://127.0.0.1:9090 instance:node_cpu:avg_rate5m #不加 -o promql 输出一样 "instance:node_cpu:avg_rate5m" 是自定义的recordrule
instance:node_cpu:avg_rate5m{instance="master1:9100", severity="page"} => 6.533333333403178 @[1623656408.629]
instance:node_cpu:avg_rate5m{instance="master2:9100", severity="page"} => 1.79999999973613 @[1623656408.629]
instance:node_cpu:avg_rate5m{instance="master3:9100", severity="page"} => 1.6666666666666714 @[1623656408.629]

[root@master1 prometheus-2.24.0.linux-amd64]#  ./promtool query instant -o json  http://127.0.0.1:9090 instance:node_cpu:avg_rate5m
[{"metric":{"__name__":"instance:node_cpu:avg_rate5m","instance":"master1:9100","severity":"page"},"value":[1623656418.572,"6.533333333403178"]},{"metric":{"__name__":"instance:node_cpu:avg_rate5m","instance":"master2:9100","severity":"page"},"value":[1623656418.572,"1.6666666666666572"]},{"metric":{"__name__":"instance:node_cpu:avg_rate5m","instance":"master3:9100","severity":"page"},"value":[1623656418.572,"1.5999999999379213"]}]
[root@master1 prometheus-2.24.0.linux-amd64]# 


#用法2 query range 查询一段时间内的值
[root@master1 prometheus-2.24.0.linux-amd64]# ./promtool query range --start=$(date -d '06/14/2021 16:00:00' +"%s") --end=$(date -d '06/14/2021 16:10:00' +"%s")   http://127.0.0.1:9090 '(1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) / (sum(increase(node_cpu_seconds_total[1m])) by (instance))))) * 100' |grep -i instance -A2
{instance="master1:9100"} =>
3.522818254603355 @[1623657600]
3.522818254603355 @[1623657602]
--
{instance="master2:9100"} =>
0.9292431706227622 @[1623657600]
0.9292431706227622 @[1623657602]
--
{instance="master3:9100"} =>
0.8505875769438287 @[1623657600]
0.8505875769438287 @[1623657602]


#用法3 查询series 
[root@master1 prometheus-2.24.0.linux-amd64]# ./promtool query series --match=instance:node_cpu:avg_rate5m  http://127.0.0.1:9090 
{__name__="instance:node_cpu:avg_rate5m", instance="master1:9100", severity="page"}
{__name__="instance:node_cpu:avg_rate5m", instance="master2:9100", severity="page"}
{__name__="instance:node_cpu:avg_rate5m", instance="master3:9100", severity="page"}

也可以指定--start和--end
[root@master1 prometheus-2.24.0.linux-amd64]# ./promtool query series --match="node_netstat_Tcp_CurrEstab{instance='master1:9100'}" http://127.0.0.1:9090
{__name__="node_netstat_Tcp_CurrEstab", instance="master1:9100", job="prometheus"}
注意: 这里的match的value和 prometheus webconsole->Graph->输入"node_netstat_Tcp_CurrEstab{instance='master1:9100'}" 的效果一样，但是这里不会直接展示该metric的value，但是使用query instant就可以

[root@master1 prometheus-2.24.0.linux-amd64]# ./promtool query instant  http://127.0.0.1:9090 "node_netstat_Tcp_CurrEstab{instance='master1:9100'}"
node_netstat_Tcp_CurrEstab{instance="master1:9100", job="prometheus"} => 197 @[1623659608.478]



#用法4 query labe。查询包含该label的key/value的value有哪些
[root@master1 prometheus-2.24.0.linux-amd64]# ./promtool query labels http://127.0.0.1:9090 job
prometheus
pushgateway
[root@master1 prometheus-2.24.0.linux-amd64]# ./promtool query labels http://127.0.0.1:9090 instance
instance1
localhost:9091
localhost:9092
master1:9091
master1:9100
master2:9100
master3:9100
pushgateway_master1

2、Expression language data types表达式语言数据类型

In Prometheus's expression language, an expression or sub-expression can evaluate to one of four types:

Instant vector - 瞬时矢量 a set of time series containing a single sample for each time series, all sharing the same timestamp，
Range vector - 距离矢量 a set of time series containing a range of data points over time for each time series
Scalar - 单一值 a simple numeric floating point value
String - 字符串 a simple string value; currently unused

3、Time series Selectors时间序列选择器

1)Instant vector selectors 选择器

node_netstat_Tcp_CurrEstab #简单查询，不过滤

node_netstat_Tcp_CurrEstab{job=~".*pro.*"} #job匹配 pattern的

node_netstat_Tcp_CurrEstab{job=~".*pro.*",instance!~"master1.*"} #排除master1:9100

匹配规则：

=: Select labels that are exactly equal to the provided string.
!=: Select labels that are not equal to the provided string.
=~: Select labels that regex-match the provided string.
!~: Select labels that do not regex-match the provided string.

http_requests_total{environment=~"staging|testing|development",method!="GET"}

匹配表达式最少需要一个不包含空string的值
{job=~".*"} # Bad!
{job=~".+"}              # Good!
{job=~".*",method="get"} # Good!

2)Range vector literals 选择器

类似于 instant vector但是他需要一个范围，http_requests_total{job="prometheus"}[5m]

3)Time Durations

ms - millisecondss - secondsm - minutesh - hoursd - days - assuming a day has always 24hw - weeks - assuming a week has always 7dy - years - assuming a year has always 365d

4)Offset modifer

offset modifer可以用于改变instant and range vectors 的偏移量

http_requests_total offset 5m  #过去5min的sum(http_requests_total{method="GET"} offset 5m) // GOOD. #offset需要紧跟着selectorsum(http_requests_total{method="GET"}) offset 5m // INVALID.rate(http_requests_total[5m] offset 1w)  #同样适用于ranget vectorrate(http_requests_total[5m] offset -1w) #可以指定负offset 用于对比和向前的时间比较

3、子查询

rate(pushgateway_http_requests_total[5m])[30m:1m] #返回过去30min内，per 5min的pushgateway_http_requests_total，每分钟取一次，30m/1m=30此，不同状态码各有30次。

1.8.2、Operators

operator支持基础的逻辑和算数运算符。

二进制运算符：
- 数学运算符：+ - * / % ^
- 比较运算符：== != > < >= <=
- 逻辑运算符：and(交集) or() unless(补集)
Aggeration operators (瞬时矢量)内置的聚合运算符有：
- sum (calculate sum over dimensions)
- min (select minimum over dimensions)
- max (select maximum over dimensions)
- avg (calculate the average over dimensions)
- group (all values in the resulting vector are 1)
- stddev 标准差
- stdvar 标准方差
- count 统计vector中的元素个数
- count_values 统计具有相同值的元数个数
- bottomk 最小k元素（按样本值）
- topk按样本值计算的最大k元素
- quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)
- 用法：

用法1：<聚合运算符> [without|by (<label list>)] ([parameter,] <vector expression>)用法2：<聚合运算符>([parameter,] <vector expression>) [without|by (<label list>)]举例1：http_requests_total  这个metric 拥有application, instance, and group labels可以通过以下方式计算所有实例中每个应用程序和组看到的HTTP请求总数：sum without (instance) (http_requests_total)sum by (application, group) (http_requests_total)

运算符优先级，从高到底，同优先级左侧优先
- ^
- *, /, %
- +, -
- ==, !=, <=, <, >=, >
- and, unless
- or

1.8.3、Functions

这里抽几个常用的函数进行说明，其他的函数后续用到再深入研究

rate函数

该函数是专门搭配counter类型数据使用的函数，它的功能是设定一个时间段。取出counter在这个时间中的平均每秒的增量。

rate(node_network_receive_packets_total{device="enp0s8"}[1m]) 统计"enp0s8" 平均每分钟内收到的包数量的变化值。

注意：只要是counter类型的数据，记得别的先不做，先给他加上一个rate()或者increase()函数，这样这个数据才变得有意义

increase适合采集力度较为粗糙的，rate()适用于采集力度较为精细(比如网络io、硬盘io等)

increase(node_network_receive_packets_total{device="enp0s8"}[1m]) 也能达到相同的效果

rate(1m)是取一段时间增量的平均每秒数量：一分钟的总增量/60s

increase(1m)则是取一段时间增量的总量: 取的是一分钟内的增量总量

sum函数

sum(rate(node_network_receive_bytes_total[1m])) 统计每分钟所有节点所有接口收到的字节数。

sum(rate(node_network_receive_bytes_total[1m]))by (instance) 根据instance进行区分

可以通过自定义标签的形式，by (cluster_name)进行区分不同的产品，比如db server/web server/middleware server/...

topk

定义：取出前几位的最高值

Gause类型：topk(2,node_netstat_Tcp_CurrEstab) #当前tcp连接数最多的2个instance
Counter类型：topk(2,rate(node_network_receive_bytes_total[1m]) #统计1m内网络流量最高的节点

其他函数：https://prometheus.io/docs/prometheus/latest/querying/functions/#functions

见：https://prometheus.io/docs/prometheus/latest/querying/functions/

1.8.4、HTTP API

见： https://prometheus.io/docs/prometheus/latest/querying/api/，这里举出几个常用的

1、status api

[root@master1 rules]# curl -s  http://localhost:9090/api/v1/status/config | python -m json.tool |sed 's/\\n/\n\t/g'    #查看配置Configuration 
[root@master1 rules]# curl -s  http://localhost:9090/api/v1/status/flags | python -m json.tool |sed 's/\\n/\n\t/g'   #查看prometheus默认配置
[root@master1 rules]# curl -s  http://localhost:9090/api/v1/status/buildinfo | python -m json.tool |sed 's/\\n/\n\t/g'   #查看prometheus版本信息

2、tsdb snapshot api

[root@master1 rules]# curl -s  http://localhost:9090/api/v1/status/tsdb | python -m json.tool |sed 's/\\n/\n\t/g'   #查看tsdb状态 或者使用 jq命令也可以格式化json输出

tsdb的admin api，要求--web.enable-admin-api is set. 已经开启
[root@master1 prometheus-2.24.0.linux-amd64]#  curl -s  -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot  | python -m json.tool  #创建snapshot
{
    "data": {
        "name": "20210618T070354Z-380704bb7b4d7c03"
    },
    "status": "success"
}
[root@master1 prometheus-2.24.0.linux-amd64]# ls -l data/snapshots/20210618T070354Z-380704bb7b4d7c03/  会在data-dir/snaphosts/...

删除series，删除参数
    match[]=<series_selector>: Repeated label matcher argument that selects the series to delete. At least one match[] argument must be provided.
    start=<rfc3339 | unix_timestamp>: Start timestamp. Optional and defaults to minimum possible time.
    end=<rfc3339 | unix_timestamp>: End timestamp. Optional and defaults to maximum possible time.

curl -X POST  -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'
curl -X POST  -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=instance:node_mem:percent' #“instance:node_mem:percent” 为自定义的metric name


删除：tombstones
由于Prometheus Block的数据一般在写完后就不会变动。如果要删除部分数据，就只能记录一下删除数据的范围，由下一次compactor组成新block的时候删除。而记录这些信息的文件即是tomstones。
CleanTombstones removes the deleted data from disk and cleans up the existing tombstones. This can be used after deleting series to free up space.
[root@master1 prometheus-2.24.0.linux-amd64]# curl -X POST  -g 'http://localhost:9090/api/v1/admin/tsdb/clean_tombstones' -I  #204代表成功
HTTP/1.1 204 No Content
Date: Fri, 18 Jun 2021 07:20:09 GMT

3、target/rules/alerts/

1) target[root@master1 ~]# alias comm="python -m json.tool |sed 's/\\n/\n\t/g'" #或者使用 jq命令也可以[root@master1 ~]# curl  -s   http://localhost:9090/api/v1/targets | comm  #返回当前prometheus target的概览信息[root@master1 ~]# curl  -s   http://localhost:9090/api/v1/targets?state=dropped  |comm #可以过滤state=[dropped|active|any]2)rules[root@master1 ~]# curl  -s   http://localhost:9090/api/v1/rules  |comm支持参数：type=alert|record3)alerts[root@master1 ~]# curl  -s   http://localhost:9090/api/v1/alerts  | jq4)alertmanager[root@master1 ~]# curl  -s   http://localhost:9090/api/v1/alertmanagers | jq5)target metadata[root@master1 ~]# curl  -s    http://localhost:9090/api/v1/targets/metadata | jqmatch_target=<label_selectors>: Label selectors that match targets by their label sets. All targets are selected if left empty.metric=<string>: A metric name to retrieve metadata for. All metric metadata is retrieved if left empty.limit=<number>: Maximum number of targets to match.6)metric metadata[root@master1 ~]# curl  -s    http://localhost:9090/api/v1/metadata?limit=3  | jqlimit=<number>: Maximum number of metrics to return.metric=<string>: A metric name to filter metadata for. All metric metadata is retrieved if left empty.

1.8.5、manage API

GET /-/healthy #健康见擦汗[root@master1 ~]# curl http://127.0.0.1:9090/-/healthyPrometheus is Healthy.[root@master1 ~]# curl http://127.0.0.1:9090/-/readyPrometheus is Ready.[root@master1 ~]# curl -X POST  http://127.0.0.1:9090/-/reload -I  #重新加载配置HTTP/1.1 200 OKDate: Sat, 19 Jun 2021 06:37:47 GMTContent-Length: 0[root@master1 ~]# curl -X POST  http://127.0.0.1:9090/-/quit  #优雅的shutdonw

CN-FuWei

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
prometheus组件详解

一、简介1.1、监控系统概述注意：监控和告警是有区别的，注意区分监控系统设计：评估系统的业务流程、业务种类、架构体系。对于各个地方的细节需要一定程度的认知分类出所需的监控项种类：业务监控：QPS,PV,UV,SUCC_RATE,投诉率 ... 系统监控：CPU,MEM,Load,IO,Traffic ... 网络监控：Tcp Retran,丢包,延迟 ... 日志监控：各种需要采集的日志，一般是单独设计和实现程序监控：嵌入程序内部，直接获取流.
复制链接

扫一扫