【大数据运维监控】Prometheus 内置的一些 Metrics

最新推荐文章于 2024-05-10 08:57:59 发布

开发实习生

最新推荐文章于 2024-05-10 08:57:59 发布

阅读量2.4k

点赞数 1

分类专栏：大数据运维监控文章标签： Prometheus Cortex 大数据运维监控

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_33356083/article/details/106714983

版权

大数据运维监控专栏收录该内容

9 篇文章 2 订阅

订阅专栏

在使用 Prometheus 的时候，我们总会遇到 Prometheus 自身的监控指标，有些指标是需要结合到这些指标来进行分析的吗，这里简单的收集点 Prometheus 的自身的指标。

运行时的状态

Prometheus 是一个 Go 开发的程序，自然是包含了 Go 的一些基础指标，在 Prometheus 中，常见的 Go 的指标有:

go_goroutines
go_memstats_heap_alloc_bytes
go_memstats_heap_released_bytes

这些 metric 主要是在定位 prometheus 运行过程中内存或者 CPU 占用高的时候会比较有用，但是说实话，我用的不多，以为真的泄漏的时候，看这些 metric 还是不够的，需要结合 pprof 进行。

规则运行状态

在一个成熟的监控系统中，可能会设置很多不同的规则，但是，执行这些规则都是需要耗费系统资源的，所以，当有一些意外情况的时候，下面这些 metric 可能就能帮助上了:

prometheus_rule_evaluation_duration_seconds：所有的 rules(recording/alerting) 的计算的时间（分位值），这个可以用来分析规则是否过于复杂以及系统的状态是否繁忙
prometheus_rule_evaluation_duration_seconds_count：执行所有的 rules 的累积时长，没怎么用到
prometheus_rule_group_duration_seconds：具体的 rule group 的耗时
prometheus_rule_group_interval_seconds：具体的 rule group 的执行间隔（如果没有异常，应该和配置中的一致，如果不一致了，那很可能系统负载比较高）
prometheus_rule_group_iterations_missed_total：因为系统繁忙导致被忽略的 rule 执行数量
prometheus_rule_group_last_duration_seconds：最后一次的执行耗时

采集状态

这是一个比较重要的状态，例如我经常关心的是 prometheus 采集 exporter 是否正确，是否真的采集到了数据还是说 exporter 超时之类的异常，那么都是通过这个类别的 metric 来定位的。

这个类别的 metric 其实就 5 个，分别是:

up：对应的 scrape target 是否是健康的，0 表示不在线（采集失败了），1 表示正常
scrape_duration_seconds：采集这个 scrape target 花费的时间，这个可以用来定位 timeout
scrape_samples_post_metric_relabeling：在 metric 被 relabel 之后，还剩下的 sample 数量，关于 relabel 可以查看我的另外一篇文章。
scrape_samples_scraped：scrape target 暴露出来的 sample 数量
scrape_series_added：在 2.10 添加的新 metric，表示这个 scrape target 新增加的系列数

存储状态

在 Prometheus 中，用的是自己的存储引擎，并且它会进行内存缓存。所以，当发现查询数据不对或者内存占用偏高的时候，最后不妨怀疑一下是不是存储的问题，下面是一些可能有价值的数据:

prometheus_tsdb_blocks_loaded：当前已经加载到内存中的块数量
prometheus_tsdb_compactions_triggered_total：压缩操作被触发的次数（可能很多，但不是每次出发都会执行）
prometheus_tsdb_compactions_total：启动到目前位置压缩的次数（默认是 2 小时一次）
prometheus_tsdb_compactions_failed_total：压缩失败的次数
prometheus_tsdb_head_chunks：head 中存放的 chunk 数量
prometheus_tsdb_head_chunks_created_total：head 中创建的 chunks 数量
prometheus_tsdb_head_chunks_removed_total：head 中移除的 chunks 数量
prometheus_tsdb_head_gc_duration_seconds：head gc 的耗时（分位值）
prometheus_tsdb_head_max_time：head 中的有效数据的最大时间（这个比较有价值）
prometheus_tsdb_head_min_time：head 中的有效数据的最小时间（这个比较有价值）
prometheus_tsdb_head_samples_appended_total：head 中添加的 samples 的总数（可以看增长速度）
prometheus_tsdb_head_series：head 中保存的 series 数量
prometheus_tsdb_reloads_total：rsdb 被重新加载的次数

开发实习生

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
【大数据运维监控】Prometheus 内置的一些 Metrics

在使用 Prometheus 的时候，我们总会遇到 Prometheus 自身的监控指标，有些指标是需要结合到这些指标来进行分析的吗，这里简单的收集点 Prometheus 的自身的指标。运行时的状态Prometheus 是一个 Go 开发的程序，自然是包含了 Go 的一些基础指标，在 Prometheus 中，常见的 Go 的指标有:go_goroutinesgo_memstats_heap_alloc_bytesgo_memstats_heap_released_bytes这些 me
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。