Prometheus和node_exporter概念

颗粒CloudCoder

于 2024-08-15 18:24:15 发布

阅读量868

点赞数 15

文章标签： prometheus 运维云原生

本文链接：https://blog.csdn.net/m0_60125201/article/details/141228974

版权

1 项目目标

（1）了解Prometheus参数

（2）了解node_exporter

2 项目准备

2.1 规划节点

主机名	主机IP	节点规划
prome-master01	10.0.1.10	服务端
prome-node01	10.0.1.20	客户端

3 项目实施

3.1 Prometheus介绍

3.1.1 Prometheus 的基本概念

Sample（数据点）：

在 Prometheus 中，Sample 是时间序列中的单个数据点，它包含了一个时间戳（timestamp）和一个数值（value）。
结构体 sample 定义如下：

type Sample struct {
    t int64 // 时间戳，单位是毫秒
    v float64 // 浮点数值
}

每个 Sample 占用 16 个字节，其中 8 个字节用于存储时间戳，8 个字节用于存储数值。

Label（标签）：

Label 是 Prometheus 中用于标识时间序列的键值对，其中 Name 是标签的名称，Value 是标签的值。
结构体 Label 定义如下：

type Label struct {
    Name, Value string
}

例如，cpu="0" 和 mode="user" 都是标签，它们分别表示 CPU 的编号和模式。

Labels（标签组）：

Labels 是一个标签数组，用于表示一个时间序列的所有标签。
类型定义如下：

type Labels []Label

标签组允许 Prometheus 对时间序列进行多维数据组织，使得用户可以根据不同的维度来查询和聚合数据。

Metric（指标）：

指标是 Prometheus 中的基本监控单元，它由指标名称和一组标签唯一确定。
例如，http_requests_total 是一个指标名称，它可以与不同的标签组合，如 http_requests_total{method="GET", handler="/messages"}，表示 GET 方法在 /messages 处理程序上的 HTTP 请求总数

3.1.2 prometheus四种查询类型

瞬时查询（Instant Queries）：

查询特定指标在当前时刻的值：

http_requests_total

查询特定标签的指标值：

http_requests_total{method="GET"}

区间查询（Range Queries）：

查询过去5分钟内特定指标的平均值：

rate(http_requests_total{method="GET"}[5m])

查询过去1小时内CPU使用率的最大值：

max(rate(container_cpu_usage_seconds_total{container_name!=""}[1h]))

聚合查询（Aggregate Queries）：

计算所有实例的HTTP请求总数的总和：

sum(http_requests_total)

计算所有实例的HTTP请求总数的平均值：

average(http_requests_total)

记录查询（Recording Rules）：

假设你有一个记录规则定义如下，用于计算每秒请求数的平均值：

groups:
- name: example_rules
  rules:
  - record: job:request_per_second:mean5m
    expr: rate(http_requests_total[5m])

然后你可以查询这个派生指标：

job:request_per_second:mean5m

3.1.3 四种标签匹配模式

1.= 等于

查询: cpu第一个核并且是用户态的数据 node_cpu_seconds_total{mode="user",cpu="0"}

2.!= 不等于

查询: 非lo网卡的接收字节数 node_network_receive_bytes_total{device!="lo"}

3.=~ 正则匹配

查询: 挂载点以/run开头的文件系统剩余字节数 node_filesystem_avail_bytes{mountpoint=~"^/run.*"}

4.!~ 正则非匹配

查询: 块设备名字不包含vda的读字节数 node_disk_read_bytes_total{device!~".vda."}

3.1.4 四种数据类型

Prometheus 支持四种主要的数据类型，用于存储和表示不同的监控指标：

Counter（计数器）：

- 计数器是一种只能递增的数据类型，用于记录从系统启动或重置以来发生的事件数量。
- 计数器的值只能增加，不能减少。如果需要记录减少的事件，可以使用 DecreasesOnly 计数器。
- 示例：http_requests_total，记录接收到的 HTTP 请求总数。

Gauge（仪表盘）：

- 仪表盘是一种可以任意增减的数据类型，用于表示某个特定时刻的量，比如内存使用量、磁盘空间等。
- 仪表盘的值可以增加也可以减少，适用于表示资源的使用情况。
- 示例：node_memory_MemTotal，表示节点的总内存量。

Histogram（直方图）：

- 直方图是一种特殊的数据类型，用于记录观察结果的分布情况。
- 直方图可以设置多个“桶”（bucket），每个桶记录落在特定范围内的观察结果的数量。
- 直方图通常与 sum 和 count 指标一起使用，sum 记录所有观察结果的总和，count 记录观察结果的总数。
- 示例：http_request_duration_seconds，记录 HTTP 请求处理时间的分布。

Summary（摘要）：

- 摘要是一种类似于直方图的数据类型，用于记录观察结果的分布情况，但它提供了额外的功能，如求平均值和百分位数。
- 摘要同样可以设置“桶”，但与直方图不同，摘要的桶是自动分配的，并且会记录每个桶的观察结果的总和和数量。
- 摘要通常用于记录请求的响应时间分布，并且可以计算出平均响应时间、请求的第 95 百分位时间等。
- 示例：rpc_durations_seconds，记录 RPC 调用的持续时间分布。

每种数据类型都有其特定的使用场景，正确选择数据类型可以帮助你更有效地收集和分析监控数据。例如，计数器适合于记录请求总数，而直方图和摘要则更适合于分析请求处理时间的分布情况。

3.2 node_exporter介绍

3.2.1 node_exporter白名单和黑名单

黑名单: 关闭某一项默认开启的采集项

--no-collector.<name>

# 未开启前
[root@prome_master_01 node_exporter]# curl -s localhost:9100/metrics| grep node_cpu |head -10
# HELP node_cpu_guest_seconds_total Seconds the CPUs spent in guests (VMs) for each mode.
# TYPE node_cpu_guest_seconds_total counter
node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="0",mode="user"} 0
node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="1",mode="user"} 0
node_cpu_guest_seconds_total{cpu="2",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="2",mode="user"} 0
node_cpu_guest_seconds_total{cpu="3",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="3",mode="user"} 0

# 关闭cpu采集
 ./node_exporter --no-collector.cpu
curl  -s  localhost:9100/metrics |grep node_cpu

白名单：关闭默认采集项而只开启某些采集

--collector.disable-defaults --collector.<name> 

# 只开启mem采集
 ./node_exporter --collector.disable-defaults --collector.meminfo

# 只开启mem 和cpu 采集
./node_exporter --collector.disable-defaults --collector.meminfo --collector.cpu

关闭原因

太重
太慢
太多资源开销

3.2.2 禁用golang sdk 指标

使用 --web.disable-exporter-metrics
promhttp_ 代表访问/metrics 的http情况

[root@prome-node-01 node_exporter]# curl  -s  localhost:9100/metrics |grep promhttp_
# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
# TYPE promhttp_metric_handler_errors_total counter
promhttp_metric_handler_errors_total{cause="encoding"} 0
promhttp_metric_handler_errors_total{cause="gathering"} 0
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 86
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

go_代表 goruntime 信息等

[root@prome-node-01 node_exporter]# curl  -s  localhost:9100/metrics |grep go_
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.5308e-05
go_gc_duration_seconds{quantile="0.25"} 5.9731e-05
go_gc_duration_seconds{quantile="0.5"} 6.8292e-05
go_gc_duration_seconds{quantile="0.75"} 0.000100601
go_gc_duration_seconds{quantile="1"} 0.000458513
go_gc_duration_seconds_sum 0.008102807
go_gc_duration_seconds_count 83

process_代表进程信息等

[root@prome-node-01 node_exporter]# curl  -s  localhost:9100/metrics |grep process_
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 3.62
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 4096
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 10
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.5601664e+07

3.2.3 节点上自打点数据上报

--collector.textfile.directory="" 配置本地采集目录
在采集目录里创建.prom文件，格式说明

# 创建目录
mkdir ./text_file_dir
# 准备 prom文件
cat <<EOF > ./text_file_dir/test.prom
# HELP nyy_test_metric just test
# TYPE nyy_test_metric gauge
nyy_test_metric{method="post",code="200"} 1027
EOF

# 启动服务
./node_exporter --collector.textfile.directory=./text_file_dir

# curl查看数据
[root@prome_master_01 tgzs]# curl  -s  localhost:9100/metrics |grep nyy
# HELP nyy_test_metric just test
# TYPE nyy_test_metric gauge
nyy_test_metric{code="200",method="post"} 1027

3.2.4 http传入参数，按采集器过滤指标

原理：通过http请求参数过滤采集器

http访问

# 只看cpu采集器的指标
http://192.168.0.112:9100/metrics?collect[]=cpu


# 只看cpu和mem采集器的指标
http://192.168.0.112:9100/metrics?collect[]=cpu&collect[]=meminfo

prometheus配置

params:
    collect[]:
      - cpu
      - meminfo

颗粒CloudCoder

关注

15
点赞
踩
24

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫