使用node_exporter采集GPU指标

最新推荐文章于 2024-09-13 07:03:49 发布

一直学下去

最新推荐文章于 2024-09-13 07:03:49 发布

阅读量3.5k

点赞数 2

分类专栏： Prometheus K8S 文章标签： prometheus gpu node_exporter

本文链接：https://blog.csdn.net/lwlfox/article/details/113250805

版权

K8S 同时被 2 个专栏收录

28 篇文章 2 订阅

订阅专栏

Prometheus

6 篇文章 0 订阅

订阅专栏

简介

node_exporter这个开源组件是配合prometheus收集主机操作系统层的metrics的常用组件，但是官方没有提供GPU卡的metrics的采集。公司由于业务需要采集GPU服务器的GPU使用情况，于是基于官方的版本把这个GPU的采集功能添加上.

git 仓库地址: https://gitee.com/kevinliu_CQ/node_exporter.git

实现简述

node_exporter扩展自定义监控指标是在collector这个目录，所以在这个项目中添加gpu_common.go,gpu.go和gpu_linux.go三个文件，用于采集GPU的metrics.采集的底层是使用了nvml这个Nvidia 官方的底层库，所以基本上实配所有N卡的系列。我测试过的系列有Tesla P4 ,Tesla T4 ,2080Ti,3080Ti。

支持的Metrics列表:

gpuDriverVersion //GPU驱动的版本号
total //显存总量 in MiB
used //显存使用量 in MiB
free //显存剩余量 in MiB
utilization //GPU 使用率 in %
temp //GPU温度in C
memUtilization //显存使用绿
maxClock //最大时钟频率
fanSpeed //风扇数度 in %
computeRunningProcesses //运行计算的进程数量
graphicsRunningProcesses //运行图像处理的进程数量
maxPcieLinkWidth //最大PCIE的连接带宽
pcieThroughput //PCIE的吞吐
performanceState //性能状态
powerManagementDefLimit //电源管理的默认上限
powerManagementLimit //电源管理的上限
powerState //电源状态
powerUsage //电源使用量
temperatureThreshold //gpu温度限速阈值

部署步骤

如果以上的支持列表中已经满足你的要求了，你就可以直接使用

1. 下载我编译好的二进制版本,然后直接运行.（目前仅支持x86架构，ARM架构的还不行）

wgt https://gitee.com/kevinliu_CQ/node_exporter/attach_files/600618/download/node_exporter_x86_64.zip
unzip node_exporter_x86_64.zip
mv node_exporter_x86_64  node_exporter
chmod +x node_exporter
nohup ./node_exporter --web.listen-address=":19200" &

2. 查看采集到的GPU指标

curl 127.0.0.1:19200/metrics|grep -i gpu

我的这个服务器是8张GPU卡，以显存剩余量为列，以上的命令会得到如下输出，在每个metrics里面包含了GPU服务器的hostname,gpu显卡的编号，GPU的类型，以及GPU的UUID

# HELP node_gpu_free Framebuffer memory free (in MiB).
# TYPE node_gpu_free gauge
node_gpu_free{hostname="gpuserver-01",id="0",type="Tesla P4",uuid="GPU-672f3395-da98-6436-2940-****"} 7611
node_gpu_free{hostname="gpuserver-01",id="1",type="Tesla P4",uuid="GPU-8ac8c01f-5679-881a-fee2-****"} 7611
node_gpu_free{hostname="gpuserver-01",id="2",type="Tesla P4",uuid="GPU-21cf8879-c6ed-e5a8-8ea0-****"} 7611
node_gpu_free{hostname="gpuserver-01",id="3",type="Tesla P4",uuid="GPU-5e194463-aeba-6054-fed4-****"} 7611
node_gpu_free{hostname="gpuserver-01",id="4",type="Tesla P4",uuid="GPU-71a4605f-43de-03e2-758a-****"} 7611
node_gpu_free{hostname="gpuserver-01",id="5",type="Tesla P4",uuid="GPU-1788f4f6-6b35-3762-ac57-****"} 7611
node_gpu_free{hostname="gpuserver-01",id="6",type="Tesla P4",uuid="GPU-eae87306-9cc7-0a19-541f-****"} 7611
node_gpu_free{hostname="gpuserver-01",id="7",type="Tesla P4",uuid="GPU-54e484c6-68b1-52bd-e414-****"} 7611

3. 配合Prometheus采集到的情况如下:

开发步骤

以下仅是一个思路，需要对go语言有一定的了解

1. 先将项目导入到IDE中，我使用的是Golang。(Golang的使用方法，包括Proxy设置这些就忽略了)

2. 这个版本添加了Nvidia GPU信息的抓取，所以编译的时候需要nvml.h 复制到/usr/local/cuda/include 目录里面

mkdir -p /usr/local/cuda/include
cp -p nvml.h /usr/local/cuda/include

3. 打开collector/gpu_linux.go文件和collector/gpu_nvml.go文件，在gpu_nvml.go中拷贝你需要的度量值到gpu_linux.go中

4. 在collector/gpu.go文件中添加新增的metrics

5. 构建二进制文件

go build

6.在项目目录中会生成一个node_exporter的二进制文件，直接运行文件即可。

PS: 如果以上步骤对于你有难度，可以留言告诉我你需要的指标(在gpu_nvml.go文件中选),然后我编译好了给你。

一直学下去

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
4
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录