1、github地址:https://github.com/NVIDIA/dcgm-exporter
2、通过docker部署dcgm-exporter监控gpu的状态,k8s可以直接用gpu-Operator 部署
docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.1-ubuntu22.04
curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
…
3、部署完成后修改之前的配置文件
修改/opt/prometheus/prometheus.yml
- job_name: node-gpu
static_configs:
- targets:
- 'ip1:9100'
- 'ip2:9400'
labels:
instance: dcgm-exporter
job_name
为这个抓取任务命名,可以在Prometheus的查询界面中用于区分不同的任务。
static_configs
中的 targets 列表指定了具体的抓取目标。这里的地址需要替换成实际的IP地址和端口,通常是运行监控代理的服务器地址。
4、对接到Grafana,详情查看https://blog.csdn.net/qq_41449217/article/details/147830745?spm=1001.2014.3001.5501