说明
NVIDIA Data Center GPU Manager (DCGM) 是一套用于在集群环境中管理和监视Tesla™GPU的工具。可以集成到Prometheus监控方案中。
部署
从 https://developer.nvidia.com/dcgm 下载deb包(需要注册)
sudo dpkg -i datacenter-gpu-manager_1.7.2_amd64.deb
systemctl enable dcgm.service
systemctl start dcgm.service
从 https://d.pr/free/f/qcUmPG 下载dcgm工具包
tar zxvf dcgm.tar.gz
cd dcgm
cp dcgm-exporter /usr/local/bin/
cp node_exporter /usr/local/bin/
mkdir /run/prometheus
cp prometheus-dcgm.service /etc/systemd/system/
cp prometheus-node-exporter.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable prometheus-dcgm.service
systemctl enable prometheus-node-exporter.service
systemctl start prometheus-dcgm.service
systemctl start prometheus-node-exporter.service
确认相关服务是否都已启动
systemctl status dcgm.service
systemctl status prometheus-dcgm.service
systemctl status prometheus-node-exporter.service
效果图 (Dashboard ID:11752)