1、GPU主機配置:
推薦安裝nvidia-container-runtime
1.1.1、安裝nvidia-container-toolkit
设置 nvidia-container-toolkit 存储库和 GPG 密钥:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
将experimental分支添加到存储库列表中:
yum-config-manager --enable libnvidia-container-experimental
更新包列表后安装nvidia-container-toolkit包:
sudo yum clean expire-cache
sudo yum install -y nvidia-container-toolkit
配置 Docker 守护进程以识别 NVIDIA 容器运行时:
sudo nvidia-ctk runtime configure --runtime=docker
设置默认运行时后重启Docker守护进程完成安装:
sudo systemctl restart docker
1.1.2、安裝nvidia-container-runtime(推薦)
配置nvidia-container-runtime.repo源
sudo yum install -y nvidia-container-runtime
安裝完成後重啟Docker
systemctl daemon-reload && systemctl restart docker
1.2、docker安裝NVIDIA監控:
docker run -d --gpus all --name dcgm-exporter --restart=always -p 9400:9400 nvidia/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
本機執行curl localhost:9400/metrics 可獲取信息
1.3、安裝node_exporter-1.5.0.linux-amd64
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz
創建服務:
cat /etc/systemd/system/node_exporter.service
Description=Node Exporter
Wants=network.target
After=network.target
[Service]
ExecStart=/usr/local/src/node_exporter/node_exporter(node_exporter路徑)
Restart=always
[Install]
WantedBy=multi-user.target
啟動服務:
systemctl daemon-reload && systemctl start node_exporter &&systemctl enable node_exporter
systemctl status node_exporter.service
2、監控主機配置:
2.1、配置prometheus.yml
cat /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s #默认采集监控数据时间间隔
external_labels:
monitor: 'my-monitor'
scrape_configs: #监控对象设置
- job_name: 'GPU'
scrape_interval: 5s
static_configs:
- targets: ['192.168.1.100:9400']
- job_name: 'GPU-node'
scrape_interval: 5s
static_configs:
- targets: ['192.168.1.100:9100']
2.2、docker安裝prometheus:
docker run -d --name prometheus --restart=always -p 9090:9090 -v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus:latest
2.3、docker安裝grafana:
docker run -d --restart=always -p 3000:3000 grafana/grafana