系统安装规划( 实验环境:Vmware Workstation )
角色 | 地址 | OS版本 | 系统配置 |
Prometheus server | 192.168.188.129 (vmware NAT方式,宿主机与被监控端同网段) | redhat 7.4 | 2C/4G/20G |
被监控端windows | 192.168.124.40 | windows10 | N/A |
1.部署Prometheus
下载软件包
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0-rc.1/prometheus-2.53.0-rc.1.linux-amd64.tar.gz -O /usr/local/src/prometheus-2.53.0.tar.gz --no-check-certificate
安装
cd /usr/local/src/
mkdir /opt/monitor
tar -zxvf prometheus-2.53.0.tar.gz -C /opt/monitor/
cd /opt/monitor/
mv prometheus-2.53.0-rc.1.linux-amd64/ prometheus/
使用systemd管理Prometheus服务
cat > /usr/lib/systemd/system/prometheus.service << EOF
[Unit]
Description=prometheus
[Service]
ExecStart=/opt/monitor/prometheus/prometheus --config.file=/opt/monitor/prometheus/prometheus.yml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
启动
systemctl enable prometheus
systemctl start prometheus
2.部署grafana
cd /usr/local/src/
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-11.1.0.linux-amd64.tar.gz
tar -zxvf grafana-enterprise-11.1.0.linux-amd64.tar.gz -C /opt/monitor/
cd /opt/monitor/
mv grafana-v11.1.0/ grafana/
使用systemd管理grafana
cat > /usr/lib/systemd/system/grafana.service << EOF
[Unit]
Description=grafana
[Service]
ExecStart=/opt/monitor/grafana/bin/grafana-server -homepath=/opt/monitor/grafana
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
启动grafana
systemctl daemon-reload
systemctl enable grafana
systemctl start grafana
导入Prometheus数据源的步骤 自行配置 此处省略
3.在测试的windows机器上部署nvidia_gpu_exporter
安装的exporter参考链接https://github.com/utkuozdemir/nvidia_gpu_exporter/blob/master/INSTALL.md
# 管理员方式打开powershell
Invoke-Expression (New-Object System.Net.WebClient).DownloadString('https://get.scoop.sh')
# 报错使用命令 iex "& {$(irm get.scoop.sh)} -RunAsAdmin"
scoop install nssm --global
scoop bucket add nvidia_gpu_exporter https://github.com/utkuozdemir/scoop_nvidia_gpu_exporter.git
scoop install nvidia_gpu_exporter/nvidia_gpu_exporter --global
New-NetFirewallRule -DisplayName "Nvidia GPU Exporter" -Direction Inbound -Action Allow -Protocol TCP -LocalPort 9835
nssm install nvidia_gpu_exporter "C:\ProgramData\scoop\apps\nvidia_gpu_exporter\current\nvidia_gpu_exporter.exe"
Start-Service nvidia_gpu_exporter
4.在Prometheus上配置监控主机,(测试主机地址为192.168.124.40)
vi /opt/monitor/prometheus/prometheus.yml
在scrape_configs: 字段下按照格式添加监控主机
- job_name: "GPU-Monitor"
static_configs:
- targets: ["192.168.124.40:9835"]
5.在grafana上导入监控模板
参考链接https://grafana.com/grafana/dashboards/14574-nvidia-gpu-metrics/
导入过程省略