Prometheus: kubernetes gpu metrics monitoring

cloud_ether

已于 2024-07-06 16:37:11 修改

阅读量951

点赞数

分类专栏： Kubernetes 文章标签： kubernetes gpu

于 2020-06-18 16:39:02 首次发布

本文链接：https://blog.csdn.net/qq_36652517/article/details/106838094

版权

Kubernetes 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

Kubernetes gpu metrics monitoring

pod-gpu-monitoring
Prerequisites
- Deploy on Kubernetes cluster
Logic
Related document

pod-gpu-monitoring

github

Prerequisites

NVIDIA Tesla drivers = R384+ (download from NVIDIA Driver Downloads page)
nvidia-docker version > 2.0 (see how to install and it’s prerequisites)
Set the default runtime to nvidia
Kubernetes version >= 1.13
Set KubeletPodResources and DevicePlugins in /etc/kubernetes/kubelet.env: KUBELET_EXTRA_ARGS=–feature-gates=KubeletPodResources=true
KUBELET_EXTRA_ARGS=–feature-gates=DevicePlugins=true
Set algor-deployment.yaml
resources: limits: nvidia.com/gpu: '1'
Deploy the algor-deployment.yaml and you will see cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint

{
	"Data": {
		"PodDeviceEntries": [{
			"PodUID": "849652ef-4881-42de-a17b-fc8e2c737882",
			"ContainerName": "myalgor",
			"ResourceName": "nvidia.com/gpu",
			"DeviceIDs": ["GPU-9566108d-0c5b-1db3-6bfd-cbe8db30f5aa", "GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"],
			"AllocResp": "CmsKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSUUdQVS1kNWY3NGJlZC05YjYwLWU5ZDUtMzVlMS05YmYxODU4N2U2MzEsR1BVLTk1NjYxMDhkLTBjNWItMWRiMy02YmZkLWNiZThkYjMwZjVhYQ=="
		}],
		"RegisteredDevices": {
			"nvidia.com/gpu": ["GPU-9566108d-0c5b-1db3-6bfd-cbe8db30f5aa", "GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"]
		}
	},
	"Checksum": 2740882834
}

Deploy on Kubernetes cluster

# Deploy nvidia-k8s-device-plugin
$ kubectl create -f gpu-programe/nvidia-device-plugin.yaml
# Deploy GPU Pods
$ kubectl create -f gpu-programe/algor-deployment.yaml
# Create the monitoring namespace
$ kubectl create namespace monitoring

# Deploy configmap,daemonset,deployment,service of prometheus,grafana
$ kubectl create -f *.yaml

# View get service of monitoring
$ kubectl -n monitoring get svc
NAME                   TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
dcgm-exporter          NodePort   10.236.9.60     <none>        9400:32410/TCP   2d5h
grafana                NodePort   10.236.60.146   <none>        3000:30300/TCP   34d
pod-gpudcgm-exporter   NodePort   10.236.44.122   <none>        9401:31059/TCP   6h20m
prometheus-service     NodePort   10.236.12.67    <none>        9090:30909/TCP   34d

# Get gpu metrics
$ curl 10.236.9.60:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_RETIRED_SBE{gpu="1", UUID="GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"} 0
DCGM_FI_DEV_RETIRED_DBE{gpu="1", UUID="GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"} 0
...

# Get pod gpu metrics
$ curl 10.236.44.122:9401/gpu/metrics
# HELP dcgm_sm_clock SM clock frequency (in MHz).
# TYPE dcgm_sm_clock gauge
dcgm_sm_clock{gpu="0",uuid="GPU-9566108d-0c5b-1db3-6bfd-cbe8db30f5aa",
pod_name="algorv3-87994b4c7-55697bdc54-gs4kp",pod_namespace="big-data",container_name="myalgor"} 405
...