Kubernetes gpu metrics monitoring
pod-gpu-monitoring
Prerequisites
-
NVIDIA Tesla drivers = R384+ (download from NVIDIA Driver Downloads page)
-
nvidia-docker version > 2.0 (see how to install and it’s prerequisites)
-
Set the default runtime to nvidia
-
Kubernetes version >= 1.13
-
Set KubeletPodResources and DevicePlugins in /etc/kubernetes/kubelet.env: KUBELET_EXTRA_ARGS=–feature-gates=KubeletPodResources=true
KUBELET_EXTRA_ARGS=–feature-gates=DevicePlugins=true -
Set algor-deployment.yaml
resources: limits: nvidia.com/gpu: '1'
-
Deploy the algor-deployment.yaml and you will see
cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
{
"Data": {
"PodDeviceEntries": [{
"PodUID": "849652ef-4881-42de-a17b-fc8e2c737882",
"ContainerName": "myalgor",
"ResourceName": "nvidia.com/gpu",
"DeviceIDs": ["GPU-9566108d-0c5b-1db3-6bfd-cbe8db30f5aa", "GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"],
"AllocResp": "CmsKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSUUdQVS1kNWY3NGJlZC05YjYwLWU5ZDUtMzVlMS05YmYxODU4N2U2MzEsR1BVLTk1NjYxMDhkLTBjNWItMWRiMy02YmZkLWNiZThkYjMwZjVhYQ=="
}],
"RegisteredDevices": {
"nvidia.com/gpu": ["GPU-9566108d-0c5b-1db3-6bfd-cbe8db30f5aa", "GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"]
}
},
"Checksum": 2740882834
}
Deploy on Kubernetes cluster
# Deploy nvidia-k8s-device-plugin
$ kubectl create -f gpu-programe/nvidia-device-plugin.yaml
# Deploy GPU Pods
$ kubectl create -f gpu-programe/algor-deployment.yaml
# Create the monitoring namespace
$ kubectl create namespace monitoring
# Deploy configmap,daemonset,deployment,service of prometheus,grafana
$ kubectl create -f *.yaml
# View get service of monitoring
$ kubectl -n monitoring get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
dcgm-exporter NodePort 10.236.9.60 <none> 9400:32410/TCP 2d5h
grafana NodePort 10.236.60.146 <none> 3000:30300/TCP 34d
pod-gpudcgm-exporter NodePort 10.236.44.122 <none> 9401:31059/TCP 6h20m
prometheus-service NodePort 10.236.12.67 <none> 9090:30909/TCP 34d
# Get gpu metrics
$ curl 10.236.9.60:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_RETIRED_SBE{gpu="1", UUID="GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"} 0
DCGM_FI_DEV_RETIRED_DBE{gpu="1", UUID="GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"} 0
...
# Get pod gpu metrics
$ curl 10.236.44.122:9401/gpu/metrics
# HELP dcgm_sm_clock SM clock frequency (in MHz).
# TYPE dcgm_sm_clock gauge
dcgm_sm_clock{gpu="0",uuid="GPU-9566108d-0c5b-1db3-6bfd-cbe8db30f5aa",
pod_name="algorv3-87994b4c7-55697bdc54-gs4kp",pod_namespace="big-data",container_name="myalgor"} 405
...
Logic
contianer binding
device register
pod gpu metrics