Prometheus: kubernetes gpu metrics monitoring

pod-gpu-monitoring

github

Prerequisites

  • NVIDIA Tesla drivers = R384+ (download from NVIDIA Driver Downloads page)

  • nvidia-docker version > 2.0 (see how to install and it’s prerequisites)

  • Set the default runtime to nvidia

  • Kubernetes version >= 1.13

  • Set KubeletPodResources and DevicePlugins in /etc/kubernetes/kubelet.env: KUBELET_EXTRA_ARGS=–feature-gates=KubeletPodResources=true
    KUBELET_EXTRA_ARGS=–feature-gates=DevicePlugins=true

  • Set algor-deployment.yaml
    resources: limits: nvidia.com/gpu: '1'

  • Deploy the algor-deployment.yaml and you will see cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint

{
	"Data": {
		"PodDeviceEntries": [{
			"PodUID": "849652ef-4881-42de-a17b-fc8e2c737882",
			"ContainerName": "myalgor",
			"ResourceName": "nvidia.com/gpu",
			"DeviceIDs": ["GPU-9566108d-0c5b-1db3-6bfd-cbe8db30f5aa", "GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"],
			"AllocResp": "CmsKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSUUdQVS1kNWY3NGJlZC05YjYwLWU5ZDUtMzVlMS05YmYxODU4N2U2MzEsR1BVLTk1NjYxMDhkLTBjNWItMWRiMy02YmZkLWNiZThkYjMwZjVhYQ=="
		}],
		"RegisteredDevices": {
			"nvidia.com/gpu": ["GPU-9566108d-0c5b-1db3-6bfd-cbe8db30f5aa", "GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"]
		}
	},
	"Checksum": 2740882834
}

Deploy on Kubernetes cluster

# Deploy nvidia-k8s-device-plugin
$ kubectl create -f gpu-programe/nvidia-device-plugin.yaml
# Deploy GPU Pods
$ kubectl create -f gpu-programe/algor-deployment.yaml
# Create the monitoring namespace
$ kubectl create namespace monitoring

# Deploy configmap,daemonset,deployment,service of prometheus,grafana
$ kubectl create -f *.yaml

# View get service of monitoring
$ kubectl -n monitoring get svc
NAME                   TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
dcgm-exporter          NodePort   10.236.9.60     <none>        9400:32410/TCP   2d5h
grafana                NodePort   10.236.60.146   <none>        3000:30300/TCP   34d
pod-gpudcgm-exporter   NodePort   10.236.44.122   <none>        9401:31059/TCP   6h20m
prometheus-service     NodePort   10.236.12.67    <none>        9090:30909/TCP   34d

# Get gpu metrics
$ curl 10.236.9.60:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_RETIRED_SBE{gpu="1", UUID="GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"} 0
DCGM_FI_DEV_RETIRED_DBE{gpu="1", UUID="GPU-d5f74bed-9b60-e9d5-35e1-9bf18587e631"} 0
...

# Get pod gpu metrics
$ curl 10.236.44.122:9401/gpu/metrics
# HELP dcgm_sm_clock SM clock frequency (in MHz).
# TYPE dcgm_sm_clock gauge
dcgm_sm_clock{gpu="0",uuid="GPU-9566108d-0c5b-1db3-6bfd-cbe8db30f5aa",
pod_name="algorv3-87994b4c7-55697bdc54-gs4kp",pod_namespace="big-data",container_name="myalgor"} 405
...

pod gpu metrics

Logic

contianer binding
contianer binding
device register
device register
pod gpu metrics
pod gpu metrics

Related document

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值