1、dcgm-exporter是一个用于将NVIDIA
GPU监控数据导出为Prometheus格式的工具,可以实时地收集GPU相关的指标数据,并将其输出为Prometheus所需的格式,以便进行可视化和告警。
2、dcgm-metrics是NVIDIA Data Center GPU
Manager(DCGM)提供的一组用于收集GPU监控指标的工具,可以实时收集GPU的温度、功耗、显存使用率、运算性能等指标数据,并将其输出为Prometheus格式。
3、Service Monitor是Prometheus中用于监控Kubernetes服务的机制,可以自动发现Kubernetes集群中的服务,并将其添加到Prometheus中进行监控。它可以帮助用户方便地监控Kubernetes集群中的各个服务,例如Pod的CPU和内存使用率等指标,以便进行故障排除和优化。
Grafana监控的就是之前rook-ceph博文里写的datasources, NVIDA DCGM Exporter 这个仪表盘
# 给有gpu的节点打标签
kubectl label nodes nodeName accelerator=nvidia-gpu
# 当时是在master节点上
docker pull v5cn/prometheus-adapter:v0.10.0
docker tag docker.io/v5cn/prometheus-adapter:v0.10.0 k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.10.0
dcgm-metrics.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: dcgm-metrics
namespace: default
data:
# 类属性键;每一个键都映射到一个简单的值
default-counters.csv: |
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message
# Clocks,,
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power,,
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes
# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
namespace: default
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
name: "dcgm-exporter"
spec:
containers:
- image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
- name: "gpu-metrics"
readOnly: true
mountPath: "/etc/dcgm-exporter"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
- name: "gpu-metrics"
configMap:
name: "dcgm-metrics"
nodeSelector:
accelerator: nvidia-gpu
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
namespace: default
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
ports:
- name: "metrics"
port: 9400
service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: "dcgm-exporter"
namespace: default
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
spec:
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
endpoints:
- port: "metrics"
path: "/metrics"
relabelings:
- action: replace
sourceLabels: [__meta_kubernetes_endpoint_node_name]
targetLabel: node_name
api.service.yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1beta1.custom.metrics.k8s.io
spec:
group: custom.metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: prometheus-adapter
namespace: default
port: 443
version: v1beta1
versionPriority: 100
# dcgm-exporter
kubectl apply -f dcgm-metrics.yaml
kubectl apply -f dcgm-exporter.yaml
# service monitor
kubectl apply -f service-monitor.yaml
kubectl apply -f api.service.yaml
# 修改 prometheus-adapter/ value.yaml里的monitoringn那里,如果改了prometheus成自定义的,否则不改
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# 在prometheus-adapter 上面一层目录执行
helm upgrade --install prometheus-adapter prometheus-adapter/ -n default
gpu-hpa-pod-behavior.yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-gpu
namespace: triton
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: face-triton-server-deployment
minReplicas: 1
maxReplicas: 2
metrics:
- type: Pods
pods:
metric:
name: DEV_GPU_UTIL_current
target:
type: AverageValue
averageValue: 40
behavior:
scaleDown:
stabilizationWindowSeconds: 60
kubectl apply -f gpu-hpa-pod-behavior.yaml