nvidia 监控

官网地址

1、dcgm-exporter是一个用于将NVIDIA
GPU监控数据导出为Prometheus格式的工具,可以实时地收集GPU相关的指标数据,并将其输出为Prometheus所需的格式,以便进行可视化和告警。
2、dcgm-metrics是NVIDIA Data Center GPU
Manager(DCGM)提供的一组用于收集GPU监控指标的工具,可以实时收集GPU的温度、功耗、显存使用率、运算性能等指标数据,并将其输出为Prometheus格式。
3、Service Monitor是Prometheus中用于监控Kubernetes服务的机制,可以自动发现Kubernetes集群中的服务,并将其添加到Prometheus中进行监控。它可以帮助用户方便地监控Kubernetes集群中的各个服务,例如Pod的CPU和内存使用率等指标,以便进行故障排除和优化。

Grafana监控的就是之前rook-ceph博文里写的datasources, NVIDA DCGM Exporter 这个仪表盘

# 给有gpu的节点打标签
kubectl label nodes nodeName accelerator=nvidia-gpu
# 当时是在master节点上
docker pull v5cn/prometheus-adapter:v0.10.0
docker tag docker.io/v5cn/prometheus-adapter:v0.10.0 k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.10.0

dcgm-metrics.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-metrics
  namespace: default
data:
  # 类属性键;每一个键都映射到一个简单的值
  default-counters.csv: |
    # Format,,
    # If line starts with a '#' it is considered a comment,,
    # DCGM FIELD, Prometheus metric type, help message
 
    # Clocks,,
    DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
 
    # Temperature,,
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
    DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
 
    # Power,,
    DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
 
    # PCIE,,
    DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
    DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
 
    # Utilization (the sample period varies depending on the product),,
    DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
    DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
    DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
 
    # Errors and violations,,
    DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
    # DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
    # DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
    # DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
    # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
    # DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
    # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
 
    # Memory usage,,
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
 
    # ECC,,
    # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
    # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
 
    # Retired pages,,
    # DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
    # DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
    # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
 
    # NVLink,,
    # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
    # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
    # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
    # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes
 
    # VGPU License status,,
    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
 
    # Remapped rows,,
    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
    DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

dcgm-exporter.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  namespace: default
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9400"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.4.0"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.4.0"
      name: "dcgm-exporter"
    spec:
      containers:
      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
        - name: "gpu-metrics"
          readOnly: true
          mountPath: "/etc/dcgm-exporter"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      - name: "gpu-metrics"
        configMap:
          name: "dcgm-metrics"
      nodeSelector:
        accelerator: nvidia-gpu
 
---
 
kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  namespace: default
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9400"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
  ports:
  - name: "metrics"
    port: 9400

service-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: "dcgm-exporter"
  namespace: default
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.4.0"
  endpoints:
  - port: "metrics"
    path: "/metrics"
    relabelings:
    - action: replace
      sourceLabels:  [__meta_kubernetes_endpoint_node_name]
      targetLabel: node_name

api.service.yaml

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.custom.metrics.k8s.io
spec:
  group: custom.metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: prometheus-adapter
    namespace: default
    port: 443
  version: v1beta1
  versionPriority: 100
# dcgm-exporter
kubectl apply -f dcgm-metrics.yaml
kubectl apply -f dcgm-exporter.yaml
# service monitor
kubectl apply -f service-monitor.yaml
 
kubectl apply -f api.service.yaml
# 修改  prometheus-adapter/ value.yaml里的monitoringn那里,如果改了prometheus成自定义的,否则不改
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# 在prometheus-adapter 上面一层目录执行
helm upgrade --install prometheus-adapter prometheus-adapter/ -n default

gpu-hpa-pod-behavior.yaml

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-gpu
  namespace: triton
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: face-triton-server-deployment
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Pods
    pods:
      metric:
        name: DEV_GPU_UTIL_current
      target:
        type: AverageValue
        averageValue: 40
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 60

kubectl apply -f gpu-hpa-pod-behavior.yaml

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值