nvidia 监控

官网地址

1、dcgm-exporter是一个用于将NVIDIA
GPU监控数据导出为Prometheus格式的工具,可以实时地收集GPU相关的指标数据,并将其输出为Prometheus所需的格式,以便进行可视化和告警。
2、dcgm-metrics是NVIDIA Data Center GPU
Manager(DCGM)提供的一组用于收集GPU监控指标的工具,可以实时收集GPU的温度、功耗、显存使用率、运算性能等指标数据,并将其输出为Prometheus格式。
3、Service Monitor是Prometheus中用于监控Kubernetes服务的机制,可以自动发现Kubernetes集群中的服务,并将其添加到Prometheus中进行监控。它可以帮助用户方便地监控Kubernetes集群中的各个服务,例如Pod的CPU和内存使用率等指标,以便进行故障排除和优化。

Grafana监控的就是之前rook-ceph博文里写的datasources, NVIDA DCGM Exporter 这个仪表盘

# 给有gpu的节点打标签
kubectl label nodes nodeName accelerator=nvidia-gpu
# 当时是在master节点上
docker pull v5cn/prometheus-adapter:v0.10.0
docker tag docker.io/v5cn/prometheus-adapter:v0.10.0 k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.10.0

dcgm-metrics.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-metrics
  namespace: default
data:
  # 类属性键;每一个键都映射到一个简单的值
  default-counters.csv: |
    # Format,,
    # If line starts with a '#' it is considered a comment,,
    # DCGM FIELD, Prometheus metric type, help message
 
    # Clocks,,
    DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
 
    # Temperature,,
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
    DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
 
    # Power,,
    DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
 
    # PCIE,,
    DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
    DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
 
    # Utilization (the sample period varies depending on the product),,
    DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
    DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
    DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
 
    # Errors and violations,,
    DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
    # DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
    # DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
    # DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
    # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
    # DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
    # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
 
    # Memory usage,,
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
 
    # ECC,,
    # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
    # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
 
    # Retired pages,,
    # DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
    # DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
    # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
 
    # NVLink,,
    # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
    # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
    # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
    # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes
 
    # VGPU License status,,
    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
 
    # Remapped rows,,
    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
    DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

dcgm-exporter.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  namespace: default
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9400"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.4.0"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.4.0"
      name: "dcgm-exporter"
    spec:
      containers:
      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
        - name: "gpu-metrics"
          readOnly: true
          mountPath: "/etc/dcgm-exporter"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      - name: "gpu-metrics"
        configMap:
          name: "dcgm-metrics"
      nodeSelector:
        accelerator: nvidia-gpu
 
---
 
kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  namespace: default
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9400"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
  ports:
  - name: "metrics"
    port: 9400

service-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: "dcgm-exporter"
  namespace: default
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.4.0"
  endpoints:
  - port: "metrics"
    path: "/metrics"
    relabelings:
    - action: replace
      sourceLabels:  [__meta_kubernetes_endpoint_node_name]
      targetLabel: node_name

api.service.yaml

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.custom.metrics.k8s.io
spec:
  group: custom.metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: prometheus-adapter
    namespace: default
    port: 443
  version: v1beta1
  versionPriority: 100
# dcgm-exporter
kubectl apply -f dcgm-metrics.yaml
kubectl apply -f dcgm-exporter.yaml
# service monitor
kubectl apply -f service-monitor.yaml
 
kubectl apply -f api.service.yaml
# 修改  prometheus-adapter/ value.yaml里的monitoringn那里,如果改了prometheus成自定义的,否则不改
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# 在prometheus-adapter 上面一层目录执行
helm upgrade --install prometheus-adapter prometheus-adapter/ -n default

gpu-hpa-pod-behavior.yaml

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-gpu
  namespace: triton
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: face-triton-server-deployment
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Pods
    pods:
      metric:
        name: DEV_GPU_UTIL_current
      target:
        type: AverageValue
        averageValue: 40
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 60

kubectl apply -f gpu-hpa-pod-behavior.yaml

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
DCGMNVIDIA Data Center GPU 管理工具的缩写,而 Prometheus 是一种用于监控和警报的开源工具。DCGM Prometheus 是将 DCGM 与 Prometheus 集成,以便更好地监控和管理 NVIDIA GPU 在数据中心环境中的性能和健康状况。 DCGM 提供了许多功能,包括 GPU 温度、功耗、内存使用情况、性能指标等的监控。而 Prometheus 则是一种通过 HTTP 协议来收集和存储时间序列数据的工具,它允许用户通过灵活的查询语言来查询和分析这些数据。 将 DCGM 与 Prometheus 集成,可以通过使用 Prometheus 提供的丰富的监控仪表板和警报系统来实时监控 GPU 的性能和健康状况。通过将 DCGM 的数据导出为 Prometheus 可以理解的格式,用户可以方便地将数据传输到 Prometheus 的存储数据库中,以进行持久化存储和长期分析。 使用 DCGM Prometheus,用户可以在数据中心中实时监控 GPU 使用情况,及时发现并处理 GPU 温度过高、功耗异常或内存使用超限等问题。此外,用户还可以设置警报规则,当 GPU 的性能或健康状况达到预定的阈值时,系统会自动发送警报通知用户,以便及时采取措施。 总的来说,DCGM Prometheus 结合了 DCGM 提供的 GPU 监控数据和 Prometheus 提供的强大查询和警报功能,为用户提供了一个全面的 GPU 监控和管理解决方案,帮助用户实时了解 GPU 的性能状况,保证数据中心的稳定和可靠运行。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值