Ascend 910B NPU 部署文档

操作系统:Ubuntu 22.04(x86架构)

NPU: 910B

一、驱动安装

1.1 创建用户及用户组

groupadd HwHiAiUser && useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash

1.2 下载驱动及固件

官网:社区版-固件与驱动-昇腾社区

wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend%20HDK/Ascend%20HDK%2023.0.3/Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend%20HDK/Ascend%20HDK%2023.0.3/Ascend-hdk-910b-npu-driver_23.0.3_linux-x86-64.run

1.3 安装驱动

chmod +x Ascend-hdk-910b-npu-driver_23.0.3_linux-x86-64.run
./Ascend-hdk-910b-npu-driver_23.0.3_linux-x86-64.run --full --install-for-all

1.4 安装固件

chmod +x Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run
./Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run --full

1.5 验证安装

npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.3                   Version: 23.0.3                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B2C              | OK            | 87.2        33                0    / 0             |
| 0                         | 0000:5A:00.0  | 0           0    / 0          3160 / 65536         |
+===========================+===============+====================================================+
| 1     910B2C              | OK            | 89.2        35                0    / 0             |
| 0                         | 0000:19:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 2     910B2C              | OK            | 87.8        35                0    / 0             |
| 0                         | 0000:49:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 3     910B2C              | OK            | 89.8        35                0    / 0             |
| 0                         | 0000:39:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 4     910B2C              | OK            | 87.9        34                0    / 0             |
| 0                         | 0000:DA:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 5     910B2C              | OK            | 99.5        35                0    / 0             |
| 0                         | 0000:99:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 6     910B2C              | OK            | 90.8        36                0    / 0             |
| 0                         | 0000:B8:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 7     910B2C              | OK            | 90.4        35                0    / 0             |
| 0                         | 0000:C8:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 8     910B2C              | OK            | 87.0        34                0    / 0             |
| 0                         | 0000:59:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 9     910B2C              | OK            | 89.4        34                0    / 0             |
| 0                         | 0000:18:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 10    910B2C              | OK            | 91.0        33                0    / 0             |
| 0                         | 0000:48:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 11    910B2C              | OK            | 93.6        36                0    / 0             |
| 0                         | 0000:38:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 12    910B2C              | OK            | 91.5        36                0    / 0             |
| 0                         | 0000:D9:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 13    910B2C              | OK            | 99.7        36                0    / 0             |
| 0                         | 0000:98:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 14    910B2C              | OK            | 88.6        33                0    / 0             |
| 0                         | 0000:B9:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 15    910B2C              | OK            | 96.5        36                0    / 0             |
| 0                         | 0000:C9:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 0                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 8                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 9                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 10                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 11                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 12                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 13                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 14                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 15                                                           |
+===========================+===============+====================================================+

二、K8S 1.25 接入昇腾NPU

2.1 创建NPU插件

来源:mind-cluster: mind-cluster 组件代码仓 - Gitee.com

apiVersion: v1
kind: ServiceAccount
metadata:
  name: ascend-device-plugin-sa-910
  namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pods-node-ascend-device-plugin-role-910
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "update", "watch", "patch"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "patch"]
  - apiGroups: [ "" ]
    resources: [ "nodes/proxy" ]
    verbs: [ "get" ]
  - apiGroups: [""]
    resources: ["nodes/status"]
    verbs: ["get", "patch", "update"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "create", "update", "list", "watch"]
  - apiGroups: [ "" ]
    resources: [ "events" ]
    verbs: [ "create" ]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pods-node-ascend-device-plugin-rolebinding-910
subjects:
  - kind: ServiceAccount
    name: ascend-device-plugin-sa-910
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: pods-node-ascend-device-plugin-role-910
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ascend-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: ascend-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      ##### For Kubernetes versions lower than 1.19, seccomp is used with annotations.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
        seccomp.security.alpha.kubernetes.io/pod: runtime/default
      labels:
        name: ascend-device-plugin-ds
    spec:
      ##### For Kubernetes version 1.19 and above, seccomp is used with securityContext:seccompProfile
#      securityContext:
#        seccompProfile:
#          type: RuntimeDefault
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - key: huawei.com/Ascend910
          operator: Exists
          effect: NoSchedule
        - key: "device-plugin"
          operator: "Equal"
          value: "v2"
          effect: NoSchedule
      priorityClassName: "system-node-critical"
      nodeSelector:
        accelerator: huawei-Ascend910
      serviceAccountName: ascend-device-plugin-sa-910
      containers:
      - image: ascend-k8sdeviceplugin:v3.0.0
        name: device-plugin-01
        resources:
          requests:
            memory: 500Mi
            cpu: 500m
          limits:
            memory: 500Mi
            cpu: 500m
        command: [ "/bin/bash", "-c", "--"]
        args: [ "device-plugin  -useAscendDocker=true
                 -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ]
        securityContext:
          privileged: true
          readOnlyRootFilesystem: true
        imagePullPolicy: Never
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: pod-resource
            mountPath: /var/lib/kubelet/pod-resources
          - name: hiai-driver
            mountPath: /usr/local/Ascend/driver
            readOnly: true
          - name: log-path
            mountPath: /var/log/mindx-dl/devicePlugin
          - name: tmp
            mountPath: /tmp
          - name: lingqu-log
            mountPath: /var/log/lingqu
          - name: containerd
            mountPath: /run/containerd
            readOnly: true
          - name: localtime
            mountPath: /etc/localtime
            readOnly: true
          - name: data-trace-file-dir
            mountPath: /user/cluster-info/datatrace-config
        env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          - name: HOST_IP
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: pod-resource
          hostPath:
            path: /var/lib/kubelet/pod-resources
        - name: hiai-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: log-path
          hostPath:
            path: /var/log/mindx-dl/devicePlugin
            type: Directory
        - name: data-trace-file-dir
          hostPath:
            path: /user/cluster-info/datatrace-config
            type: DirectoryOrCreate
        - name: tmp
          hostPath:
            path: /tmp
        - name: lingqu-log
          hostPath:
            path: /var/log/lingqu
            type: DirectoryOrCreate
        - name: containerd
          hostPath:
            path: /run/containerd # update the directory where the containerd.sock file is located When using older version of Docker
        - name: localtime
          hostPath:
            path: /etc/localtime

2.2 AscendDocker Runtime下载安装

2.2.1 下载链接

wget -c https://gitee.com/ascend/ascend-docker-runtime/releases/download/v6.0.0-RC3/Ascend-docker-runtime_{version}_linux-{arch}.run

2.2.2 Containerd runtime 安装步骤

  1. 校验安装包完整
./Ascend-docker-runtime_{version}_linux-{arch}.run --check

输出示例

[WARNING]: --check is meaningless...
Verifying archive integrity... All good.
  1. 添加可执行权限
chmod u+x Ascend-docker-runtime_{version}_linux-{arch}.run
  1. 安装Runtime
  • 默认路径安装
./Ascend-docker-runtime_{version}_linux-{arch}.run --install
  • 自定义路径安装
./Ascend-docker-runtime_{version}_linux-{arch}.run --install --install-path=<custom-path>

成功提示

[INFO] Ascend Docker Runtime install success

2.2.3 修改Containerd配置

  • 无配置文件时
mkdir /etc/containerd
containerd config default > /etc/containerd/config.toml
  • 已有配置文件时
vim /etc/containerd/config.toml
  1. 更新runtime路径
[plugins."io.containerd.runtime.v1.linux"]
  runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"  # 替换为实际路径

### 如何在昇腾Ascend 910B平台上使用MindSpore部署DeepSeek #### 准备工作 为了成功部署DeepSeek,在昇腾Ascend 910B平台上的准备工作至关重要。确保安装了必要的软件包和依赖项,包括但不限于Python环境、MindSpore框架以及特定于硬件的支持库[^1]。 #### 配置开发环境 配置适用于昇腾Ascend 910B的开发环境涉及设置正确的编译器版本和其他工具链组件。这一步骤对于实现最佳性能尤为关键。推荐遵循官方文档中的具体说明来完成此过程[^2]。 #### 安装MindSpore 针对昇腾Ascend 910B优化过的MindSpore版本可以从官方网站获取并按照给定指令进行安装。考虑到不同操作系统可能存在差异化的安装流程,请参照对应操作系统的安装指南执行相应命令。 ```bash pip install mindspore-ascend==1.x.x -i https://pypi.tuna.tsinghua.edu.cn/simple ``` #### 下载预训练模型 访问ModelZoo或其他可信资源下载预先训练好的DeepSeek-R1 671B权重文件。这些权重将用于初始化网络参数以便后续微调或推理应用。 #### 加载与转换模型结构定义 利用提供的脚本或者自定义代码片段读取已保存下来的checkpoint文件,并将其转化为适合当前运行时加载的形式。注意要匹配好源码中定义的数据流图同实际物理设备之间的映射关系。 ```python import numpy as np from mindspore import Tensor, load_checkpoint, load_param_into_net # 假设已经有一个构建好了的model实例化对象net param_dict = load_checkpoint("path/to/deepseek_r1_671b.ckpt") load_param_into_net(net, param_dict) input_tensor = Tensor(np.random.randn(1, 3, 224, 224).astype(np.float32)) output = net(input_tensor) print(output.shape) ``` #### 设置监控机制 鉴于昇腾Ascend 910B的特点,建议集成Prometheus等外部监测服务跟踪重要性能指标如NPU显存占用率及数据处理速度。这样可以帮助及时发现潜在瓶颈从而采取措施加以改进。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

风华talk

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值