操作系统:Ubuntu 22.04(x86架构)
NPU: 910B
一、驱动安装
1.1 创建用户及用户组
groupadd HwHiAiUser && useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash
1.2 下载驱动及固件
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend%20HDK/Ascend%20HDK%2023.0.3/Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend%20HDK/Ascend%20HDK%2023.0.3/Ascend-hdk-910b-npu-driver_23.0.3_linux-x86-64.run
1.3 安装驱动
chmod +x Ascend-hdk-910b-npu-driver_23.0.3_linux-x86-64.run
./Ascend-hdk-910b-npu-driver_23.0.3_linux-x86-64.run --full --install-for-all
1.4 安装固件
chmod +x Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run
./Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run --full
1.5 验证安装
npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.3 Version: 23.0.3 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B2C | OK | 87.2 33 0 / 0 |
| 0 | 0000:5A:00.0 | 0 0 / 0 3160 / 65536 |
+===========================+===============+====================================================+
| 1 910B2C | OK | 89.2 35 0 / 0 |
| 0 | 0000:19:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 2 910B2C | OK | 87.8 35 0 / 0 |
| 0 | 0000:49:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 3 910B2C | OK | 89.8 35 0 / 0 |
| 0 | 0000:39:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 4 910B2C | OK | 87.9 34 0 / 0 |
| 0 | 0000:DA:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 5 910B2C | OK | 99.5 35 0 / 0 |
| 0 | 0000:99:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 6 910B2C | OK | 90.8 36 0 / 0 |
| 0 | 0000:B8:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 7 910B2C | OK | 90.4 35 0 / 0 |
| 0 | 0000:C8:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 8 910B2C | OK | 87.0 34 0 / 0 |
| 0 | 0000:59:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 9 910B2C | OK | 89.4 34 0 / 0 |
| 0 | 0000:18:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 10 910B2C | OK | 91.0 33 0 / 0 |
| 0 | 0000:48:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 11 910B2C | OK | 93.6 36 0 / 0 |
| 0 | 0000:38:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 12 910B2C | OK | 91.5 36 0 / 0 |
| 0 | 0000:D9:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 13 910B2C | OK | 99.7 36 0 / 0 |
| 0 | 0000:98:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 14 910B2C | OK | 88.6 33 0 / 0 |
| 0 | 0000:B9:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
| 15 910B2C | OK | 96.5 36 0 / 0 |
| 0 | 0000:C9:00.0 | 0 0 / 0 3159 / 65536 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 0 |
+===========================+===============+====================================================+
| No running processes found in NPU 1 |
+===========================+===============+====================================================+
| No running processes found in NPU 2 |
+===========================+===============+====================================================+
| No running processes found in NPU 3 |
+===========================+===============+====================================================+
| No running processes found in NPU 4 |
+===========================+===============+====================================================+
| No running processes found in NPU 5 |
+===========================+===============+====================================================+
| No running processes found in NPU 6 |
+===========================+===============+====================================================+
| No running processes found in NPU 7 |
+===========================+===============+====================================================+
| No running processes found in NPU 8 |
+===========================+===============+====================================================+
| No running processes found in NPU 9 |
+===========================+===============+====================================================+
| No running processes found in NPU 10 |
+===========================+===============+====================================================+
| No running processes found in NPU 11 |
+===========================+===============+====================================================+
| No running processes found in NPU 12 |
+===========================+===============+====================================================+
| No running processes found in NPU 13 |
+===========================+===============+====================================================+
| No running processes found in NPU 14 |
+===========================+===============+====================================================+
| No running processes found in NPU 15 |
+===========================+===============+====================================================+
二、K8S 1.25 接入昇腾NPU
2.1 创建NPU插件
来源:mind-cluster: mind-cluster 组件代码仓 - Gitee.com
apiVersion: v1
kind: ServiceAccount
metadata:
name: ascend-device-plugin-sa-910
namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: pods-node-ascend-device-plugin-role-910
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "update", "watch", "patch"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "patch"]
- apiGroups: [ "" ]
resources: [ "nodes/proxy" ]
verbs: [ "get" ]
- apiGroups: [""]
resources: ["nodes/status"]
verbs: ["get", "patch", "update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "create", "update", "list", "watch"]
- apiGroups: [ "" ]
resources: [ "events" ]
verbs: [ "create" ]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: pods-node-ascend-device-plugin-rolebinding-910
subjects:
- kind: ServiceAccount
name: ascend-device-plugin-sa-910
namespace: kube-system
roleRef:
kind: ClusterRole
name: pods-node-ascend-device-plugin-role-910
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ascend-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: ascend-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
##### For Kubernetes versions lower than 1.19, seccomp is used with annotations.
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
seccomp.security.alpha.kubernetes.io/pod: runtime/default
labels:
name: ascend-device-plugin-ds
spec:
##### For Kubernetes version 1.19 and above, seccomp is used with securityContext:seccompProfile
# securityContext:
# seccompProfile:
# type: RuntimeDefault
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: huawei.com/Ascend910
operator: Exists
effect: NoSchedule
- key: "device-plugin"
operator: "Equal"
value: "v2"
effect: NoSchedule
priorityClassName: "system-node-critical"
nodeSelector:
accelerator: huawei-Ascend910
serviceAccountName: ascend-device-plugin-sa-910
containers:
- image: ascend-k8sdeviceplugin:v3.0.0
name: device-plugin-01
resources:
requests:
memory: 500Mi
cpu: 500m
limits:
memory: 500Mi
cpu: 500m
command: [ "/bin/bash", "-c", "--"]
args: [ "device-plugin -useAscendDocker=true
-logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ]
securityContext:
privileged: true
readOnlyRootFilesystem: true
imagePullPolicy: Never
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: pod-resource
mountPath: /var/lib/kubelet/pod-resources
- name: hiai-driver
mountPath: /usr/local/Ascend/driver
readOnly: true
- name: log-path
mountPath: /var/log/mindx-dl/devicePlugin
- name: tmp
mountPath: /tmp
- name: lingqu-log
mountPath: /var/log/lingqu
- name: containerd
mountPath: /run/containerd
readOnly: true
- name: localtime
mountPath: /etc/localtime
readOnly: true
- name: data-trace-file-dir
mountPath: /user/cluster-info/datatrace-config
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: pod-resource
hostPath:
path: /var/lib/kubelet/pod-resources
- name: hiai-driver
hostPath:
path: /usr/local/Ascend/driver
- name: log-path
hostPath:
path: /var/log/mindx-dl/devicePlugin
type: Directory
- name: data-trace-file-dir
hostPath:
path: /user/cluster-info/datatrace-config
type: DirectoryOrCreate
- name: tmp
hostPath:
path: /tmp
- name: lingqu-log
hostPath:
path: /var/log/lingqu
type: DirectoryOrCreate
- name: containerd
hostPath:
path: /run/containerd # update the directory where the containerd.sock file is located When using older version of Docker
- name: localtime
hostPath:
path: /etc/localtime
2.2 AscendDocker Runtime下载安装
2.2.1 下载链接
wget -c https://gitee.com/ascend/ascend-docker-runtime/releases/download/v6.0.0-RC3/Ascend-docker-runtime_{version}_linux-{arch}.run
2.2.2 Containerd runtime 安装步骤
- 校验安装包完整
./Ascend-docker-runtime_{version}_linux-{arch}.run --check
输出示例
[WARNING]: --check is meaningless...
Verifying archive integrity... All good.
- 添加可执行权限
chmod u+x Ascend-docker-runtime_{version}_linux-{arch}.run
- 安装Runtime
- 默认路径安装
./Ascend-docker-runtime_{version}_linux-{arch}.run --install
- 自定义路径安装
./Ascend-docker-runtime_{version}_linux-{arch}.run --install --install-path=<custom-path>
成功提示
[INFO] Ascend Docker Runtime install success
2.2.3 修改Containerd配置
- 无配置文件时
mkdir /etc/containerd
containerd config default > /etc/containerd/config.toml
- 已有配置文件时
vim /etc/containerd/config.toml
- 更新runtime路径
[plugins."io.containerd.runtime.v1.linux"]
runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime" # 替换为实际路径