1 安装过程
-
- 环境要求
MLU100, MLU270, x5k, MLU220 devices
MLU100 driver > 3.5; MLU270 driver >2.2.0; MLU220 driver > 4.1.1
libcndev.so >= V1.8.0
Kubernetes >= v1.11.2
-
- 下载与构建
(以下步骤需在联网环境进行,如离线安装,需在通网节点下载)
1.2.1 Clone 包:
git clone GitHub - Cambricon/cambricon-k8s-device-plugin
1.2.2 进入安装目录:
cd cambricon-k8s-device-plugin/device-plugin
1.2.3 构建镜像:
如离线安装,需修改build_image.sh 以及Dockerfile,修改内容见附录;
./build_image.sh
注意:构建完成后将镜像传送至安装节点;
确保Cambricon neuware已安装,在构建镜像过程中会需要libcndev.so。
-
- 加载镜像
docker load -i image/cambricon-k8s-device-plugin-amd64.tar
-
- 部署守护进程
1.4.1修改yaml文件
vim ./example/cambricon-device-plugin-daemonset.yaml
可修改参数:
args:
- --mode=default #device plugin mode: default, sriov or env-share
- --virtualization-num=1 # virtualization number for each MLU, used only in sriov mode or env-share mode
1.4.2启动进程
kubectl create -f cambricon-device-plugin-daemonset.yaml
-
- 使用MLU运行任务
Cambricon MLU现在可以通过容器级资源需求使用,使用资源名称:cambricon.com/mlu
例:
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
restartPolicy: OnFailure
containers:
- image: ubuntu:16.04
name: pod1-ctr
command: ["sleep"]
args: ["100000"]
resources:
limits:
cambricon.com/mlu: 1
- 问题解决
- 构建镜像过程中出现以下问题:
Err:1 http://deb.debian.org/debian buster InRelease
Temporary failure resolving 'deb.debian.org'
Err:2 http://security.debian.org/debian-security buster/updates InRelease
Temporary failure resolving 'security.debian.org'
Err:3 http://deb.debian.org/debian buster-updates InRelease
Temporary failure resolving 'deb.debian.org'
Reading package lists... Done
W: Failed to fetch http://deb.debian.org/debian/dists/buster/InRelease Temporary failure resolving 'deb.debian.org'
W: Failed to fetch http://security.debian.org/debian-security/dists/buster/updates/InRelease Temporary failure resolving 'security.debian.org'
W: Failed to fetch http://deb.debian.org/debian/dists/buster-updates/InRelease Temporary failure resolving 'deb.debian.org'
W: Some index files failed to download. They have been ignored, or old ones used instead.
解决办法:
vim /etc/docker/daemon.json,
添加行 "dns": ["114.114.114.114","8.8.8.8"]
重启docker:systemctl restart docker
-
- 报错:file not found in build context or excluded by .dockerignore
原因:dockerfile 不能获取 父目录
解决办法:将文件copy到当前目录
-
- Pause k8s 镜像下载失败
如果kubernetes集群在内网环境中,无法访问gcr.io网站,则可先通过一台能访问gcr.io的机器下载pause镜像,导出后再导入内网的docker私有镜像仓库中,并在kubelet的启动参数中加上--pod_infra_container_image,然后重启kubelet.
docker pull kubernetes/pause
-
- spec.template.spec.containers[0].securityContext.privileged:Forbidden: disallowed by policy问题
解决方法:kube-apiserver和kubelet的启动脚本中添加--allow_privileged=true
步骤:
1.管理节点vim /etc/sysconfig/kube-apiserver
2. 修改KUBE_APISERVER_OPTS='--allow_privileged=true'
3. systemctl daemon-reload
systemctl restart kube-apiserver
systemctl status -l kube-apiserver
4.计算节点 vi /etc/sysconfig/kubelet
5. 修改KUBELET_OPTS='--allow_privileged=true'
6. systemctl daemon-reload
systemctl restart kubelet
systemctl status -l kubelet
-
- http: server gave HTTP response to HTTPS client
解决方法:vim /etc/docker/daemon.json
修改{ "insecure-registries":["xxxxxxxxx:5000"] }
systemctl daemon-reload
systemctl restart docker