openpai:v1.5.0搭建详细教程
一、环境准备
硬件要求 | 软件要求 | |
---|---|---|
dev box 机器 |
|
|
master 机器 |
|
|
GPU Worker |
|
|
二、安装过程
1.安装完系统,安装下面必要的环境
sudo apt install openssh-server (安装ssh)
sudo apt install ntp
2.配置devbox机器ssh免密访问其他机器
ssh-keygen -t rsa (一路回车)
ssh-copy-id (用户名)@(ip地址) (把密钥分发到其他节点)
3.给gpu节点安装nvidia驱动,这里最好安装nvidia-418驱动
首先屏蔽掉nouveau驱动
sudo vi /etc/modprobe.d/blacklist.conf
最后一行添加 blacklist nouveau
保存退出
sudo update-initramfs -u
sudo reboot
查看是否还有nouveau驱动
lsmod | grep nouveau
安装nvidia驱动
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install nvidia-418
sudo reboot
查看驱动是否安装成功
nvidia-smi
4.每台节点都需要安装docker , 安装nvidia-container-runtime,见官网nvidia.github.io/nvidia-container-runtime/
1.列出docker-ce版本,这里我选的最新版本安装
sudo apt-cache madison docker-ce
sudo apt-get install docker-ce=5:20.10.5~3-0~ubuntu-xenial
安装完成后,需要修改docker的默认runtime为nvidia
sudo vi /etc/docker/daemon.json
添加如下内容
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
重启docker服务
sudo systemctl daemon-reload
sudo systemctl restart docker
测试一下是否可以直接调用gpu
sudo docker run --rm nvidia/cuda:10.0-base nvidia-smi (有显卡信息输出表示成功,注意,不要加--gpus=all或者runtime=nvidia参数运行,这样都不是直接调用gpu)
2.环境安装完成,开始安装openpai
1.在dev box机器上,使用下面的命令来克隆OpenPAI的repo:
git clone https://github.com/microsoft/pai.git
cd pai
2. checkout到某一个tag,来选择需要安装的OpenPAI版本:
git checkout v1.5.0
3.编辑/contrib/kubespray/config目录下的layout.yaml
和config.yaml
文件
layout.yaml
格式示例 (内存、cpu、显卡、ip地址需要根据情况修改,如果参数错误的话,后面运行的时候会提示错误,我的gpu节点内存配置不一样就一直报错,所以我直接注释掉这项检查)
# GPU cluster example
# This is a cluster with one master node and two worker nodes
machine-sku:
master-machine: # define a machine sku
# the resource requirements for all the machines of this sku
# We use the same memory format as Kubernetes, e.g. Gi, Mi
# Reference: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory
mem: 60Gi
cpu:
# the number of CPU vcores
vcore: 24
gpu-machine:
computing-device:
# For `type`, please follow the same format specified in device plugin.
# For example, `nvidia.com/gpu` is for NVIDIA GPU, `amd.com/gpu` is for AMD GPU,
# and `enflame.com/dtu` is for Enflame DTU.
# Reference: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
type: nvidia.com/gpu
model: K80
count: 4
mem: 220Gi
cpu:
vcore: 24
machine-list:
- hostname: pai-master # name of the machine, **do not** use upper case alphabet letters for hostname
hostip: 10.0.0.1
machine-type: master-machine # only one master-machine supported
pai-master: "true"
- hostname: pai-worker1
hostip: 10.0.0.2
machine-type: gpu-machine
pai-worker: "true"
- hostname: pai-worker2
hostip: 10.0.0.3
machine-type: gpu-machine
pai-worker: "true"
config.yaml
格式示例(修改一下用户名和密码,至于镜像地址,国内的源经常不好用,所以我都是阿里云服务器手动下载的)
user: forexample
password: forexample
docker_image_tag: v1.5.0
# Optional
#######################################################################
# OpenPAI Customized Settings #
#######################################################################
# enable_hived_scheduler: true
#############################################
# Ansible-playbooks' inventory hosts' vars. #
#############################################
# ssh_key_file_path: /path/to/you/key/file
#####################################
# OpenPAI's service image registry. #
#####################################
# docker_registry_domain: docker.io
# docker_registry_namespace: openpai
# docker_registry_username: exampleuser
# docker_registry_password: examplepasswd
################################################################
# OpenPAI's daemon qos config. #
# By default, the QoS class for PAI daemon is BestEffort. #
# If you want to promote QoS class to Burstable or Guaranteed, #
# you should set the value to true. #
################################################################
# qos-switch: "false"
###########################################################################################
# Pre-check setting #
###########################################################################################
# docker_check: true
# resource_check: true
########################################################################################
# Advanced docker configuration. If you are not familiar with them, don't change them. #
########################################################################################
# docker_data_root: /mnt/docker
# docker_config_file_path: /etc/docker/daemon.json
# docker_iptables_enabled: false
## An obvious use case is allowing insecure-registry access to self hosted registries.
## Can be ipaddress and domain_name.
## example define 172.19.16.11 or mirror.registry.io
# openpai_docker_insecure_registries:
# - mirror.registry.io
# - 172.19.16.11
## Add other registry,example China registry mirror.
# openpai_docker_registry_mirrors:
# - https://registry.docker-cn.com
# - https://mirror.aliyuncs.com
#######################################################################
# kubespray setting #
#######################################################################
# If you couldn't access to gcr.io or docker.io, please configure it.
# gcr_image_repo: "gcr.io"
# kube_image_repo: "gcr.io/google-containers"
# quay_image_repo: "quay.io"
# docker_image_repo: "docker.io"
# etcd_image_repo: "quay.io/coreos/etcd"
# pod_infra_image_repo: "gcr.io/google_containers/pause-{{ image_arch }}"
# kubeadm_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/kubeadm" (这里将地址替换为现在的地址)
# hyperkube_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/hyperkube" (这里将地址替换为现在的地址)
# openpai_kube_network_plugin: calico
# openpai_kubespray_extra_var:
# key: value
# key: value
#######################################################################
# host daemon port setting #
#######################################################################
# host_daemon_port_start: 40000
# host_daemon_port_end: 65535
4.安装Kubernetes
cd pai/contrib/kubespray
/bin/bash quick-start-kubespray.sh
安装过程中有些需要下载的组件经常下不了,但是手动去下载有可能会成功,所有我碰到的自动下不了的,我会手动下载再放到指定的路径,一般要下载的都在这个文件里
比如第一个需要下载的就经常下载不了,而且每次运行完就会自动删掉该文件,导致我下次再运行又得重新下载,所以我把他注释掉,手动下载后拷贝到 pai-deploy/目录下
自动安装到这一步,环境检查完毕,根据错误提示去修改正确的参数,如果你觉得提示的错误没问题,那就输入continue继续,继续根据提示输入N,接着安装
到这一步又出现错误了,应该是我安装的docker-ce版本跟它的不一样啊,那继续屏蔽它, 编辑sudo nano /home/ouc/pai-deploy/kubespray/roles/container-engine/docker/tasks/main.yml
重新运行,然后又出现下载不了的文件了,老规矩,手动打开网页下载,然后放到指定路径,实在下不了的可以翻墙下载
最后输出信息表示安装成功
You can run the following commands to setup kubectl on you local host:
ansible-playbook -i ${HOME}/pai-deploy/kubespray/inventory/pai/hosts.yml set-kubectl.yml --ask-become-pass
5.安装opepai服务(这个网络一般就直接下载成功了)
/bin/bash quick-start-service.sh
最后输出信息表示成功
Kubernetes cluster config : ~/pai-deploy/kube/config
OpenPAI cluster config : ~/pai-deploy/cluster-cfg
OpenPAI cluster ID : pai
Default username : admin
Default password : admin-password
You can go to http://<your-master-ip>, then use the default username and password to log in.