openpai:v1.5.0搭建详细教程

最新推荐文章于 2022-10-24 15:49:14 发布

风道北来

最新推荐文章于 2022-10-24 15:49:14 发布

阅读量2k

点赞数

文章标签： docker 深度学习

本文链接：https://blog.csdn.net/qq_39045077/article/details/114971280

版权

openpai:v1.5.0搭建详细教程

一、环境准备

硬件要求	软件要求
dev box 机器	它可以与所有其他机器（master和worker机器）通信。它是独立于master机器和worker机器之外的一台机器。它可以访问Internet。尤其是可以访问Docker Hub。部署过程会从Docker Hub拉取Docker镜像。	Ubuntu 16.04 (18.04、20.04应该可用，但没有经过完整测试) SSH服务已开启。可以免密登录所有master和worker机器。 Docker已被正确安装。
master 机器	至少40GB内存。必须有固定的局域网 IP 地址（LAN IP address），且可以和其他所有机器通信。可以访问Internet。尤其是可以访问Docker Hub。部署过程会从Docker Hub拉取Docker镜像。	Ubuntu 16.04 (18.04、20.04应该可用，但没有经过完整测试) SSH服务已开启。和所有worker机器有同样的SSH用户名和密码，且该SSH用户有sudo权限。 Docker已被正确安装。 NTP已被成功开启。您可以用命令`apt install ntp`来检查。它是OpenPAI的专用服务器。OpenPAI管理它的所有资源（如CPU、内存、GPU等）。如果有其他工作负载，则可能由于资源不足而导致未知问题。

GPU Worker

至少16GB内存。
必须有固定的局域网 IP 地址（LAN IP address），且可以和其他所有机器通信。
可以访问Internet。尤其是可以访问Docker Hub。部署过程会从Docker Hub拉取Docker镜像。

Ubuntu 16.04 (18.04、20.04应该可用，但没有经过完整测试)
SSH服务已开启。
所有master和worker机器有同样的SSH用户名和密码，且该SSH用户有sudo权限。
Docker已被正确安装。
它是OpenPAI的专用服务器。OpenPAI管理它的所有资源（如CPU、内存、GPU等）。如果有其他工作负载，则可能由于资源不足而导致未知问题。

二、安装过程

1.安装完系统，安装下面必要的环境

sudo apt install openssh-server (安装ssh)
sudo apt install ntp

2.配置devbox机器ssh免密访问其他机器

ssh-keygen -t rsa     （一路回车）
ssh-copy-id (用户名)@(ip地址)      （把密钥分发到其他节点）

3.给gpu节点安装nvidia驱动，这里最好安装nvidia-418驱动

首先屏蔽掉nouveau驱动

sudo vi /etc/modprobe.d/blacklist.conf

最后一行添加 blacklist nouveau

保存退出

sudo update-initramfs -u
sudo reboot

查看是否还有nouveau驱动

lsmod | grep nouveau

安装nvidia驱动

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install nvidia-418
sudo reboot

查看驱动是否安装成功

 nvidia-smi

4.每台节点都需要安装docker , 安装nvidia-container-runtime,见官网nvidia.github.io/nvidia-container-runtime/

1.列出docker-ce版本，这里我选的最新版本安装

sudo apt-cache madison docker-ce
sudo apt-get install docker-ce=5:20.10.5~3-0~ubuntu-xenial

安装完成后，需要修改docker的默认runtime为nvidia

sudo vi /etc/docker/daemon.json

添加如下内容

{
 "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

重启docker服务

sudo systemctl daemon-reload
sudo systemctl restart docker

测试一下是否可以直接调用gpu

sudo docker run --rm nvidia/cuda:10.0-base nvidia-smi  （有显卡信息输出表示成功，注意，不要加--gpus=all或者runtime=nvidia参数运行，这样都不是直接调用gpu）

2.环境安装完成，开始安装openpai

1.在dev box机器上，使用下面的命令来克隆OpenPAI的repo：

git clone https://github.com/microsoft/pai.git
cd pai

2. checkout到某一个tag，来选择需要安装的OpenPAI版本：

git checkout v1.5.0

3.编辑/contrib/kubespray/config目录下的layout.yaml和config.yaml文件

layout.yaml 格式示例（内存、cpu、显卡、ip地址需要根据情况修改，如果参数错误的话，后面运行的时候会提示错误，我的gpu节点内存配置不一样就一直报错，所以我直接注释掉这项检查）

# GPU cluster example
# This is a cluster with one master node and two worker nodes

machine-sku:
  master-machine: # define a machine sku
    # the resource requirements for all the machines of this sku
    # We use the same memory format as Kubernetes, e.g. Gi, Mi
    # Reference: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory
    mem: 60Gi
    cpu:
      # the number of CPU vcores
      vcore: 24
  gpu-machine:
    computing-device:
      # For `type`, please follow the same format specified in device plugin.
      # For example, `nvidia.com/gpu` is for NVIDIA GPU, `amd.com/gpu` is for AMD GPU,
      # and `enflame.com/dtu` is for Enflame DTU.
      # Reference: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
      type: nvidia.com/gpu
      model: K80
      count: 4
    mem: 220Gi
    cpu:
      vcore: 24

machine-list:
  - hostname: pai-master # name of the machine, **do not** use upper case alphabet letters for hostname
    hostip: 10.0.0.1
    machine-type: master-machine # only one master-machine supported
    pai-master: "true"
  - hostname: pai-worker1
    hostip: 10.0.0.2
    machine-type: gpu-machine
    pai-worker: "true"
  - hostname: pai-worker2
    hostip: 10.0.0.3
    machine-type: gpu-machine
    pai-worker: "true"

config.yaml 格式示例(修改一下用户名和密码，至于镜像地址，国内的源经常不好用，所以我都是阿里云服务器手动下载的）

user: forexample
password: forexample
docker_image_tag: v1.5.0

# Optional

#######################################################################
#                    OpenPAI Customized Settings                      #
#######################################################################
# enable_hived_scheduler: true

#############################################
# Ansible-playbooks' inventory hosts' vars. #
#############################################
# ssh_key_file_path: /path/to/you/key/file

#####################################
# OpenPAI's service image registry. #
#####################################
# docker_registry_domain: docker.io
# docker_registry_namespace: openpai
# docker_registry_username: exampleuser
# docker_registry_password: examplepasswd

################################################################
# OpenPAI's daemon qos config.                                 #
# By default, the QoS class for PAI daemon is BestEffort.      #
# If you want to promote QoS class to Burstable or Guaranteed, #
# you should set the value to true.                            #
################################################################
# qos-switch: "false"

###########################################################################################
#                         Pre-check setting                                               #
###########################################################################################
# docker_check: true
# resource_check: true

########################################################################################
# Advanced docker configuration. If you are not familiar with them, don't change them. #
########################################################################################
# docker_data_root: /mnt/docker
# docker_config_file_path: /etc/docker/daemon.json
# docker_iptables_enabled: false

## An obvious use case is allowing insecure-registry access to self hosted registries.
## Can be ipaddress and domain_name.
## example define 172.19.16.11 or mirror.registry.io
# openpai_docker_insecure_registries:
#   - mirror.registry.io
#   - 172.19.16.11

## Add other registry,example China registry mirror.
# openpai_docker_registry_mirrors:
#   - https://registry.docker-cn.com
#   - https://mirror.aliyuncs.com

#######################################################################
#                       kubespray setting                             #
#######################################################################

# If you couldn't access to gcr.io or docker.io, please configure it.
# gcr_image_repo: "gcr.io"
# kube_image_repo: "gcr.io/google-containers"
# quay_image_repo: "quay.io"
# docker_image_repo: "docker.io"
# etcd_image_repo: "quay.io/coreos/etcd"
# pod_infra_image_repo: "gcr.io/google_containers/pause-{{ image_arch }}"
# kubeadm_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/kubeadm"       （这里将地址替换为现在的地址）
# hyperkube_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/hyperkube"      （这里将地址替换为现在的地址）

# openpai_kube_network_plugin: calico

# openpai_kubespray_extra_var:
#   key: value
#   key: value

#######################################################################
#                     host daemon port setting                        #
#######################################################################
# host_daemon_port_start: 40000
# host_daemon_port_end: 65535

4.安装Kubernetes

cd pai/contrib/kubespray
/bin/bash quick-start-kubespray.sh

安装过程中有些需要下载的组件经常下不了，但是手动去下载有可能会成功，所有我碰到的自动下不了的，我会手动下载再放到指定的路径，一般要下载的都在这个文件里

比如第一个需要下载的就经常下载不了，而且每次运行完就会自动删掉该文件，导致我下次再运行又得重新下载，所以我把他注释掉，手动下载后拷贝到 pai-deploy/目录下

自动安装到这一步，环境检查完毕，根据错误提示去修改正确的参数，如果你觉得提示的错误没问题，那就输入continue继续，继续根据提示输入N，接着安装

到这一步又出现错误了，应该是我安装的docker-ce版本跟它的不一样啊，那继续屏蔽它, 编辑sudo nano /home/ouc/pai-deploy/kubespray/roles/container-engine/docker/tasks/main.yml

重新运行，然后又出现下载不了的文件了，老规矩，手动打开网页下载，然后放到指定路径，实在下不了的可以翻墙下载

最后输出信息表示安装成功

You can run the following commands to setup kubectl on you local host:
ansible-playbook -i ${HOME}/pai-deploy/kubespray/inventory/pai/hosts.yml set-kubectl.yml --ask-become-pass

5.安装opepai服务(这个网络一般就直接下载成功了)

/bin/bash quick-start-service.sh

最后输出信息表示成功

Kubernetes cluster config :     ~/pai-deploy/kube/config
OpenPAI cluster config    :     ~/pai-deploy/cluster-cfg
OpenPAI cluster ID        :     pai
Default username          :     admin
Default password          :     admin-password

You can go to http://<your-master-ip>, then use the default username and password to log in.

风道北来

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
1
评论
openpai:v1.5.0搭建详细教程

openpai:v1.5.0搭建详细教程一、环境准备硬件要求软件要求 dev box 机器它可以与所有其他机器（master和worker机器）通信。它是独立于master机器和worker机器之外的一台机器。它可以访问Internet。尤其是可以访问Docker Hub。部署过程会从Doc...
复制链接

扫一扫