kubernetes 培训_在Kubernetes上使用horovod进行分布式深度学习培训

最新推荐文章于 2025-02-23 18:00:00 发布

杨_明

最新推荐文章于 2025-02-23 18:00:00 发布

阅读量1.7k

点赞数 1

文章标签：深度学习 python 机器学习 java 人工智能

原文链接：https://towardsdatascience.com/distributed-deep-learning-training-with-horovod-on-kubernetes-6b28ac1d6b5d

版权

kubernetes 培训

You may have noticed that even a powerful machine like the Nvidia DGX is not fast enough to train a deep learning model quick enough. Not mentioning the long wait time just to copy data into the DGX. Datasets are getting larger, GPUs are disaggregated from storage, workers with GPUs need to coordinate for model checkpointing and log saving. Your system may grow beyond a single server, team wants to share both GPU hardware and data easily.

您可能已经注意到，即使像Nvidia DGX这样的功能强大的机器也不够快，无法足够快速地训练深度学习模型。更不用说将数据复制到DGX的漫长等待时间了。数据集越来越大，GPU从存储中分离出来，使用GPU的工作人员需要协调以进行模型检查点和日志保存。您的系统可能超出了单个服务器，团队希望轻松共享GPU硬件和数据。

Enter distributed training with Horovod on Kubernetes. In this blog, I will walk through the setups to train a deep learning model in multi-worker distributed environment with Horovod on Kubernetes.

使用Horovod在Kubernetes上进行分布式培训。在此博客中，我将逐步介绍如何在Horberd上使用Kubernetes在多员工分布式环境中训练深度学习模型。

在Kubernetes上使用TensorFlow的Horovod (Horovod with TensorFlow on Kubernetes)

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. Open sourced by Uber, Horovod has proved that with little code change it scales a single-GPU training to run across many GPUs in parallel.

Horovod是针对TensorFlow，Keras，PyTorch和Apache MXNet的分布式深度学习培训框架。 Horovod由Uber开源，事实证明，只需更改很少的代码，它就可以扩展单GPU训练，以跨多个GPU并行运行。

As an example, I will train a movie review sentiment model using Horovod with TensorFlow and Keras. Although Keras itself supports distributed training natively, I found it a little more complex and less stable comparing to Horovod.

例如，我将使用Horovod和TensorFlow和Keras来训练电影评论情感模型。尽管Keras本身支持本地分布式培训，但与Horovod相比，我发现它稍微复杂一些，也不稳定。

Often time, customers ask me how to allocate and manage the schedule of GPUs to team members in such an environment. This becomes more important in a multi-server environment. I have heard solutions like time table in Excel (awful, but still surprisingly common), Python scripts, Kubernetes and commercial softwares. I will use Kubernetes because it supports a nice interface to run many application containers, including deep learning, on top of a cluster.

客户经常会问我在这种环境下如何为团队成员分配和管理GPU的计划。这在多服务器环境中变得更加重要。我听说过解决方案，例如Excel中的时间表(糟糕的，但仍然令人惊讶地很常见)，Python脚本，Kubernetes和商业软件。我将使用Kubernetes，因为它支持一个漂亮的界面来在集群之上运行许多应用程序容器，包括深度学习。

A fast shared storage/filesystem is critical to simplify distributed training. It is the glue that holds together the different stages of your machine learning workflows, and it enables teams to share both GPU hardware and data. I will use FlashBlade S3 for hosting the dataset, and FlashBlade NFS for checkpointing and storing TensorBoard logs.

快速共享的存储/文件系统对于简化分布式培训至关重要。它是将机器学习工作流程的不同阶段结合在一起的粘合剂，它使团队可以共享GPU硬件和数据。我将使用FlashBlade S3托管数据集，并使用FlashBlade NFS进行检查点和存储TensorBoard日志。

The below is the architecture of this setup:

以下是此设置的体系结构：

Image for post — Distributed training with Horovod for TensorFlow on Kubernetes and FlashBlade

在Kubernetes上部署Horovod(Deploy Horovod on Kubernetes)

In a multi-worker Horovod setup, a single primary and multiple worker nodes coordinates to train the model in parallel. It uses MPI and SSH to exchange and update model parameters. One way to run Horovid on Kubernetes is to use kubeflow and its mpi-job library, I found it is overkilled to introduce Kubeflow just for this purpose. Kubeflow itself is a big project. For now, let’s keep it simple.

在多工人Horovod设置中，单个主要节点和多个工人节点进行协调以并行训练模型。它使用MPI和SSH交换和更新模型参数。在Kubernetes上运行Horovid的一种方法是使用kubeflow及其mpi-job库，我发现为此目的引入Kubeflow实在是太过分了。 Kubeflow本身是一个大项目。现在，让我们保持简单。

We need to install MTP and SSH first. Horovod provides an official Docker file for this. I have customised it to fit my needs. While MPI and SSH setup can be put into the Docker image, we do need to configure passwordless SSH authentication for the Horovod pods. Not a hard requirement but to make the example more concise, I use a Kubernetes persistent volume (PV) to store my SSH configuration and mount it on all containers at /root/.ssh.

我们需要先安装MTP和SSH。 Horovod为此提供了一个官方的Docker文件。我已根据自己的需要对其进行了自定义。虽然可以将MPI和SSH设置放入Docker映像中，但我们确实需要为Horovod Pod配置无密码SSH身份验证。这不是一个硬性要求，但是为了使示例更简洁，我使用Kubernetes持久卷(PV)来存储我的SSH配置并将其安装在/root/.ssh所有容器上。

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: horovod-ssh-shared
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: pure-file

Note the PV is a pure-file class (backed by FlashBlade NFS) with ReadWriteMany access mode. The same way, I also create another PV called tf-shared for checkpointing and TensorBoard logs. I mount these PVs to all the containers:

请注意，PV是具有ReadWriteMany访问模式的pure-file类(由FlashBlade NFS支持)。同样，我还创建了另一个称为tf-shared PV，用于检查点和TensorBoard日志。我将这些PV安装到所有容器上：

volumeMounts:
  - name: horovod-ssh-vol
    mountPath: /root/.ssh
  - name: tf-shared-vol
    mountPath: /tf/models

I use Kubernetes Init Container to run a init-ssh.sh script to generate the SSH passwordless authentication configuration before the Horovod primary container starts.

我使用Kubernetes Init Container在Horovod主容器启动之前运行init-ssh.sh脚本来生成SSH无密码身份验证配置。

initContainers:
- name: init-ssh
image: uprush/horovod-cpu:latest
volumeMounts:
  - name: horovod-ssh-vol
    mountPath: /root/.ssh
command: ['/bin/bash', '/root/init-ssh.sh']

The content of the init-ssh.sh is something like this:

init-ssh.sh的内容如下所示：

if [ -f /root/.ssh/authorized_keys ]
then
    echo "SSH already configured."
else
    ssh-keygen -t rsa -b 2048 -N '' -f /root/.ssh/id_rsa
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
    chmod 700 /root/.ssh
    chmod 600 /root/.ssh/authorized_keys
fi

I then declare two Kubernetes Deployments: one for the primary, another for workers. While the primary does nothing, the workers start a SSH server in the pod.

然后，我声明两个Kubernetes部署：一个用于主部署，另一个用于工作人员。当主服务器不执行任何操作时，工作人员会在Pod中启动SSH服务器。

- name: horovod-cpu
    image: "uprush/horovod-cpu:latest"
    command: [ "sh", "-c", "/usr/sbin/sshd -p 2222; sleep infinity" ]

With these, root user on the primary pod can SSH to the workers without password. Horovod setup is ready.

有了这些，主Pod上的root用户就可以在没有密码的情况下SSH到worker。 Horovod设置已准备就绪。

在TensorFlow中访问S3上的数据集 (Accessing Datasets on S3 in TensorFlow)

My dataset is stored in FlashBlade S3 as TensorFlow record files. I want my TensorFlow script to directly access it instead of downloading to local directory. So I added several environment variables using Kubernetes Secret to the deployments:

我的数据集作为TensorFlow记录文件存储在FlashBlade S3中。我希望我的TensorFlow脚本直接访问它，而不是下载到本地目录。因此，我使用Kubernetes Secret将多个环境变量添加到了部署中：

env:
- name: AWS_ACCESS_KEY_ID
  valueFrom:
    secretKeyRef:
      name: tf-s3
      key: access-key
- name: AWS_SECRET_ACCESS_KEY
  valueFrom:
    secretKeyRef:
      name: tf-s3
      key: secret-key
- name: S3_ENDPOINT
  value: 192.168.170.11
- name: S3_USE_HTTPS
  value: "0"

Later in my TensorFlow script, I will use these variables for S3 authentication:

稍后在TensorFlow脚本中，我将使用以下变量进行S3身份验证：

endpoint_url = f"http://{os.environ['S3_ENDPOINT']}"
kwargs = {'endpoint_url':endpoint_url}
s3 = s3fs.S3FileSystem(anon=False, client_kwargs=kwargs)# List all training tfrecord files
training_files_list = s3.ls("s3://datasets/aclImdb/train/")
training_files = [f"s3://{f}" for f in training_files_list]# Now let's create tf datasets
training_ds = tf.data.TFRecordDataset(training_files, num_parallel_reads=AUTO)

FlashBlade S3 is very fast, a minimum deploy can go up to 7GB/s read throughput at around 3ms latency consistently. This should be good enough for many DL training workloads.

FlashBlade S3的速度非常快，最小部署可以始终以大约3ms的延迟将读取吞吐量提高到7GB / s。对于许多DL培训工作负载而言，这应该足够好。

Kubernetes上的GPU调度 (GPU Scheduling on Kubernetes)

To let Kubernetes schedule the pod based on GPU resource requests, we need to install the Nvidia k8s device plugin. It is required to use nvidia-docker2 package instead of regular docker as default runtime. Follow the README on how to prepare your GPU nodes. The device plugin installation is straightforward using helm. In my lab, I only install the plugin on nodes with Tesla GPUs on them. So I added Node Label to my GPU nodes.

为了让Kubernetes根据GPU资源请求安排Pod，我们需要安装Nvidia k8s设备插件。需要使用nvidia-docker2软件包而不是常规nvidia-docker2作为默认运行时。按照自述文件了解如何准备GPU节点。使用helm可以轻松安装设备插件。在我的实验室中，我仅将插件安装在带有Tesla GPU的节点上。因此，我将节点标签添加到了我的GPU节点。

kubectl label nodes fb-ubuntu01 nvidia.com/gpu.family=teslahelm install \
    --version=0.6.0 \
    --generate-name \
    --set compatWithCPUManager=true \
    --set nodeSelector."nvidia\.com/gpu\.family"=tesla \
    nvdp/nvidia-device-plugin

The plugin will be installed as a DaemonSet in kube-system namespace. If everything went well, the GPU nodes should now have GPU capacity:

该插件将作为daemonSet安装在kube-system名称空间中。如果一切顺利，GPU节点现在应该具有GPU容量：

kubectl describe node fb-ubuntu01

Capacity:
  cpu:                32
  ephemeral-storage:  292889880Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             264092356Ki
  nvidia.com/gpu:     1
  pods:               110

We can then request GPU resource for Horovod pods:

然后，我们可以为Horovod Pod请求GPU资源：

resources:
  limits:
    nvidia.com/gpu: 2 # requesting 2 GPUs

准备培训 (Prepare for Training)

Next I use a pre-train script to prepare the environment for training. The script uses Kubernetes CLI to select Horovod pods and then do the followings:

接下来，我使用训练前脚本来准备训练环境。该脚本使用Kubernetes CLI选择Horovod Pod，然后执行以下操作：

Generate a pip-install.sh script to install Python dependencies on all pods.
生成pip-install.sh脚本以在所有Pod上安装Python依赖项。
Generate a horovod-run.sh script to start the Horovod job.
生成horovod-run.sh脚本以启动Horovod作业。
Copy source code and generated scripts from my workstation to the shared PV of the Horovod primary pod.
将源代码和生成的脚本从我的工作站复制到Horovod主Pod的共享PV。

After running the pre-train.sh script, my primary pod will have these files in the shared PV:

运行pre-train.sh脚本后，我的主Pod将在共享PV中包含以下文件：

root@horovod-primary-84fcd7bdfd-2j8tc:/tf/models/examples# ls
horovod-run.sh  imdb-sentiment.py  pip-install.sh  pre-train.sh  requirements.txt

Here is an example of a generated horovod-run.sh:

这是生成的horovod-run.sh的示例：

mkdir -p /tf/models/aclImdb/checkpointsAWS_LOG_LEVEL=3 horovodrun -np 3 \
 -H localhost:1,10-244-1-129.default.pod:1,10-244-0-145.default.pod:1 \
 -p 2222 \
 python imdb-sentiment.py

This script runs the training job on three pods in parallel, with each pod using 1 CPU. Here we don’t use GPUs because the model is very small.

该脚本在三个Pod上并行运行训练作业，每个Pod使用1个CPU。在这里我们不使用GPU，因为模型很小。

Because everything is automated, each time I change the training code in my VSCode (I use the Remote extension to write code on the server over SSH), I run the following to start the training job:

因为一切都是自动化的，所以每次我更改VSCode中的培训代码时(我使用Remote扩展通过SSH在服务器上编写代码)，我都会运行以下命令来开始培训工作：

Run the pre-train.sh script to regenerate and copy source code.
运行pre-train.sh脚本以重新生成和复制源代码。
Enter to Horovod primary pod.
输入到Horovod主窗格。
Run pip-install.sh to install dependencies on all pods.
运行pip-install.sh在所有Pod上安装依赖项。
Run horovod-run.sh to start Horovod training job.
运行horovod-run.sh以开始Horovod培训工作。

So far this workflow works well for me.

到目前为止，此工作流程对我来说效果很好。

带TensorFlow的Horovod (Horovod with TensorFlow)

The modifications to the training script required to use Horovod with TensorFlow is well documented here.

在此处充分记录了将Horovod与TensorFlow一起使用所需的训练脚本的修改。

My example code is an end-to-end runnable script to train a movie review sentiment model. It is similar to single-node training except:

我的示例代码是一个端到端可运行脚本，用于训练电影评论情感模型。它与单节点训练相似，除了：

The code runs on all Horovod pods in parallel.
该代码在所有Horovod pod上并行运行。
Each pod only processes parts of the total number of training and validation batches, so shard the dataset (use tf.data.Dataset.shard()) and set steps_per_epoch and validation_steps properly when calling model.fit.
每个Pod仅处理培训和验证批处理总数的一部分，因此在调用model.fit时对数据集进行分片(使用tf.data.Dataset.shard() )并正确设置steps_per_epoch和validation_steps 。
Some tasks, such as saving checkpoints, TensorBoard logs and the model, should be taken care to only run on the primary pod (hvd.rank() = 0) to prevent other workers from corrupting them.
应注意某些任务(例如保存检查点，TensorBoard日志和模型)仅在主容器上运行( hvd.rank() = 0 )，以防止其他工作人员破坏它们。
Because the pods can run on any server (GPU nodes only if requesting GPU resource) in the Kubernetes cluster, we should save checkpoints, TensorBoard logs and the model in a persistent volume (FlashBlade NFS in my example) or object storage (e.g., FlashBlade S3).
因为Pod可以在Kubernetes集群中的任何服务器上运行(仅在请求GPU资源时才需要GPU节点)，所以我们应该将检查点，TensorBoard日志和模型保存在持久卷(在我的示例中为FlashBlade NFS)或对象存储(例如FlashBlade)中S3)。

I will skip the details of the training code here. Please refer to my example code.

我将在此处跳过培训代码的详细信息。请参考我的示例代码。

The below is a running output:

以下是正在运行的输出：

If I look at my Kubernetes monitoring UI, I can see all the Horovod pods’ CPU usage jumps up. This indicates the training job was running on all the pods in parallel.

如果我查看Kubernetes监视UI，则可以看到所有Horovod pod的CPU使用率都在上升。这表明培训作业正在所有Pod上并行运行。

概要(Summary)

Distributed training is the future of deep learning. Using Horovod and Kubernetes, we demonstrated the steps to quickly spin up a dynamic distributed deep learning training environment. This enables deep learning engineers and researchers to easily share, schedule and fully leverage the expensive GPUs and the data.

分布式培训是深度学习的未来。使用Horovod和Kubernetes，我们演示了快速启动动态分布式深度学习培训环境的步骤。这使深度学习工程师和研究人员可以轻松共享，安排和充分利用昂贵的GPU和数据。

A shared storage like FlashBlade plays an important role in this setup. FlashBlade makes it possible to share the resource and data. It relieves me from saving/aggregating checkpoints, TensorBoard logs and the models. Horovod with Kubernetes and FlashBlade just makes my deep learning life much easier.

像FlashBlade这样的共享存储在此设置中起着重要作用。 FlashBlade使共享资源和数据成为可能。它使我免于保存/汇总检查点，TensorBoard日志和模型。 Horovod与Kubernetes和FlashBlade一起使我的深度学习生活变得更加轻松。

No more time table in Excel!

Excel中没有时间表了！