教案：Horovod on Ray

最新推荐文章于 2024-11-03 21:27:26 发布

路人与大师

最新推荐文章于 2024-11-03 21:27:26 发布

阅读量297

点赞数 5

文章标签：算法

本文链接：https://blog.csdn.net/weixin_41046245/article/details/139737556

版权

教案：Horovod on Ray

课程目标

了解Horovod与Ray的集成原理和优势。
学习如何在Ray集群上安装和配置Horovod。
掌握使用RayExecutor进行分布式训练的方法。
理解Ray的弹性执行特性及其在Horovod中的应用。

教学内容

1. Horovod on Ray简介

集成目的
- 结合Horovod的分布式训练优势与Ray的集群管理和弹性扩展能力。
- 使用RayExecutor API进行分布式任务执行。
- 目前仅支持Gloo后端。

2. 安装Horovod和Ray

安装步骤

$ HOROVOD_WITH_GLOO=1 ... pip install 'horovod[ray]'

参考文档
- Ray的高级安装说明：Ray安装文档

3. 使用RayExecutor进行分布式训练

基本用法

from horovod.ray import RayExecutor
import ray

# 启动或连接到现有Ray集群
ray.init()

# 在集群上启动num_workers个actor
executor = RayExecutor(
    setting, num_workers=num_workers, use_gpu=True)
executor.start()

执行函数

def simple_fn():
    hvd.init()
    print("hvd rank", hvd.rank())
    return hvd.rank()

result = executor.run(simple_fn)
assert len(set(result)) == hosts * num_slots
executor.shutdown()

4. 状态执行

支持有状态Actors

import torch
from horovod.torch import hvd
from horovod.ray import RayExecutor

class MyModel:
    def __init__(self, learning_rate):
        self.model = NeuralNet()
        optimizer = torch.optim.SGD(
            self.model.parameters(),
            lr=learning_rate,
        )
        self.optimizer = hvd.DistributedOptimizer(optimizer)

    def get_weights(self):
        return dict(self.model.parameters())

    def train(self):
        return self._train(self.model, self.optimizer)

ray.init()
executor = RayExecutor(...)
executor.start(executable_cls=MyModel)

for i in range(5):
    executor.execute(lambda worker: worker.train())

result = executor.execute(lambda worker: worker.get_weights())
assert all(isinstance(res, dict) for res in result)

5. 弹性RayExecutor

弹性执行

$ ray up ray/python/ray/autoscaler/aws/example-full.yaml

import horovod.torch as hvd

def training_fn():
    hvd.init()
    model = Model()
    torch.cuda.set_device(hvd.local_rank())

    @hvd.elastic.run
    def train(state):
        for state.epoch in range(state.epoch, epochs):
            ...
            state.commit()

    state = hvd.elastic.TorchState(model, optimizer, batch=0, epoch=0)
    state.register_reset_callbacks([on_state_reset])
    train(state)
    return

import ray
from horovod.ray import RayExecutor

ray.init(address="auto")
settings = RayExecutor.create_settings(verbose=True)
executor = RayExecutor(
    settings, min_workers=1, use_gpu=True, cpus_per_slot=2)
executor.start()
executor.run(training_fn)

6. AWS集群启动

集群配置

cluster_name: horovod-cluster
provider: {type: aws, region: us-west-2}
auth: {ssh_user: ubuntu}
min_workers: 3
max_workers: 3

head_node: {InstanceType: p3.2xlarge, ImageId: ami-0b294f219d14e6a82}
worker_nodes: {InstanceType: p3.2xlarge, ImageId: ami-0b294f219d14e6a82}
setup_commands:
    - HOROVOD_WITH_GLOO=1 HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[ray]

启动和监控集群

$ ray up ray_cluster.yaml
$ ray monitor ray_cluster.yaml

教学活动

讲解与讨论
- 介绍Horovod与Ray的集成，讲解其动机和优势。
- 演示基本的RayExecutor用法及其在分布式训练中的应用。
实践操作
- 安装Horovod和Ray，并配置集群环境。
- 修改现有的训练脚本以使用RayExecutor进行分布式训练。
案例分析
- 使用RayExecutor进行有状态和无状态的分布式训练。
- 配置和运行弹性RayExecutor。

课后作业

安装与配置
- 安装Horovod和Ray，配置本地或云端集群环境。
代码实现
- 修改现有的训练脚本，使用RayExecutor进行分布式训练，并分析其性能。
弹性训练
- 配置弹性RayExecutor，并运行训练任务，记录和分析结果。

参考资料

通过本次课程，学生将掌握Horovod与Ray的集成使用方法，能够在分布式环境中高效运行训练任务，并利用Ray的弹性特性进行自动扩展和资源管理。

路人与大师

关注

5
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫