deepspeed 1

最新推荐文章于 2024-09-10 13:01:50 发布

m0_46092647

最新推荐文章于 2024-09-10 13:01:50 发布

阅读量960

点赞数 21

文章标签： python

本文链接：https://blog.csdn.net/m0_46092647/article/details/140540454

版权

@[TOC]deepspeed 中文简要文档
作者：ygz
时间：20240719

开始

model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
model=model,
model_parameters=params)

注：初始化deepspeed引擎。这个引擎封装了torch.nn.module类型和训练与保存模型的api。
在deepspeed初始化的时候，将数据并行与混合精度训练被适当配置。deepspeed能狗够着管理模型的训练器与优化器。

若是你已经有一个分布式环境配置，比如使用torch的分布式配置之后，你应该使用deepspeed的分布式初始化函数。
torch.distributed.init_process_group(…)
deepspeed.init_distributed()
默认的使用的NCCL的后端，nvdia的多GPU通讯库。对nGPU有比较好的适配，对其他的不一定会好。可以重写它的后端。

But if you don’t need the distributed environment setup until after deepspeed.initialize() you don’t have to use this function, as DeepSpeed will automatically initialize the distributed environment during its initialize. Regardless, you will need to remove torch.distributed.init_process_group if you already had it in place.
【question】：如何初始化的。如何使用并行机制的，如何做到的？

训练

当deepspeed引擎被初始化，就可以简单使用这个模型来进行前向传播与反向传播与权重更新。
for step, batch in enumerate(data_loader):
#forward() method
loss = model_engine(batch)

#runs backpropagation
model_engine.backward(loss)

#weight update
model_engine.step()

由deepspeed 自动的来执行分布式数据的训练混合精度与预先定义好的学习率调度器。
进行几个必要的操作：
1，权重平均，分布式数据并行的权重平均。【期望最小】
2，损失缩放：in FP16/mixed precision training, the DeepSpeed engine automatically handles scaling the loss to avoid precision loss in the gradients.【防止精度损失，机器表示的精度是有损失的】
3，学习率调度器：DeepSpeed calls the step() method of the scheduler at every training step (when model_engine.step() is executed).
若是使用者不想使用这个每一步去调用一次调度器。这样调度器调度器需要自己去执行。

模型保存

使用save_checkpoint或者load_checkpoint API 去报春或者加载一个节点
ckpt_dir: the directory where checkpoints will be saved.
ckpt_id: an identifier that uniquely identifies a checkpoint in the directory. In the following code snippet, we use the loss value as the checkpoint identifier.

#load checkpoint
_, client_sd = model_engine.load_checkpoint(args.load_dir, args.ckpt_id)
step = client_sd['step']

#advance data loader to ckpt step
dataloader_to_step(data_loader, step + 1)

for step, batch in enumerate(data_loader):

    #forward() method
    loss = model_engine(batch)

    #runs backpropagation
    model_engine.backward(loss)

    #weight update
    model_engine.step()

    #save checkpoint
    if step % args.save_interval:
        client_sd['step'] = step
        ckpt_id = loss.item()
        model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)

deepspeed会自动的保存模型、优化器、调度器的状态【封装好了的】。
However, the user may want to save additional data that are unique to a given model training. To support these items, save_checkpoint accepts a client state dictionary client_sd for saving. These items can be retrieved from load_checkpoint as a return argument. In the example above, the step value is stored as part of the client_sd.
【解释】：上面的这个实验保存了这个clientid是为了让下次训练的时候，知道上次训练到什么步骤了。
注意：每个节点都需要调用这个方法去保存它的这个权重。

deepspeed配置

https://www.deepspeed.ai/docs/config-json/
【可以配置的所有参数】
deepspeed 的一个配置的example

{
  "train_batch_size": 8,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015
    }
  },
  "fp16": {
    "enabled": true
  },
  "zero_optimization": true
}

使用deepspeed开始训练

deepspeed 的launcher 是deepspeed脚本，去使用这个分布式训练。
Windows 见python 安装安装目录下的scripts文件夹
linux 使用cuda时见环境的bin目录
.conda/envs/minillm/bin

You have already integrated DeepSpeed into your model
client_entry.py is the entry script for your model
client args is the argparse command line arguments
ds_config.json is the configuration file for DeepSpeed

多节点资源配置

deepspeed配置多节点使用hostfiles，这与openmpi 和horovod是兼容的。
介绍
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.
The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

节点之间需要能进行无密码通讯使用ssh链接，还需要使用slot去指定GPU的数目

worker-1 slots=4
worker-2 slots=4

specifies that two machines named worker-1 and worker-2 each have four GPUs to use for training.

The following command launches a PyTorch training job across all available nodes and GPUs specified in myhostfile:

deepspeed --hostfile=myhostfile <client_entry.py> <client args> \
  --deepspeed --deepspeed_config ds_config.json

前面的传递的参数会传递给deepspeed launcher，一个python脚本【位置见前面的说明】后面的参数传递给要运行的python文件，它需要使用这个ds_json来进行初始化。

为了灵活方便的使用，也支持使用一些参数去限制使用节点的数目。

deepspeed --num_nodes=2 \
	<client_entry.py> <client args> \
	--deepspeed --deepspeed_config ds_config.json

也可使用一个去排除或者指定哪些GPU参与训练：

deepspeed --exclude="worker-2:0@worker-3:0,1" \
	<client_entry.py> <client args> \
	--deepspeed --deepspeed_config ds_config.json

deepspeed --include="worker-2:0,1" \
	<client_entry.py> <client args> \
	--deepspeed --deepspeed_config ds_config.json

多节点环境变量

广播用户自定义的环境变量是有用的，在训练的时候。默认的deepspeed将会官博所有NCCL还有PYthon相关的环境变量到所有的节点。如果想传播一个环境变量，需要指定他们在这个点文件，环境变量之间以行分割。deepspeed将会在本地路径寻找或者在家目录下。
you would like to override the default name of this file or path and name with your own, you can specify this with the environment variable, DS_ENV_FILE

As a concrete example, some clusters require special NCCL variables to set prior to training. The user can simply add these variables to a .deepspeed_env file in their home directory that looks like this:
NCCL_IB_DISABLE=1
NCCL_SOCKET_IFNAME=eth0

deepseed 将会将这些环境变脸传递给所有的对端

MPI兼容与AzureML兼容

也可以使用MPI的launcher 来部署deepspeed，使用吗mpirun来运行。但是它仍然使用的是torch distributed NCCL backend 而不是MPI backend。
【多进程之间的通讯】

o launch your training job with mpirun + DeepSpeed or with AzureML (which uses mpirun as a launcher backend) you simply need to install the mpi4py python package. DeepSpeed will use this to discover the MPI environment and pass the necessary state (e.g., world size, rank) to the torch distributed backend.
安装MPI4py包，deepspeed将会去发现它的环境并且设置相应的状态。

you are using model parallelism, pipeline parallelism, or otherwise require torch.distributed calls before calling deepspeed.initialize(…) we provide the same MPI support with an additional DeepSpeed API call. Replace your initial torch.distributed.init_process_group(…) call with: