【框架】简化多卡训练——huggingface accelerate使用方法介绍

HuggingFace 的 accelerate 库可以实现只需要修改几行代码就可以实现ddp训练,且支持混合精度训练和TPU训练。(甚至支持deepspeed。)
accelerate支持的训练方式为CPU/单GPU (TPU)/多GPU(TPU) DDP模式/fp32/fp16等。

安装

pip install accelerate

使用

使用accelerate进行单卡或者多卡训练的代码是相同的,不过在单卡训练的时候可以不使用gather_for_metrics()函数聚合信息。这里为了保持代码的不变性,仍然保留gather_for_metrics()。以下为使用accelerate运行在MNIST数据集上面运行手写数字识别的样例代码main.py

import datetime
import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

#======================================================================
# import accelerate
from accelerate import Accelerator
from accelerate.utils import set_seed
#======================================================================


class BasicNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
        self.act = F.relu

    def forward(self, x):
        x = self.act(self.conv1(x))
        x = self.act(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.act(self.fc1(x))
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def main(epochs,
         lr=1e-3,
         batch_size= 1024,
         ckpt_dir = "ckpts",
         ckpt_path = "checkpoint.pt",
         mixed_precision="no", #'fp16' or 'bf16'
         ):

    if not os.path.exists(ckpt_dir):
        os.makedirs(ckpt_dir)

    ckpt_path = os.path.join(ckpt_dir, ckpt_path)

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307), (0.3081))
    ])

    train_dset = datasets.MNIST('data', train=True, download=True, transform=transform)
    test_dset = datasets.MNIST('data', train=False, transform=transform)

    train_loader = torch.utils.data.DataLoader(train_dset, shuffle=True, batch_size=batch_size, num_workers=2)
    test_loader = torch.utils.data.DataLoader(test_dset, shuffle=False, batch_size=batch_size, num_workers=2)

    model = BasicNet()
    optimizer = optim.AdamW(model.parameters(), lr=lr)

    #======================================================================
    # initialize accelerator and auto move data/model to accelerator.device
    set_seed(42)
    accelerator = Accelerator(mixed_precision=mixed_precision)
    accelerator.print(f'device {str(accelerator.device)} is used!')
    # Send everything through `accelerator.prepare`
    train_loader, test_loader, model, optimizer = accelerator.prepare(
        train_loader, test_loader, model, optimizer
    )
    #======================================================================


    model.train()
    optimizer.zero_grad()

    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(train_loader):
            output = model(data)
            loss = F.nll_loss(output, target)

            #======================================================================
            #attention here! 
            accelerator.backward(loss)
            #======================================================================

            optimizer.step()
            optimizer.zero_grad()

        model.eval()
        correct = 0
        with torch.no_grad():
            for data, target in test_loader:
                output = model(data)
                pred = output.argmax(dim=1, keepdim=True)

                #======================================================================
                #gather data from multi-gpus (used when in ddp mode)
                pred = accelerator.gather_for_metrics(pred)
                target = accelerator.gather_for_metrics(target)
                #======================================================================

                correct += pred.eq(target.view_as(pred)).sum().item()

            eval_metric = 100. * correct / len(test_loader.dataset)
        #======================================================================
        #print logs and save ckpt  
        accelerator.wait_for_everyone()
        nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        accelerator.print(f"epoch【{epoch}】@{nowtime} --> eval_accuracy= {eval_metric:.2f}%")
        net_dict = accelerator.get_state_dict(model)
        accelerator.save(net_dict, ckpt_path+"_"+str(epoch))
        #======================================================================


if __name__ == '__main__':
    #mixed_precision='fp16' or  'bf16')

    main(epochs=5, 
         lr=1e-4,
         batch_size=1024,
         ckpt_dir = "ckpts",
         ckpt_path = "checkpoint.pt",
         mixed_precision="no") 

单GPU运行

直接CUDA_VISIBLE_DEVICES=0 python main.py就可以指定单GPU进行运行,CUDA_VISIBLE_DEVICES可以设置为其它的gpu id。

结果:

device cuda is used!
epoch【0】@2024-05-20 11:46:21 --> eval_accuracy= 89.84%
epoch【1】@2024-05-20 11:46:27 --> eval_accuracy= 93.44%
epoch【2】@2024-05-20 11:46:32 --> eval_accuracy= 95.52%
epoch【3】@2024-05-20 11:46:39 --> eval_accuracy= 96.55%
epoch【4】@2024-05-20 11:46:44 --> eval_accuracy= 97.07%

多GPU运行

首先要在~/.cache/huggingface/accelerate下生成default_config.yaml文件。可以在terminal使用

accelerate config

进行交互性地配置,不过较为复杂。可以使用如下代码生成较为简单的配置。

import os
from accelerate.utils import write_basic_config
write_basic_config() # Write a config file
os._exit(0) # Restart the notebook to reload info from the latest config file 

生成的default_config.yaml如下

{
  "compute_environment": "LOCAL_MACHINE",
  "debug": false,
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "enable_cpu_affinity": false,
  "machine_rank": 0,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 1,
  "num_processes": 8,
  "rdzv_backend": "static",
  "same_network": false,
  "tpu_use_cluster": false,
  "tpu_use_sudo": false,
  "use_cpu": false
}

为了可以运行多个ddp程序,建议加上main_process_port的设置。如果设置gpu数量为2,修改num_processes为2。

方法1 使用default_config运行

CUDA_VISIBLE_DEVICES=0,1 accelerate launch main.py

结果

device cuda:0 is used!
epoch【0】@2024-05-20 12:11:38 --> eval_accuracy= 84.74%
epoch【1】@2024-05-20 12:11:41 --> eval_accuracy= 90.13%
epoch【2】@2024-05-20 12:11:44 --> eval_accuracy= 92.16%
epoch【3】@2024-05-20 12:11:48 --> eval_accuracy= 93.28%
epoch【4】@2024-05-20 12:11:51 --> eval_accuracy= 94.11%

方法2 使用default_config+手动配置

一般需要配置的信息为main_process_port, num_processes和gpu ids。

CUDA_VISIBLE_DEVICES=0,1 accelerate launch --main_process_port 41011 --num_processes 2 main.py

结果

device cuda:0 is used!
epoch【0】@2024-05-20 12:11:38 --> eval_accuracy= 84.74%
epoch【1】@2024-05-20 12:11:41 --> eval_accuracy= 90.13%
epoch【2】@2024-05-20 12:11:44 --> eval_accuracy= 92.16%
epoch【3】@2024-05-20 12:11:48 --> eval_accuracy= 93.28%
epoch【4】@2024-05-20 12:11:51 --> eval_accuracy= 94.11%

方法3 使用pytorch运行

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
    --nproc_per_node 2 \
    --use_env  \
    --master_port 41011 \
    accelerate_sample.py

结果

[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] 
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] *****************************************
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] *****************************************
device cuda:0 is used!
epoch【0】@2024-05-20 11:26:20 --> eval_accuracy= 84.13%
epoch【1】@2024-05-20 11:26:24 --> eval_accuracy= 90.28%
epoch【2】@2024-05-20 11:26:27 --> eval_accuracy= 92.35%
epoch【3】@2024-05-20 11:26:31 --> eval_accuracy= 93.59%
epoch【4】@2024-05-20 11:26:34 --> eval_accuracy= 94.52%

方法4 使用notebook启动

from accelerate import notebook_launcher
args = dict(
    epochs = 5,
    lr = 1e-4,
    batch_size= 1024,
    ckpt_dir = "ckpts",
    ckpt_path = "checkpoint.pt",
    mixed_precision="no").values()

notebook_launcher(main, args, num_processes=2, use_port="41011")

运行

CUDA_VISIBLE_DEVICES=0,1 python main.py

结果

Launching training on 2 GPUs.
device cuda:0 is used!
epoch【0】@2024-05-20 12:09:43 --> eval_accuracy= 84.69%
epoch【1】@2024-05-20 12:09:47 --> eval_accuracy= 90.36%
epoch【2】@2024-05-20 12:09:51 --> eval_accuracy= 92.13%
epoch【3】@2024-05-20 12:09:54 --> eval_accuracy= 93.20%
epoch【4】@2024-05-20 12:09:57 --> eval_accuracy= 94.03%

参考

  1. https://huggingface.co/datasets/HuggingFace-CN-community/translation/blob/main/eat_accelerate_in_30_minites.md
  2. https://blog.csdn.net/iin729/article/details/124955224
  3. https://huggingface.co/docs/accelerate/index
  4. https://pytorch.org/docs/stable/distributed.html
使用MPI(Message Passing Interface多卡训练ResNet50模型,可以按以下步骤进行操作: 1. 安装MPI库:首先,确保安装了MPI库,例如OpenMPI或MPICH。MPI库是用于在不同进程之间传递消息的标准,用于实现多卡训练。 2. 数据并行:ResNet50是一种常见的卷积神经网络模型,可以使用数据并行的方法进行多卡训练。数据并行是将训练数据分成多个部分,每个进程负责处理其中一部分数据,并在每次迭代时将梯度进行聚合。 3. 模型并行:ResNet50也可以使用模型并行的方法进行多卡训练。模型并行是将模型分成多个部分,每个进程负责处理其中一部分模型,并在每次前向传播和反向传播时将梯度进行聚合。 4. 实现并行训练使用MPI库的API,按照数据并行或模型并行的方法实现多卡训练。具体来说,需要创建多个进程,每个进程负责加载数据、构建模型、前向传播、反向传播和参数更新。在每次迭代时,进程之间交换梯度,并更新模型参数。 5. 通信:MPI库提供了一套通信接口,用于实现进程间的消息传递。通过这些接口,可以在不同进程之间进行梯度聚合、参数更新和模型同步等操作。 6. 解决同步问题:在多卡训练中,由于进程之间计算的不一致性,可能会导致同步问题。为了解决这个问题,可以使用同步操作,例如Allreduce、Barrier等,确保所有进程在某个点上同步执行。 总结起来,使用MPI多卡训练ResNet50需要安装MPI库,并按照数据并行或模型并行的方法实现并行训练。同时,需要使用MPI库提供的通信接口解决进程间的消息传递和同步问题。这样可以充分利用多个GPU的计算能力,加快ResNet50模型的训练速度。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值