Accelerate 单机多卡使用指南

king of code porter

已于 2025-04-05 16:14:02 修改

阅读量569

点赞数 5

分类专栏： pytroch分布式文章标签： pytorch 深度学习

于 2025-04-01 15:18:19 首次发布

本文链接：https://blog.csdn.net/ak47maker/article/details/146908306

版权

pytroch分布式专栏收录该内容

2 篇文章

订阅专栏

Accelerate 基本介绍

Accelerate是Hugging Face推出的PyTorch扩展库，旨在简化分布式训练流程。它提供了统一的API，让开发者可以用相同的代码在多种硬件配置上运行训练任务，主要特点包括：

- 统一代码适配不同硬件（CPU/GPU/TPU）

- 简化分布式训练配置

- 自动处理混合精度训练

- 内置模型保存/加载功能

- 支持梯度累积等训练技巧

在做深度学习训练的时候，往往需要合理的利用显卡资源，不管是DP还是DDP都是能直接影响到训练的效果的,本文主要是在认识Accelerate的基础上主要针对单机多卡的环境下做一个简单的使用指南，如有错误，请多多指正。

单机多卡配置方法

在使用Accelerate的时候，当然安装是必要的：

pip install accelerate

在使用单机多卡的时候，需要通过通过命令accelerate config行运行配置向导：

accelerate config

配置过程会进行以下信息的交互：

-----------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine                                                                                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                           
multi-GPU                                                                                                                                      
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                     
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no              
Do you wish to optimize your script with torch dynamo?[yes/NO]:no                                                                              
Do you want to use DeepSpeed? [yes/NO]: no                                                                                                     
Do you want to use FullyShardedDataParallel? [yes/NO]: no                                                                                      
Do you want to use TensorParallel? [yes/NO]: no                                                                                                
Do you want to use Megatron-LM ? [yes/NO]: no                                                                                                  
How many GPU(s) should be used for distributed training? [1]:2                                                                                 
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all                                           
Would you like to enable numa efficiency? (Currently only supported on NVIDIA hardware). [yes/NO]: yes
-----------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use mixed precision?
fp16                                                                                                                                           
accelerate configuration saved at /home/a/.cache/huggingface/accelerate/default_config.yaml

当然，也可以自己手动创建配置，路径为：/home/a/.cache/huggingface/accelerate/default_config.yaml，部分手动设置的参数为：

compute_environment: LOCAL_MACHINE

distributed_type: MULTI_GPU

num_processes: 2  # 使用2块GPU

mixed_precision: fp16

通过accelerate config进行配置的过程已经完毕，接下来我们将用例子来说明如何使用多卡完成我们的例子。

Accelerate在单机多卡场景下的完整应用

先上代码，毕竟是如何使用，应以官网的例子为主：

https://github.com/huggingface/accelerate/blob/main/examples/cv_example.py

这个是一个完整的例子了，已经包括了CPU,但GPU,多GPU等情况，因此不需要详细阐述，在这里我们只需要列出主要的步骤。

初始化Accelerator

from accelerate import Accelerator
from accelerate.utils import set_seed

accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)#这里的cpu是通过外部参数传入，可以指定CPU/GPU，mixed_precision: 指定混合精度模式(fp16/bf16)

准备数据和模型

    train_dataset = PetsDataset()#创建数据集
    eval_dataset = PetsDataset()

    # 初始化数据加载器
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size, num_workers=4)
    eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=batch_size, num_workers=4)
    # 初始化模型
    model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))

创建优化器与学习率调度器

    optimizer = torch.optim.Adam(params=model.parameters(), lr=lr / 25)
    lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=lr, epochs=num_epochs, steps_per_epoch=len(train_dataloader))

使用prepare方法包装对象

# 使用accelerator.prepare()包装模型、优化器、数据加载器等
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

prepare() 方法会自动：

- 将模型分布到各GPU
- 包装优化器以支持混合精度
- 设置分布式数据采样器

训练循环

    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            inputs = (batch["image"] - mean) / std
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

关键点：

- 使用 accelerator.backward() 而非 loss.backward()
- 梯度会自动在多GPU间同步
- 混合精度训练会自动处理

评估与模型保存

        model.eval()
        accurate = 0
        num_elems = 0
        for _, batch in enumerate(eval_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            inputs = (batch["image"] - mean) / std
            with torch.no_grad():
                outputs = model(inputs)
            predictions = outputs.argmax(dim=-1)
            predictions, references = accelerator.gather_for_metrics((predictions, batch["label"]))
            accurate_preds = predictions == references
            num_elems += accurate_preds.shape[0]
            accurate += accurate_preds.long().sum()

        eval_metric = accurate.item() / num_elems
        # Use accelerator.print to print only on the main process.
        accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")

# 只在主进程保存模型
accelerator.wait_for_everyone()
if accelerator.is_main_process:
    torch.save(model.state_dict(), "model.pth")

至此，使用accelerator进行单机多卡的配置和使用简介完毕。当然，官网中的操作步骤是在代码运行脚本中创建accelerate config后再运行训练脚本，如下所示

accelerate config --config_file config.yaml  # This will create a config file on your server to `config.yaml`
accelerate launch --config_file config.yaml ./cv_example.py --data_dir path_to_data  # This will run the script on your server

这将会在当前目录下生成一个 config.yaml文件，同样是交互式，这样的好处是可以自己选择不同的配置文件进行训练操作。由于只是在停留在工具的使用层面，更深更具体的使用方式，还需要进一步在工程是进行使用，如果有错误和不妥的地方，请多多指教。