Accelerate 单机多卡使用指南

Accelerate 基本介绍

Accelerate是Hugging Face推出的PyTorch扩展库,旨在简化分布式训练流程。它提供了统一的API,让开发者可以用相同的代码在多种硬件配置上运行训练任务,主要特点包括:

- 统一代码适配不同硬件(CPU/GPU/TPU)

- 简化分布式训练配置

- 自动处理混合精度训练

- 内置模型保存/加载功能

- 支持梯度累积等训练技巧

在做深度学习训练的时候,往往需要合理的利用显卡资源,不管是DP还是DDP都是能直接影响到训练的效果的,本文主要是在认识Accelerate的基础上主要针对单机多卡的环境下做一个简单的使用指南,如有错误,请多多指正。

单机多卡配置方法

在使用Accelerate的时候,当然安装是必要的:

pip install accelerate

在使用单机多卡的时候,需要通过通过命令accelerate config行运行配置向导:

accelerate config

配置过程会进行以下信息的交互:

-----------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine                                                                                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                           
multi-GPU                                                                                                                                      
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                     
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no              
Do you wish to optimize your script with torch dynamo?[yes/NO]:no                                                                              
Do you want to use DeepSpeed? [yes/NO]: no                                                                                                     
Do you want to use FullyShardedDataParallel? [yes/NO]: no                                                                                      
Do you want to use TensorParallel? [yes/NO]: no                                                                                                
Do you want to use Megatron-LM ? [yes/NO]: no                                                                                                  
How many GPU(s) should be used for distributed training? [1]:2                                                                                 
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all                                           
Would you like to enable numa efficiency? (Currently only supported on NVIDIA hardware). [yes/NO]: yes
-----------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use mixed precision?
fp16                                                                                                                                           
accelerate configuration saved at /home/a/.cache/huggingface/accelerate/default_config.yaml 

当然,也可以自己手动创建配置,路径为:/home/a/.cache/huggingface/accelerate/default_config.yaml,部分手动设置的参数为:

compute_environment: LOCAL_MACHINE

distributed_type: MULTI_GPU

num_processes: 2  # 使用2块GPU

mixed_precision: fp16

通过accelerate config进行配置的过程已经完毕,接下来我们将用例子来说明如何使用多卡完成我们的例子。

Accelerate在单机多卡场景下的完整应用

先上代码,毕竟是如何使用,应以官网的例子为主:

https://github.com/huggingface/accelerate/blob/main/examples/cv_example.py

这个是一个完整的例子了,已经包括了CPU,但GPU,多GPU等情况,因此不需要详细阐述,在这里我们只需要列出主要的步骤。

初始化Accelerator

from accelerate import Accelerator
from accelerate.utils import set_seed

accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)#这里的cpu是通过外部参数传入,可以指定CPU/GPU,mixed_precision: 指定混合精度模式(fp16/bf16)

 准备数据和模型

    train_dataset = PetsDataset()#创建数据集
    eval_dataset = PetsDataset()

    # 初始化数据加载器
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size, num_workers=4)
    eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=batch_size, num_workers=4)
    # 初始化模型
    model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))

创建优化器与学习率调度器

    optimizer = torch.optim.Adam(params=model.parameters(), lr=lr / 25)
    lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=lr, epochs=num_epochs, steps_per_epoch=len(train_dataloader))

使用prepare方法包装对象 

# 使用accelerator.prepare()包装模型、优化器、数据加载器等
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

prepare() 方法会自动:

- 将模型分布到各GPU
- 包装优化器以支持混合精度
- 设置分布式数据采样器

训练循环 

    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            inputs = (batch["image"] - mean) / std
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

 

关键点:

- 使用 accelerator.backward() 而非 loss.backward()
- 梯度会自动在多GPU间同步
- 混合精度训练会自动处理

评估与模型保存 

        model.eval()
        accurate = 0
        num_elems = 0
        for _, batch in enumerate(eval_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            inputs = (batch["image"] - mean) / std
            with torch.no_grad():
                outputs = model(inputs)
            predictions = outputs.argmax(dim=-1)
            predictions, references = accelerator.gather_for_metrics((predictions, batch["label"]))
            accurate_preds = predictions == references
            num_elems += accurate_preds.shape[0]
            accurate += accurate_preds.long().sum()

        eval_metric = accurate.item() / num_elems
        # Use accelerator.print to print only on the main process.
        accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")

# 只在主进程保存模型
accelerator.wait_for_everyone()
if accelerator.is_main_process:
    torch.save(model.state_dict(), "model.pth")

至此,使用accelerator进行单机多卡的配置和使用简介完毕。当然,官网中的操作步骤是在代码运行脚本中创建accelerate config后再运行训练脚本,如下所示

accelerate config --config_file config.yaml  # This will create a config file on your server to `config.yaml`
accelerate launch --config_file config.yaml ./cv_example.py --data_dir path_to_data  # This will run the script on your server

这将会在当前目录下生成一个 config.yaml文件,同样是交互式,这样的好处是可以自己选择不同的配置文件进行训练操作。由于只是在停留在工具的使用层面,更深更具体的使用方式,还需要进一步在工程是进行使用,如果有错误和不妥的地方,请多多指教。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值