Accelerate 基本介绍
Accelerate是Hugging Face推出的PyTorch扩展库,旨在简化分布式训练流程。它提供了统一的API,让开发者可以用相同的代码在多种硬件配置上运行训练任务,主要特点包括:
- 统一代码适配不同硬件(CPU/GPU/TPU)
- 简化分布式训练配置
- 自动处理混合精度训练
- 内置模型保存/加载功能
- 支持梯度累积等训练技巧
在做深度学习训练的时候,往往需要合理的利用显卡资源,不管是DP还是DDP都是能直接影响到训练的效果的,本文主要是在认识Accelerate的基础上主要针对单机多卡的环境下做一个简单的使用指南,如有错误,请多多指正。
单机多卡配置方法
在使用Accelerate的时候,当然安装是必要的:
pip install accelerate
在使用单机多卡的时候,需要通过通过命令accelerate config行运行配置向导:
accelerate config
配置过程会进行以下信息的交互:
-----------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine
-----------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use TensorParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
Would you like to enable numa efficiency? (Currently only supported on NVIDIA hardware). [yes/NO]: yes
-----------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use mixed precision?
fp16
accelerate configuration saved at /home/a/.cache/huggingface/accelerate/default_config.yaml
当然,也可以自己手动创建配置,路径为:/home/a/.cache/huggingface/accelerate/default_config.yaml,部分手动设置的参数为:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_processes: 2 # 使用2块GPU
mixed_precision: fp16
通过accelerate config进行配置的过程已经完毕,接下来我们将用例子来说明如何使用多卡完成我们的例子。
Accelerate在单机多卡场景下的完整应用
先上代码,毕竟是如何使用,应以官网的例子为主:
https://github.com/huggingface/accelerate/blob/main/examples/cv_example.py
这个是一个完整的例子了,已经包括了CPU,但GPU,多GPU等情况,因此不需要详细阐述,在这里我们只需要列出主要的步骤。
初始化Accelerator
from accelerate import Accelerator
from accelerate.utils import set_seed
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)#这里的cpu是通过外部参数传入,可以指定CPU/GPU,mixed_precision: 指定混合精度模式(fp16/bf16)
准备数据和模型
train_dataset = PetsDataset()#创建数据集
eval_dataset = PetsDataset()
# 初始化数据加载器
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size, num_workers=4)
eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=batch_size, num_workers=4)
# 初始化模型
model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))
创建优化器与学习率调度器
optimizer = torch.optim.Adam(params=model.parameters(), lr=lr / 25)
lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=lr, epochs=num_epochs, steps_per_epoch=len(train_dataloader))
使用prepare方法包装对象
# 使用accelerator.prepare()包装模型、优化器、数据加载器等
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
prepare() 方法会自动:
- 将模型分布到各GPU
- 包装优化器以支持混合精度
- 设置分布式数据采样器
训练循环
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch = {k: v.to(accelerator.device) for k, v in batch.items()}
inputs = (batch["image"] - mean) / std
outputs = model(inputs)
loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
关键点:
- 使用 accelerator.backward() 而非 loss.backward()
- 梯度会自动在多GPU间同步
- 混合精度训练会自动处理
评估与模型保存
model.eval()
accurate = 0
num_elems = 0
for _, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch = {k: v.to(accelerator.device) for k, v in batch.items()}
inputs = (batch["image"] - mean) / std
with torch.no_grad():
outputs = model(inputs)
predictions = outputs.argmax(dim=-1)
predictions, references = accelerator.gather_for_metrics((predictions, batch["label"]))
accurate_preds = predictions == references
num_elems += accurate_preds.shape[0]
accurate += accurate_preds.long().sum()
eval_metric = accurate.item() / num_elems
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")
# 只在主进程保存模型
accelerator.wait_for_everyone()
if accelerator.is_main_process:
torch.save(model.state_dict(), "model.pth")
至此,使用accelerator进行单机多卡的配置和使用简介完毕。当然,官网中的操作步骤是在代码运行脚本中创建accelerate config后再运行训练脚本,如下所示
accelerate config --config_file config.yaml # This will create a config file on your server to `config.yaml`
accelerate launch --config_file config.yaml ./cv_example.py --data_dir path_to_data # This will run the script on your server
这将会在当前目录下生成一个 config.yaml文件,同样是交互式,这样的好处是可以自己选择不同的配置文件进行训练操作。由于只是在停留在工具的使用层面,更深更具体的使用方式,还需要进一步在工程是进行使用,如果有错误和不妥的地方,请多多指教。