基于PyTorch DDP分布式训练的实现与优化
引言
在深度学习的实际应用中,训练大型神经网络常常需要大量的计算资源和时间。为了提高训练效率,利用多GPU进行分布式训练成为了一种常见的解决方案。本文将介绍如何使用PyTorch框架进行分布式训练,特别是通过DistributedDataParallel
(DDP)模块实现更高效的并行化处理。
环境设置
在开始编写分布式训练代码之前,首先需要设置多GPU的运行环境。示例代码中首先通过setup
函数配置每个进程的运行环境,包括指定主节点地址和端口,并初始化进程组:
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12357'
torch.distributed.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
数据处理
为了保证数据在不同GPU间均衡,使用DistributedSampler
确保每个进程处理数据的唯一性和均匀性。同时,利用PyTorch的DataLoader
来并行加载数据,提高数据处理效率。
train_sampler = DistributedSampler(train_set, num_replicas=world_size, rank=rank)
train_loader = DataLoader(train_set, batch_size=16, sampler=train_sampler)
模型构建与训练
本示例使用ResNet-18作为训练模型,并通过DistributedDataParallel
封装以支持多GPU训练。DDP能够自动分配模型的参数到各个GPU,并同步参数的更新,极大地提高了训练的速度和效率。
model = resnet18(pretrained=False, num_classes=10).cuda(rank)
model = DDP(model, device_ids=[rank])
训练过程中,通过循环遍历训练数据,执行前向传播、损失计算、反向传播和参数更新。利用tqdm
库显示训练进度,增加交互性。
for epoch in range(10):
model.train()
pbar = tqdm(train_loader, desc="Training")
for data in pbar:
inputs, labels = data[0].cuda(rank), data[1].cuda(rank)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
pbar.set_postfix(Loss=loss.item(),Epoch=epoch,Rank=rank)
性能评估
在每个训练周期后,通过评估函数计算模型在测试集上的准确率,从而监控模型的学习效果。
if rank == 0:
accuracy = evaluate(model, rank, test_loader)
print(f"Rank {rank}, Test Accuracy: {accuracy}%")
总结
本文展示了如何在PyTorch框架下使用DistributedDataParallel
模块进行分布式训练。通过此技术,可以有效地利用多个GPU资源,加速大规模数据集上的模型训练。希望本文能帮助读者更好地理解和应用PyTorch的分布式训练技术。
参考文献
- PyTorch官方文档:DistributedDataParallel
- CIFAR-10数据集介绍
通过本文的介绍,读者可以在自己的项目中实现类似的多GPU训练任务,从而提升训练效率和模型性能。
附录完整代码:
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler, Subset
from torchvision.models import resnet18
import wandb
import numpy as np
from tqdm import tqdm
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12357'
torch.distributed.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
torch.distributed.destroy_process_group()
def evaluate(model, device, test_loader):
model.eval()
correct = 0
total = 0
with torch.no_grad():
for data in test_loader:
images, labels = data[0].to(device), data[1].to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
return accuracy
def main(rank, world_size):
setup(rank, world_size)
# if rank==0:
# wandb.init(
# project="project-DDP-Cifar10",
#
# # track hyperparameters and run metadata
# config={
# "learning_rate": 1e-3,
# }
# )
# 数据变换
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# 加载数据集
train_set = torchvision.datasets.CIFAR10(root='/mnt/sda/zjb/data/cifar10', train=True, download=True, transform=transform)
Downsampled_Sample = False #是否下采样数据
if Downsampled_Sample:
# 计算要抽取的样本数量(四分之一)
# 创建所有索引的列表
indices = list(range(len(train_set)))
split = int(np.floor(0.25 * len(train_set)))
# 取四分之一的随机索引
subset_indices = indices[:split]
subset=Subset(train_set,subset_indices)
train_sampler = DistributedSampler(subset, num_replicas=world_size, rank=rank)
else:
train_sampler = DistributedSampler(train_set, num_replicas=world_size, rank=rank)
train_loader = DataLoader(train_set, batch_size=16, sampler=train_sampler)
test_set = torchvision.datasets.CIFAR10(root='/mnt/sda/zjb/data/cifar10', train=False, download=True, transform=transform)
test_sampler = DistributedSampler(test_set, num_replicas=world_size, rank=rank)
test_loader = DataLoader(test_set, batch_size=64, sampler=test_sampler)
# 模型定义
model = resnet18(pretrained=False, num_classes=10).cuda(rank)
model = DDP(model, device_ids=[rank])
# 损失函数和优化器
criterion = nn.CrossEntropyLoss().cuda(rank)
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 训练过程
for epoch in range(10):
model.train()
pbar = tqdm(train_loader, desc="Training")
for data in pbar:
inputs, labels = data[0].cuda(rank), data[1].cuda(rank)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
pbar.set_postfix(Loss=loss.item(),Epoch=epoch,Rank=rank)
print(f"Rank {rank}, Epoch {epoch}, Loss: {loss.item()}")
# 评估模型
if rank == 0 :
accuracy = evaluate(model, rank, test_loader)
print(f"Rank {rank}, Test Accuracy: {accuracy}%")
cleanup()
if __name__ == "__main__":
world_size = 4 #GPU数量
torch.multiprocessing.spawn(main, args=(world_size,), nprocs=world_size)