- 对于分布式训练的理解
- 分布式训练是多线程多节点训练,但是若是模型过大,导致单一图像已经超出单个GPU显存大小,则和单机多卡中的数据并行效果一致,都会导致out of memary问题;
- 模型并行其运行速率太慢,相当于是每次只能在单一卡上运行,会出现显卡空闲的情况,解析见下:
pytorch单机并行训练
- torch-1.0分布式训练实例解析,主要流程:
- parsers中添加相关参数解析;
- training之前,设置节点等级:
torch.cuda.set_device(args.local_rank)
- 初始化:
torch.distributed.init_process_group(backend='nccl',init_method='env://')
- 分布式加载:
self.model = self.model.cuda()
self.model = torch.nn.parallel.DistributedDataParallel(model, device_ids=self.args.gpu_ids)
5.运行
CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=4 train_autodeeplab.py --backbone resnet --lr 0.007 --workers 4 --epochs 50 --batch-size 2 --gpu-ids 0,1,2,3 --checkname deeplab-resnet --eval-interval 1 --dataset pascal --arch-lr 0.1 --distributed True --local_rank 0
- torch分布式训练实例,如下所示:
import argparse
import os
import random
import shutil
import time
import warnings
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.multiprocessing as mp
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
model_names = sorted(name for name in models.__dict__
if name.islower() and not name.startswith("__")
and callable(models.__dict__[name]))
parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('data', metavar='DIR',
help='path to dataset')
parser.add_argument('-a', '--arch', metavar='ARCH', default='resnet18',
choices=model_names,
help='model architecture: ' +
' | '.join(model_names) +
' (default: resnet18)')
parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
help='number of data loading workers (default: 4)')
parser.add_argument('--epochs', default=90, type=int, metavar='N',
help='number of total epochs to run')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
help='manual epoch number (useful on restarts)')
parser.add_argument('-b', '--batch-size', default=256, type=int,
metavar='N',
help='mini-batch size (default: 256), this is the total '
'batch size of all GPUs on the current node when '
'using Data Parallel or Distributed Data Parallel')
parser.add_argument('--lr', '--learning-rate', default=0.1, type=float,
metavar='LR', help='initial learning rate', dest='lr')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
help='momentum')
parser.add_argument('--wd', '--weight-decay',

本文介绍了PyTorch中的分布式训练,讨论了当模型过大导致内存不足时的问题,并对比了模型并行和数据并行的效率。内容包括分布式训练的基本流程,如参数解析、节点设置、初始化和运行示例。
最低0.47元/天 解锁文章
1229

被折叠的 条评论
为什么被折叠?



