yolov5的分布式训练问题

简介:

在服务器多卡训练的时候出现这个报错
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. 查询之后是分布式训练的问题。

PyTorch的分布式训练主要依赖于torch.distributed包,它提供了一套原语,用于同步多个进程的计算和数据。这些原语可以在多个机器上的多个进程之间进行通信,支持多种后端(如NCCL、Gloo和MPI)。

PyTorch的分布式训练主要有两种模式:

数据并行(Data Parallel):这是最常见的分布式训练模式。在这种模式下,每个进程都有一个模型的副本,并处理数据集的一个子集。所有的进程并行地进行前向传播和反向传播,然后同步更新模型的参数。PyTorch提供了torch.nn.DataParallel和torch.nn.parallel.DistributedDataParallel(简称DDP)两种方式来实现数据并行。

模型并行(Model Parallel):这种模式用于模型太大,无法在一个GPU上完全加载的情况。在这种模式下,模型的不同部分在不同的GPU上运行。这需要更复杂的编程,但可以让你训练更大的模型。

本地M6000,训练一个epoch 20-21min,
服务器两块A800, 训练一个epoch 3-4min,
服务器四块A800, 训练一个epoch 38-40s。
现在2024年7月一块A800 80GB的在10-13万,: (。

1. 本地主机训练

训练集准备

训练集都是下面的格式
dataset
|_____images
|_____labels

images文件夹下存图片,labels文件夹下存txt标签, class_id center_x center_y width height的yolo格式。
可以准备多个这样的训练集,用一个yaml文件把这些训练集统一起来。
数据集的配置文件:

Dataset/person.yaml


train: 
  - Dataset/CoCoPerson_Mini/train
  - Dataset/ped/train 
  - Dataset/labeled_dataset_20240724/door_dataset_4_grid # door
  - Dataset/labeled_dataset_20240724/mask_dataset_4_grid # mask
  - Dataset/labeled_dataset_20240724/video_day_with_person # day
  - Dataset/labeled_dataset_20240724/video_night_with_person # night
  - Dataset/dark_augment_dataset  # dark
val: # val images (relative to 'path')    
  - Dataset/CoCoPerson_Mini/val
  - Dataset/val_dataset_20240724

nc: 1                     

# Classes
names:
  0: person

model/person.yaml

模型配置文件

nc: 1  # number of classes
depth_multiple: 0.33  # model depth multiple    
width_multiple: 0.25  # layer channel multiple  
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2      参数依次为: [ch_out, kernel, stride, padding, groups]
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]],  # 9
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

训练

python train.py --epochs 150 --data Dataset/person.yaml --batch-size 32 --weights weights/yolov5n.pt --img 640 --cfg models/person.yaml --freeze 4  --name yolov5n_person --device 0

train.py

训练脚本中每个参数的作用:

–weights:初始权重的路径。
–cfg:模型配置文件(model.yaml)的路径。
–data:数据集配置文件(dataset.yaml)的路径。
–hyp:超参数配置文件的路径。
–epochs:总的训练周期数。
–batch-size:所有GPU的总批量大小,如果为-1,则自动批处理。
–imgsz:训练和验证图像的大小(像素)。
–rect:是否进行矩形训练。
–resume:是否从最近的训练恢复。
–nosave:是否只保存最后的检查点。
–noval:是否只验证最后的周期。
–noautoanchor:是否禁用AutoAnchor。
–noplots:是否不保存绘图文件。
–evolve:是否进化超参数。
–bucket:gsutil桶。
–cache:图像缓存。
–image-weights:是否在训练中使用加权图像选择。
–device:CUDA设备,例如0或0,1,2,3或cpu。
–multi-scale:是否改变图像大小。
–single-cls:是否将多类数据作为单类训练。
–optimizer:优化器,可选’SGD’,‘Adam’,‘AdamW’。
–sync-bn:是否使用SyncBatchNorm,只在DDP模式下可用。
–workers:最大的数据加载器工作者(每个RANK在DDP模式下)。
–project:保存到项目/名称的路径。
–name:保存到项目/名称的名称。
–exist-ok:如果项目/名称存在,是否可以,不增加。
–quad:是否使用四倍数据加载器。
–cos-lr:是否使用余弦学习率调度器。
–label-smoothing:标签平滑epsilon。
–patience:早停耐心(没有改进的周期数)。
–freeze:冻结层,例如backbone=10,first3=0 1 2。
–save-period:每x周期保存一次检查点(如果<1则禁用)。
–seed:全局训练种子。
–local_rank:自动DDP多GPU参数,不要修改。
Logger参数:

–entity:实体。
–upload_dataset:是否上传数据,"val"选项。
–bbox_interval:设置边界框图像记录间隔。
–artifact_alias:要使用的数据集工件的版本

2. 分布式训练

在服务器上用两个显卡训练的时候报错如下

AutoAnchor: 4.52 anchors/target, 0.998 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/yolov5m_person3/labels.jpg... 
Traceback (most recent call last):
  File "train.py", line 646, in <module>
    main(opt)
  File "train.py", line 540, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 235, in train
    model = smart_DDP(model)
  File "/code/utils/torch_utils.py", line 63, in smart_DDP
    return DDP(model, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 520, in __init__
    self.process_group = _get_default_group()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 394, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
  • 分布式训练的指令
python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --epochs 100 --data Dataset/person.yaml --batch-size 256 --weights weights/yolov5n.pt --img 640 --cfg models/person.yaml --freeze 4  --name yolov5n_person --device 0,1

–nproc_per_node参数指定了每个节点(在这个上下文中,节点通常指的是一台机器)上的进程数量,通常设置为你的GPU数量。–use_env参数表示环境变量(包括MASTER_ADDR,MASTER_PORT,RANK,和WORLD_SIZE)应该从环境中获取,而不是从命令行参数中获取。

train.py是你的训练脚本,后面的是传递给训练脚本的参数。
下面是输出

python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --epochs 100 --data Dataset/person.yaml --batch-size 256 --weights weights/yolov5n.pt --img 640 --cfg models/person.yaml --freeze 4  --name yolov5n_person --device 0,1
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:177: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=weights/yolov5n.pt, cfg=models/person.yaml, data=Dataset/person.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=256, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=yolov5n_person, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[4], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 2024-7-24 Python-3.8.10 torch-1.10.0a0+3fd9dcf CUDA:0 (NVIDIA A800-SXM4-80GB, 81251MiB)
                                                         CUDA:1 (NVIDIA A800-SXM4-80GB, 81251MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.0, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
cnwla-a800-p01107:46170:46170 [0] NCCL INFO Bootstrap : Using bond0:10.0.0.156<0>
cnwla-a800-p01107:46170:46170 [0] NCCL INFO Plugin name set by env to libnccl-net-none.so

cnwla-a800-p01107:46170:46170 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net-none.so), using internal implementation
cnwla-a800-p01107:46170:46170 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE [1]mlx5_bond_1:1/RoCE [2]mlx5_bond_2:1/RoCE [3]mlx5_bond_3:1/RoCE [4]mlx5_bond_4:1/RoCE ; OOB bond0:10.0.0.156<0>
cnwla-a800-p01107:46170:46170 [0] NCCL INFO Using network IB
NCCL version 2.11.4+cuda11.4
cnwla-a800-p01107:46171:46171 [1] NCCL INFO Bootstrap : Using bond0:10.0.0.156<0>
cnwla-a800-p01107:46171:46171 [1] NCCL INFO Plugin name set by env to libnccl-net-none.so

cnwla-a800-p01107:46171:46171 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net-none.so), using internal implementation
cnwla-a800-p01107:46171:46171 [1] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE [1]mlx5_bond_1:1/RoCE [2]mlx5_bond_2:1/RoCE [3]mlx5_bond_3:1/RoCE [4]mlx5_bond_4:1/RoCE ; OOB bond0:10.0.0.156<0>
cnwla-a800-p01107:46171:46171 [1] NCCL INFO Using network IB
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8.
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_TC set by environment to 136.
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_SL set by environment to 5.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_TC set by environment to 136.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_SL set by environment to 5.
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 00/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 01/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 02/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 03/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 04/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 05/16 :    0   1
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 06/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 07/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 08/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 09/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 10/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 11/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 12/16 :    0   1
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 13/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 14/16 :    0   1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 15/16 :    0   1
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff,00000000
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff,00000000
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 00 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 00 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 01 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 01 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 02 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 02 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 03 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 03 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 04 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 04 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 05 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 05 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 06 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 06 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 07 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 07 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 08 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 08 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 09 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 09 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 10 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 10 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 11 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 11 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 12 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 12 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 13 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 13 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 14 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 14 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 15 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 15 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Connected all rings
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Connected all trees
cnwla-a800-p01107:46171:48359 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cnwla-a800-p01107:46171:48359 [1] NCCL INFO 16 coll channels, 16 p2p channels, 16 p2p channels per peer
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Connected all rings
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Connected all trees
cnwla-a800-p01107:46170:48292 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cnwla-a800-p01107:46170:48292 [0] NCCL INFO 16 coll channels, 16 p2p channels, 16 p2p channels per peer
cnwla-a800-p01107:46170:48292 [0] NCCL INFO comm 0x7fc600008fb0 rank 0 nranks 2 cudaDev 0 busId 8d000 - Init COMPLETE
cnwla-a800-p01107:46171:48359 [1] NCCL INFO comm 0x7fa7b8008fb0 rank 1 nranks 2 cudaDev 1 busId 92000 - Init COMPLETE
cnwla-a800-p01107:46170:46170 [0] NCCL INFO Launch mode Parallel

                 from  n    params  module                                  arguments                     
  0                -1  1      1760  models.common.Conv                      [3, 16, 6, 2, 2]              
  1                -1  1      4672  models.common.Conv                      [16, 32, 3, 2]                
  2                -1  1      4800  models.common.C3                        [32, 32, 1]                   
  3                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  4                -1  2     29184  models.common.C3                        [64, 64, 2]                   
  5                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  6                -1  3    156928  models.common.C3                        [128, 128, 3]                 
  7                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  8                -1  1    296448  models.common.C3                        [256, 256, 1]                 
  9                -1  1    164608  models.common.SPPF                      [256, 256, 5]                 
 10                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 14                -1  1      8320  models.common.Conv                      [128, 64, 1, 1]               
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     22912  models.common.C3                        [128, 64, 1, False]           
 18                -1  1     36992  models.common.Conv                      [64, 64, 3, 2]                
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1     74496  models.common.C3                        [128, 128, 1, False]          
 21                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 24      [17, 20, 23]  1      8118  models.yolo.Detect                      [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
person summary: 214 layers, 1765270 parameters, 1765270 gradients, 4.2 GFLOPs

Transferred 342/349 items from weights/yolov5n.pt
freezing model.0.conv.weight
freezing model.0.bn.weight
freezing model.0.bn.bias
freezing model.1.conv.weight
freezing model.1.bn.weight
freezing model.1.bn.bias
freezing model.2.cv1.conv.weight
freezing model.2.cv1.bn.weight
freezing model.2.cv1.bn.bias
freezing model.2.cv2.conv.weight
freezing model.2.cv2.bn.weight
freezing model.2.cv2.bn.bias
freezing model.2.cv3.conv.weight
freezing model.2.cv3.bn.weight
freezing model.2.cv3.bn.bias
freezing model.2.m.0.cv1.conv.weight
freezing model.2.m.0.cv1.bn.weight
freezing model.2.m.0.cv1.bn.bias
freezing model.2.m.0.cv2.conv.weight
freezing model.2.m.0.cv2.bn.weight
freezing model.2.m.0.cv2.bn.bias
freezing model.3.conv.weight
freezing model.3.bn.weight
freezing model.3.bn.bias
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.002), 60 bias
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning /workspace/Dataset/CoCoPerson_Mini/train/labels.cache... 24223 images, 2501 backgrounds, 1 corrupt: 100%|██████████| 24
train: WARNING ⚠️ /workspace/Dataset/CoCoPerson_Mini/train/images/000000458309.jpg: ignoring corrupt image/label: negative label values [-0.00081699]
val: Scanning /workspace/Dataset/CoCoPerson_Mini/val/labels.cache... 10767 images, 536 backgrounds, 0 corrupt: 100%|██████████| 10767/1

AutoAnchor: 4.52 anchors/target, 0.998 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/yolov5n_person3/labels.jpg... 
Image sizes 640 train, 640 val
Using 16 dataloader workers
Logging results to runs/train/yolov5n_person3
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       0/99      15.9G    0.07459    0.03704          0        467        640: 100%|██████████| 95/95 [01:46<00:00,  1.12s/it] 
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 43/43 [00:35<00:00,  1.21it/s]
                   all      10767      32315      0.762      0.775      0.817      0.368

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       1/99      41.3G    0.05806    0.03021          0        410        640: 100%|██████████| 95/95 [01:00<00:00,  1.56it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 43/43 [00:33<00:00,  1.28it/s]
                   all      10767      32315      0.743      0.794      0.818      0.436

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       2/99      41.3G    0.05144    0.02921          0        459        640: 100%|██████████| 95/95 [01:01<00:00,  1.53it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 43/43 [00:32<00:00,  1.32it/s]
                   all      10767      32315       0.86      0.838       0.91      0.488

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       3/99      41.3G    0.04577    0.02913          0        419        640: 100%|██████████| 95/95 [01:01<00:00,  1.55it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 43/43 [00:32<00:00,  1.31it/s]
                   all      10767      32315      0.861      0.851       0.92      0.614

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       4/99      41.3G    0.04273    0.02861          0        417        640: 100%|██████████| 95/95 [01:01<00:00,  1.55it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 43/43 [00:33<00:00,  1.28it/s]
                   all      10767      32315      0.877      0.848      0.927      0.669

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       5/99      41.3G    0.04143    0.02842          0        420        640: 100%|██████████| 95/95 [01:01<00:00,  1.54it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 43/43 [00:33<00:00,  1.28it/s]
                   all      10767      32315      0.864      0.837      0.913      0.618

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       6/99      41.3G    0.04074    0.02866          0        463        640: 100%|██████████| 95/95 [01:01<00:00,  1.55it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 43/43 [00:34<00:00,  1.26it/s]
                   all      10767      32315      0.863      0.838      0.915       0.68

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       7/99      41.3G     0.0401    0.02827          0        418        640: 100%|██████████| 95/95 [01:00<00:00,  1.57it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 43/43 [00:33<00:00,  1.28it/s]
                   all      10767      32315      0.857      0.814      0.899      0.655

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       8/99      41.3G    0.03952    0.02857          0        412        640: 100%|██████████| 95/95 [00:59<00:00,  1.59it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 43/43 [00:33<00:00,  1.28it/s]
                   all      10767      32315      0.863      0.834      0.916      0.681

3. 查看GPU占用

/code# nvidia-smi
Wed Jul 24 20:40:14 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800-SXM...  On   | 00000000:8D:00.0 Off |                    0 |
| N/A   36C    P0   172W / 400W |  41510MiB / 81251MiB |     88%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM...  On   | 00000000:92:00.0 Off |                    0 |
| N/A   39C    P0   166W / 400W |  17158MiB / 81251MiB |     98%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1640244      C   /opt/conda/bin/python           41479MiB |
|    1   N/A  N/A   1640245      C   /opt/conda/bin/python           17151MiB |
+-----------------------------------------------------------------------------+

可以看到两块GPU都在运行!

  • 用4块卡训练
Thu Jul 25 20:14:08 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800-SXM...  On   | 00000000:21:00.0 Off |                    0 |
| N/A   39C    P0   195W / 400W |  33384MiB / 81251MiB |     95%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM...  On   | 00000000:27:00.0 Off |                    0 |
| N/A   43C    P0   183W / 400W |  33382MiB / 81251MiB |     98%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800-SXM...  On   | 00000000:51:00.0 Off |                    0 |
| N/A   40C    P0   125W / 400W |  33382MiB / 81251MiB |     98%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800-SXM...  On   | 00000000:56:00.0 Off |                    0 |
| N/A   38C    P0   166W / 400W |  33286MiB / 81251MiB |     98%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3135579      C   /opt/conda/bin/python           33343MiB |
|    1   N/A  N/A   3135580      C   /opt/conda/bin/python           33373MiB |
|    2   N/A  N/A   3135581      C   /opt/conda/bin/python           33373MiB |
|    3   N/A  N/A   3135582      C   /opt/conda/bin/python           33277MiB |
+-----------------------------------------------------------------------------+

四块卡训练的时候一直有个断言错误,
assert torch.cuda.device_count() > LOCAL_RANK, ‘insufficient CUDA devices for DDP command’

我在服务器上离线用torch.cuda.device_count()输出是4,也就是说挂载了4块显卡的,但是在运行的时候在断言前面输出torch.cuda.device_count()为2, 而LOCAL_RANK的值为0,1,2,3所以在LOCAL_RANK 为2,3的时候触发断言错误。
但是在train.py的顶端GIT_INFO下面输出torch.cuda.device_count(),后面的结果也就为4了,很神奇, 提前调用一下就解决了这个问题。

LOCAL_RANK = int(os.getenv('LOCAL_RANK', -1))  # https://pytorch.org/docs/stable/elastic/run.html
RANK = int(os.getenv('RANK', -1))
WORLD_SIZE = int(os.getenv('WORLD_SIZE', 1))
GIT_INFO = check_git_info()
# 输出下面的结果后都正常了
print("local rank:", LOCAL_RANK)
print("RANK: ",RANK)
print("WORLD_SIZE: ",WORLD_SIZE)
print("torch.cuda.device_count(): ",torch.cuda.device_count())
python -m torch.distributed.run --nproc_per_node=4  train.py --epochs 250 --data Dataset/person.yaml --batch-size 1024 --weights weights/yolov5n.pt --img 640 --cfg models/person.yaml --freeze 3  --name yolov5n_person --device 0,1 --project /workspace/runs/model --save-period 10

部分输出

local rank: 2
RANK:  2
WORLD_SIZE:  4
local rank: 1
RANK:  1
WORLD_SIZE:  4
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) local rank: 3
RANK:  3
WORLD_SIZE:  4
torch.cuda.device_count():  4
torch.cuda.device_count():  4
torch.cuda.device_count():  4
4 LOCAL_RANK: 2
4 LOCAL_RANK: 3
4 LOCAL_RANK: 1

wandb: W&B disabled due to login timeout.
local rank: 0
RANK:  0
WORLD_SIZE:  4
torch.cuda.device_count():  4

pytorch分布式训练中local_rank参数的说明:https://pytorch.org/docs/stable/elastic/run.html

  • 5
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值