RuntimeError: DataLoader worker (pid 32633) is killed by signal: Killed._torch.utils.data.dataloader killed by signal-CSDN博客

本文链接：https://blog.csdn.net/weixin_46151178/article/details/130051688

硬件设备jetson xavier nx开发套件

软件版本bsp3271 jetpack4.6.1

torch1.10 torchvision 0.11.1

error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 32633) is killed by signal: Killed.

 python3 train.py --img 640 --batch 8 --epochs 300 --data data/coco128.yaml --cfg models/yolov5s.yaml --weights weights/yolov5s.pt --device '0'
github: skipping check (offline)
YOLOv5 🚀 v5.0-0-gf5b8f7d torch 1.10.0 CUDA:0 (Xavier, 7772.16796875MB)

Namespace(adam=False, artifact_alias='latest', batch_size=8, bbox_interval=-1, bucket='', cache_images=False, cfg='models/yolov5s.yaml', data='data/coco128.yaml', device='0', entity=None, epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=-1, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp3', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=8, upload_dataset=False, weights='weights/yolov5s.pt', workers=8, world_size=1)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0
wandb: Install Weights & Biases for YOLOv5 logging with 'pip install wandb' (recommended)

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  1    156928  models.common.C3                        [128, 128, 3]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  1    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]
  9                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
/home/chd321/.local/lib/python3.6/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /media/nvidia/NVME/pytorch/pytorch-v1.10.0/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Model Summary: 283 layers, 7276605 parameters, 7276605 gradients, 17.2 GFLOPS

Transferred 360/362 items from weights/yolov5s.pt

WARNING: Dataset not found, nonexistent paths: ['/home/chd321/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [01:45<00:00, 65.9kB/s]
Dataset autodownload success

Scaled weight_decay = 0.0005
Optimizer groups: 62 .bias, 62 conv.weight, 59 other
train: Scanning '../coco128/labels/train2017' images and labels... 126 found, 2 missing, 0 empty, 0 corrupted: 100%|█████████████████████████████████████| 128/128 [00:00<00:00, 681.53it/s]
train: New cache created: ../coco128/labels/train2017.cache
val: Scanning '../coco128/labels/train2017.cache' images and labels... 126 found, 2 missing, 0 empty, 0 corrupted: 100%|██████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
Plotting labels...

autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 test
Using 6 dataloader workers
Logging results to runs/train/exp3
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
  0%|                                                                                                                                                                | 0/16 [00:56<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 543, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 303, in train
    pred = model(imgs)  # forward
  File "/home/chd321/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chd321/yolov5/models/yolo.py", line 123, in forward
    return self.forward_once(x, profile)  # single-scale inference, train
  File "/home/chd321/yolov5/models/yolo.py", line 139, in forward_once
    x = m(x)  # run
  File "/home/chd321/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chd321/yolov5/models/common.py", line 138, in forward
    return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
  File "/home/chd321/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chd321/yolov5/models/common.py", line 42, in forward
    return self.act(self.bn(self.conv(x)))
  File "/home/chd321/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chd321/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/chd321/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    self.padding, self.dilation, self.groups)
  File "/home/chd321/.local/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 29639) is killed by signal: Killed.

原因内存不够，添加虚拟内存内存方法如下

chd321@ubuntu:~/yolov5$ sudo vi trian.py
[sudo] password for chd321:
chd321@ubuntu:~/yolov5$ sudo vi train.py
chd321@ubuntu:~/yolov5$ git clone https://github.com/JetsonHacksNano/installSwapfile
Cloning into 'installSwapfile'...
remote: Enumerating objects: 37, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 37 (delta 0), reused 0 (delta 0), pack-reused 34
Unpacking objects: 100% (37/37), done.
chd321@ubuntu:~/yolov5$ cd installSwapfile/
chd321@ubuntu:~/yolov5/installSwapfile$ s
-bash: s: command not found
chd321@ubuntu:~/yolov5/installSwapfile$ ls
autoMount.sh  installSwapfile.sh  LICENSE  README.md
chd321@ubuntu:~/yolov5/installSwapfile$ vi installSwapfile.sh
chd321@ubuntu:~/yolov5/installSwapfile$ sudo ./installSwapfile.sh