当模型多GPU运行时,遇到dimension specified as 0 but tensor has no dimensions问题

当使用pytorch多gpu运行时候,loss反传时出现错误:

Traceback (most recent call last):
  File "trainval_net.py", line 343, in <module>
    rois_label = fasterRCNN(im1_data,im2_data, im1_info, im2_info,gt_boxes, num_boxes)
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 115, in forward
    return self.gather(outputs, self.output_device)
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 127, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    return gather_map(outputs)
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
    ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/nn/parallel/_functions.py", line 54, in <lambda>
    ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
RuntimeError: dimension specified as 0 but tensor has no dimensions
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f0b38b5abe0>>
Traceback (most recent call last):
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
  File "/home/zhangxin/anaconda2/envs/py35/lib/python3.5/multiprocessing/queues.py", line 345, in get
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 954, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 887, in _find_spec
TypeError: 'NoneType' object is not iterable

其原因为新版本torch(0.4.0)不支持loss为标量,因此需要将所有loss转化为一维向量,也很简单。如:
原loss设定:

def compute_loss:
  loss = 0;
  return loss

改为:

def compute_loss:
  loss = 0;
  return loss.view(-1)

记住需要将所有的loss都变为向量,因为有的模型不止一个loss(如Faster-RCNN)。

已标记关键词 清除标记
<div><p>可爱的楼主您好~我单卡训练没问题,多卡训练报错 Traceback (most recent call last): File "trainval_net.py", line 479, in rois_label = _RCNN(im_data, im_info, gt_boxes, num_boxes, File "/home/xinjianli/anaconda3/envs/torch12/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in <strong>call</strong> result = self.forward(<em>input, </em><em>kwargs) File "/home/xinjianli/anaconda3/envs/torch12/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.gather(outputs, self.output_device) File "/home/xinjianli/anaconda3/envs/torch12/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather return gather(outputs, output_device, dim=self.dim) File "/home/xinjianli/anaconda3/envs/torch12/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/home/xinjianli/anaconda3/envs/torch12/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(</em>outputs))) File "/home/xinjianli/anaconda3/envs/torch12/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(<em>outputs))) File "/home/xinjianli/anaconda3/envs/torch12/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(</em>outputs))) TypeError: zip argument #1 must support iteration</p> <h1>!/bin/bash</h1> <p>set -e cd ..</p> <p>CUDA_VISIBLE_DEVICES=18,19,20,21 python trainval_net.py --dataset pascal_voc_0712 --net snet_146 --bs 256 --nw 16 \ --lr 1e-2 --epochs 150 --cuda --lr_decay_step 25,50,75 --use_tfboard True \ --save_dir snet146 --eval_interval 2 --logdir snet146_log --pre ./weights/snet_146.tar \ --checkepoch 2 --mgpus</p><p>该提问来源于开源项目:ouyanghuiyu/Thundernet_Pytorch</p></div>
<div><p>I try the newest code update 6.28. And the test_1024p.sh still meet the out of memory problem. And the train_512p.sh works fine on single GPU, but when using multiple GPUs, I always get </p> <p>Exception NameError: "global name 'FileNotFoundError' is not defined" in > ignored Traceback (most recent call last): File "train.py", line 61, in Variable(data['image']), Variable(data['feat']), infer=save_fake) File "/home/f214/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in <strong>call</strong> result = self.forward(<em>input, </em><em>kwargs) File "/home/f214/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 115, in forward return self.gather(outputs, self.output_device) File "/home/f214/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 127, in gather return gather(outputs, output_device, dim=self.dim) File "/home/f214/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather return gather_map(outputs) File "/home/f214/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(</em>outputs))) File "/home/f214/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(<em>outputs))) File "/home/f214/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, </em>outputs) File "/home/f214/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/_functions.py", line 54, in forward ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs)) File "/home/f214/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/_functions.py", line 54, in ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs)) RuntimeError: dimension specified as 0 but tensor has no dimensions</p> <p>I also try to modify the GPUs with --gpu_ids=1,2 or 1,2,3, same error occurred. </p> <p>when using train_1024p.sh, I get Traceback (most recent call last): File "train.py", line 38, in model = create_model(opt) File "/media/f214/workspace/gan/pix2pixHD/models/models.py", line 15, in create_model model.initialize(opt) File "/media/f214/workspace/gan/pix2pixHD/models/pix2pixHD_model.py", line 60, in initialize self.load_network(self.netG, 'G', opt.which_epoch, pretrained_path) <br /> File "/media/f214/workspace/gan/pix2pixHD/models/base_model.py", line 60, in load_network raise('Generator must exist!') TypeError: exceptions must be old-style classes or derived from BaseException, not str</p> <p>I try the code on both servers with 4<em>1080ti and 3</em>Titan X. tensorrt4.0 conda environment cuda9.0 and cudnn7.1.3</p><p>该提问来源于开源项目:NVIDIA/pix2pixHD</p></div>
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页