Pytorch调试过程中的BUG

1.损失函数的loss向量对不上

// An highlighted block
var foo = 'bar';Traceback (most recent call last):
  File "train.py", line 182, in <module>
    val_percent=args.val / 100)
  File "train.py", line 81, in train_net
    loss = criterion(masks_pred, true_masks)
  File "/public/home/lidd/.conda/envs/lgg/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/public/home/lidd/.conda/envs/lgg/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 862, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/public/home/lidd/.conda/envs/lgg/lib/python3.6/site-packages/torch/nn/functional.py", line 1550, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/public/home/lidd/.conda/envs/lgg/lib/python3.6/site-packages/torch/nn/functional.py", line 1409, in nll_loss
    return torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: invalid argument 3: only batches of spatial targets supported (3D tensors) but got targets of dimension: 4 at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THNN/generic/SpatialClassNLLCriterion.c:59

这里的意思是计算损失函数的时候,target支持为3维向量,而这里得到的是四维向量。

输入向量:[batch-size,7,H,W]
输出向量:[batch-size,1,H,W]

那么到底哪里是应该抛弃的维度呢?
Pytorch官方文档 CrossEntropyLoss

在这里插入图片描述
红框内为输入向量维度,蓝框内为目标向量维度。
所以我们应该把输入向量,输出向量的维度修改为。

输入向量:[batch-size,7,H,W]
输出向量:[batch-size,H,W]

2.测试过程中向量维度不符合要求

// An highlighted block
Traceback (most recent call last):                                                                                                                        
  File "train.py", line 182, in <module>
    val_percent=args.val / 100)
  File "train.py", line 99, in train_net
    val_score = eval_net(net, val_loader, device)
  File "/public/home/lidd/lgg/Pytorch-UNet-master/eval.py", line 25, in eval_net
    tot += F.cross_entropy(mask_pred, true_masks).item()
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/nn/functional.py", line 2009, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/nn/functional.py", line 1840, in nll_loss
    ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: invalid argument 3: only batches of spatial targets supported (3D tensors) but got targets of dimension: 4 at /tmp/pip-req-build-ocx5vxk7/aten/src/THNN/generic/SpatialClassNLLCriterion.c:61

修改测试过程中的向量维度。

3.语法错误

// An highlighted block
(lgg) [lidd@login03 Pytorch-UNet-master]$ python train.py -b=8
INFO: Using device cpu
INFO: Network:
        1 input channels
        7 output channels (classes)
        Bilinear upscaling
INFO: Creating dataset with 868 examples
INFO: Starting training:
        Epochs:          5
        Batch size:      8
        Learning rate:   0.001
        Training size:   782
        Validation size: 86
        Checkpoints:     True
        Device:          cpu
        Images scaling:  1
    
Epoch 1/5:   0%|                                        | 0/782 [00:00<?, ?img/s]
Illegal instruction (core dumped)

报错非法指令。
解决方案:重新安装环境后解决。

4.Batch-size=n训练过程中loss突然变成0

// loss 突然变成0
python train.py -b=8
INFO: Using device cpu
INFO: Network:
        1 input channels
        7 output channels (classes)
        Bilinear upscaling
INFO: Creating dataset with 868 examples
INFO: Starting training:
        Epochs:          5
        Batch size:      8
        Learning rate:   0.001
        Training size:   782
        Validation size: 86
        Checkpoints:     True
        Device:          cpu
        Images scaling:  1
    
Epoch 1/5:  10%|██████████████▏                                                                                                                            | 80/782 [01:33<13:21,  1.14s/img, loss (batch)=0.886I
NFO: Validation cross entropy: 1.86862473487854                                                                                                                                                                  
Epoch 1/5:  20%|███████████████████████████▊                                                                                                            | 160/782 [03:34<11:51,  1.14s/img, loss (batch)=2.35e-7I
NFO: Validation cross entropy: 5.887489884504049e-10                                                                                                                                                             
Epoch 1/5:  31%|███████████████████████████████████████████▌                                                                                                  | 240/782 [05:41<11:29,  1.27s/img, loss (batch)=0I
NFO: Validation cross entropy: 0.0                                                                                                                                                                               
Epoch 1/5:  41%|██████████████████████████████████████████████████████████                                                                                    | 320/782 [07:49<09:16,  1.20s/img, loss (batch)=0I
NFO: Validation cross entropy: 0.0                                                                                                                                                                               
Epoch 1/5:  51%|████████████████████████████████████████████████████████████████████████▋                                                                     | 400/782 [09:55<07:31,  1.18s/img, loss (batch)=0I
NFO: Validation cross entropy: 0.0                                                                                                                                                                               
Epoch 1/5:  61%|███████████████████████████████████████████████████████████████████████████████████████▏                                                      | 480/782 [12:02<05:58,  1.19s/img, loss (batch)=0I
NFO: Validation cross entropy: 0.0                                                                                                                                                                               
Epoch 1/5:  72%|█████████████████████████████████████████████████████████████████████████████████████████████████████▋                                        | 560/782 [14:04<04:16,  1.15s/img, loss (batch)=0I
NFO: Validation cross entropy: 0.0                                                                                                                                                                               
Epoch 1/5:  82%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                         | 640/782 [16:11<02:49,  1.20s/img, loss (batch)=0I
NFO: Validation cross entropy: 0.0                                                                                                                                                                               
Epoch 1/5:  92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋           | 720/782 [18:21<01:18,  1.26s/img, loss (batch)=0I
NFO: Validation cross entropy: 0.0                                                                                                                                                                               
Epoch 1/5:  94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋        | 736/782 [19:17<01:12,  1.57s/img, loss (batch)=0]
Traceback (most recent call last):
  File "train.py", line 182, in <module>
    val_percent=args.val / 100)
  File "train.py", line 66, in train_net
    for batch in train_loader:
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 4.
Original Traceback (most recent call last):
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/public/home/lidd/.conda/envs/lgg2/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: Expected object of scalar type Double but got scalar type Byte for sequence element 4 in sequence argument at position #1 'tensors'

分析问题:
训练的loss和测试的loss初始没有问题。
问题关键出在训练中,出现了loss突降的原因。

5.Batch-size=1训练过程中loss跳动

// An highlighted block
(lgg2) [lidd@login03 Pytorch-UNet-master]$ python train.py -b=1
INFO: Using device cpu
INFO: Network:
        1 input channels
        7 output channels (classes)
        Bilinear upscaling
INFO: Creating dataset with 868 examples
INFO: Starting training:
        Epochs:          5
        Batch size:      1
        Learning rate:   0.001
        Training size:   782
        Validation size: 86
        Checkpoints:     True
        Device:          cpu
        Images scaling:  1
    
Epoch 1/5:  11%|█████████████▉                                                                                                                 | 86/782 [02:12<12:41,  1.09s/img, loss (batch)=0.0636]          I
NFO: Validation cross entropy: 0.05946924779997315                                                                                                                                                               
Epoch 1/5:  22%|███████████████████████████▍                                                                                                 | 172/782 [04:52<10:44,  1.06s/img, loss (batch)=1.15e-8]          I
NFO: Validation cross entropy: 3.730315458357298                                                                                                                                                                 
Epoch 1/5:  33%|█████████████████████████████████████████▏                                                                                   | 258/782 [08:16<25:50,  2.96s/img, loss (batch)=5.31e-9]          I
NFO: Validation cross entropy: 2.2019995843036284                                                                                                                                                                
Epoch 1/5:  44%|██████████████████████████████████████████████████████▉                                                                      | 344/782 [13:43<21:47,  2.98s/img, loss (batch)=7.86e-9]          I
NFO: Validation cross entropy: 1.200446563662029                                                                                                                                                                 
Epoch 1/5:  54%|████████████████████████████████████████████████████████████████████▏                                                         | 423/782 [18:43<10:34,  1.77s/img, loss (batch)=6.7e-9]

应该随机梯度下降不稳定的原因。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值