在使用pytorch进行训练的时候,loss一直维持在同一个很大的数附近震荡,很明显是模型有问题,经过了长时间的查找,才发现pytorch早已提示了错误,而自己忽略了。·
/home/anaconda3/envs/M/lib/python3.6/site-packages/torch/nn/modules/loss.py:528: UserWarning: Using a target size (torch.Size([256])) that is different to the input size (torch.Size([256, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
/home/anaconda3/envs/M/lib/python3.6/site-packages/torch/nn/modules/loss.py:528: UserWarning: Using a target size (torch.Size([186])) that is different to the input size (torch.Size([186, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
epoch = 0 loss = 35.4517
epoch = 50 loss = 11.7757
epoch = 100 loss = 11.7866
epoch = 150 loss = 11.9761
epoch = 200 loss = 11.6557
epoch = 250 loss = 11.6361
epoch = 300 loss = 11.7160
epoch = 350 loss = 11.6047
epoch = 400 loss = 11.6412
epoch = 450 loss = 11.6390
epoch = 500 loss = 11.7137
epoch = 550 loss = 11.6388
epoch = 600 loss = 11.7710
epoch = 650 loss = 11.6640
epoch = 700 loss = 11.6348
epoch = 750 loss = 11.7574
epoch = 800 loss = 11.5811
epoch = 850 loss = 11.7239
epoch = 900 loss = 11.8139
epoch = 950 loss = 11.7817
如上图所示,pytorch提示我loss输入的两个参数(input, target)的维度不一致,target是一维的,而input是二维的(多了维度为1的一维),虽然在数量上是一致的,但pytorch不会自动进行转置,只会进行brodecasting操作,及广播操作,可能的结果是把target的第一个数广播为input同样的维度,即所有的target结果为一样的,这也解释了为什么预测出来的结果都是在一个很小的范围内震荡,如下图:
预测结果:[2.542457342147827, 2.467282295227051, 2.467350721359253, 2.4719324111938477, 2.4960854053497314, 2.5416831970214844, 2.5303843021392822]
实际结果:[1.9479999542236328, 2.5869998931884766, 2.2699999809265137, 2.38100004196167, 1.4229999780654907, 1.8990000486373901, 1.4620000123977661]
解决办法:
通过tensor的squeeze操作(减少维度)或者unsqueeze操作(增加维度)来使input和target的维度相同,修改之后,loss变得正常了,也不再提示warning,如下图:
Starting training with saved checkpoints
epoch = 0 loss = 29.6703
epoch = 50 loss = 1.3070
epoch = 100 loss = 0.3954
epoch = 150 loss = 0.2571
epoch = 200 loss = 0.2827
epoch = 250 loss = 0.2776
epoch = 300 loss = 0.1670
epoch = 350 loss = 0.1842
第二种情况,loss一直维持在高位。
[INFO] EPOCH: 1/200 --> Val loss: 1.590242, Val accuracy: 0.2767
[INFO] EPOCH: 2/200 --> Val loss: 1.589383, Val accuracy: 0.2767
[INFO] EPOCH: 3/200 --> Val loss: 1.592154, Val accuracy: 0.2767
[INFO] EPOCH: 4/200 --> Val loss: 1.590423, Val accuracy: 0.2767
检查之后发现,是loss function 中weight_decay 设置有问题
# Loss & Optimizer
criterion = nn.NLLLoss()
optimizer = optim.Adam(self.model.parameters(), lr = self.init_lr, weight_decay = 0.001)
#optimizer = optim.Adam(self.model.parameters(), lr = self.init_lr)
当不设置weight_decay 时,loss能正常下降,当设置了weight_decay时,loss就维持不变了。此时应该减小weight_decay 的值。weight_decay太大会导致learning rate衰减太快,导致的结果就是loss不再变化。