昨晚在改代码的时候遇到的一个小错误。在这里记录下,防止再犯。
报错情况
Traceback (most recent call last):
File "/home/YiwenXia/trainstage2.py", line 116, in <module>
train(args)
File "/home/YiwenXia/trainstage2.py", line 59, in train
loss = loss_criterion(pred, target)
File "/home/xindima/anaconda3/envs/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xindima/anaconda3/envs/python3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/home/xindima/anaconda3/envs/python3.10/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: 0D or 1D target tensor expected, multi-target not supported
通过在网上查询资料发现:
- 这意味着
CrossEntropyLoss
期望的目标张量(target tensor)是0D或1D的,但是提供的目标张量可能是2D或更高维度的。 - 这个问题通常发生在目标张量(通常是标签)被误解为独热编码(one-hot encoded)形式,而
CrossEntropyLoss
期望的是类索引的形式。
出现问题的代码如下:
for epoch in range(1, args.max_epoch + 1):
model.train()
total_loss = 0
for batch, (data, target, *_) in enumerate(train_data):
data, target = data.to(DEVICE), target.to(DEVICE)
pred = model(data)
loss = loss_criterion(pred, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
# 输出每个epoch的平均损失
print("epoch", epoch, "loss:", total_loss / len(train_dataset) * args.batch_size)
为了解决这个问题,我们首先要确保target
是一个1D的张量,其中每个元素是一个类索引,而不是独热编码形式。
检查loaddataset.PreDataset
的输出,确保它返回的标签是类索引的形式。
loaddataset代码如下:
# loaddataset.py
from torchvision.datasets import CIFAR10
from PIL import Image
import torch
class PreDataset(CIFAR10):
def __getitem__(self, item):
img, target = self.data[item], self.targets[item]
img = Image.fromarray(img)
if self.transform is not None:
imgL = self.transform(img)
imgR = self.transform(img)
if self.target_transform is not None:
target = self.target_transform(target)
assert imgL is not None and imgR is not None, "One of the images is None"
return imgL, imgR, target
if __name__=="__main__":
import config
train_data = PreDataset(root='/home/YiwenXia/dataset', train=True, transform=config.train_transform, download=True)
print(train_data[0])
返回的输出如下:
[ 2.3677e-01, -1.7623e-01, -6.0891e-01, ..., -3.9257e-01,
-5.1057e-01, -6.0891e-01],
[ 3.1544e-01, -9.7567e-02, -6.4824e-01, ..., -3.1390e-01,
-3.9257e-01, -5.1057e-01],
[ 2.1710e-01, 7.6703e-04, -4.3190e-01, ..., -3.3357e-01,
-4.1224e-01, -4.7124e-01]],
[[-1.1428e-01, -2.1183e-01, -3.4841e-01, ..., -1.1288e+00,
-1.2849e+00, -1.4020e+00],
[-8.3616e-01, -9.1420e-01, -1.0313e+00, ..., -1.4800e+00,
-1.6946e+00, -1.8507e+00],
[-1.1873e+00, -1.2459e+00, -1.3044e+00, ..., -1.4995e+00,
-1.7531e+00, -1.9287e+00],
...,
[-1.5971e+00, -1.5971e+00, -1.5971e+00, ..., -9.7273e-01,
-1.1093e+00, -1.2849e+00],
[-1.5190e+00, -1.5776e+00, -1.6361e+00, ..., -9.1420e-01,
-9.7273e-01, -1.1678e+00],
[-1.4605e+00, -1.3629e+00, -1.4020e+00, ..., -9.9224e-01,
-1.0118e+00, -1.0703e+00]]]), 6)
可以看到它返回了一个元组,其中第一个元素是数据(可能是图像),第二个元素是一个整数6,这应该是类别标签。这意味着标签已经是类索引的形式,而不是独热编码的形式。
因此,错误可能不是由于标签的格式问题。我们需要进一步检查其他可能的问题。
检查pred
的形状和值,以确保模型的输出是正确的。
pred = model(data)
print(pred.shape, pred) # 打印模型的输出形状和值
loss = loss_criterion(pred, target)
可以得到:
pred的shape是torch.Size([200, 10]) tensor([[-0.2345, -0.2942, -0.9322, ..., -0.4986, -0.0681, -1.6638],
[-0.1042, 0.1533, -0.6155, ..., 0.1431, -0.2920, -1.6294],
[-0.3651, 0.7063, -0.2860, ..., 0.3435, -0.6246, -0.7833],
...,
[-0.0724, 0.5727, 0.0755, ..., -0.8571, 0.1019, -0.4271],
[ 0.3314, -0.7106, -1.1914, ..., 0.3072, -0.5874, -1.4873],
[ 0.0365, 0.1306, -0.2654, ..., 0.2176, -0.7125, -0.8015]],
device='cuda:0', grad_fn=<AddmmBackward0>)
target的shape是torch.Size([200, 3, 32, 32]) tensor([[[[ 0.6144, 0.6338, 0.6919, ..., 1.4479, 1.4479, 1.4673],
[ 0.6338, 0.6531, 0.6919, ..., 1.4673, 1.4673, 1.4867],
[ 0.6725, 0.6919, 0.7113, ..., 1.4673, 1.4673, 1.5255],
...,
[-1.0721, -1.1109, -1.1303, ..., -0.7232, -0.8395, -0.9558],
[-0.9752, -0.9752, -0.9364, ..., -0.8201, -0.9752, -1.2078],
[-0.8783, -0.8977, -0.8977, ..., -0.8589, -1.0334, -1.2272]],
从这里,我们可以看到问题所在了。
pred
的形状是 (200, 10)
,这是正确的,因为有一个批次大小为 200 的数据,模型预测 10 个类别。
然而,target
的形状是 (200, 3, 32, 32)
,这意味着它是一个批次大小为 200 的图像数据集,每个图像有 3 个通道,高度和宽度都是 32。这不是期望的标签数据。
需要确保 train_data
DataLoader 返回正确的数据和标签。从之前loaddataset
输出中,可以看到它确实返回了一个标签值(整数 6),但在训练循环中,没有正确地提取这个标签。
解决方案:
因此应该将:
for batch, (data, target, *_) in enumerate(train_data):
data, target = data.to(DEVICE), target.to(DEVICE)
...
改为:
for batch, (data, _, target) in enumerate(train_data):
data, target = data.to(DEVICE), target.to(DEVICE)
...
有时候就是一个小小的错误浪费很多时间。。。