【持续更新】训练自己模型时遇到的bug合集_tracerwarning: trace had nondeterministic nodes. d-CSDN博客

本文链接：https://blog.csdn.net/hu_yinghui/article/details/127486188

本文汇总了深度学习开发中遇到的十种常见问题，包括运行时错误、模块初始化错误、函数调用错误和模型加载问题。提供了详细的解决方法，如类型转换、模块继承、参数调整和结构修改。适合开发者快速定位并修复这些问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

问题1：

RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 ‘target’

解决方法：
将原来的 label 改成 label.long()

Loss= torch.nn.CrossEntropyLoss()
loss = Loss(out, label) 改成
loss = Loss(out, label.long())

问题2：

AttributeError: cannot assign module before Module.init() call在初始化函数调用前不能获得分配模块
解决方法：
在init方法中添加super继承父类的属性和方法：

super(ResNet, self).__init__()

问题3：

raise NotImplementedError
NotImplementedError
解决方法：
调用函数时，def forword()少写了一个r，导致函数无法调用

问题4：

RuntimeError: Error(s) in loading state_dict for ResNet:
Missing key(s) in state_dict: “conv1.weight”, “bn1.weight”, “bn1.bias”, “bn1.running_mean”, “bn1.running_var”,
Unexpected key(s) in state_dict: “model.conv1.weight”, “model.bn1.weight”, “model.bn1.bias”, “model.bn1.running_mean”, “model.bn1.running_var”, “model.bn1.num_batches_tracked”,
解决方法：
load_state_dict方法还有一个重要的参数是strict，该参数默认是True，表示预训练模型的层和自己定义的网络结构层严格对应相等（比如层名和维度）。所以当我们修改了网络结构后，如果strict之为True的时候就会报错。将strict改为False。

load_state_dict（state_dict, False）

问题5：

TypeError: ConvBNRelu() got an unexpected keyword argument ‘padding’
解决方法：
定义函数时，没有定义参数padding

问题6：

TypeError: unsupported operand type(s) for //: ‘tuple’ and ‘int’
RuntimeError: The size of tensor a (320) must match the size of tensor b (192) at non-singleton dimension 1
解决方法：
修改a和b通道数，使之前后一致。

问题7：

RuntimeError: running_mean should contain 64 elements not 128
解决方法：
BN层与Conv层的输出维度保持一致。

问题8：

RuntimeError: size mismatch, m1: [128 x 7168], m2: [1792 x 10] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:273
解决方法：
avgpool设置的kernel_size太小

问题9：

RuntimeError: Given groups=1, weight of size 2048 1858 1 1, expected input[128, 704, 2, 2] to have 1858 channels, but got 704 channels instead
解决方法：
通道数有问题，计算修改验证

–

问题10：

RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 7.93 GiB total capacity; 6.76 GiB already allocated; 52.19 MiB free; 113.13 MiB cached)
解决方法：
查看gpu内存： nvidia-smi
释放gpu缓存： sudo fuser /dev/nvidia*
都没有什么用（×）
可能是加载模型太大，显存不够
修改batch_size，将batch_size改小

问题11

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
TracerWarning: Trace had nondeterministic nodes. Did you forget call .eval() on your model? Nodes: %out : Float(1, 1024, 1, 1) = aten::dropout(%input.29, %1001, %1002), scope: InceptionNet # /mnt/usr/local/anaconda3/envs/torch-python3.7/lib/python3.7/site-packages/torch/nn/functional.py:806:0
pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel()
解决方法：
类别数和数据集类别不相符合

问题12

UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x90 in position 22: illegal multibyte sequence

解决方法：
第一行没有设置：

# _*_ coding:utf-8 _*_

open函数中加上：

encoding=‘UTF-8’

问题13

UnicodeDecodeError: ‘utf-8’ codec can’t decode bytes in position 15-16: invalid continuation byte
解决方法：
open函数中加上

encoding='ISO-8859-1'

问题14

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
这个类中只有一个成员，最少要有2个成员。