Deep Learning Debug记录

澪mio

已于 2024-04-20 00:10:36 修改

阅读量381

点赞数

分类专栏：深度学习文章标签：深度学习

于 2023-08-31 17:34:07 首次发布

本文链接：https://blog.csdn.net/qq_52358603/article/details/132608141

版权

深度学习专栏收录该内容

71 篇文章 90 订阅

订阅专栏

Deep Learning Debug记录

1 、numpy读.npy格式数据报错

ValueError: Object arrays cannot be loaded when allow_pickle=False

原因：
numpy版本的问题，在1.16.3版本后，allow_pickle的值默认设为False。
解决方案：
1、降低numpy的版本
2、在numpy.load()函数调用的地方将allow_pickle值设置为True np.load(src, allow_pickle=True)

2、unzip时出现问题

Archive:  GoPro_large9G.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.

原因：参考其它博客
在window主机上尝试解压，发现正常解压，说明源文件正常，百度说 “一般在linux下解压zip文件，直接用系统默认的extract here进行解压(默认使用的是 unzip)
如果压缩文件.zip是大于2G的，那unzip就无法使用了，这是由于C库中long类型数据所能表示的文件偏移在32位机子上只能有2G”
具体原因不明，也有可能是压缩包产生损坏

3、pytorch版本问题

anaconda3/envs/pytorchEnv/lib/python3.7/site-packages/torch/functional.py:478:：UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

解决方案：
根据报错找对对应的functional.py文件，根据报错的提示找到functional的504行加上代码indexing = ‘ij’

return _VF.meshgrid(tensors, **kwargs,indexing = 'ij')  # type: ignore[attr-defined]

4、模型加载权重报错

RuntimeError: Error(s) in loading state_dict for ResNet:
	Unexpected key(s) in state_dict: "module.conv1.weight", "module.bn1.weight", "module.bn1.bias", "module.bn1.running_mean", "module.bn1.running_var", "module.conv2.weight", "module.bn2.weight", "module.bn2.bias", "module.bn2.running_mean", "module.bn2.running_var", "module.conv3.weight", "module.bn3.weight", "module.bn3.bias", "module.bn3.running_mean", "module.bn3.running_var", "module.layer1.0.conv1.weight", "module.layer1.0.bn1.weight", "module.layer1.0.bn1.bias", "module.layer1.0.bn1.running_mean", "module.layer1.0.bn1.running_var", "module.layer1.0.conv2.weight", "module.layer1.0.bn2.weight", "module.layer1.0.bn2.bias",

解决方案：
模型权重问题

5、模型测试时需要扩充维度

训练时，数据维度一般都是 (batch_size, c, h, w)，而在测试时只输入一张图片(c,h,w)，所以需要扩充维度。
扩充维度

import cv2
import torch
 
image = cv2.imread(img_path)
#image = torch.tensor(image)
image = torch.from_numpy(image)
print(image.size())
 
img = image.unsqueeze(dim=0)  
print(img.size())
 
img = img.squeeze(dim=0)
print(img.size())
 
# output:
# torch.Size([(h, w, c)])
# torch.Size([1, h, w, c])
# torch.Size([h, w, c])

降低维度
维度压缩，这个函数会把张量中所有为1的维度全部删除，以此达到降维操作。如果输入的维度是 $\times 1 \times B \times C \times 1 \times D)$ 函数会输出维度为 $\times B \times C \times D)$ 。如果定义了维度dim的参数，那么函数只会处理对应的维度。

>>> x = torch.zeros(2, 1, 2, 1, 2)
>>> x.size()
torch.Size([2, 1, 2, 1, 2])

>>> y = torch.squeeze(x)
>>> y.size()
torch.Size([2, 2, 2])

>>> y = torch.squeeze(x, 0)
>>> y.size()
torch.Size([2, 1, 2, 1, 2])

>>> y = torch.squeeze(x, 1)
>>> y.size()
torch.Size([2, 2, 1, 2])

6、端口问题

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.

原因：
由于中途关闭DDP运行，从而没有释放DDP的相关端口号，显存占用信息，当下次再次运行DDP时，使用的端口号是使用的DDP默认的端口号，也即是29500，因此造成冲突。
解决方案：
1、手动释放显存，kill -9 pid 相关显存占用的进程，关闭所有这个服务器打开的终端，从而就能释放掉前一个DDP占用的显存和端口号
2、在命令行中在启动DDP命令中（在xx.py前）手动加上一句"_ _master_port=xxxxx"，如下图所示（注意需要释放前一个DDP占用的显存，可能会导致显存不足）：
3、直接在nvidia-smi命令中kill掉一个相关进程，就能强迫程序停止DDP，从而DDP就会自动释放掉相应的端口号和占用的显卡资源，或者直接在命令行Ctrl+C强制中断程序，也可以直接使用Ctrl+Z快捷键强制中断程序，只不过此时没有释放DDP的端口号，需要你手动改一下DDP需要占用的端口号。

7.CUDA版本问题

在安装pip包的时候，遇到CUDA 版本问题。很多时候 CUDA 版本没达到要求，重新安装 CUDA 太麻烦，且一般都没有 root 权限。因此，需要调用 conda 自己安装的 CUDA 版本。

conda create -n your_env_name python=3.10.13 # 1、创建 conda 环境
conda activate your_env_name
conda install cudatoolkit==11.8 -c nvidia    # 2、安装指定 CUDA 版本
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118 # 3、安装支持 CUDA 的 PyTorch
conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc # 4、安装 cuda-nvcc
# 其中第 4 步是最容易遗漏的，也很少有博客提到。实测不安装 cuda-nvcc 会导致调用系统自带的 CUDA 。
conda install packaging
pip install causal-conv1d==1.1.1
pip install mamba-ssm