Deep Learning Debug记录

1 、numpy读.npy格式数据报错

ValueError: Object arrays cannot be loaded when allow_pickle=False

原因:
numpy版本的问题,在1.16.3版本后,allow_pickle的值默认设为False。
解决方案:
1、降低numpy的版本
2、在numpy.load()函数调用的地方将allow_pickle值设置为True np.load(src, allow_pickle=True)

2、unzip时出现问题

Archive:  GoPro_large9G.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.

原因:参考其它博客
在window主机上尝试解压,发现正常解压,说明源文件正常,百度说 “一般在linux下解压zip文件,直接用系统默认的extract here进行解压(默认使用的是 unzip)
如果压缩文件.zip是大于2G的,那unzip就无法使用了,这是由于C库中long类型数据所能表示的文件偏移在32位机子上只能有2G”
具体原因不明,也有可能是压缩包产生损坏

3、pytorch版本问题

anaconda3/envs/pytorchEnv/lib/python3.7/site-packages/torch/functional.py:478::UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

解决方案:
根据报错找对对应的functional.py文件,根据报错的提示找到functional的504行加上代码indexing = ‘ij’

return _VF.meshgrid(tensors, **kwargs,indexing = 'ij')  # type: ignore[attr-defined]

4、模型加载权重报错

RuntimeError: Error(s) in loading state_dict for ResNet:
	Unexpected key(s) in state_dict: "module.conv1.weight", "module.bn1.weight", "module.bn1.bias", "module.bn1.running_mean", "module.bn1.running_var", "module.conv2.weight", "module.bn2.weight", "module.bn2.bias", "module.bn2.running_mean", "module.bn2.running_var", "module.conv3.weight", "module.bn3.weight", "module.bn3.bias", "module.bn3.running_mean", "module.bn3.running_var", "module.layer1.0.conv1.weight", "module.layer1.0.bn1.weight", "module.layer1.0.bn1.bias", "module.layer1.0.bn1.running_mean", "module.layer1.0.bn1.running_var", "module.layer1.0.conv2.weight", "module.layer1.0.bn2.weight", "module.layer1.0.bn2.bias", 

解决方案:
模型权重问题

5、模型测试时需要扩充维度

训练时,数据维度一般都是 (batch_size, c, h, w),而在测试时只输入一张图片(c,h,w),所以需要扩充维度。
扩充维度

import cv2
import torch
 
image = cv2.imread(img_path)
#image = torch.tensor(image)
image = torch.from_numpy(image)
print(image.size())
 
img = image.unsqueeze(dim=0)  
print(img.size())
 
img = img.squeeze(dim=0)
print(img.size())
 
# output:
# torch.Size([(h, w, c)])
# torch.Size([1, h, w, c])
# torch.Size([h, w, c])

降低维度
维度压缩,这个函数会把张量中所有为1的维度全部删除,以此达到降维操作。如果输入的维度是 ( A × 1 × B × C × 1 × D ) (A \times 1 \times B \times C \times 1 \times D) (A×1×B×C×1×D)函数会输出维度为 ( A × B × C × D ) (A \times B \times C \times D) (A×B×C×D)。如果定义了维度dim的参数,那么函数只会处理对应的维度。

>>> x = torch.zeros(2, 1, 2, 1, 2)
>>> x.size()
torch.Size([2, 1, 2, 1, 2])

>>> y = torch.squeeze(x)
>>> y.size()
torch.Size([2, 2, 2])

>>> y = torch.squeeze(x, 0)
>>> y.size()
torch.Size([2, 1, 2, 1, 2])

>>> y = torch.squeeze(x, 1)
>>> y.size()
torch.Size([2, 2, 1, 2])

6、端口问题

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.

原因:
由于中途关闭DDP运行,从而没有释放DDP的相关端口号,显存占用信息,当下次再次运行DDP时,使用的端口号是使用的DDP默认的端口号,也即是29500,因此造成冲突。
解决方案:
1、手动释放显存,kill -9 pid 相关显存占用的进程,关闭所有这个服务器打开的终端,从而就能释放掉前一个DDP占用的显存和端口号
2、在命令行中在启动DDP命令中(在xx.py前)手动加上一句"_ _master_port=xxxxx",如下图所示(注意需要释放前一个DDP占用的显存,可能会导致显存不足):
3、直接在nvidia-smi命令中kill掉一个相关进程,就能强迫程序停止DDP,从而DDP就会自动释放掉相应的端口号和占用的显卡资源,或者直接在命令行Ctrl+C强制中断程序,也可以直接使用Ctrl+Z快捷键强制中断程序,只不过此时没有释放DDP的端口号,需要你手动改一下DDP需要占用的端口号。

7.CUDA版本问题

在安装pip包的时候,遇到CUDA 版本问题。很多时候 CUDA 版本没达到要求,重新安装 CUDA 太麻烦,且一般都没有 root 权限。因此,需要调用 conda 自己安装的 CUDA 版本。

conda create -n your_env_name python=3.10.13 # 1、创建 conda 环境
conda activate your_env_name
conda install cudatoolkit==11.8 -c nvidia    # 2、安装指定 CUDA 版本
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118 # 3、安装支持 CUDA 的 PyTorch
conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc # 4、安装 cuda-nvcc
# 其中第 4 步是最容易遗漏的,也很少有博客提到。实测不安装 cuda-nvcc 会导致调用系统自带的 CUDA 。
conda install packaging
pip install causal-conv1d==1.1.1
pip install mamba-ssm

其中第 4 步是最容易遗漏的,也很少有博客提到。实测不安装 cuda-nvcc 会导致调用系统自带的 CUDA 。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
The deep learning toolbox includes various software libraries, frameworks, and tools that help developers and researchers build and train deep neural networks. Some of the popular deep learning toolboxes are: 1. TensorFlow: Developed by Google, TensorFlow is an open-source deep learning library that supports building and training neural networks for various applications. 2. PyTorch: Developed by Facebook, PyTorch is an open-source deep learning framework that provides a flexible platform for building and training neural networks. 3. Keras: Keras is a high-level neural networks API that runs on top of TensorFlow, Theano, or Microsoft Cognitive Toolkit. It simplifies the process of building deep learning models and enables fast experimentation. 4. Caffe: Caffe is an open-source framework for deep learning that is widely used for image recognition and classification tasks. 5. MXNet: Apache MXNet is an open-source deep learning framework that supports multiple programming languages and provides a scalable and efficient platform for building and training neural networks. 6. Torch: Torch is an open-source scientific computing framework that provides a range of tools and modules for building and training deep neural networks. 7. Theano: Theano is a Python library that enables efficient mathematical computations and supports building and training neural networks. These toolboxes enable developers and researchers to create complex deep learning models with ease and efficiency. They provide pre-built modules, functions, and algorithms that can be customized to suit specific requirements.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值