CL_MVSNet复现可能会出现的问题汇总

你不困我困

已于 2024-06-23 21:43:06 修改

阅读量475

点赞数 1

分类专栏： MVS 文章标签：深度学习

于 2023-10-30 21:53:32 首次发布

本文链接：https://blog.csdn.net/Kunjpg/article/details/134126176

版权

MVS 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1.最好按照说明文档要求配好python3.7和pytorch1.0

安装pytorch1.0

conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge

报错，执行下面再重执行安装

conda install conda=23.10.0

安装各种包

安装cv2:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple  opencv-python

安装tensorboard

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorboardX
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorboard

安装torch-tb-profiler

 pip install torch-tb-profiler

安装

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple  plyfile

如何执行shell脚本
在终端中输入“chmod +x 文件名.sh”，将文件设置为可执行文件。
在终端中输入“./文件名.sh”，即可运行shell脚本。

2. 【已解决】 FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future.

torch.distributed.launch被弃用，考虑使用torchrun模块进行替换。
解决方案：
将训练脚本中的torch.distributed.launch替换为torchrun。例如，如果原始命令如下

python -m torch.distributed.launch --nproc_per_node=2 train.py
将其修改为下面的命令：

python -m torch.distributed.run --use-env --nproc_per_node=2 train.py

如果还是报错如下：
在这里插入图片描述
删掉–use-env

torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2344619) of binary: /home/vgg/anaco

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1105295) of binary: /home/vgg/anaconda3/envs/kunpython37/bin/python

3.【已解决】torch.distributed.elastic.multiprocessing.api:Sending process 2344620 closing signal SIGTERM

单卡跑就行，解决方案看5

4. 【已解决】module ‘progressbar’ has no attribute ‘Variable’

解决方案
卸载掉progressbar2和progressbar模块重装

pip uninstall progressbar2
pip uninstall progressbar

重装，建议安装低版本的progressbar2

pip install progressbar2==3.51
pip install progressbar==2.1

5. 【已解决】RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper__cudnn_convolution)

问题原因：参与运算的变量不在同一个gpu上，考虑将所有数据移动到同一个gpu上运行，或者干脆使用单卡运行，在训练脚本中改为：就是单卡跑

CUDA_VISIBLE_DEVICES=0,--nproc_per_node=1

6. 【已解决】CUDA out of memory.

使用查看显卡空间

 gpustat

然后切换成有空的显卡

CUDA_VISIBLE_DEVICES=有空的显卡号

7. 【已解决】训练意外中断，使用检查点文件继续训练

找到.log文件夹下的检查点文件，复制路径
在这里插入图片描述
在主函数里找到训练
找到训练里的第10个参数resume，添加default = ‘检查点文件地址’
如下：

你不困我困

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
2
评论
CL_MVSNet复现可能会出现的问题汇总

torch.distributed.launch被弃用，考虑使用torchrun模块进行替换。将训练脚本中的torch.distributed.launch替换为torchrun。例如，如果原始命令如下如果还是报错如下：删掉–use-env解决方案：在dataloader时参数shuffle默认False即可。
复制链接

扫一扫

专栏目录