【实验记录1——SimCC】

冰糖狮子头

已于 2023-05-14 17:10:12 修改

阅读量1.3k

点赞数 7

文章标签： python 深度学习 opencv

于 2023-05-04 22:31:46 首次发布

本文链接：https://blog.csdn.net/yuandeyixinren11/article/details/130301563

版权

在COCOtrain2017数据集上训练模型时，由于opencv版本更改导致错误，尝试降级至3.4.0.14版本后出现安装问题，进一步升级到4.4.0.46后运行过程中遇到cuDNN错误。通过调整GPU使用和设置绝对路径解决部分问题，但所有实验在运行一段时间后被杀死，调试发现是checkpoint_file路径问题。

摘要由CSDN通过智能技术生成

1. 在coco train2017数据集上跑：

1.1

python tools/train.py     --cfg experiments/coco/resnet/sa_simdr/original/res50_384x288_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml

参数量43M

之前在服务器上跑SimCC的时候已经调好了，是可以正常跑的，但是今天又跑了一下，发现报错：

cv2.error: OpenCV(3.4.11) /tmp/pip-req-build-9tmfflg3/opencv/modules/imgproc/src/color.cpp:182: erro

应该是之前跑rle的时候把opencv的版本给更改了，于是开始更换版本：

pip uninstall opencv-python

pip install opencv-python==3.4.0.14

于是又报错：
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> opencv-python

网上说可能是安装的版本太低了，于是：

pip install opencv-python==4.4.0.46

安装成功，就是不知道运行会不会又报错，而且是在训练了几个epoch之后才报错的，不知道为什么…

然后运行：

python tools/train.py     --cfg experiments/coco/resnet/sa_simdr/original/res50_384x288_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml

报了一个从来没见过的错误：

 raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuman/HPE/code/SimCC-main/SimCC-main/tools/../lib/models/pose_resnet.py", line 205, in forward
    x = self.conv1(x)
  File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

在配置文件中设置的是GPUS: (0,1,2,3)，换成单块卡试试：GPUS: (0,)
（但是昨天就是用的GPUS: (0,1,2,3)，没报错啊…）
跑起来了，希望中间不要报错！！！

成功跑完第1个epoch的结果：
在这里插入图片描述

在这里插入图片描述
1.2

CUDA_VISIBLE_DEVICES=3 python tools/train.py --cfg experiments/coco/lpn/simdr/lpn50_256x192_gd256x2_gc.yaml

参数量6.6M

在这里插入图片描述
把HEAD_INPUT: 由4096改为3072
3072 = 48*64

 HEATMAP_SIZE:
  - 48
  - 64

在这里插入图片描述
1.3

CUDA_VISIBLE_DEVICES=6 python tools/train.py --cfg experiments/coco/lpn/sa-simdr/lpn50_256x192_gd256x2_gc.yaml

sa-simdr和simdr损失函数不一样
sa-simdr需要target_x,target_y
simdr需要target
参数量：
在这里插入图片描述
1.4

CUDA_VISIBLE_DEVICES=4 python tools/train.py --cfg experiments/coco/resnet/sa_simdr/original/res50_256x192_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml

参数量：
在这里插入图片描述

2.在coco val2017数据集上跑：

CUDA_VISIBLE_DIVICES=5 python tools/test.py --cfg experiments/coco/resnet/sa_simdr/original/res50_384x288_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml

程序没有记录每个epoch的情况，而是几乎几秒就出来了结果：
在这里插入图片描述
官方结果：

怎么比官方结果还好…

3.在mpii数据集上跑：

在 MPII 数据集上进行训练
使用 SimDR 作为关键点坐标表示进行训练：

python tools/train.py --cfg experiments/mpii/hrnet/simdr/norm_w32_256x256_adam_lr1e-3_ls2e1.yaml

跑完第一个epoch的结果：
在这里插入图片描述
跑完第210个epoch的结果：

程序被杀死啦！！！

这三个实验在跑了三天之后突然全部被杀死了，不知道怎么回事…
然后配置文件中会设置AUTO_RESUME=True，所以按道理再重新运行是会接着从上次结束的epoch开始跑，但是发现重新运行又从头开始了。
于是在师兄的帮助下开始debug，发现程序没有进到这个if语句里边：
在这里插入图片描述
说明没有找到checkpoint_file这个文件，这是因为pycharm远程连接服务器，如果文件路径写成相对路径就只会在当前路径tools文件夹下找文件，所以需要将配置文件中的OUTPUT_DIR: ‘output’ 改为绝对路径：
OUTPUT_DIR: ‘/home/liuman/HPE/code/SimCC-main/SimCC-main/output’
这样就可以找到了，重新运行就从第94个epoch开始了。

CUDA_VISIBLE_DEVICES=4 python tools/test.py --cfg experiments/mpii/lpn/lpn50_256x256_gd256x2_gc.yaml

在这里插入图片描述

CUDA_VISIBLE_DEVICES=4 python tools/test.py --cfg experiments/mpii/lpn/sa_simdr/lpn50_256x256_gd256x2_gc.yaml

在这里插入图片描述

backbone使用lpn，lpn是对resnet的优化：
CUDA_VISIBLE_DEVICES=4 python tools/train.py --cfg experiments/mpii/lpn/sa_simdr/lpn50_256x256_gd256x2_gc.yaml
在这里插入图片描述
CUDA_VISIBLE_DEVICES=7 python tools/train.py --cfg experiments/mpii/resnet/sa_simdr/original/res50_256x256_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml

冰糖狮子头

关注

7
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
【实验记录1——SimCC】

然后配置文件中会设置AUTO_RESUME=True，所以按道理再重新运行是会接着从上次结束的epoch开始跑，但是发现重新运行又从头开始了。安装成功，就是不知道运行会不会又报错，而且是在训练了几个epoch之后才报错的，不知道为什么…在配置文件中设置的是GPUS: (0,1,2,3)，换成单块卡试试：GPUS: (0,)（但是昨天就是用的GPUS: (0,1,2,3)，没报错啊…这样就可以找到了，重新运行就从第94个epoch开始了。跑起来了，希望中间不要报错！在 MPII 数据集上进行训练。
复制链接

扫一扫