1. 在coco train2017数据集上跑:
1.1
python tools/train.py --cfg experiments/coco/resnet/sa_simdr/original/res50_384x288_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml
参数量43M
之前在服务器上跑SimCC的时候已经调好了,是可以正常跑的,但是今天又跑了一下,发现报错:
cv2.error: OpenCV(3.4.11) /tmp/pip-req-build-9tmfflg3/opencv/modules/imgproc/src/color.cpp:182: erro
应该是之前跑rle的时候把opencv的版本给更改了,于是开始更换版本:
pip uninstall opencv-python
pip install opencv-python==3.4.0.14
于是又报错:
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> opencv-python
网上说可能是安装的版本太低了,于是:
pip install opencv-python==4.4.0.46
安装成功,就是不知道运行会不会又报错,而且是在训练了几个epoch之后才报错的,不知道为什么…
然后运行:
python tools/train.py --cfg experiments/coco/resnet/sa_simdr/original/res50_384x288_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml
报了一个从来没见过的错误:
raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/liuman/HPE/code/SimCC-main/SimCC-main/tools/../lib/models/pose_resnet.py", line 205, in forward
x = self.conv1(x)
File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
在配置文件中设置的是GPUS: (0,1,2,3),换成单块卡试试:GPUS: (0,)
(但是昨天就是用的GPUS: (0,1,2,3),没报错啊…)
跑起来了,希望中间不要报错!!!
成功跑完第1个epoch的结果:
1.2
CUDA_VISIBLE_DEVICES=3 python tools/train.py --cfg experiments/coco/lpn/simdr/lpn50_256x192_gd256x2_gc.yaml
参数量6.6M
把HEAD_INPUT: 由4096改为3072
3072 = 48*64
HEATMAP_SIZE:
- 48
- 64
1.3
CUDA_VISIBLE_DEVICES=6 python tools/train.py --cfg experiments/coco/lpn/sa-simdr/lpn50_256x192_gd256x2_gc.yaml
sa-simdr和simdr损失函数不一样
sa-simdr需要target_x,target_y
simdr需要target
参数量:
1.4
CUDA_VISIBLE_DEVICES=4 python tools/train.py --cfg experiments/coco/resnet/sa_simdr/original/res50_256x192_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml
参数量:
2.在coco val2017数据集上跑:
CUDA_VISIBLE_DIVICES=5 python tools/test.py --cfg experiments/coco/resnet/sa_simdr/original/res50_384x288_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml
程序没有记录每个epoch的情况,而是几乎几秒就出来了结果:
官方结果:
怎么比官方结果还好…
3.在mpii数据集上跑:
在 MPII 数据集上进行训练
使用 SimDR 作为关键点坐标表示进行训练:
python tools/train.py --cfg experiments/mpii/hrnet/simdr/norm_w32_256x256_adam_lr1e-3_ls2e1.yaml
跑完第一个epoch的结果:
跑完第210个epoch的结果:
程序被杀死啦!!!
这三个实验在跑了三天之后突然全部被杀死了,不知道怎么回事…
然后配置文件中会设置AUTO_RESUME=True,所以按道理再重新运行是会接着从上次结束的epoch开始跑,但是发现重新运行又从头开始了。
于是在师兄的帮助下开始debug,发现程序没有进到这个if语句里边:
说明没有找到checkpoint_file这个文件,这是因为pycharm远程连接服务器,如果文件路径写成相对路径就只会在当前路径tools文件夹下找文件,所以需要将配置文件中的OUTPUT_DIR: ‘output’ 改为绝对路径:
OUTPUT_DIR: ‘/home/liuman/HPE/code/SimCC-main/SimCC-main/output’
这样就可以找到了,重新运行就从第94个epoch开始了。
CUDA_VISIBLE_DEVICES=4 python tools/test.py --cfg experiments/mpii/lpn/lpn50_256x256_gd256x2_gc.yaml
CUDA_VISIBLE_DEVICES=4 python tools/test.py --cfg experiments/mpii/lpn/sa_simdr/lpn50_256x256_gd256x2_gc.yaml
backbone使用lpn,lpn是对resnet的优化:
CUDA_VISIBLE_DEVICES=4 python tools/train.py --cfg experiments/mpii/lpn/sa_simdr/lpn50_256x256_gd256x2_gc.yaml
CUDA_VISIBLE_DEVICES=7 python tools/train.py --cfg experiments/mpii/resnet/sa_simdr/original/res50_256x256_d256x3_adam_lr1e-3_deconv3_split2_sigma6.yaml