1.禁用 cudnn batch_norm:
1.1查看当前环境中已安装的pytorch位置:
在终端或者命令行输入python,进入python环境
1.2为pytorch的路径设置环境变量
~$ PYTORCH=/home/liuman/anaconda3/envs/pytorch02/lib/python3.8/site-packages/torch
2.克隆此存储库,我们将克隆的目录称为 ${POSE_ROOT}
3.安装依赖项:
(pytorch02) liuman@gpu01-beiserver:~$ cd /home/liuman/HPE/code/SimpleBaseline/
(pytorch02) liuman@gpu01-beiserver:~/HPE/code/SimpleBaseline$ ls
CONTRIBUTING.md experiments lib LICENSE pose_estimation README.md requirements.txt SECURITY.md
(pytorch02) liuman@gpu01-beiserver:~/HPE/code/SimpleBaseline$ pip install -r requirements.txt
报错:
解决方法:
在requirements.txt文件中,修改需要的opencv-python版本为相近的版本,如3.4.11.41
(如果换一个版本还是安装不成功,就再换一个 )
4.制作库
:
$cd lib
$make
会依次执行lib目录下面makefile.txt文件中的指令:
all:
cd nms; python setup.py build_ext --inplace; rm -rf build; cd ../../
clean:
cd nms; rm *.so; cd ../../
5.安装COCOAPI:
# COCOAPI=/path/to/clone/cocoapi
git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
cd $COCOAPI/PythonAPI
# Install into global site-packages
make install
# Alternatively, if you do not have permissions or prefer
# not to install the COCO API into global site-packages
python3 setup.py install --user
6.下载imagenet、coco、mpii的pytorch预训练模型(没有下载caffe-style的)
:
(我之前已经将预训练模型下载到了本地,就不再放在这个项目的model目录下了,用到的时候修改路径)
7.初始化输出(训练模型输出目录)和日志(张量板日志目录)目录
mkdir output
mkdir log
(data文件也没有重新下载,用的时候修改路径)
8.在coco train2017数据集上训练:
python pose_estimation/train.py \
--cfg experiments/coco/resnet50/256x192_d256x3_adam_lr1e-3.yaml
报错:
解决方法:
不改变pyyaml的版本,直接替换load()这个函数
用safe_load()替换load()
解决方法:
更新tensorboardx版本:
pip install --upgrade tensorboardx
重新训练有warning:
UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
解决:
再训练会报错Fail to read某张图片,但是这张图片是有的,应该就是没读取出来。
debug train.py,明明给了cfg的路径,还是报错:
usage: train.py [-h] --cfg CFG
train.py: error: the following arguments are required: --cfg
解决方法,将required=True删掉:
# parser.add_argument('--cfg',
# help='experiments/coco/resnet50/256x192_d256x3_adam_lr1e-3.yaml',
# required=True,
# type=str)
parser.add_argument('--cfg',
help='experiments/coco/resnet50/256x192_d256x3_adam_lr1e-3.yaml',
type=str)
debug的时候想要step into model=eval(…),结果:
Couldn’t apply path mapping to the remote file.
解决方法是start ssh section.
训练的时候报错(图片明明存在,却读取不到):
ValueError: Caught ValueError in DataLoader worker process 3.
ValueError: Fail to read /home/liuman/HPE/dataset/mscoco/2017/images/train2017/000000468530.jpg
解决方法:
将train_loader中的num_workers改为0
有些参数本来加载在gpu0上,现在却被程序加载在gpu4上。程序默认使用gpu0作为主gpu,但是现在我想使用gpu4
RuntimeError: module must have its parameters and buffers on device cuda:4 (device_ids[0]) but found one of them on device: cuda:0
解决:
CUDA_VISIBLE_DEVICES=4 python pose_estimation/train.py \
--cfg experiments/coco/resnet50/256x192_d256x3_adam_lr1e-3.yaml