掏粪日记之tf-pose 01

最新推荐文章于 2021-11-09 17:43:55 发布

elvindp

最新推荐文章于 2021-11-09 17:43:55 发布

阅读量1.9k

点赞数 2

文章标签： tfboy 掏粪 caffe openpose

本文链接：https://blog.csdn.net/elvindp/article/details/86630910

版权

从open-pose到tf-pose和从tf到caffe（其实没到）

open-pose是CMU的大佬们开源的Caffe项目。然鹅，小渣表示c++看起来简直要死，debug起来欲生欲死。刚好老板说我们要优化的（便宜的），刚好GitHub上发现了Ildoo大佬给的tf-pose，刚好大佬给的还有给了优化过的mobilenet版。在种种刚好的巧合下，我过上了掏粪的幸福生活。然鹅鹅鹅，并不是，那时的我还是想得太简单了…

没有windows版！没有windows版！没有windows版！

所以一开始我就很果断的，掏出我尘封已久的Ubuntu！然后一顿完美操作，pycharm上完美运行 run.py。
当然了，Ubuntu原带的，默认的是python2.7，要掏粪必须先py35。我当然要去google什么，更改默认python版本啦，把什么ld链接改了啊。嗯，之后会把terminal命令放上来。

Python
版本3.5以上，因为要用tf
pip upgrade:
python -m pip install --upgrade pip
Swig
bulid pafprocess时需要的python和c++接口对应工具。直接pip安装就ok了。
pip安装requiements.txt里的packages
pip install -r requiements.txt

运行tf-pose

run.py
要注意改parser.add_argument里面的argument。
- model的type记得改成需要的，cmu和mobilnet_thin。
- resize默认是432x368，虽然写的0x0，但在下面会判断w和h，0x0会设置成432x368。
- resize-out-ratio这是个坑爹的参数，在pafprocess时，要放大原图，大小会影响速度，但不影响interfere推断那部分。默认是4，设成8效果会好一点，2则差些。但注意2以内，有时出不来结果，即没有骨骼，没有热点显示。
run_video.py
- run_video.py里没有resize-out-ratio这个参数，但不设，很容易没骨骼。需要在下面把e.inference里的参数改一下：
  humans = e.inference(image, resize_to_default=True, upsamle_size=4)

其实可以有windows版…

下载GitHub上的源码：
https://github.com/ildoonet/tf-pose-estimation
下载wget的built好的exe，在环境变量里添加文件的路径
下载Swig的windows版的built压缩包swigwin，在环境变量里添加文件的路径：
D:\pvp\tf-pose-estimation-master\swigwin-3.0.12
在环境变量里添加vs14的cl.exe路径：
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\amd64
否则报错：
error: command 'cl.exe' failed: No such file or directory
在环境变量里添加rc.exe的路径：
C:\Program Files (x86)\Windows Kits\8.1\bin\x64
否则报错：
LINK : fatal error LNK1158: cannot run 'rc.exe' error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\bin\\amd64\\link.exe' failed with exit status 1158
编译pafprocess：
swig -python -c++ pafprocess.i && python setup.py build_ext --inplace
因为bash文件在windows里会有一堆bug，所以我把setup.py里那部分全部注释掉：
# subprocess.check_output(["bash", sh_location], cwd=cwd)。
然后自己手动下载models里文件…
编译为tf-pose为python库： python setup.py install
因为requiement.txt里没有tf和python-opencv，要自己安装，pip一条龙服务

models:

cmu：
- 和原openpose版有些不同，准确度下降了，experiment.md里有比较。但我跑了coco的evalution后发现，其实更低…不过我是resize后的结果，还有跑的是coco2017的dataset。有些许不同。
mobilenet_thin：
- 作者提供的是mobilenet_v1_0.75版的model，在networks.py里有这部分代码。conv_width1=0.75和conv_width2=0.50。conv_width1是指backbone的mobilenet的depth系数，作者用0.75版的，也就是64->128->128->256->256->512->512->512->512->512->512变成0.75乘以。而conv_width2，即stage部分的depth系数，直接变成0.5，总共有7个stage，但前2个和后5个有点小小区别。倒数第二层，前两个stage是512x0.5=256个kernel，后5个是128x0.5=64个kernel。
- 作者不再提供checkpoint，所以如果要转框架或继续训练，有2种选择：
1. 自己跑 train.py。慢得一批1080Ti，100 steps要20分钟！作者据说跑了100k个…而且看github上有人说，loss要降到260以内才算acceptable，还有即使loss降到260，也未必好结果，heatmap还是偶尔有问题。
2. 自己define一个network，然后再把frozen的pb里的weights和bias，一个一个对应load过去。我选择狗带。后续：发现作者有network的definition，就是network_mobilenet_thin.py。所以还是比较可行的。
  注意1：network.py里的getnetwork fuction有个地方给model命名时，要读取input dimension，但作者trianed的input dimension是none。我加了个判断给它：
```
if sess_for_load is not None:
	if _type == 'cmu' or _type == 'vgg':
		if not os.path.isfile(pretrain_path_full):
            raise Exception('Model file doesn\'t exist, path=%s' % pretrain_path_full)
        net.load(os.path.join(_get_base_path(), pretrain_path), sess_for_load)
    else:
		dim1 = placeholder_input.shape[1].value
		dim2 = placeholder_input.shape[2].value
	    if not (dim1 is None and dim2 is None):
	        s = '%dx%d' % (dim1, dim2)
		else:
	        s = '?x?'
```
  后续：作者压根没给ckpt，所以不理也没差…不过有位老哥很好人的给了，太特么感动了，找了两个星期，终于有了个结果！虽然浪费了这么多时间，但还好有结果了，不然就血蹦了！

training:

运行tf-pose/train.py: 路径基本都要改
parser里要指定coco dataset位置，还有model保存路径：

    parser.add_argument('--datapath', type=str, default='/media/icduser/Data/Data/coco2017/annotations')
    parser.add_argument('--imgpath', type=str, default='/media/icduser/Data/Data/coco2017/')
    parser.add_argument('--batchsize', type=int, default=16)
    parser.add_argument('--gpus', type=int, default=1)
    parser.add_argument('--max-epoch', type=int, default=30)
    parser.add_argument('--lr', type=str, default='0.0001')
    parser.add_argument('--modelpath', type=str,
                        default='/media/icduser/Data/Caffe/tf-pose/tf_pose/private/tf-openpose-models-2019-1/')
    parser.add_argument('--logpath', type=str,
                        default='/media/icduser/Data/Caffe/tf-pose/tf_pose/private/tf-openpose-log-2019-1/')

即使指定了location，还是报错找不到路径，所以我在saver前加了判断path.exists，然后print出来。

                # save weights
                _save_path = os.path.join(args.modelpath, training_name, 'model')
                if not os.path.exists(_save_path):
                    print('{:1} not exist, mkdirs'.format(_save_path))
                    os.makedirs(_save_path)

                saver.save(sess, _save_path, global_step=global_step)

注意：我train时候用的是Ubuntu的，后来改用windows，因为文件夹名规则问题，有可能导致windows打不开model文件夹。

虽然windows能run，但不能trian：

tensorpack有个parallel的PrefetchData的function，可以加快多GPU并行获取dataflow的速度。在windows下不适用，原因是在Ubuntu是forkable的code，但windows比较严格，是不forkable。如果有大神会改的，希望留下言说下方法，不胜感激…
注：
class MultiProcessPrefetchData(ProxyDataFlow)里的注释是：

This DataFlow does support windows. However, Windows requires more strict picklability on processes, which means that some code that’s forkable on Linux may not be forkable on Windows. If that happens you’ll need to re-organize some part of code that’s not forkable.

Github上找到唯一关联的解释：

Regarding the original issue, turned out that PrefetchData does support windows, but windows has a more strict picklability requirement for processes, i.e. it requires to pickle get_train_dataflow.preprocess. Ref: https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

Run ckpt:

My Goal: Convert tf-pose’s mobilnet_thin model into Caffemodel.

虽然搞了两个星期都没能做成，但还好把trained model找到了，可以开始工作了。
因为用的老哥的model，有点心虚，还是要测试下是不是原作者给的原版，自己改了run.py，不读graph_opt.pb，改为读取model-388003。然后发现跑ckpt时，报错：
TensorFlow: “Attempting to use uninitialized value” in variable initialization
这是因为smoother的variables在ckpt里没有initializer，而opt的pb应该把initializer了的variables也保存进去了，如果直接上

init = tf.global_variables_initializer()
sess.run(init)

会把restore的parameter全部重新初始化，乱码图输出，妈的智障…
度娘谷歌了下，有位大佬给了只初始化部分变量的代码：

            uninitialized_vars = []
            for var in tf.global_variables():
                try:
                    persistent_sess.run(var)
                except tf.errors.FailedPreconditionError:
                    uninitialized_vars.append(var)
            initialize_op = tf.variables_initializer(uninitialized_vars)
            persistent_sess.run(initialize_op)

除此之外还有：

persistent_sess.run(tf.report_uninitialized_variables())

也能得到没初始化的变量，但这个输出的是name，挺麻烦的，所以54它好了。
万幸能跑出结果了！

以上就是跑tf-pose的ckpt的23事，踩了好多坑，虽然没做什么，但感觉好累…

更新1：model-388003和graph_opt.pb不是一个train number，跑eval.py，发现mAP和mAR都差好多。不知道是不是哪里改错了。
更新2：while inference if set resize-to-default=True, accuracy will be down.

err type1:

batch size too large ->set smaller batch size
small batch size ->args.batchsize//16 became 0 -> shape (0,)

          sample_image = [enqueuer.last_dp[0][i] for i in range(4)]
          outputMat = sess.run(
              outputs,
              feed_dict={q_inp: np.array((sample_image + val_image)*(args.batchsize // 16))}
          )

Sample result and test result use the images given by author. In fact, not any influence to the training process. What’s more, the size of sample image and val image need to set with batch size. So I commend them.

                sample_results = []
                for i in range(len(sample_image)):
                    test_result = CocoPose.display_image(sample_image[i], heatMat[i], pafMat[i], as_numpy=True)
                    test_result = cv2.resize(test_result, (640, 640))
                    test_result = test_result.reshape([640, 640, 3]).astype(float)
                    sample_results.append(test_result)

                test_results = []
                for i in range(len(val_image)):
                    test_result = CocoPose.display_image(val_image[i], heatMat[len(sample_image) + i],
                                                         pafMat[len(sample_image) + i], as_numpy=True)
                    test_result = cv2.resize(test_result, (640, 640))
                    test_result = test_result.reshape([640, 640, 3]).astype(float)
                    test_results.append(test_result)

                # save summary
                summary = sess.run(merged_validate_op, feed_dict={
                    valid_loss: average_loss / total_cnt,
                    valid_loss_ll: average_loss_ll / total_cnt,
                    valid_loss_ll_paf: average_loss_ll_paf / total_cnt,
                    valid_loss_ll_heat: average_loss_ll_heat / total_cnt,
                    # sample_valid: test_results,
                    # sample_train: sample_results
                })

elvindp

关注

2
点赞
踩
11

收藏

觉得还不错? 一键收藏
10
评论
掏粪日记之tf-pose 01

从open-pose到tf-pose和从tf到caffe（其实没到）open-pose是cmu的大佬们开源的项目。然鹅，小渣表示c++看起来简直要死有木有，debug起来欲生欲死有木有。刚好老板说我们要优化的（便宜的），刚好Ildoo大佬给的tf版，刚好大佬给的还有简化mobilenet版。在种种刚好的巧合下，我过上了掏粪的日子。然鹅鹅鹅，那时的我还是想得太简单了…没有windows版！没有w...
复制链接

扫一扫