备注:
1.作者
Hsiao-Yu Fish Tung,Katerina Fragkiadaki 卡耐基梅隆大学
一、概述
1. abstract
(1) 跟直接优化mesh and skeleton 的参数不一样的是,我们通过优化网络的权重来预测一个 monocular RGB video中的3D shape and skeleton 的配置;
(2) 模型采用end-to-end framework;
(3) 模型训练联合使用 strong supervision from synthetic(合成的) data 和 self supervision from differentiable rendering of skeleton keypoints, dense 3D mesh motion , human-background segmentation;
(4) 联合使用supervised learning 和 test-time optimization,监督学习在合适的时间对模型参数进行初始化,确保测试时候 good pose and surface initialization;
(5) 优点:self-supervision by BP through differentiable rendering allows(unsupervised) adaptation of model to the test data,and offer much tighter fit than a pretrained fixed model.
2 .应用方向
对于非设定实验场中单视觉的人体以及其运动理解是很重要的,可有以下应用场景:
automated gym, dancing teacher , rehabilitation guidance, patient monitoringand safer human-robot interactions;
对于影视行业的 character motion capture(MOCAP)and retargeting (that still require tedious labor effort of artists to achieve the disired accuracy ,or the use espensive multi-camera setups and green-scerrn backgrounds.)
二、网络架构
1. 主旨描述
(1)提出一个基于monocular video的运动捕捉的网络模型,学习将图片序列映射到相应的3D 网格序列;
(2)使用合成的渲染模型进行strong supervision;以及从3D 到2D的渲染模型 并对应于2D监测点的真实单目视频进行 self-supervision;
(3)self-supervision利用 2D body joint detection ,2D figure-ground segmentation, 2D optical flow;除此之外,2D身体关节标注更易获取,以及optical flow 能容易的从合成数据泛化到真实数据;
(4)跟以往的基于motion capture work进行优化的不同点,我们使用 differentiable warping and differentiable camera project for optical flow and segmantation losses ;这些方法的综合运用有利于进行end-to-end with BP的学习;
(5)使用SMPL 作为 dense human 3D mesh model;我们的任务是对 渲染过程进行 逆向工程操作,并且预测SMPL的参数;
(6)给出了连续两帧的三维网格预测,可微投影网格顶点的三维运动矢量,并将其与估计的二维光流矢量进行匹配;可微运动渲染和匹配需要顶点可见性估计,我们使用光线投射和我们的代码加速神经模型来完成;(如下图)相似的,in each frame,3D keypoint are projected and their distances to corresponding detected 2D keypoints are penalized.Differentiable segmen