[读论文] （MeshInversion）Monocular 3D Object Reconstruction with GAN inversion (ECCV2022)

YuQiao0303

已于 2023-12-26 15:53:23 修改

阅读量3.7k

点赞数 1

分类专栏：读论文 AI 文章标签： 3d 生成对抗网络人工智能

于 2022-12-05 21:07:45 首次发布

本文链接：https://blog.csdn.net/qq_34342853/article/details/128187071

版权

AI 同时被 2 个专栏收录

32 篇文章 1 订阅

订阅专栏

读论文

21 篇文章 2 订阅

订阅专栏

概述

项目主页：https://www.mmlab-ntu.com/project/meshinversion/
方法名称：MeshInversion
输入：单目图像（in the wild，有背景的，没有抠图的）
输出：textured 3D mesh
key challenge: 缺少3D或multiview supervision
方法核心：先预训练一个3D GAN （ConvMesh，其中mesh表达为deformation and texture maps），可以从latent code z生成textured mesh。然后在inference的时候，从输入的图片倒推最符合的z。（这是一个inferece optimization的方法！！）（将生成的mesh用预测的相机参数渲染出来，用输入图片的texture CD loss和mask CD loss来监督）
主要用到或参考的网络：ConvMesh，PatchGAN，mask 用现成的segmentation tool (PointRend)来获取。

Related Work

Single View 3D Reconstruction

image-3D object pairs [46,35,32,39]
multi-view images [33,28,51,47,34]
SMPL for humans and 3DMM for faces [8,40,18],

CMR [19] reconstructs category-specific
textured mesh

texture一般有两种方法，一个是direct regression of pixel values in the UV texture map – often blurry 但作者用的这个。
主流方法是learning the texture flow，对novel view的泛化能力不好。

GAN inversion

GAN inversion 是指先训练好一个GAN，然后找到合适的z，使得z输入GAN以后得到的输出尽可能满足要求。

通常可以用
梯度下降（略）

用一个encoder来学：
Bau, D., Strobelt, H., Peebles, W., Zhou, B., Zhu, J.Y., Torralba, A., et al.: Semantic photo manipulation with a generative image prior. In: SIGGRAPH (2019)

或者二者的结合：
Zhu, J., Shen, Y., Zhao, D., Zhou, B.: In-domain GAN inversion for real image
editing. In: ECCV (2020)

3D领域最新的工作，包括用GAN Inversion进行点云补全：
Zhang, J., Chen, X., Cai, Z., Pan, L., Zhao, H., Yi, S., Yeo, C.K., Dai, B., Loy, C.C.:
Unsupervised 3D shape completion through GAN inversion. In: CVPR (2021)

textured mesh generation

6.Learning to predict 3D objects with an interpolation-based differentiable renderer.
In: NeurIPS (2019)
重建的mesh可微渲染之后，用渲染得到的multi view images做discriminaive 监督

13.Leveraging 2D data to learn textured
3D mesh generation. In: CVPR (2020)
VAE 方法，face colors instead of texture maps

38.Convolutional generation of textured 3D meshes
topology-aligned texture maps and deformation maps in the UV space. （本文就用了他的pretrained model）

Method

看起来大体方法是用Generator从latent code生成geometry和texture，然后用chamfer mask loss和chamfer texture loss来监督。

Preliminaries

mesh表示为O = (V,F,T), 即点，面，texture map。
其中，由于

An individual mesh is iso-morphic to a 2-pole sphere.

因此点的位置可用球体的deformation $\Delta \mathbf{V}$ 表示：
$\mathbf{V} = \mathbf{V}_{sphere} + \Delta \mathbf{V}$
以前的方法大多用MLP来regress delta V，本文使用CNN。

渲染时，使用弱透视投影。（区别于透视投影和正交投影的一种投影方法），参数为π, 包含scale s， translation t和rotation r。

3.1 Reconstruction with Generative Prior

Pre-training Stage

这个阶段训练了一个3D GAN。
Generator主要参考ConvMesh
- 发生在uv space
- 输出的是deformation map和texture map。
Discriminator主要参考PatchGAN。
Loss 包括
- generator loss
- Discrininator loss on UV space
- DIscrininator loss on image space (参考PatchGAN)

Inversion Stage

目的：find the z that best recovers the 3D object from the input image $\mathbf{I}_{in}$ .
需要：原始的image，其对应的mask，还有将3Dshape进行渲染的相机参数。
- 其中mask 用现成的segmentation tool (PointRend)来获取。
  - 理由在此：https://github.com/junzhezhang/mesh-inversion/issues/5 是为了fair comparison以及强调这是test time optimization
- 用ConvMesh 预测Mesh (shape)的latent code z，用CMR预测相机参数π。
  - 如何预测相机参数π：如果直接regress camera pose from scratch，存在camera-shape ambiguity问题。[24] 所以我们用CMR来initialize the camera。
- 用预测的相机参数，将预测的mesh渲染为2D图片求loss（见下文）

由于这个相机位置是不断oprimize的，image不可能完美对齐，需要一个鲁棒的texture loss，见下文

3.2 chamfer texture loss （重点参考）

将image看做2D点云，每个点有2D坐标和3D的RGB颜色值。
两个图像的dissimilarity就用chamfer distance来表达。
- 其中distance D 被分解为 appearance term and spatial term, 都用的l2 distance。
- 重要：具体来说，考虑到我们只想让他tolerant on local misalignment, 因此在spatial term上增加了一个exp操作来惩罚空间距离过远的点，变成这样:
- 解释：首先是Da和Ds相乘。
  - 增加epsilon是如果有一样位置的点（Ds为零），颜色相差极大（Da很大），那应该算作不同的点，免得给他弄成零了；
  - 然后Ds这边加上指数，惩罚距离太远的，因为我只想要较小的misalignment
  - 取个max
- 注意：Ds这一项是不可微的，他只是训练Da（texture）用的权重。

这个东西挺有用的，请看消融实验：
在这里插入图片描述

除了pixel level的CD loss，还有feature level的CD loss：
Specifically, we apply the Chamfer texture loss between the (foreground) feature maps extracted with a pre-trained VGG-19 network [42] from the rendered image and the input image.
这一点有点像contextual loss （The contextual loss for image transformation with non-aligned data.），但有点区别。

feature level Chamfer texture loss: 考虑location，但不要求完全对齐；
contextual loss：完全不考虑location。

loss的消融实验

在这里插入图片描述
CT是指chamfer texture loss；
LpCT是pixel level的； LfCT是feature level的。

看中间那三行，可以看到，
contextual是最差的，
其次是只有L1；
L1 + perceptual好一点；
最好的还是CT loss

3.3 Chamfer Mask Loss

传统的mask loss，通常是把3Dshape量化到一个个grid of pixels（mask），然后和gt mask 求l1或IoU loss
- 从3D shape 得到mask需要rasterization that discretizes the mesh into a grid of pixels. 这一部会导致信息丢失，引入误差，对训练好的ConvMesh影响尤其大。
为此，作者提出Chamfer Mask Loss Lcm. （不求L1，而求CD，不再有量化误差）
- 不是将mesh渲染为binary mask，而是把mesh的点直接投影到image plane，得到Sv。
- 然后把用现成工具分割得到的前景点的坐标给normalize到-1到1之间，得到Sf。
- 然后计算Sv和Sf的chamfer distance

总loss

pixel-level chamfer texture loss (appearance)
feature-level chamfer texture loss (appearance)
chamfer mask loss (geometry)
smooth loss (neighboring faces to have similar normals i.e. low cosine)
latent space loss (L2 norm of z to ensure Gaussian distribution)

等下仔细看看代码，尤其是这个latent space loss。
以及那个feature level是咋搞啊。

Experiments

datasets：
- CUB-200-2011 （鸟类）
- PASCAL3D: cars
pretrain ConvMesh: pseudo ground truths ??? 感觉是指上文提到的那个segmentation和camera pose prediction网络得到的结果。
inference 时GAN inversion：似乎也是pseudo ground truths。
evaluation：用的GT了
- geometry accuracy: rendered masks 和 GT masks的2D mask IoU
- appearance quality: image synthesis metric FID （single view and multi view）, 反映了GT images和generated images的分布的相似性。
- user study: 找了40个user来打分。
- (PASCAL3D 特有：有approximated 3D CAD shapes，可以用3D IoU）

Texture Flow vs. Texture Regression

Texture Flow 更常用，但在invisible的地方容易出错；因为容易copy foreground pixies including the obstacles.

实现（主要来自补充材料）

时间，显存，设备GPU

Pre-training：
600 epochs, with a batch size of 128,
15 hours on four Nvidia V100 GPUs.

网络结构：和ConvMesh一样。

convolutional generator G with 2 branches.
- 输入：latent code z （64）
- 输出：deformation map S 32*32; texture map T 512-512
UV space discriminator
- deformation map
- texture map
image space discriminator (PatchGAN)

chamfer texture loss实现笔记

在这里插入图片描述

解释：首先是Da和Ds相乘。
- 增加epsilon是如果有一样位置的点（Ds为零），颜色相差极大（Da很大），那应该算作不同的点，免得给他弄成零了；
- 然后Ds这边加上指数，惩罚距离太远的，因为我只想要较小的misalignment
- 取个max
注意：Ds这一项是不可微的，他只是训练Da（texture）用的权重。

texture CD loss 代码

mesh_inversion.py
https://github.com/junzhezhang/mesh-inversion/blob/d6614726344f5a56c068df2750fefc593c4ca43d/lib/mesh_inversion.py#L265

if self.args.chamfer_texture_pixel_loss:
    # NOTE: batch size should be one
    pix_pos_pred = mask2proj(mask_pred)
    pix_pred = grid_sample_from_vtx(pix_pos_pred, image_pred)
    dist_map_c, idx_a, idx_b = distChamfer_downsample(pix_pred,color_target,resolution=self.args.chamfer_resolution)
    dist_map_p, _, _ = distChamfer_downsample(pix_pos_pred,vtx_target,resolution=self.args.chamfer_resolution, idx_a=idx_a, idx_b=idx_b)

    xy_threshold = self.args.xy_threshold
    k = self.args.xy_k
    alpha = self.args.xy_alpha
    eps = 1 - (2*k*xy_threshold)**2
    rgb_eps = self.args.rgb_eps
    if eps == 1:
        xy_term = torch.pow(1+k*dist_map_p, alpha)
    else:
        xy_term = F.relu(torch.pow(eps+k*dist_map_p, alpha)-1) + 1
    dist_map = xy_term * (dist_map_c + rgb_eps)

    dist_min_ab = dist_map.min(-1)[0]
    dist_mean_ab = dist_min_ab.mean(-1)

    loss += dist_mean_ab * self.args.chamfer_texture_pixel_loss_wt
    
    ### colect the matched points in the target for visualization
    indices = dist_map.argmin(dim=-1)
    self.matched_pos = torch.stack([vtx_target[i,indices[i]] for i in range(indices.shape[0])],0)
    self.matched_clr = torch.stack([color_target[i,indices[i]] for i in range(indices.shape[0])],0)
    # v2 from: grid sample
    self.matched_clr_v2 = grid_sample_from_vtx(self.matched_pos, target) # NOTE that back vertices color shown as well

其中的参数：
https://github.com/junzhezhang/mesh-inversion/blob/d6614726344f5a56c068df2750fefc593c4ca43d/lib/arguments.py#L135

# loss related
        self._parser.add_argument('--chamfer_mask_loss', action='store_true', default=True, help='if use Chamfer mask loss')
        self._parser.add_argument('--chamfer_mask_loss_wt', type=float, default=10.0)
        self._parser.add_argument('--chamfer_texture_pixel_loss', action='store_true', default=True, help='Chamfer texture loss - pixel level')
        self._parser.add_argument('--chamfer_texture_pixel_loss_wt', type=float, default=1.0)
        self._parser.add_argument('--chamfer_texture_feat_loss', action='store_true', default=True, help='Chamfer texture loss - feature level')
        self._parser.add_argument('--chamfer_texture_feat_loss_wt', type=float, default=0.04)
        self._parser.add_argument('--xy_threshold', type=float, default=0.16)
        self._parser.add_argument('--xy_k', type=float, default=1.0)
        self._parser.add_argument('--xy_alpha', type=float, default=1)
        self._parser.add_argument('--rgb_eps', type=float, default=1)
        self._parser.add_argument('--subpool_threshold', type=float, default=0.5)
        self._parser.add_argument('--chamfer_resolution', type=int, default=8192, help='resolution for computing chamfer texture losses')         
        # other losses
        self._parser.add_argument('--mesh_regularization_loss', action='store_true', default=False, help='')
        self._parser.add_argument('--mesh_regularization_loss_wt', type=float, default=0.00005)
        self._parser.add_argument('--nll_loss', action='store_true', default=True, help='')
        self._parser.add_argument('--nll_loss_wt', type=float, default=0.05)

YuQiao0303

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
4
评论
[读论文] （MeshInversion）Monocular 3D Object Reconstruction with GAN inversion (ECCV2022)

项目主页：https://www.mmlab-ntu.com/project/meshinversion/方法名称：MeshInversion输入：单目图像（in the wild，有背景的，没有抠图的）输出：textured 3D meshkey challenge: 缺少3D或multiview supervision方法核心：先预训练一个3D GAN ，可以从latent code z生成textured mesh。然后在inference的时候，从输入的图片倒推最符合的z。（这是一个in
复制链接

扫一扫

专栏目录