Image-based 3D Object Reconstruction

最新推荐文章于 2021-04-21 20:56:44 发布

李伯爵的指间沙

最新推荐文章于 2021-04-21 20:56:44 发布

阅读量1.9k

点赞数 4

分类专栏： English blog

你若不肯低头，谁能为你戴上桂冠

本文链接：https://blog.csdn.net/m0_37690102/article/details/106675005

版权

English blog 专栏收录该内容

90 篇文章 4 订阅

订阅专栏

Image-based 3D Object Reconstruction: The Latest Technologies and Trends in the Era of Deep Learning (Overview and Coding)

0. 概述

三维重建是一个长期存在的不适定问题，已经被计算机视觉、计算机图形学和机器学习界探索了几十年。

自2015年以来，利用卷积神经网络（CNN）进行基于图像的三维重建引起了越来越多的关注，并且表现非常出色。鉴于这一快速发展的新时代，本文全面综述了这一领域的最新发展，重点研究了利用深度学习技术从单个或多个RGB图像中估计一般物体三维形状的方法。

0 Summary .
Three-dimensional reconstruction is a long-standing problem that has been explored for decades by the computer vision, computer graphics and machine learning communities.
Since 2015, the use of convolutional neural network (CNN) for image-based 3D reconstruction has attracted more and more attention, and has performed very well.In view of this new era of rapid development, this paper provides a comprehensive overview of the latest developments in this field, with emphasis on deep learning techniques for estimating the three-dimensional shape of general objects from single or multiple RGB images.

1. 简介

基于图像的三维重建的目标是从一幅或多幅二维图像中推断出物体和场景的三维几何和结构，从二维图像中恢复丢失的维数一直是经典的多视图立体和shape-from-X方法的目标，这些方法已经被广泛研究了几十年。

第一代方法是从几何的角度来处理这一问题的；它们侧重于从数学上理解和形式化三维到二维的投影过程，目的是设计不适定反问题的数学或算法解，有效的解决方案通常需要使用精确校准的摄像机拍摄多幅图像。

有趣的是，人类善于利用先验知识解决这种不适定反问题。

我们只用一只眼睛就能推断出物体的大致大小和大致几何结构，甚至可以从另一个角度猜测它的样子。之所以能做到这一点，是因为所有以前看到的物体和场景都使我们能够建立起先前的知识，并建立一个物体外观的心理模型。

第二代三维重建方法试图利用这一先验知识，将三维重建问题表述为一个识别问题。深度学习技术的发展，更重要的是，大型训练数据集的可用性不断提高，催生了新一代的方法，能够从一个或多个RGB图像中恢复物体的三维结构，而无需复杂的摄像机校准过程。

本文对利用深度学习技术进行三维物体重建的最新进展进行了全面而系统的综述，收集了149篇论文，这些论文自2015年以来出现在领先的计算机视觉、计算机图形学和机器学习会议和期刊上。目标是帮助读者在这一新兴领域中找到方向，这一领域在过去几年中获得了巨大的发展势头。

1. Introduction
The goal of image-based 3D reconstruction is to infer the 3D geometry and structure of objects and scenes from one or more 2D images. Recovering the lost dimension from 2d images has always been the goal of classical multi-view stereo and shape-from-x methods, which have been widely studied for decades.
The first generation approaches the problem geometrically;They focus on mathematically understanding and formalizing the 3d to 2D projection process, with the aim of designing mathematical or algorithmic solutions to ill-posed inverse problems. Effective solutions often require multiple images to be taken using accurately calibrated cameras.
The interesting thing is that humans are good at using prior knowledge to solve this problem.
With just one eye, we can infer the general size and geometry of an object, and even its appearance from another Angle.This is possible because all the objects and scenes we have seen before enable us to build up our previous knowledge and build up a mental model of how the object will look.
The second generation of 3D reconstruction method tries to use this prior knowledge to express the 3D reconstruction problem as a recognition problem.Advances in deep learning technology and, more importantly, the increasing availability of large training data sets have led to a new generation of methods that can restore the three-dimensional structure of objects from one or more RGB images without the need for a complex camera calibration process.
This paper provides a comprehensive and systematic overview of the latest advances in 3d object reconstruction using deep learning techniques, collecting 149 papers that have appeared in leading computer vision, computer graphics and machine learning conferences and journals since 2015.The goal is to help readers navigate this emerging field, which has gained tremendous momentum over the past few years.

2. 问题陈述与分类

设I = {Ik,k=1,...,n}是一个或多个对象X的n(≥1)张RGB图像的集合。三维重建可以将其归纳为学习预测器fθ的过程，该预测器fθ可以推断尽可能接近已知形状X的形状。

换句话说，函数fθ是重建目标L(I)=d(fθ(I),X)的最小值。这里，θ是f的一组参数，d(·,·)是目标形状X与重构形状f(I)之间距离的一个特定的度量。

本综述根据输入I的性质、输出的表示、训练和测试期间用于近似预测器f的神经网络结构、它们使用的训练过程及其监督程度，讨论并分类了最新的技术，可视化总结见下表。

2. Problem statement and classification
Set I = {Ik, k = 1,...,n} is a collection of N (≥1) RGB images of one or more object X.3d reconstruction can be summarized as a process of learning a predictor f, which can infer the shape as close as possible to the known shape X.
In other words, the function f is the minimum value of the reconstructed target L(I)=d(f (I),X).Here, theta is a set of parameters to f, and d(·,·) is a specific measure of the distance between the target shape X and the reconstructed shape f(I).
This review discusses and classifies the latest technologies based on the nature of the input I, the representation of the output, the neural network structure used to approximate the predictor F during training and testing, the training process they use, and the degree of supervision. The visual summary is shown in the table below.

具体地，输入I可以是单个图像，或者使用RGB相机捕捉的多个图像，其内部和外部参数可以是已知或未知的，或者视频流，即具有时间相关性的图像序列。

输出的表示对于网络结构的选择至关重要，这也影响了重建的计算效率和质量，主要是以下几种表示方法：

•体积表示：在早期基于深度学习的三维重建技术中被广泛采用，它允许使用规则体素网格对三维形状进行参数化。因此，在图像分析中使用的二维卷积可以很容易地扩展到三维，但是它们在内存需求方面非常昂贵，只有少数技术可以达到亚像素精度。

•基于面的表示：如网格和点云，虽然这种表示具有记忆效率，但它不是规则结构，因此不容易融入深度学习体系结构。

•中间表示：一些三维重建算法直接从RGB图像预测物体的三维几何结构，然而另一些算法将问题分解为连续步骤，每个步骤预测一个中间表示。

已经使用了各种网络架构来实现预测器f，主干架构（在训练和测试期间可以不同）由编码器h和解码器g（即f＝g◦h）组成。编码器将输入映射到称为特征向量或代码的隐变量x中，使用一系列的卷积和池化操作，然后是全连接层。解码器也称为生成器，通过使用全连接层或反卷积网络（卷积和上采样操作的序列，也称为上卷积）将特征向量解码为所需输出。前者适用于三维点云等非结构化输出，后者则用于重建体积网格或参数化表面。

虽然网络的体系结构及其构建块很重要，但性能在很大程度上取决于网络的训练方式。在本文中，将从以下几个方面详细介绍：

•数据集：目前有各种数据集可用于训练和评估基于深度学习的三维重建，其中一些使用真实数据，另一些则是计算机图形学生成的。

•损失函数：损失函数的选择会显著影响重建质量，同时规定了监督的程度。

•训练过程和监督程度：有些方法需要用相应的三维模型标注真实的图像，获得这些图像的成本非常高；有些方法则依赖于真实数据和合成数据的组合；另一些则通过利用容易获得的监督信号的损失函数来避免完全的三维监督。

Specifically, input I can be a single image, or multiple images captured with an RGB camera, with known or unknown internal and external parameters, or video stream, that is, a time-dependent image sequence.

The representation of output is crucial to the selection of network structure, which also affects the computational efficiency and quality of reconstruction. The main representation methods are as follows:

• Volume representation: Widely used in early 3d reconstruction techniques based on deep learning, it allows the parameterization of 3D shapes using regular voxel grids.Therefore, 2d convolution used in image analysis can be easily extended to 3D, but they are very expensive in terms of memory requirements and only a few techniques can achieve sub-pixel accuracy.
• Surface based representations such as grids and point clouds, while memory-efficient, are not regular structures because they do not easily fit into deep learning architectures.
•Intermediate representation: Some 3D reconstruction algorithms predict the 3D geometry of an object directly from an RGB image, while others break the problem down into successive steps, each of which predicts an intermediate representation.
Various network architectures have been used to implement predictor F, and the backbone architecture (which can vary during training and testing) consists of an encoder H and a decoder G (i.e., F = g◦).The encoder maps the input into an implicit variable x called an eigenvector or code, using a series of convolution and pooling operations, followed by the full connection layer.A decoder, also known as a generator, decodes an eigenvector into the desired output by using a full connection layer or deconvolution network (a sequence of convolution and up-sampling operations, also known as up-convolution).The former is suitable for unstructured output such as 3D point clouds, while the latter is used to reconstruct volumetric grids or parameterized surfaces.
While the architecture of the network and its building blocks are important, performance largely depends on how the network is trained.In this paper, it will be introduced in detail from the following aspects:
• Data sets: There are a variety of data sets available for training and evaluating 3D reconstruction based on deep learning, some using real data, others made up of computer graphics students.
• The loss function: The choice of the loss function significantly affects the quality of the reconstruction and determines the degree of supervision.
• Training process and degree of supervision: Some methods need to mark real images with corresponding 3D models, and the cost of obtaining these images is very high;Some rely on a combination of real and synthetic data;Others avoid full THREE - dimensional monitoring by using easily available monitoring signal loss functions.

3. 编码阶段

基于深度学习的三维重建算法将输入I编码为特征向量x=h(I)∈X，其中X是隐空间。一个好的映射函数h应该满足以下性质：

•表示相似3D对象的两个输入I1和I2应映射为x1和x2∈X，它们在隐空间中彼此接近。

•x的小扰动∂x应与输入形状的小扰动相对应。

•由h引起的潜在表示应不受外部因素的影响，如摄像机姿态。

•三维模型及其对应的二维图像应映射到隐空间的同一点上，这将确保表示的特征不是含糊不清的，从而有助于重建。

前两个条件可以通过使用编码器来解决，编码器将输入映射到离散或连续隐空间，它们可以是平面的或层次的（。第三个问题可以通过使用分离表示解决，最后一个在训练阶段通过使用TL架构来解决。

3. Coding stage

A 3d reconstruction algorithm based on deep learning encodes the input I as an eigenvector x=h(I)∈X, where x is a hidden space.A good mapping function H should satisfy the following properties:

• Two inputs I1 and I2 representing similar 3D objects should map to x1 and x2∈X, which are close to each other in hidden space.

• Small disturbance ∂ X of x should correspond to small disturbance of the input shape.

• The potential representation caused by H should be independent of external factors, such as camera attitude.

• The 3D model and its corresponding 2D image should map to the same point in the hidden space, which will help in reconstruction by ensuring that the features represented are not ambiguous.

The first two conditions can be solved by using an encoder, which maps inputs into discrete or continuous hidden Spaces, which can be planar or hierarchical (.The third problem can be solved by using a separate representation, and the last by using the TL architecture during the training phase.

3.1离散隐空间

Wu在他们的开创性工作[1]中引入了3D ShapeNet，这是一种编码网络，它将表示大小为303的离散体积网格的3D形状映射到大小4000×1的潜在表示中。其核心网络由nconv=3个卷积层（每个卷积层使用3D卷积滤波器）和nfc=3个全连接层组成。这种标准的普通架构已经被用于三维形状分类和恢复，并用于从以体素网格表示的深度图中进行三维重建。

将输入图像映射到隐空间的2D编码网络遵循与3D ShapeNet相同的架构，但使用2D卷积。早期的工作在使用的层的类型和数量上有所不同，其他的工作包括池化层和激活函数，通过改变这些，可以提高学习效率，达到更好的效果。

3.1 Discrete hidden space

In their pioneering work [1], Wu introduced 3D ShapeNet, a coding network that maps 3D shapes representing a discrete volume grid of size 303 to a potential representation of size 4000×1.Its core network is composed of NCONv =3 convolutional layers (each convolutional layer USES 3D convolution filter) and NFC =3 full connection layers.This standard common architecture has been used for 3d shape classification and restoration and for 3D reconstruction from depth maps represented by voxel grids.

The 2D encoding network that maps the input image to hidden space follows the same architecture as the 3D ShapeNet, but USES 2D convolution.The early work varied in the types and number of layers used, while other work included pooling layers and activation functions, which could be changed to improve learning efficiency and achieve better results.

3.2连续潜空间

使用前一节中介绍的编码器，隐空间X可能不是连续的，因此它不允许简单的插值。换句话说，如果x1=h(I1)和x2=h(I2)，则不能保证(x1+x2)/2可以解码为有效的3D形状。此外，x1的小扰动不一定对应于输入的小扰动。

变分自编码器（VAE）及其3D扩展（3D-VAE）具有一个基本独特的特性，使得它们适合生成建模：通过设计，它们的隐空间是连续的，允许简单的采样和插值。

其关键思想是，它不是将输入映射到特征向量，而是映射到多变量高斯分布的平均向量μ和标准差σ向量。然后，采样层获取这两个向量，并通过从高斯分布随机采样生成特征向量x，该特征向量x将用作随后解码阶段的输入。

3.2 Continuous latent space

Using the encoder described in the previous section, the hidden space X may not be continuous, so it does not allow simple interpolation.In other words, if x1=h(I1) and x2=h(I2), there is no guarantee that (x1+x2)/2 can be decoded into a valid 3D shape.In addition, small perturbations of x1 do not necessarily correspond to small perturbations of the input.

VAE and their 3D extension (3D-Vae) have a fundamentally unique feature that makes them suitable for generation modeling: through design, their implicit space is continuous, allowing for simple sampling and interpolation.

The key idea is that it does not map the input to the eigenvector, but to the mean and standard deviation vectors of the multivariate Gaussian distribution.The sampling layer then takes the two vectors and generates the eigenvector X by randomly sampling from the Gaussian distribution, which will be used as input in the subsequent decoding phase.

3.3层次潜空间

Liu[2]表明，将输入映射到单个潜在表示的编码器不能提取丰富的结构，因此可能导致模糊的重构。为提高重建质量，Liu引入了更复杂的内部变量结构，其具体目标是鼓励对潜在特征检测器的分层排列进行学习。

该方法从一个全局隐变量层开始，该层被硬连接到一组局部隐变量层，每个隐变量层的任务是表示一个级别的特征抽象。跳跃连接以自上而下的定向方式将隐代码连接在一起：接近输入的局部代码将倾向于表示较低级别的特征，而远离输入的局部代码将倾向于表示较高级别的特征。最后，当输入到特定于任务的模型（如三维重建）中时，将局部隐代码连接到扁平结构。

3.3 Potential space of layers

Liu[2] showed that encoders that map inputs to a single potential representation cannot extract rich structures, which may lead to fuzzy reconstruction.To improve the quality of reconstruction, Liu introduced a more complex internal variable structure with the specific goal of encouraging learning about the hierarchical arrangement of potential feature detectors.

The method starts with a global layer of hidden variables, which is hardwired to a set of local hidden variable layers, each of which is tasked with representing a level of characteristic abstraction.Skip joins join together hidden code in a top-down, directed way: local code close to the input will tend to represent lower-level features, and local code far from the input will tend to represent higher-level features.Finally, when input to a task-specific model, such as 3d reconstruction, local hidden code is wired to a flat structure.

3.4分离表示

图像中对象的外观受多个因素的影响，例如对象的形状、相机姿势和照明条件。标准编码器在学习的代码x中表示所有这些变量。这在诸如识别和分类之类的应用中是不可取的，这些应用应该对诸如姿势和照明之类的外部因素保持不变。三维重建也可以受益于分离式表示，其中形状、姿势和灯光用不同的代码表示。

3.4 Separation representation

The appearance of an object in an image is influenced by several factors, such as the shape of the object, camera posture, and lighting conditions.Standard encoders represent all of these variables in learning code X.This is not desirable in applications such as recognition and classification, where external factors such as posture and lighting should remain constant.Three-dimensional reconstruction can also benefit from discrete representations, where shapes, poses, and lights are represented in different codes.

【1】Z. Wu, S. Song, A. Khosla, F. Yu, L.Zhang, X. Tang, and J. Xiao, “3D shapenets: A deep representation forvolumetric shapes,” in IEEE CVPR, 2015, pp. 1912–1920.

【2】S. Liu, C.L. Giles, I. Ororbia, and G. Alexander, “Learning a HierarchicalLatent-Variable Model of 3D Shapes,” International Conference on 3D Vision,2018.

参考文档：http://mp.weixin.qq.com/s?__biz=MzU2NTczMzQ1MQ==&mid=2247484583&idx=1&sn=0f814a36f0ee5d3950e25f46537a725f&chksm=fcb67583cbc1fc95205d4f5b859e425457840da829fae3e225eec77afbbe3cbe68b5f91a8545&mpshare=1&scene=24&srcid=&sharer_sharetime=1591242892880&sharer_shareid=cc5ffb1d306d67c81444a3aa7b0ae74c#rd

I hope I can help you,If you have any questions, please comment on this blog or send me a private message. I will reply in my free time.

李伯爵的指间沙

关注

4
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Image-based 3D Object Reconstruction

Image-based 3D Object Reconstruction: The Latest Technologies and Trends in the Era of Deep Learning (Overview and Coding)0.概述三维重建是一个长期存在的不适定问题，已经被计算机视觉、计算机图形学和机器学习界探索了几十年。自2015年以来，利用卷积神经网络（CNN）进行基于图像的三维重建引起了越来越多的关注，并且表现非常出色。鉴于这一快速发展的新时代，本文全面综述了这一领域.
复制链接

扫一扫