Image-based 3D Object Reconstruction

Image-based 3D Object Reconstruction: The Latest Technologies and Trends in the Era of Deep Learning (Overview and Coding)

0. 概述



  • 0 Summary .

  • Three-dimensional reconstruction is a long-standing problem that has been explored for decades by the computer vision, computer graphics and machine learning communities.

  • Since 2015, the use of convolutional neural network (CNN) for image-based 3D reconstruction has attracted more and more attention, and has performed very well.In view of this new era of rapid development, this paper provides a comprehensive overview of the latest developments in this field, with emphasis on deep learning techniques for estimating the three-dimensional shape of general objects from single or multiple RGB images.

1. 简介







  • 1. Introduction

  • The goal of image-based 3D reconstruction is to infer the 3D geometry and structure of objects and scenes from one or more 2D images. Recovering the lost dimension from 2d images has always been the goal of classical multi-view stereo and shape-from-x methods, which have been widely studied for decades.

  • The first generation approaches the problem geometrically;They focus on mathematically understanding and formalizing the 3d to 2D projection process, with the aim of designing mathematical or algorithmic solutions to ill-posed inverse problems. Effective solutions often require multiple images to be taken using accurately calibrated cameras.

  • The interesting thing is that humans are good at using prior knowledge to solve this problem.

  • With just one eye, we can infer the general size and geometry of an object, and even its appearance from another Angle.This is possible because all the objects and scenes we have seen before enable us to build up our previous knowledge and build up a mental model of how the object will look.

  • The second generation of 3D reconstruction method tries to use this prior knowledge to express the 3D reconstruction problem as a recognition problem.Advances in deep learning technology and, more importantly, the increasing availability of large training data sets have led to a new generation of methods that can restore the three-dimensional structure of objects from one or more RGB images without the need for a complex camera calibration process.

  • This paper provides a comprehensive and systematic overview of the latest advances in 3d object reconstruction using deep learning techniques, collecting 149 papers that have appeared in leading computer vision, computer graphics and machine learning conferences and journals since 2015.The goal is to help readers navigate this emerging field, which has gained tremendous momentum over the past few years.

2. 问题陈述与分类

设I = {Ik,k=1,...,n}是一个或多个对象X的n(≥1)张RGB图像的集合。三维重建可以将其归纳为学习预测器fθ的过程,该预测器fθ可以推断尽可能接近已知形状X的形状。



  • 2. Problem statement and classification

  • Set I = {Ik, k = 1,...,n} is a collection of N (≥1) RGB images of one or more object X.3d reconstruction can be summarized as a process of learning a predictor f, which can infer the shape as close as possible to the known shape X.

  • In other words, the function f is the minimum value of the reconstructed target L(I)=d(f (I),X).Here, theta is a set of parameters to f, and d(·,·) is a specific measure of the distance between the target shape X and the reconstructed shape f(I).

  • This review discusses and classifies the latest technologies based on the nature of the input I, the representation of the output, the neural network structure used to approximate the predictor F during training and testing, the training process they use, and the degree of supervision. The visual summary is shown in the table below.



体积表示:在早期基于深度学习的三维重建技术中被广泛采用,它允许使用规则体素    网格对三维形状进行参数化。因此,在图像分析中使用的二维卷积可以很容易地扩展到    三维,但是它们在内存需求方面非常昂贵,只有少数技术可以达到亚像素精度。

基于面的表示:如网格和点云,虽然这种表示具有记忆效率,但它不是规则结构,因    此不容易融入深度学习体系结构。







  • Specifically, input I can be a single image, or multiple images captured with an RGB camera, with known or unknown internal and external parameters, or video stream, that is, a time-dependent image sequence.

    The representation of output is crucial to the selection of network structure, which also affects the computational efficiency and quality of reconstruction. The main representation methods are as follows:

    • Volume representation: Widely used in early 3d reconstruction techniques based on deep learning, it allows the parameterization of 3D shapes using regular voxel grids.Therefore, 2d convolution used in image analysis can be easily extended to 3D, but they are very expensive in terms of memory requirements and only a few techniques can achieve sub-pixel accuracy.

  • • Surface based representations such as grids and point clouds, while memory-efficient, are not regular structures because they do not easily fit into deep learning architectures.

  • •Intermediate representation: Some 3D reconstruction algorithms predict the 3D geometry of an object directly from an RGB image, while others break the problem down into successive steps, each of which predicts an intermediate representation.

  • Various network architectures have been used to implement predictor F, and the backbone architecture (which can vary during training and testing) consists of an encoder H and a decoder G (i.e., F = g◦).The encoder maps the input into an implicit variable x called an eigenvector or code, using a series of convolution and pooling operations, followed by the full connection layer.A decoder, also known as a generator, decodes an eigenvector into the desired output by using a full connection layer or deconvolution network (a sequence of convolution and up-sampling operations, also known as up-convolution).The former is suitable for unstructured output such as 3D point clouds, while the latter is used to reconstruct volumetric grids or parameterized surfaces.

  • While the architecture of the network and its building blocks are important, performance largely depends on how the network is trained.In this paper, it will be introduced in detail from the following aspects:

  • • Data sets: There are a variety of data sets available for training and evaluating 3D reconstruction based on deep learning, some using real data, others made up of computer graphics students.

  • • The loss function: The choice of the loss function significantly affects the quality of the reconstruction and determines the degree of supervision.

  • • Training process and degree of supervision: Some methods need to mark real images with corresponding 3D models, and the cost of obtaining these images is very high;Some rely on a combination of real and synthetic data;Others avoid full THREE - dimensional monitoring by using easily available monitoring signal loss functions.

3. 编码阶段







  • 3. Coding stage

    A 3d reconstruction algorithm based on deep learning encodes the input I as an eigenvector x=h(I)∈X, where x is a hidden space.A good mapping function H should satisfy the following properties:

    • Two inputs I1 and I2 representing similar 3D objects should map to x1 and x2∈X, which are close to each other in hidden space.

    • Small disturbance ∂ X of x should correspond to small disturbance of the input shape.

    • The potential representation caused by H should be independent of external factors, such as camera attitude.

    • The 3D model and its corresponding 2D image should map to the same point in the hidden space, which will help in reconstruction by ensuring that the features represented are not ambiguous.

    The first two conditions can be solved by using an encoder, which maps inputs into discrete or continuous hidden Spaces, which can be planar or hierarchical (.The third problem can be solved by using a separate representation, and the last by using the TL architecture during the training phase.


Wu在他们的开创性工作[1]中引入了3D ShapeNet,这是一种编码网络,它将表示大小为303的离散体积网格的3D形状映射到大小4000×1的潜在表示中。其核心网络由nconv=3个卷积层(每个卷积层使用3D卷积滤波器)和nfc=3个全连接层组成。这种标准的普通架构已经被用于三维形状分类和恢复,并用于从以体素网格表示的深度图中进行三维重建。

将输入图像映射到隐空间的2D编码网络遵循与3D ShapeNet相同的架构,但使用2D卷积。早期的工作在使用的层的类型和数量上有所不同,其他的工作包括池化层和激活函数,通过改变这些,可以提高学习效率,达到更好的效果。

  • 3.1 Discrete hidden space

    In their pioneering work [1], Wu introduced 3D ShapeNet, a coding network that maps 3D shapes representing a discrete volume grid of size 303 to a potential representation of size 4000×1.Its core network is composed of NCONv =3 convolutional layers (each convolutional layer USES 3D convolution filter) and NFC =3 full connection layers.This standard common architecture has been used for 3d shape classification and restoration and for 3D reconstruction from depth maps represented by voxel grids.

    The 2D encoding network that maps the input image to hidden space follows the same architecture as the 3D ShapeNet, but USES 2D convolution.The early work varied in the types and number of layers used, while other work included pooling layers and activation functions, which could be changed to improve learning efficiency and achieve better results.





  • 3.2 Continuous latent space

    Using the encoder described in the previous section, the hidden space X may not be continuous, so it does not allow simple interpolation.In other words, if x1=h(I1) and x2=h(I2), there is no guarantee that (x1+x2)/2 can be decoded into a valid 3D shape.In addition, small perturbations of x1 do not necessarily correspond to small perturbations of the input.

    VAE and their 3D extension (3D-Vae) have a fundamentally unique feature that makes them suitable for generation modeling: through design, their implicit space is continuous, allowing for simple sampling and interpolation.

    The key idea is that it does not map the input to the eigenvector, but to the mean and standard deviation vectors of the multivariate Gaussian distribution.The sampling layer then takes the two vectors and generates the eigenvector X by randomly sampling from the Gaussian distribution, which will be used as input in the subsequent decoding phase.




  • 3.3 Potential space of layers

    Liu[2] showed that encoders that map inputs to a single potential representation cannot extract rich structures, which may lead to fuzzy reconstruction.To improve the quality of reconstruction, Liu introduced a more complex internal variable structure with the specific goal of encouraging learning about the hierarchical arrangement of potential feature detectors.

    The method starts with a global layer of hidden variables, which is hardwired to a set of local hidden variable layers, each of which is tasked with representing a level of characteristic abstraction.Skip joins join together hidden code in a top-down, directed way: local code close to the input will tend to represent lower-level features, and local code far from the input will tend to represent higher-level features.Finally, when input to a task-specific model, such as 3d reconstruction, local hidden code is wired to a flat structure.



  • 3.4 Separation representation

    The appearance of an object in an image is influenced by several factors, such as the shape of the object, camera posture, and lighting conditions.Standard encoders represent all of these variables in learning code X.This is not desirable in applications such as recognition and classification, where external factors such as posture and lighting should remain constant.Three-dimensional reconstruction can also benefit from discrete representations, where shapes, poses, and lights are represented in different codes.

【1】Z. Wu, S. Song, A. Khosla, F. Yu, L.Zhang, X. Tang, and J. Xiao, “3D shapenets: A deep representation forvolumetric shapes,” in IEEE CVPR, 2015, pp. 1912–1920.

【2】S. Liu, C.L. Giles, I. Ororbia, and G. Alexander, “Learning a HierarchicalLatent-Variable Model of 3D Shapes,” International Conference on 3D Vision,2018.


I hope I can help you,If you have any questions, please  comment on this blog or send me a private message. I will reply in my free time.

  • 4
  • 8
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


