【读论文】Neural Scene Graphs for Dynamic Scenes

小白有颗大白梦

已于 2024-04-03 01:42:54 修改

阅读量784

点赞数 8

分类专栏：读论文NeRF NeRF学习文章标签： NeRF 计算机视觉场景图

于 2024-03-25 12:51:09 首次发布

本文链接：https://blog.csdn.net/weixin_62012485/article/details/137010659

版权

NeRF学习同时被 2 个专栏收录

19 篇文章 3 订阅

订阅专栏

读论文NeRF

14 篇文章 1 订阅

订阅专栏

Neural Scene Graphs for Dynamic Scenes

文章目录

Neural Scene Graphs for Dynamic Scenes

1. What

What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)

Using video and annotated tracking data, this paper composes dynamic, multi-object scenes into a learned scene graph, which can also be used for 3D object detection via inverse rendering.

2. Why

Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)

Maybe contain Background, Question, Others, Innovation:

Traditional pipelines containing point clouds allow for learning hierarchical scene representations but can’t handle highly view-dependent features.

Nerf resolves the view-dependent effect but does not allow for hierarchical representations or dynamic scenes.

NeRF-W has some tries, it incorporates an appearance embedding and a decomposition of transient and static elements via uncertainty fields but still relies on the consistency of the static scene.

Related works mentioned:

Implicit Scene Representations: Existing methods have been proposed to learn features on discrete geometric primitives, such as points, meshed, and multi-planes.
Neural Rendering: Differentiable rendering functions have made it possible
to learn scene representations. NeRF stands out because it outputs a color value conditioned on a ray’s direction.
Scene Graph Representations: Model a scene as a directed graph which represents objects as leaf nodes.
Latent Class Encoding: By adding a latent vector z to the input 3D query point, similar objects can be modeled using the same network.

We will introduce part3 and 4 in more detail later.

3. How

3.1 Neural Scene Graphs

The first thing for scene reconstruction is to model the scene in a specific way. That is what we will introduce first.

请添加图片描述

On the left side(a), there’s an “isometric view” of a “Neural scene graph.” This graph represents different elements within a scene as nodes and their spatial relationships as edges.

Each node is associated with a transformation (rotation and translation) and scaling, denoted as $T^w_i$ and $S_i$ , indicating how each node (object) is oriented and scaled within the world coordinate system $W$ . The nodes are visualized as colored boxes, with edges indicating the relationships between them like the positions of objects relative to each other or the world frame $W$ . The objects have latent object codes like $l_1,l_2$ suggesting they represent specific objects like cars and trucks. There’s also a background node $F_{bckg}$ and different class nodes like $F_{\theta_{car}}$ and $F_{\theta_{truck}}$ .

To sum up, we can define it as a directed acyclic graph:

${\mathcal S}=\langle{\mathcal W},C,F,L,E\rangle.$

where to supply that C is a leaf node representing the camera and E is the edge, representing affine transformations from $u$ to $v$ (relationship) or property assignments.

3.2 Representation Models

请添加图片描述

Static and dynamic scene representations are different.
For the static scene, it is the same as the original NeRF, which uses position $(x, y, z)$ and direction $d_x,d_y,d_z)$ as input, and uses color $(c)$ and density $(\sigma)$ as output. We can summarize the process as:

$\begin{aligned}[\sigma(\boldsymbol{x}),\boldsymbol{y}(\boldsymbol{x})]& =F_{\theta_{bckg,1}}(\gamma_{x}(\boldsymbol{x})) \\\mathbf{c}(\boldsymbol{x})& =F_{\theta_{bckg,2}}(\gamma_{d}(\boldsymbol{d}),\boldsymbol{y}(\boldsymbol{x})). \end{aligned}$

For the dynamic scene, each object is represented by a neural radiance field.

Meanwhile considering the limit of computation, we introduce a latent vector $l$ encoding an object’s representation. Conditioning on the latent code allows shared weights $\theta_c$ between all objects of class $c$ . Adding $l_o$ to the input of a volumetric scene function $F_{\theta_c}$ can be thought of as a mapping from the representation function of class $c$ to the radiance field of object $o$ .

In the architecture of NN, we add this 256-dimensional latent vector $l_o$ , resulting in the following new first stage:

$[y(x),\sigma(x)]=F_{\theta_{c,1}}(\gamma_{\boldsymbol{x}}(\boldsymbol{x}),\boldsymbol{l}_{o}).$

Because in the video, the dynamic object will change over time, to consider the location-dependent effects, we add the global position $p_o$ in the global frame as another input, that is:

$c(x,l_o,p_o)=F_{\theta_{c,2}}(\gamma_d(d),y(x,l_o),p_o).$

Notice the $x$ inside this formulation was in the local coordinate, after a certain transformation and normalization:

$x_o=S_oT_o^wx\text{ with }x_o\in[1,-1].$

We need to query its color and volume density in the local object coordinate system. When a ray passes through the global coordinate system, we need to convert it to the local coordinate system, which will be reflected in the rendering part later.

So all in all, the representation of a dynamic scene is:

$F_{\theta_c}:(\boldsymbol{x}_o,\boldsymbol{d}_o,\boldsymbol{p}_o,\boldsymbol{l}_o)\rightarrow(\boldsymbol{c},\boldsymbol{\sigma});\forall\boldsymbol{x}_o\in[-1,1].$

3.3 Neural Scene Graph Rendering

请添加图片描述

Each ray $r_j$ traced through the scene is discretized at $N_d$ sampling points at each of the $m_j$ dynamic node intersections and $N_s$ in the background like the original NeRF, resulting in a set of quadrature points ${\{t_{i}\}_{i=1}^{N_{s}+m_{j}N_{d}}\}_{j}$

When testing whether the ray passes the dynamic objects, it will use the Ray-box Intersection and then transform this ray to the objects’ local coordinates to query the property. The calculation method is the same as NeRF:

$\begin{aligned}\hat{C}(\boldsymbol{r})&=\sum_{i=1}^{N_s+m_jN_d}T_i\alpha_ic_i,\text{where}\\T_i&=\exp(-\sum_{k=1}^{i-1}\sigma_k\delta_k) \text{and} \alpha_i=1-\exp(-\sigma_i\delta_i)\end{aligned}$

Finally, the loss function is:

$\mathcal{L}=\sum_{r\in\mathbb{R}}\|\hat{C}(\boldsymbol{r})-C(\boldsymbol{r})\|_{2}^{2}+\frac{1}{\sigma^{2}}\|\boldsymbol{z}\|_{2}^{2},$

which also uses the latent code $p(z_o)$ like DeepSDF.

3.4 3D Object Detection as Inverse Rendering

Just like R-CNN (【计算机视觉】24-Object Detection-CSDN博客), it samples anchor positions in a bird’s-eye view plane and optimizes over anchor box positions and latent object codes that minimize the $\ell_{1}$ image loss between the synthesized image and an observed image.