Neural Scene Graphs for Dynamic Scenes
文章目录
1. What
What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)
Using video and annotated tracking data, this paper composes dynamic, multi-object scenes into a learned scene graph, which can also be used for 3D object detection via inverse rendering.
2. Why
Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)
Maybe contain Background, Question, Others, Innovation:
Traditional pipelines containing point clouds allow for learning hierarchical scene representations but can’t handle highly view-dependent features.
Nerf resolves the view-dependent effect but does not allow for hierarchical representations or dynamic scenes.
NeRF-W has some tries, it incorporates an appearance embedding and a decomposition of transient and static elements via uncertainty fields but still relies on the consistency of the static scene.
Related works mentioned:
- Implicit Scene Representations: Existing methods have been proposed to learn features on discrete geometric primitives, such as points, meshed, and multi-planes.
- Neural Rendering: Differentiable rendering functions have made it possible
to learn scene representations. NeRF stands out because it outputs a color value conditioned on a ray’s direction. - Scene Graph Representations: Model a scene as a directed graph which represents objects as leaf nodes.
- Latent Class Encoding: By adding a latent vector z to the input 3D query point, similar objects can be modeled using the same network.
We will introduce part3 and 4 in more detail later.
3. How
3.1 Neural Scene Graphs
The first thing for scene reconstruction is to model the scene in a specific way. That is what we will introduce first.
On the left side(a), there’s an “isometric view” of a “Neural scene graph.” This graph represents different elements within a scene as nodes and their spatial relationships as edges.
Each node is associated with a transformation (rotation and translation) and scaling, denoted as T i w T^w_i Tiw and S i S_i Si, indicating how each node (object) is oriented and scaled within the world coordinate system W W W. The nodes are visualized as colored boxes, with edges indicating the relationships between them like the positions of objects relative to each other or the world frame W W W. The objects have latent object codes like l 1 , l 2 l_1,l_2 l1,l2 suggesting they represent specific objects like cars and trucks. There’s also a background node F b c k g F_{bckg} Fbckg and different class nodes like F θ c a r F_{\theta_{car}} Fθcar and F θ t r u c k F_{\theta_{truck}} Fθtruck.
To sum up, we can define it as a directed acyclic graph:
S = ⟨ W , C , F , L , E ⟩ . {\mathcal S}=\langle{\mathcal W},C,F,L,E\rangle. S=⟨W,C,F,L,E⟩.
where to supply that C is a leaf node representing the camera and E is the edge, representing affine transformations from u u u to v v v(relationship) or property assignments.
3.2 Representation Models
Static and dynamic scene representations are different.
For the static scene, it is the same as the original NeRF, which uses position
(
x
,
y
,
z
)
(x, y, z)
(x,y,z) and direction
(
d
x
,
d
y
,
d
z
)
(d_x,d_y,d_z)
(dx,dy,dz) as input, and uses color
(
c
)
(c)
(c) and density
(
σ
)
(\sigma)
(σ) as output. We can summarize the process as:
[ σ ( x ) , y ( x ) ] = F θ b c k g , 1 ( γ x ( x ) ) c ( x ) = F θ b c k g , 2 ( γ d ( d ) , y ( x ) ) . \begin{aligned}[\sigma(\boldsymbol{x}),\boldsymbol{y}(\boldsymbol{x})]& =F_{\theta_{bckg,1}}(\gamma_{x}(\boldsymbol{x})) \\\mathbf{c}(\boldsymbol{x})& =F_{\theta_{bckg,2}}(\gamma_{d}(\boldsymbol{d}),\boldsymbol{y}(\boldsymbol{x})). \end{aligned} [σ(x),y(x)]c(x)=Fθbckg,1(γx(x))=Fθbckg,2(γd(d),y(x)).
For the dynamic scene, each object is represented by a neural radiance field.
Meanwhile considering the limit of computation, we introduce a latent vector l l lencoding an object’s representation. Conditioning on the latent code allows shared weights θ c \theta_c θc between all objects of class c c c. Adding l o l_o lo to the input of a volumetric scene function F θ c F_{\theta_c} Fθc can be thought of as a mapping from the representation function of class c c c to the radiance field of object o o o.
In the architecture of NN, we add this 256-dimensional latent vector l o l_o lo, resulting in the following new first stage:
[ y ( x ) , σ ( x ) ] = F θ c , 1 ( γ x ( x ) , l o ) . [y(x),\sigma(x)]=F_{\theta_{c,1}}(\gamma_{\boldsymbol{x}}(\boldsymbol{x}),\boldsymbol{l}_{o}). [y(x),σ(x)]=Fθc,1(γx(x),lo).
Because in the video, the dynamic object will change over time, to consider the location-dependent effects, we add the global position p o p_o po in the global frame as another input, that is:
c ( x , l o , p o ) = F θ c , 2 ( γ d ( d ) , y ( x , l o ) , p o ) . c(x,l_o,p_o)=F_{\theta_{c,2}}(\gamma_d(d),y(x,l_o),p_o). c(x,lo,po)=Fθc,2(γd(d),y(x,lo),po).
Notice the x x x inside this formulation was in the local coordinate, after a certain transformation and normalization:
x o = S o T o w x with x o ∈ [ 1 , − 1 ] . x_o=S_oT_o^wx\text{ with }x_o\in[1,-1]. xo=SoTowx with xo∈[1,−1].
We need to query its color and volume density in the local object coordinate system. When a ray passes through the global coordinate system, we need to convert it to the local coordinate system, which will be reflected in the rendering part later.
So all in all, the representation of a dynamic scene is:
F θ c : ( x o , d o , p o , l o ) → ( c , σ ) ; ∀ x o ∈ [ − 1 , 1 ] . F_{\theta_c}:(\boldsymbol{x}_o,\boldsymbol{d}_o,\boldsymbol{p}_o,\boldsymbol{l}_o)\rightarrow(\boldsymbol{c},\boldsymbol{\sigma});\forall\boldsymbol{x}_o\in[-1,1]. Fθc:(xo,do,po,lo)→(c,σ);∀xo∈[−1,1].
3.3 Neural Scene Graph Rendering
Each ray r j r_j rj traced through the scene is discretized at N d N_d Nd sampling points at each of the m j m_j mj dynamic node intersections and N s N_s Ns in the background like the original NeRF, resulting in a set of quadrature points { { t i } i = 1 N s + m j N d } j \{\{t_{i}\}_{i=1}^{N_{s}+m_{j}N_{d}}\}_{j} {{ti}i=1Ns+mjNd}j
When testing whether the ray passes the dynamic objects, it will use the Ray-box Intersection and then transform this ray to the objects’ local coordinates to query the property. The calculation method is the same as NeRF:
C ^ ( r ) = ∑ i = 1 N s + m j N d T i α i c i , where T i = exp ( − ∑ k = 1 i − 1 σ k δ k ) and α i = 1 − exp ( − σ i δ i ) \begin{aligned}\hat{C}(\boldsymbol{r})&=\sum_{i=1}^{N_s+m_jN_d}T_i\alpha_ic_i,\text{where}\\T_i&=\exp(-\sum_{k=1}^{i-1}\sigma_k\delta_k) \text{and} \alpha_i=1-\exp(-\sigma_i\delta_i)\end{aligned} C^(r)Ti=i=1∑Ns+mjNdTiαici,where=exp(−k=1∑i−1σkδk)andαi=1−exp(−σiδi)
Finally, the loss function is:
L = ∑ r ∈ R ∥ C ^ ( r ) − C ( r ) ∥ 2 2 + 1 σ 2 ∥ z ∥ 2 2 , \mathcal{L}=\sum_{r\in\mathbb{R}}\|\hat{C}(\boldsymbol{r})-C(\boldsymbol{r})\|_{2}^{2}+\frac{1}{\sigma^{2}}\|\boldsymbol{z}\|_{2}^{2}, L=r∈R∑∥C^(r)−C(r)∥22+σ21∥z∥22,
which also uses the latent code p ( z o ) p(z_o) p(zo) like DeepSDF.
3.4 3D Object Detection as Inverse Rendering
Just like R-CNN (【计算机视觉】24-Object Detection-CSDN博客), it samples anchor positions in a bird’s-eye view plane and optimizes over anchor box positions and latent object codes that minimize the ℓ 1 \ell_{1} ℓ1 image loss between the synthesized image and an observed image.
4. Self-thoughts
- How to handle shadows on merge scene graphs.
- Have no idea how to improve it.