【读论文】SUDS: Scalable Urban Dynamic Scenes

1. What

To extend NeRFs to dynamic large-scale urban scenes, this paper introduces two key innovations: (a) factorize the scene into three separate hash table data structures to encode static, dynamic, and far-field radiance fields, and (b) make use of unlabeled target signals consisting of RGB images(from N videos), sparse LiDAR(depth), off-the-shelf self-supervised 2D descriptors(DINO), and 2D optical flow.

2. Why

Challenge…

The largest dynamic NeRF built to date.

3. How

3.1 Inputs

RGB images from N videos + LiDAR depth measurements + 2D self-supervised pixel (DINO) + 2D optical flow without bounding boxes.

3.2 Representation

3.2.1 Scene composition and Hash tables

We have three branches to model the urban scenes:

  1. A static branch containing non-moving topography consistent across videos.
  2. A dynamic branch to disentangle video-specific objects [19,29,56], moving or otherwise.
  3. A far-field environment map to represent far-away objects and the sky[41 Urban Neural Field].

As for the hash table, firstly, we have a given input coordinate ( x , d , t , v i d ) (\mathbf{x},\mathbf{d},\mathbf{t},\mathbf{vid}) (x,d,t,vid). Then we need to find the surrounding voxels in each table, which we denote as v l , s , v l , d , v l , e \mathbf{v}_{l,s},\mathbf{v}_{l,d},\mathbf{v}_{l,e} vl,s,vl,d,vl,e.

The static branch makes use of 3D spatial voxels v l , s \mathbf{v}_{l,s} vl,s while the dynamic branch makes use of 4D spacetime voxels v l , d \mathbf{v}_{l,d} vl,d and the far-field branch makes use of 3D voxels v l , e \mathbf{v}_{l,e} vl,e(implemented via normalized 3D direction vectors) that index an environment map.

Then, we will compute hash indices i l , s ( o r i l , d o r i l , e ) \mathbf{i}_{l,s} (\mathrm{or} \mathbf{i}_{l,d} \mathrm{or} \mathbf{i}_{l,e}) il,s(oril,doril,e) for each corner with the following hash functions:

i l , s = static ⁡ hash ⁡ ( s p a c e ( v l , s ) ) i l , d = dynamic ⁡ hash ⁡ ( s p a c e ( v l , d ) , t i m e ( v l , d ) , v i d ) i l , e = env ⁡ hash ⁡ ( d i r ( v l , e ) , v i d ) \begin{aligned}&\mathbf{i}_{l,s}=\operatorname{static}\operatorname{hash}(space(\mathbf{v}_{l,s}))\\&\mathbf{i}_{l,d}=\operatorname{dynamic}\operatorname{hash}(space(\mathbf{v}_{l,d}),time(\mathbf{v}_{l,d}),\mathbf{vid})\\&\mathbf{i}_{l,e}=\operatorname{env}\operatorname{hash}(dir(\mathbf{v}_{l,e}),\mathbf{vid})\end{aligned} il,s=statichash(space(vl,s))il,d=dynamichash(space(vl,d),time(vl,d),vid)il,e=envhash(dir(vl,e),vid)

Recall the hash function is h ( x ) = ( ⨁ i = 1 d x i π i ) m o d    T h(\mathbf{x})=\left(\bigoplus_{i=1}^dx_i\pi_i\right)\mod T h(x)=(i=1dxiπi)modT.

Finally, we linearly interpolate features up to the nearest voxel vertices and obtain the feature vectors.

Notice: we add v i d \mathbf{vid} vid as an auxiliary input to the hash, but do not use it for interpolation (since averaging across distinct movers is unnatural). From this perspective, we leverage hashing to effectively index separate interpolating
functions for each video, without a linear growth in memory with the number of videos.

3.2.2 From branch to images

We will establish a relationship between the image and the feature vector obtained from the hash table:

  1. Static branch

    σ s ( x ) ∈ R c s ( x , d , A v i d F ( t ) ) ∈ R 3 . \begin{aligned}&\sigma_{s}(\mathbf{x})\in\mathbb{R}\\&\mathbf{c}_{s}(\mathbf{x},\mathbf{d},A_{vid}\mathcal{F}(t))\in\mathbb{R}^{3}.\end{aligned} σs(x)Rcs(x,d,AvidF(t))R3.

    We add a latent embedding A v i d F ( t ) A_{vid}\mathcal{F}(t) AvidF(t) consisting of a video-specific matrix A v i d A_{vid} Avid and a Fourier-encoded time index F ( t ) \mathcal{F}(t) F(t).

  2. Dynamic branch

    σ d ( x , t , v i d ) ∈ R ρ d ( x , t , v i d ) ∈ [ 0 , 1 ] c d ( x , t , v i d , d ) ∈ R 3 \sigma_d(\mathbf{x},\mathbf{t},\mathbf{vid})\in\mathbb{R} \\ \rho_{d}(\mathbf{x},\mathbf{t},\mathbf{vid})\in[0,1]\\ \mathbf{c}_d(\mathbf{x},\mathbf{t},\mathbf{vid},\mathbf{d})\in\mathbb{R}^3 σd(x,t,vid)Rρd(x,t,vid)[0,1]cd(x,t,vid,d)R3

    We find shadows to play a crucial role in the appearance of urban scenes, we explicitly model a shadow field of scalar values ρ d ∈ \rho_{d} \in ρd [0, 1], used to scale down the static color c s \mathbf{c}_s cs.

  3. Far-field branch

    σ d ( x , t , v i d ) ∈ R \sigma_d(\mathbf{x},\mathbf{t},\mathbf{vid})\in\mathbb{R} σd(x,t,vid)R

3.2.3 Rendering

Firstly, we derive a single density and radiance value for any position by computing the weighted sum of the static and dynamic components, combined with the pointwise shadow reduction:

σ ( x , t , v i d ) = σ s ( x ) + σ d ( x , t , v i d ) c ( x , t , v i d , d ) = σ s σ ( 1 − ρ d ) c s ( x , d , A v i d F ( t ) ) + σ d σ c d ( x , t , v i d , d ) \begin{aligned}\sigma(\mathbf{x},\mathbf{t},\mathbf{vid})& =\sigma_{s}(\mathbf{x})+\sigma_{d}(\mathbf{x},\mathbf{t},\mathbf{vid}) \\\mathbf{c(x,t,vid,d)}& =\frac{\sigma_{s}}{\sigma}(1-\rho_{d})\mathbf{c}_{s}(\mathbf{x},\mathbf{d},A_{vid}\mathcal{F}(t)) \\&+ \frac{\sigma_{d}}{\sigma}\mathbf{c}_{d}(\mathbf{x},\mathbf{t},\mathbf{vid},\mathbf{d})\end{aligned} σ(x,t,vid)c(x,t,vid,d)=σs(x)+σd(x,t,vid)=σσs(1ρd)cs(x,d,AvidF(t))+σσdcd(x,t,vid,d)

Then similar to the α \alpha α-blending, we calculate the final color C ^ \hat{C} C^:

C ^ ( r , t , v i d ) = ∫ 0 + ∞ T ( t ) σ ( r ( t ) , t , v i d ) c ( r ( t ) , t , v i d , d ) d t + T ( + ∞ ) c e ( d , v i d ) , where T ( t ) = exp ⁡ ( − ∫ 0 t σ ( r ( s ) , t , v i d ) d s ) . \begin{aligned}\hat{C}(\mathbf{r},\mathbf{t},\mathbf{vid})&=\int_{0}^{+\infty}T(t)\sigma(\mathbf{r}(t),\mathbf{t},\mathbf{vid})\mathbf{c}(\mathbf{r}(t),\mathbf{t},\mathbf{vid},\mathbf{d})dt\\&+T(+\infty)\mathbf{c}_{e}(\mathbf{d},\mathbf{vid}),\end{aligned}\\ \text{where} T(t)=\exp\left(-\int_{0}^{t}\sigma(\mathbf{r}(s),\mathbf{t},\mathbf{vid})ds\right). C^(r,t,vid)=0+T(t)σ(r(t),t,vid)c(r(t),t,vid,d)dt+T(+)ce(d,vid),whereT(t)=exp(0tσ(r(s),t,vid)ds).

3.2.4 Other supervision

  1. Feature distillation

    We add a C-dimensional output head to each of our branches to predict the DINO feature and then compare it with the offline model.

    Φ s ( x ) ∈ R C Φ d ( x , t , v i d ) ∈ R C Φ e ( d , v i d ) ∈ R C . \begin{aligned}&\Phi_{s}(\mathbf{x})\in\mathbb{R}^{C} \\&\Phi_d(\mathbf{x},\mathbf{t},\mathbf{vid})\in\mathbb{R}^C \\&\Phi_e(\mathbf{d},\mathbf{vid})\in\mathbb{R}^C.\end{aligned} Φs(x)RCΦd(x,t,vid)RCΦe(d,vid)RC.

    These are similar to the mapping of color. Firstly, x \mathbf{x} x is passed to a grid encoding and then to a MLP to obtain the final output.
    But in the code, feature, flow, and color output use different grids, which makes their training independent.

                self.encoding_feature = tcnn.Encoding(
                    n_input_dims=5,
                    encoding_config={
                        'otype': 'SequentialGrid',
                        'n_levels': num_levels,
                        'n_features_per_level': features_per_level,
                        'log2_hashmap_size': log2_hashmap_size,
                        'base_resolution': base_resolution,
                        'include_static': False
                    },
                )
    
  2. Scene flow

    s t ′ ∈ [ − 1 , 1 ] ( x , t , v i d ) ∈ R 3 s_{t^{\prime}\in[-1,1]}(\mathbf{x},\mathbf{t},\mathbf{vid})\in \mathbb{R}^3 st[1,1](x,t,vid)R3

3.3 Optimization

L = ( L c + λ f L f + λ d L d + λ o L o ) ⏟ reconstruction losses + ( L c w + λ f L f w ) ⏟ warping losses λ f l o ( L c y c + L s m + L s l o ) ⏟ flow losses + ( λ e L e + λ d L d ) ⏟ static-dynamic factorization + λ ρ L ρ . \mathcal{L}=\underbrace{\left(\mathcal{L}_{c}+\lambda_{f}\mathcal{L}_{f}+\lambda_{d}\mathcal{L}_{d}+\lambda_{o}\mathcal{L}_{o}\right)}_{\text{reconstruction losses}}+\underbrace{\left(\mathcal{L}_{c}^{w}+\lambda_{f}\mathcal{L}_{f}^{w}\right)}_{\text{warping losses}}\\\lambda_{flo}\underbrace{\left(\mathcal{L}_{cyc}+\mathcal{L}_{sm}+\mathcal{L}_{slo}\right)}_{\text{flow losses}}+\underbrace{\left(\lambda_{e}\mathcal{L}_{e}+\lambda_{d}\mathcal{L}_{d}\right)}_{\text{static-dynamic factorization}}+\lambda_{\rho}\mathcal{L}_{\rho}. L=reconstruction losses (Lc+λfLf+λdLd+λoLo)+warping losses (Lcw+λfLfw)λfloflow losses (Lcyc+Lsm+Lslo)+static-dynamic factorization (λeLe+λdLd)+λρLρ.

4. Experiment

4.1 City-Scale Reconstruction

It did not specify the dataset.
在这里插入图片描述
在这里插入图片描述

4.2 KITTI Benchmarks

KITTI + VKITTI (same as NSG)

4.3 Diagnostics

在这里插入图片描述

Flow-based warping is the single-most important input, while depth is the least crucial input.

Ref:

[1] SUDS: Scalable Urban Dynamic Scenes_suds论文解读-CSDN博客

[2] 【读论文】Instant Neural Graphics Primitives with a Multiresolution Hash Encoding-CSDN博客

  • 6
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值