【读论文】Gaussian-based街景重建 HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

1. HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

1.1 What

Without LiDAR scans and 3D bounding boxes input, from images, noisy 2D semantic labels, and noisy optical flow, this paper involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians, where moving object poses are regularized via physical constraints.

1.2 Why

Introduction:

  • With the rise of neural rendering, many approaches have emerged to lift 2D information to 3D space, but those works addressing dynamic scenes need additional bboxes.
  • For PNF, naive joint optimization of per-frame pose transformations is prone to local minima and sensitive to initialization.
  • Some works fit 2D situations but are non-trivial to extract accurate semantics in 3D due to the inaccurate (inferred) 3D geometry.

Related work:

  • 3D Scene Understanding
    • Numerous techniques have focused on predicting semantic labels [5, 9, 35], depth maps [10, 28], and optical flows [42] solely from 2D input images. But they are limited in understanding 3D.
    • Another line of approach suggests conducting semantic scene understanding solely based on 3D input [29, 31]. But LiDAR is expensive.
    • NeRF-based works such as Semantic-NeRF are only applicable to static environments.
  • Urban Scene Reconstruction
    • Many of these approaches rely on the availability of accurate 3D bounding boxes, as seen in NSG, MARS, and UniSim.
    • SUDS [36], avoids the use of 3D bounding boxes by grouping the scene based on learned feature fields.

1.3 How

在这里插入图片描述

Seeing from the pipeline, we know that the scene was divided into several dynamic and one static scenes. In static scene, three types of Gaussian were used to represent the scene and only two types of Gaussian were used in dynamic scenes. This is because the optical flow can be calculated from the motion of the Gaussian center. And then, just like NSG, the dynamic Gaussians are transformed to the whole scene, combining with the static Gaussians to obtain the rendered result of three types of Gaussian, which are supervised to learn.

1.3.1 Decomposed Scene Representation

  1. Static and Dynamic 3D Gaussians

We know to divide the scene into static and dynamic, but the division basis still needs to be determined (Maybe refer to NSG).

Another thing is to add semantic logits s ∈ R S \mathbf{s}\in\mathbb{R}^{S} sRS to each Gaussian, allowing for rendering 2D semantic labels.

  1. Unicycle Model

The vehicle is represented by a unicycle model as below, with three elements: ( x t , y t , θ t ) (x_t,y_t,\theta_t) (xt,yt,θt).

在这里插入图片描述

Concretely,

x t + 1 = x t + Δ x = x t + R ( sin ⁡ θ t + 1 − sin ⁡ θ t ) = x t + v t ω t ( sin ⁡ θ t + 1 − sin ⁡ θ t ) y t + 1 = y t − v t ω t ( cos ⁡ θ t + 1 − cos ⁡ θ t ) θ t + 1 = θ t + ω t . \begin{aligned}&x_{t+1} =x_t +\Delta x=x_t+R(\sin\theta_{t+1}-\sin\theta_{t})=x_{t}+\frac{v_{t}}{\omega_{t}}(\sin\theta_{t+1}-\sin\theta_{t}) \\&y_{t+1} =y_{t}-\frac{v_{t}}{\omega_{t}}(\cos\theta_{t+1}-\cos\theta_{t}) \\&\theta_{t+1} =\theta_t+\omega_t \end{aligned}. xt+1=xt+Δx=xt+R(sinθt+1sinθt)=xt+ωtvt(sinθt+1sinθt)yt+1=ytωtvt(cosθt+1cosθt)θt+1=θt+ωt.
This model integrates physical constraints when compared to directly optimizing the transformations of dynamic vehicles at every frame independently.

1.3.2 Holistic Urban Gaussian Splatting

Three types of results—render images, semantic maps, and optical flow were rendered in this model.

  1. Novel View Synthesis

The same as the original GS, but adds the effect of exposure.

That is because when working with 3D Gaussians, there is no neural network capable of processing appearance embeddings thereby compensating exposure. So inspired by Urban-NeRF, this paper generates an affine matrix A ∈ R 3 × 3 \mathbf{A}\in\mathbb{R}^{3\times3} AR3×3 and vector b ∈ R 3 \mathbf{b}\in\mathbb{R}^{3} bR3 via a small MLP for each camera to add the exposure.

C ~ = A × C + b . \tilde{\mathrm{C}}=\mathrm{A}\times\mathrm{C}+\mathrm{b}. C~=A×C+b.

Then utilizing α \alpha α-blending:

π : C = ∑ i ∈ N c i α i ′ ∏ j = 1 i − 1 ( 1 − α j ′ ) . \pi:\quad\mathbf{C}=\sum_{i\in\mathcal{N}}\mathbf{c}_i\alpha_i'\prod_{j=1}^{i-1}(1-\alpha_j'). π:C=iNciαij=1i1(1αj).

  • Explanation
    1. For the first sample (closest to the viewer), its color is weighted fully by its alpha value because there are no preceding samples to occlude it.
    2. For the second sample, its color is weighted by its alpha value and also by the remaining light that passed through the first sample (i.e., ( 1 − α 1 ′ ) ( 1 - \alpha'_1 ) (1α1)).
  1. Semantic Reconstruction

Obtain 2D semantic labels via α-blending based on the 3D semantic logit s s s:

π : S = ∑ i ∈ N softmax ( s i ) α i ′ ∏ j = 1 i − 1 ( 1 − α j ′ ) . \pi:\quad\mathbf{S}=\sum_{i\in\mathcal{N}}\text{softmax}(\mathbf{s}_i)\alpha_i'\prod_{j=1}^{i-1}(1-\alpha_j'). π:S=iNsoftmax(si)αij=1i1(1αj).

Note that we perform the softmax operation on 3D semantic logits s i s_i si *prior to ∗ α *\alpha α blending, in contrast to most existing methods that apply softmax to 2D.

  1. Optical Flow

π : F = ∑ i ∈ N f i α i ′ ∏ j = 1 i − 1 ( 1 − α j ′ ) , \pi:\quad\mathbf{F}=\sum_{i\in\mathcal{N}}\mathbf{f}_i\alpha_i'\prod_{j=1}^{i-1}(1-\alpha_j'), π:F=iNfiαij=1i1(1αj),

where f t 1 → t 2 = μ 2 ′ − μ 1 ′ \mathbf{f}_{t_{1}\to t_{2}}=\mu_{2}^{\prime}-\mu_{1}^{\prime} ft1t2=μ2μ1 and μ ′ \mu^{'} μ is the projected center of Gaussian.

1.3.3 Loss function

L = L I + λ S L S + λ F L F + λ t L t + λ u n i L u n i + λ r e g L r e g \mathcal{L}=\mathcal{L}_\mathbf{I}+\lambda_\mathbf{S}\mathcal{L}_\mathbf{S}+\lambda_\mathbf{F}\mathcal{L}_\mathbf{F}+\lambda_\mathbf{t}\mathcal{L}_\mathbf{t}+\lambda_{uni}\mathcal{L}_{uni}+\lambda_{reg}\mathcal{L}_{reg} L=LI+λSLS+λFLF+λtLt+λuniLuni+λregLreg

It is equal to image loss plus semantic loss plus optical flow plus Unicycle Model Losses.

1.3.4 Implementation Details

Pseudo-GTs:

We utilize InverseForm [5] to generate pseudo ground truth for semantic segmentation. For initializing the unicycle model, we employ a monocular-based method, QD-3DT [16], to acquire pseudo ground truth for 3D bounding boxes and tracking IDs at each training view. For optical flow, we use Unimatch [41] to obtain pseudo-ground truth.

1.4 My Takeaway and Self-thoughts

  1. NSG + Gaussian Splatting
  2. Unicycle model
  3. Handle exposure
  4. Render semantic maps and optical flow in Gaussian
  5. Pseudo-GTs

Refer

[1] 13.1.2.3 A simple unicycle (uiuc.edu)

[2] HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting (xdimlab.github.io)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值