1. HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
1.1 What
Without LiDAR scans and 3D bounding boxes input, from images, noisy 2D semantic labels, and noisy optical flow, this paper involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians, where moving object poses are regularized via physical constraints.
1.2 Why
Introduction:
- With the rise of neural rendering, many approaches have emerged to lift 2D information to 3D space, but those works addressing dynamic scenes need additional bboxes.
- For PNF, naive joint optimization of per-frame pose transformations is prone to local minima and sensitive to initialization.
- Some works fit 2D situations but are non-trivial to extract accurate semantics in 3D due to the inaccurate (inferred) 3D geometry.
Related work:
- 3D Scene Understanding
- Numerous techniques have focused on predicting semantic labels [5, 9, 35], depth maps [10, 28], and optical flows [42] solely from 2D input images. But they are limited in understanding 3D.
- Another line of approach suggests conducting semantic scene understanding solely based on 3D input [29, 31]. But LiDAR is expensive.
- NeRF-based works such as Semantic-NeRF are only applicable to static environments.
- Urban Scene Reconstruction
- Many of these approaches rely on the availability of accurate 3D bounding boxes, as seen in NSG, MARS, and UniSim.
- SUDS [36], avoids the use of 3D bounding boxes by grouping the scene based on learned feature fields.
1.3 How
Seeing from the pipeline, we know that the scene was divided into several dynamic and one static scenes. In static scene, three types of Gaussian were used to represent the scene and only two types of Gaussian were used in dynamic scenes. This is because the optical flow can be calculated from the motion of the Gaussian center. And then, just like NSG, the dynamic Gaussians are transformed to the whole scene, combining with the static Gaussians to obtain the rendered result of three types of Gaussian, which are supervised to learn.
1.3.1 Decomposed Scene Representation
- Static and Dynamic 3D Gaussians
We know to divide the scene into static and dynamic, but the division basis still needs to be determined (Maybe refer to NSG).
Another thing is to add semantic logits s ∈ R S \mathbf{s}\in\mathbb{R}^{S} s∈RS to each Gaussian, allowing for rendering 2D semantic labels.
- Unicycle Model
The vehicle is represented by a unicycle model as below, with three elements: ( x t , y t , θ t ) (x_t,y_t,\theta_t) (xt,yt,θt).
Concretely,
x
t
+
1
=
x
t
+
Δ
x
=
x
t
+
R
(
sin
θ
t
+
1
−
sin
θ
t
)
=
x
t
+
v
t
ω
t
(
sin
θ
t
+
1
−
sin
θ
t
)
y
t
+
1
=
y
t
−
v
t
ω
t
(
cos
θ
t
+
1
−
cos
θ
t
)
θ
t
+
1
=
θ
t
+
ω
t
.
\begin{aligned}&x_{t+1} =x_t +\Delta x=x_t+R(\sin\theta_{t+1}-\sin\theta_{t})=x_{t}+\frac{v_{t}}{\omega_{t}}(\sin\theta_{t+1}-\sin\theta_{t}) \\&y_{t+1} =y_{t}-\frac{v_{t}}{\omega_{t}}(\cos\theta_{t+1}-\cos\theta_{t}) \\&\theta_{t+1} =\theta_t+\omega_t \end{aligned}.
xt+1=xt+Δx=xt+R(sinθt+1−sinθt)=xt+ωtvt(sinθt+1−sinθt)yt+1=yt−ωtvt(cosθt+1−cosθt)θt+1=θt+ωt.
This model integrates physical constraints when compared to directly optimizing the transformations of dynamic vehicles at every frame independently.
1.3.2 Holistic Urban Gaussian Splatting
Three types of results—render images, semantic maps, and optical flow were rendered in this model.
- Novel View Synthesis
The same as the original GS, but adds the effect of exposure.
That is because when working with 3D Gaussians, there is no neural network capable of processing appearance embeddings thereby compensating exposure. So inspired by Urban-NeRF, this paper generates an affine matrix A ∈ R 3 × 3 \mathbf{A}\in\mathbb{R}^{3\times3} A∈R3×3 and vector b ∈ R 3 \mathbf{b}\in\mathbb{R}^{3} b∈R3 via a small MLP for each camera to add the exposure.
C ~ = A × C + b . \tilde{\mathrm{C}}=\mathrm{A}\times\mathrm{C}+\mathrm{b}. C~=A×C+b.
Then utilizing α \alpha α-blending:
π : C = ∑ i ∈ N c i α i ′ ∏ j = 1 i − 1 ( 1 − α j ′ ) . \pi:\quad\mathbf{C}=\sum_{i\in\mathcal{N}}\mathbf{c}_i\alpha_i'\prod_{j=1}^{i-1}(1-\alpha_j'). π:C=i∈N∑ciαi′j=1∏i−1(1−αj′).
- Explanation
- For the first sample (closest to the viewer), its color is weighted fully by its alpha value because there are no preceding samples to occlude it.
- For the second sample, its color is weighted by its alpha value and also by the remaining light that passed through the first sample (i.e., ( 1 − α 1 ′ ) ( 1 - \alpha'_1 ) (1−α1′)).
- Semantic Reconstruction
Obtain 2D semantic labels via α-blending based on the 3D semantic logit s s s:
π : S = ∑ i ∈ N softmax ( s i ) α i ′ ∏ j = 1 i − 1 ( 1 − α j ′ ) . \pi:\quad\mathbf{S}=\sum_{i\in\mathcal{N}}\text{softmax}(\mathbf{s}_i)\alpha_i'\prod_{j=1}^{i-1}(1-\alpha_j'). π:S=i∈N∑softmax(si)αi′j=1∏i−1(1−αj′).
Note that we perform the softmax operation on 3D semantic logits s i s_i si *prior to ∗ α *\alpha ∗α blending, in contrast to most existing methods that apply softmax to 2D.
- Optical Flow
π : F = ∑ i ∈ N f i α i ′ ∏ j = 1 i − 1 ( 1 − α j ′ ) , \pi:\quad\mathbf{F}=\sum_{i\in\mathcal{N}}\mathbf{f}_i\alpha_i'\prod_{j=1}^{i-1}(1-\alpha_j'), π:F=i∈N∑fiαi′j=1∏i−1(1−αj′),
where f t 1 → t 2 = μ 2 ′ − μ 1 ′ \mathbf{f}_{t_{1}\to t_{2}}=\mu_{2}^{\prime}-\mu_{1}^{\prime} ft1→t2=μ2′−μ1′ and μ ′ \mu^{'} μ′ is the projected center of Gaussian.
1.3.3 Loss function
L = L I + λ S L S + λ F L F + λ t L t + λ u n i L u n i + λ r e g L r e g \mathcal{L}=\mathcal{L}_\mathbf{I}+\lambda_\mathbf{S}\mathcal{L}_\mathbf{S}+\lambda_\mathbf{F}\mathcal{L}_\mathbf{F}+\lambda_\mathbf{t}\mathcal{L}_\mathbf{t}+\lambda_{uni}\mathcal{L}_{uni}+\lambda_{reg}\mathcal{L}_{reg} L=LI+λSLS+λFLF+λtLt+λuniLuni+λregLreg
It is equal to image loss plus semantic loss plus optical flow plus Unicycle Model Losses.
1.3.4 Implementation Details
Pseudo-GTs:
We utilize InverseForm [5] to generate pseudo ground truth for semantic segmentation. For initializing the unicycle model, we employ a monocular-based method, QD-3DT [16], to acquire pseudo ground truth for 3D bounding boxes and tracking IDs at each training view. For optical flow, we use Unimatch [41] to obtain pseudo-ground truth.
1.4 My Takeaway and Self-thoughts
- NSG + Gaussian Splatting
- Unicycle model
- Handle exposure
- Render semantic maps and optical flow in Gaussian
- Pseudo-GTs
Refer
[1] 13.1.2.3 A simple unicycle (uiuc.edu)
[2] HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting (xdimlab.github.io)