【读论文】Gaussian-based街景重建 HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

本文链接：https://blog.csdn.net/weixin_62012485/article/details/137588475

1. HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

1.1 What

Without LiDAR scans and 3D bounding boxes input, from images, noisy 2D semantic labels, and noisy optical flow, this paper involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians, where moving object poses are regularized via physical constraints.

1.2 Why

Introduction:

With the rise of neural rendering, many approaches have emerged to lift 2D information to 3D space, but those works addressing dynamic scenes need additional bboxes.
For PNF, naive joint optimization of per-frame pose transformations is prone to local minima and sensitive to initialization.
Some works fit 2D situations but are non-trivial to extract accurate semantics in 3D due to the inaccurate (inferred) 3D geometry.

Related work:

3D Scene Understanding
- Numerous techniques have focused on predicting semantic labels [5, 9, 35], depth maps [10, 28], and optical flows [42] solely from 2D input images. But they are limited in understanding 3D.
- Another line of approach suggests conducting semantic scene understanding solely based on 3D input [29, 31]. But LiDAR is expensive.
- NeRF-based works such as Semantic-NeRF are only applicable to static environments.
Urban Scene Reconstruction
- Many of these approaches rely on the availability of accurate 3D bounding boxes, as seen in NSG, MARS, and UniSim.
- SUDS [36], avoids the use of 3D bounding boxes by grouping the scene based on learned feature fields.

1.3 How

在这里插入图片描述

Seeing from the pipeline, we know that the scene was divided into several dynamic and one static scenes. In static scene, three types of Gaussian were used to represent the scene and only two types of Gaussian were used in dynamic scenes. This is because the optical flow can be calculated from the motion of the Gaussian center. And then, just like NSG, the dynamic Gaussians are transformed to the whole scene, combining with the static Gaussians to obtain the rendered result of three types of Gaussian, which are supervised to learn.

1.3.1 Decomposed Scene Representation

Static and Dynamic 3D Gaussians

We know to divide the scene into static and dynamic, but the division basis still needs to be determined (Maybe refer to NSG).

Another thing is to add semantic logits $\mathbf{s}\in\mathbb{R}^{S}$ to each Gaussian, allowing for rendering 2D semantic labels.

Unicycle Model

The vehicle is represented by a unicycle model as below, with three elements: $(x_t,y_t,\theta_t)$ .

在这里插入图片描述

Concretely,

$\begin{aligned}&x_{t+1} =x_t +\Delta x=x_t+R(\sin\theta_{t+1}-\sin\theta_{t})=x_{t}+\frac{v_{t}}{\omega_{t}}(\sin\theta_{t+1}-\sin\theta_{t}) \\&y_{t+1} =y_{t}-\frac{v_{t}}{\omega_{t}}(\cos\theta_{t+1}-\cos\theta_{t}) \\&\theta_{t+1} =\theta_t+\omega_t \end{aligned}.$