【读论文】SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

小白有颗大白梦

已于 2024-04-03 01:42:37 修改

阅读量1k

点赞数 20

分类专栏：读论文NeRF NeRF学习文章标签： NeRF Gaussian

于 2024-03-26 14:56:15 首次发布

本文链接：https://blog.csdn.net/weixin_62012485/article/details/137046415

版权

NeRF学习同时被 2 个专栏收录

19 篇文章 3 订阅

订阅专栏

读论文NeRF

14 篇文章 1 订阅

订阅专栏

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

文章目录

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

1. What

What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)

Inputting a monocular dynamic video, this paper uses sparse control points to drive Gaussian. Each control point has 6 DoF which is time-varying and can be predicted by a MLP. This method can enable dynamic view synthesis and motion editing but still has some limitations in inaccurate poses or intense movements.

2. Why

Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)

Maybe contain Background, Question, Others, Innovation:

Nerf-based methods struggle with low rendering qualities, speeds, and high memory usage. Existing 3D-GS only applies to static scenes. An intuitive method [47] involves learning a flow vector for each 3D Gaussian, but it incurs a significant time cost for training and inference(The author of 47 is a co-author of this).

Related work:

Dynamic NeRF
Dynamic Gaussian Splatting
3D Deformation and Editing

This part is relatively unfamiliar. It introduces the traditional editing methods in graphics which focus on preserving the geometric details of 3D objects during the deformation process, containing some descriptors like Laplacian coordinates, Poisson equation, and cage-based approaches.

Recently, there have been other approaches that aim to edit the scene geometry learned from 2D images. This paper belongs to this class.

3. How

3.1 Sparse Control Points

We will first introduce the definition of control points, which is the core concept used in this article.

There are a set of sparse control points $\mathcal{P}=\{(p_{i}\in\mathbb{R}^{3},o_{i}\in\mathbb{R}^{+})\},i\in \{1,2,\cdots,N_{p}\}$ . And $o_i$ is a learnable radius parameter that controls how the impact of a control point on a Gaussian.

Meanwhile, for each control point $k$ , we learn time-varying 6 DoF transformations $[R_i^t|T_i^t]\in\mathbf{SE}(3)$ , consisting of a local frame rotation matrix $R_i^t\in\mathbf{SO}(3)$ and a translation vector $T_i^t\in\mathbb{R}^3$ . But instead of directly optimizing the transformation parameters, we employ an MLP $\Psi$ to learn a time-varying transformation field:

$\Psi:(p_{i},t)\rightarrow(R_{i}^{t},T_{i}^{t}).$

3.2 Dynamic Scene Rendering

After having some control points, we need to found the connection between it with the Gaussian.

We use the k-nearest neighbor (KNN) search to obtain its K(= 4) neighboring control points denoted as $\{p_{k}|k\in\mathcal{N}_{j}\}$ . Then define its weight as:

$w_{jk}=\frac{\hat w_{jk}}{\sum\limits_{k\in\mathcal{N}_{j}}\hat w_{jk}},\text{where}\hat w_{jk}=\exp(-\frac{d_{jk}^{2}}{2o_{k}^{2}}),$

where $d_{jk}$ is the distance between the center of Gaussian $G_j$ and the neighboring control point $p_k$ , and $o_k$ is the learned radius parameter of neighboring control points $p_k.$

Then learn from the ideas of Linear Blend Skinning (LBS), which tells us each vertex in the model is assigned a weight for each bone indicating how much influence that bone has over the vertex’s position. When a bone moves, each vertex moves according to a weighted average of the transformations (rotations and translations) of all the bones that influence it. So we can adjust the Gaussian by:

$\begin{aligned}\mu_j^t&=\sum_{k\in\mathcal{N}_j}w_{jk}\left(R_k^t(\mu_j-p_k)+p_k+T_k^t\right)\\q_j^t&=(\sum_{k\in\mathcal{N}_j}w_{jk}r_k^t)\otimes q_j,\end{aligned}$

The new rotation is the weighted average of the rotations of the neighboring control points, represented as quaternions $r_k^t$ (the same meaning as $R_k^t$ , just different in the mathematical form). This average quaternion is then multiplied by the original rotation quaternion of the Gaussian using the quaternion product, which combines the two rotations in a way that is appropriate for 3D rotations.

3.3 Optimization

We have known how to control Gaussian by the control points. Then we will introduce two small strategies in optimization.

ARAP Loss

A loss function was introduced from the paper “As-rigid-as-possible surface modeling”, which helps maintain rigidity.

Firstly, it defines points’ trajectories $p_i^{traj}$ in the scene motion as:

$p_i^{\text{traj}}=\frac{1}{N_t}p_i^{t_1}\oplus p_i^{t_2}\oplus\cdots\oplus p_i^{t_{N_t}},$

where $p_i^{t}$ represents the position of point $i$ in time $t$ .

Then, a local neighborhood for each control point is determined via ball queries, which means finding all control points within a predefined radius to define a local area of influence. To calculate $\mathcal{L}_\mathrm{arap}$ , we randomly sample two time steps $t_1$ and $t_2$ . For each point $p_k$ within the radius $\: k\in\mathcal{N}_{\mathrm{c}i})$ , its transformed locations with learned translation parameters $T_k^{t_1}$ and $T_k^{t_2}$ are: $p_k^{t_1}=p_k+T_k^{t_1}$ and $p{k}^{t_{2}}=p_{k}+T_{k}^{t_{2}}$ , thus the new rotation matrix $\hat{R}_{i}$ can be estimated as:

$\hat{R}_{i}=\arg\min_{R\in\mathbf{SO}(3)}\sum_{k\in\mathcal{N}_{c_{i}}}w_{ik}||(p_{i}^{t_{1}}-p_{k}^{t_{1}})-R(p_{i}^{t_{2}}-p_{k}^{t_{2}})||^{2}.$

Finally, $\mathcal{L}_\mathrm{arap}$ can be calculated as:

${\mathcal L}_{\mathrm{arap}}(p_{i},t_{1},t_{2})=\sum_{k\in{\mathcal N}_{c_{i}}}w_{ik}||(p_{i}^{t_{1}}-p_{k}^{t_{1}})-\hat{R}_{i}(p_{i}^{t_{2}}-p_{k}^{t_{2}})||^{2}.$

Adaptive Control Points

Another strategy, very similar to the Adaptive Control of Gaussian.

Prune: calculate its overall impact $W_i= \sum_{j\in\tilde{N}i}w_{ji}$ on the set of Gaussians $j\in\tilde{\mathcal{N}}_i$ whose K nearest neighbors include $p_i.$ Then, we prune $p_i$ if $W_i$ is close to zero, indicating little contribution to the motion of 3D Gaussians.
Clone: calculate the summation of the Gaussian gradient norm as:

$g_{i}=\sum_{j\in\tilde{\mathcal{N}}_{i}}\tilde{w}_{j}\|\frac{d\mathcal{L}}{d\mu_{j}}\|_{2}^{2},\mathrm{where}\tilde{w}_{j}=\frac{w_{ji}}{\sum\limits_{j\in\tilde{\mathcal{N}}_{k}}w_{ji}}.$

A large $g_k$ indicates poor reconstruction. And add a new point $p_k^{'}$ as

$p_k^{\prime}=\sum_{j\in\tilde{\mathcal{N}}_k}\tilde{w}_i\mu_j;\sigma_k^{\prime}=\sigma_k.$

Now, we can utilize this pipeline to understand every process.

在这里插入图片描述

3.4 Motion Editing

Given a set of user-defined handle points $\{h_l\in\mathbb{R}^3\mid l\in\mathcal{H}\subset \{1,2,\cdots,N_p\}\}$ , the control graph $P^{\prime}$ can be deformed by minimizing the APAR energy formulated as:

$E(\mathcal{P}^{\prime})=\sum_{i=1}^{N_{p}}\sum_{j\in\mathcal{N}_{i}}w_{ij}||(p_{i}^{\prime}-p_{j}^{\prime})-\hat{R}_{i}(p_{i}-p_{j})||^{2},$

with the fixed position condition $p_{l}^{\prime}=h_{l} \: \mathrm{for}\:l\in\mathcal{H}.$

4. Self-thoughts

Only use one camera, if we use six cameras like AD, how to combine them?
The edit of Gaussian is with the help of physical rigid body motion.
ARAP can be replaced by ARAPReg.
Only suitable for small objects with meaningful control points. Can we use only a few control points like human posture representation?