SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
文章目录
1. What
What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)
Inputting a monocular dynamic video, this paper uses sparse control points to drive Gaussian. Each control point has 6 DoF which is time-varying and can be predicted by a MLP. This method can enable dynamic view synthesis and motion editing but still has some limitations in inaccurate poses or intense movements.
2. Why
Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)
Maybe contain Background, Question, Others, Innovation:
Nerf-based methods struggle with low rendering qualities, speeds, and high memory usage. Existing 3D-GS only applies to static scenes. An intuitive method [47] involves learning a flow vector for each 3D Gaussian, but it incurs a significant time cost for training and inference(The author of 47 is a co-author of this).
Related work:
-
Dynamic NeRF
-
Dynamic Gaussian Splatting
-
3D Deformation and Editing
This part is relatively unfamiliar. It introduces the traditional editing methods in graphics which focus on preserving the geometric details of 3D objects during the deformation process, containing some descriptors like Laplacian coordinates, Poisson equation, and cage-based approaches.
Recently, there have been other approaches that aim to edit the scene geometry learned from 2D images. This paper belongs to this class.
3. How
3.1 Sparse Control Points
We will first introduce the definition of control points, which is the core concept used in this article.
There are a set of sparse control points P = { ( p i ∈ R 3 , o i ∈ R + ) } , i ∈ { 1 , 2 , ⋯ , N p } \mathcal{P}=\{(p_{i}\in\mathbb{R}^{3},o_{i}\in\mathbb{R}^{+})\},i\in \{1,2,\cdots,N_{p}\} P={(pi∈R3,oi∈R+)},i∈{1,2,⋯,Np}. And o i o_i oi is a learnable radius parameter that controls how the impact of a control point on a Gaussian.
Meanwhile, for each control point k k k, we learn time-varying 6 DoF transformations [ R i t ∣ T i t ] ∈ S E ( 3 ) [R_i^t|T_i^t]\in\mathbf{SE}(3) [Rit∣Tit]∈SE(3) , consisting of a local frame rotation matrix R i t ∈ S O ( 3 ) R_i^t\in\mathbf{SO}(3) Rit∈SO(3) and a translation vector T i t ∈ R 3 T_i^t\in\mathbb{R}^3 Tit∈R3. But instead of directly optimizing the transformation parameters, we employ an MLP Ψ \Psi Ψ to learn a time-varying transformation field:
Ψ : ( p i , t ) → ( R i t , T i t ) . \Psi:(p_{i},t)\rightarrow(R_{i}^{t},T_{i}^{t}). Ψ:(pi,t)→(Rit,Tit).
3.2 Dynamic Scene Rendering
After having some control points, we need to found the connection between it with the Gaussian.
We use the k-nearest neighbor (KNN) search to obtain its K(= 4) neighboring control points denoted as { p k ∣ k ∈ N j } \{p_{k}|k\in\mathcal{N}_{j}\} {pk∣k∈Nj}. Then define its weight as:
w j k = w ^ j k ∑ k ∈ N j w ^ j k , where w ^ j k = exp ( − d j k 2 2 o k 2 ) , w_{jk}=\frac{\hat w_{jk}}{\sum\limits_{k\in\mathcal{N}_{j}}\hat w_{jk}},\text{where}\hat w_{jk}=\exp(-\frac{d_{jk}^{2}}{2o_{k}^{2}}), wjk=k∈Nj∑w^jkw^jk,wherew^jk=exp(−2ok2djk2),
where d j k d_{jk} djk is the distance between the center of Gaussian G j G_j Gj and the neighboring control point p k p_k pk, and o k o_k ok is the learned radius parameter of neighboring control points p k . p_k. pk.
Then learn from the ideas of Linear Blend Skinning (LBS), which tells us each vertex in the model is assigned a weight for each bone indicating how much influence that bone has over the vertex’s position. When a bone moves, each vertex moves according to a weighted average of the transformations (rotations and translations) of all the bones that influence it. So we can adjust the Gaussian by:
μ j t = ∑ k ∈ N j w j k ( R k t ( μ j − p k ) + p k + T k t ) q j t = ( ∑ k ∈ N j w j k r k t ) ⊗ q j , \begin{aligned}\mu_j^t&=\sum_{k\in\mathcal{N}_j}w_{jk}\left(R_k^t(\mu_j-p_k)+p_k+T_k^t\right)\\q_j^t&=(\sum_{k\in\mathcal{N}_j}w_{jk}r_k^t)\otimes q_j,\end{aligned} μjtqjt=k∈Nj∑wjk(Rkt(μj−pk)+pk+Tkt)=(k∈Nj∑wjkrkt)⊗qj,
The new rotation is the weighted average of the rotations of the neighboring control points, represented as quaternions r k t r_k^t rkt (the same meaning as R k t R_k^t Rkt, just different in the mathematical form). This average quaternion is then multiplied by the original rotation quaternion of the Gaussian using the quaternion product, which combines the two rotations in a way that is appropriate for 3D rotations.
3.3 Optimization
We have known how to control Gaussian by the control points. Then we will introduce two small strategies in optimization.
- ARAP Loss
A loss function was introduced from the paper “As-rigid-as-possible surface modeling”, which helps maintain rigidity.
Firstly, it defines points’ trajectories p i t r a j p_i^{traj} pitraj in the scene motion as:
p i traj = 1 N t p i t 1 ⊕ p i t 2 ⊕ ⋯ ⊕ p i t N t , p_i^{\text{traj}}=\frac{1}{N_t}p_i^{t_1}\oplus p_i^{t_2}\oplus\cdots\oplus p_i^{t_{N_t}}, pitraj=Nt1pit1⊕pit2⊕⋯⊕pitNt,
where p i t p_i^{t} pit represents the position of point i i i in time t t t.
Then, a local neighborhood for each control point is determined via ball queries, which means finding all control points within a predefined radius to define a local area of influence. To calculate L a r a p \mathcal{L}_\mathrm{arap} Larap, we randomly sample two time steps t 1 t_1 t1 and t 2 t_2 t2. For each point p k p_k pk within the radius ( i . e . k ∈ N c i ) (i.e. \: k\in\mathcal{N}_{\mathrm{c}i}) (i.e.k∈Nci), its transformed locations with learned translation parameters T k t 1 T_k^{t_1} Tkt1 and T k t 2 T_k^{t_2} Tkt2 are: p k t 1 = p k + T k t 1 p_k^{t_1}=p_k+T_k^{t_1} pkt1=pk+Tkt1 and p k t 2 = p k + T k t 2 p{k}^{t_{2}}=p_{k}+T_{k}^{t_{2}} pkt2=pk+Tkt2, thus the new rotation matrix R ^ i \hat{R}_{i} R^i can be estimated as:
R ^ i = arg min R ∈ S O ( 3 ) ∑ k ∈ N c i w i k ∣ ∣ ( p i t 1 − p k t 1 ) − R ( p i t 2 − p k t 2 ) ∣ ∣ 2 . \hat{R}_{i}=\arg\min_{R\in\mathbf{SO}(3)}\sum_{k\in\mathcal{N}_{c_{i}}}w_{ik}||(p_{i}^{t_{1}}-p_{k}^{t_{1}})-R(p_{i}^{t_{2}}-p_{k}^{t_{2}})||^{2}. R^i=argR∈SO(3)mink∈Nci∑wik∣∣(pit1−pkt1)−R(pit2−pkt2)∣∣2.
Finally, L a r a p \mathcal{L}_\mathrm{arap} Larap can be calculated as:
L a r a p ( p i , t 1 , t 2 ) = ∑ k ∈ N c i w i k ∣ ∣ ( p i t 1 − p k t 1 ) − R ^ i ( p i t 2 − p k t 2 ) ∣ ∣ 2 . {\mathcal L}_{\mathrm{arap}}(p_{i},t_{1},t_{2})=\sum_{k\in{\mathcal N}_{c_{i}}}w_{ik}||(p_{i}^{t_{1}}-p_{k}^{t_{1}})-\hat{R}_{i}(p_{i}^{t_{2}}-p_{k}^{t_{2}})||^{2}. Larap(pi,t1,t2)=k∈Nci∑wik∣∣(pit1−pkt1)−R^i(pit2−pkt2)∣∣2.
- Adaptive Control Points
Another strategy, very similar to the Adaptive Control of Gaussian.
-
Prune: calculate its overall impact W i = ∑ j ∈ N ~ i w j i W_i= \sum_{j\in\tilde{N}i}w_{ji} Wi=∑j∈N~iwji on the set of Gaussians j ∈ N ~ i j\in\tilde{\mathcal{N}}_i j∈N~i whose K nearest neighbors include p i . p_i. pi. Then, we prune p i p_i pi if W i W_i Wi is close to zero, indicating little contribution to the motion of 3D Gaussians.
-
Clone: calculate the summation of the Gaussian gradient norm as:
g i = ∑ j ∈ N ~ i w ~ j ∥ d L d μ j ∥ 2 2 , w h e r e w ~ j = w j i ∑ j ∈ N ~ k w j i . g_{i}=\sum_{j\in\tilde{\mathcal{N}}_{i}}\tilde{w}_{j}\|\frac{d\mathcal{L}}{d\mu_{j}}\|_{2}^{2},\mathrm{where}\tilde{w}_{j}=\frac{w_{ji}}{\sum\limits_{j\in\tilde{\mathcal{N}}_{k}}w_{ji}}. gi=j∈N~i∑w~j∥dμjdL∥22,wherew~j=j∈N~k∑wjiwji.
A large g k g_k gk indicates poor reconstruction. And add a new point p k ′ p_k^{'} pk′ as
p k ′ = ∑ j ∈ N ~ k w ~ i μ j ; σ k ′ = σ k . p_k^{\prime}=\sum_{j\in\tilde{\mathcal{N}}_k}\tilde{w}_i\mu_j;\sigma_k^{\prime}=\sigma_k. pk′=j∈N~k∑w~iμj;σk′=σk.
Now, we can utilize this pipeline to understand every process.
3.4 Motion Editing
Given a set of user-defined handle points { h l ∈ R 3 ∣ l ∈ H ⊂ { 1 , 2 , ⋯ , N p } } \{h_l\in\mathbb{R}^3\mid l\in\mathcal{H}\subset \{1,2,\cdots,N_p\}\} {hl∈R3∣l∈H⊂{1,2,⋯,Np}}, the control graph P ′ P^{\prime} P′ can be deformed by minimizing the APAR energy formulated as:
E ( P ′ ) = ∑ i = 1 N p ∑ j ∈ N i w i j ∣ ∣ ( p i ′ − p j ′ ) − R ^ i ( p i − p j ) ∣ ∣ 2 , E(\mathcal{P}^{\prime})=\sum_{i=1}^{N_{p}}\sum_{j\in\mathcal{N}_{i}}w_{ij}||(p_{i}^{\prime}-p_{j}^{\prime})-\hat{R}_{i}(p_{i}-p_{j})||^{2}, E(P′)=i=1∑Npj∈Ni∑wij∣∣(pi′−pj′)−R^i(pi−pj)∣∣2,
with the fixed position condition p l ′ = h l f o r l ∈ H . p_{l}^{\prime}=h_{l} \: \mathrm{for}\:l\in\mathcal{H}. pl′=hlforl∈H.
4. Self-thoughts
- Only use one camera, if we use six cameras like AD, how to combine them?
- The edit of Gaussian is with the help of physical rigid body motion.
- ARAP can be replaced by ARAPReg.
- Only suitable for small objects with meaningful control points. Can we use only a few control points like human posture representation?