【读论文】SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

1. What

What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)

Inputting a monocular dynamic video, this paper uses sparse control points to drive Gaussian. Each control point has 6 DoF which is time-varying and can be predicted by a MLP. This method can enable dynamic view synthesis and motion editing but still has some limitations in inaccurate poses or intense movements.

2. Why

Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)

Maybe contain Background, Question, Others, Innovation:

Nerf-based methods struggle with low rendering qualities, speeds, and high memory usage. Existing 3D-GS only applies to static scenes. An intuitive method [47] involves learning a flow vector for each 3D Gaussian, but it incurs a significant time cost for training and inference(The author of 47 is a co-author of this).

Related work:

  • Dynamic NeRF

  • Dynamic Gaussian Splatting

  • 3D Deformation and Editing

    This part is relatively unfamiliar. It introduces the traditional editing methods in graphics which focus on preserving the geometric details of 3D objects during the deformation process, containing some descriptors like Laplacian coordinates, Poisson equation, and cage-based approaches.

    Recently, there have been other approaches that aim to edit the scene geometry learned from 2D images. This paper belongs to this class.

3. How

3.1 Sparse Control Points

We will first introduce the definition of control points, which is the core concept used in this article.

There are a set of sparse control points P = { ( p i ∈ R 3 , o i ∈ R + ) } , i ∈ { 1 , 2 , ⋯   , N p } \mathcal{P}=\{(p_{i}\in\mathbb{R}^{3},o_{i}\in\mathbb{R}^{+})\},i\in \{1,2,\cdots,N_{p}\} P={(piR3,oiR+)},i{1,2,,Np}. And o i o_i oi is a learnable radius parameter that controls how the impact of a control point on a Gaussian.

Meanwhile, for each control point k k k, we learn time-varying 6 DoF transformations [ R i t ∣ T i t ] ∈ S E ( 3 ) [R_i^t|T_i^t]\in\mathbf{SE}(3) [RitTit]SE(3) , consisting of a local frame rotation matrix R i t ∈ S O ( 3 ) R_i^t\in\mathbf{SO}(3) RitSO(3) and a translation vector T i t ∈ R 3 T_i^t\in\mathbb{R}^3 TitR3. But instead of directly optimizing the transformation parameters, we employ an MLP Ψ \Psi Ψ to learn a time-varying transformation field:

Ψ : ( p i , t ) → ( R i t , T i t ) . \Psi:(p_{i},t)\rightarrow(R_{i}^{t},T_{i}^{t}). Ψ:(pi,t)(Rit,Tit).

3.2 Dynamic Scene Rendering

After having some control points, we need to found the connection between it with the Gaussian.

We use the k-nearest neighbor (KNN) search to obtain its K(= 4) neighboring control points denoted as { p k ∣ k ∈ N j } \{p_{k}|k\in\mathcal{N}_{j}\} {pkkNj}. Then define its weight as:

w j k = w ^ j k ∑ k ∈ N j w ^ j k , where w ^ j k = exp ⁡ ( − d j k 2 2 o k 2 ) , w_{jk}=\frac{\hat w_{jk}}{\sum\limits_{k\in\mathcal{N}_{j}}\hat w_{jk}},\text{where}\hat w_{jk}=\exp(-\frac{d_{jk}^{2}}{2o_{k}^{2}}), wjk=kNjw^jkw^jk,wherew^jk=exp(2ok2djk2),

where d j k d_{jk} djk is the distance between the center of Gaussian G j G_j Gj and the neighboring control point p k p_k pk, and o k o_k ok is the learned radius parameter of neighboring control points p k . p_k. pk.

Then learn from the ideas of Linear Blend Skinning (LBS), which tells us each vertex in the model is assigned a weight for each bone indicating how much influence that bone has over the vertex’s position. When a bone moves, each vertex moves according to a weighted average of the transformations (rotations and translations) of all the bones that influence it. So we can adjust the Gaussian by:

μ j t = ∑ k ∈ N j w j k ( R k t ( μ j − p k ) + p k + T k t ) q j t = ( ∑ k ∈ N j w j k r k t ) ⊗ q j , \begin{aligned}\mu_j^t&=\sum_{k\in\mathcal{N}_j}w_{jk}\left(R_k^t(\mu_j-p_k)+p_k+T_k^t\right)\\q_j^t&=(\sum_{k\in\mathcal{N}_j}w_{jk}r_k^t)\otimes q_j,\end{aligned} μjtqjt=kNjwjk(Rkt(μjpk)+pk+Tkt)=(kNjwjkrkt)qj,

The new rotation is the weighted average of the rotations of the neighboring control points, represented as quaternions r k t r_k^t rkt (the same meaning as R k t R_k^t Rkt, just different in the mathematical form). This average quaternion is then multiplied by the original rotation quaternion of the Gaussian using the quaternion product, which combines the two rotations in a way that is appropriate for 3D rotations.

3.3 Optimization

We have known how to control Gaussian by the control points. Then we will introduce two small strategies in optimization.

  1. ARAP Loss

A loss function was introduced from the paper “As-rigid-as-possible surface modeling”, which helps maintain rigidity.

Firstly, it defines points’ trajectories p i t r a j p_i^{traj} pitraj in the scene motion as:

p i traj = 1 N t p i t 1 ⊕ p i t 2 ⊕ ⋯ ⊕ p i t N t , p_i^{\text{traj}}=\frac{1}{N_t}p_i^{t_1}\oplus p_i^{t_2}\oplus\cdots\oplus p_i^{t_{N_t}}, pitraj=Nt1pit1pit2pitNt,

where p i t p_i^{t} pit represents the position of point i i i in time t t t.

Then, a local neighborhood for each control point is determined via ball queries, which means finding all control points within a predefined radius to define a local area of influence. To calculate L a r a p \mathcal{L}_\mathrm{arap} Larap, we randomly sample two time steps t 1 t_1 t1 and t 2 t_2 t2. For each point p k p_k pk within the radius ( i . e .   k ∈ N c i ) (i.e. \: k\in\mathcal{N}_{\mathrm{c}i}) (i.e.kNci), its transformed locations with learned translation parameters T k t 1 T_k^{t_1} Tkt1 and T k t 2 T_k^{t_2} Tkt2 are: p k t 1 = p k + T k t 1 p_k^{t_1}=p_k+T_k^{t_1} pkt1=pk+Tkt1 and p k t 2 = p k + T k t 2 p{k}^{t_{2}}=p_{k}+T_{k}^{t_{2}} pkt2=pk+Tkt2, thus the new rotation matrix R ^ i \hat{R}_{i} R^i can be estimated as:

R ^ i = arg ⁡ min ⁡ R ∈ S O ( 3 ) ∑ k ∈ N c i w i k ∣ ∣ ( p i t 1 − p k t 1 ) − R ( p i t 2 − p k t 2 ) ∣ ∣ 2 . \hat{R}_{i}=\arg\min_{R\in\mathbf{SO}(3)}\sum_{k\in\mathcal{N}_{c_{i}}}w_{ik}||(p_{i}^{t_{1}}-p_{k}^{t_{1}})-R(p_{i}^{t_{2}}-p_{k}^{t_{2}})||^{2}. R^i=argRSO(3)minkNciwik∣∣(pit1pkt1)R(pit2pkt2)2.

Finally, L a r a p \mathcal{L}_\mathrm{arap} Larap can be calculated as:

L a r a p ( p i , t 1 , t 2 ) = ∑ k ∈ N c i w i k ∣ ∣ ( p i t 1 − p k t 1 ) − R ^ i ( p i t 2 − p k t 2 ) ∣ ∣ 2 . {\mathcal L}_{\mathrm{arap}}(p_{i},t_{1},t_{2})=\sum_{k\in{\mathcal N}_{c_{i}}}w_{ik}||(p_{i}^{t_{1}}-p_{k}^{t_{1}})-\hat{R}_{i}(p_{i}^{t_{2}}-p_{k}^{t_{2}})||^{2}. Larap(pi,t1,t2)=kNciwik∣∣(pit1pkt1)R^i(pit2pkt2)2.

  1. Adaptive Control Points

Another strategy, very similar to the Adaptive Control of Gaussian.

  • Prune: calculate its overall impact W i = ∑ j ∈ N ~ i w j i W_i= \sum_{j\in\tilde{N}i}w_{ji} Wi=jN~iwji on the set of Gaussians j ∈ N ~ i j\in\tilde{\mathcal{N}}_i jN~i whose K nearest neighbors include p i . p_i. pi. Then, we prune p i p_i pi if W i W_i Wi is close to zero, indicating little contribution to the motion of 3D Gaussians.

  • Clone: calculate the summation of the Gaussian gradient norm as:

    g i = ∑ j ∈ N ~ i w ~ j ∥ d L d μ j ∥ 2 2 , w h e r e w ~ j = w j i ∑ j ∈ N ~ k w j i . g_{i}=\sum_{j\in\tilde{\mathcal{N}}_{i}}\tilde{w}_{j}\|\frac{d\mathcal{L}}{d\mu_{j}}\|_{2}^{2},\mathrm{where}\tilde{w}_{j}=\frac{w_{ji}}{\sum\limits_{j\in\tilde{\mathcal{N}}_{k}}w_{ji}}. gi=jN~iw~jdμjdL22,wherew~j=jN~kwjiwji.

    A large g k g_k gk indicates poor reconstruction. And add a new point p k ′ p_k^{'} pk as

    p k ′ = ∑ j ∈ N ~ k w ~ i μ j ; σ k ′ = σ k . p_k^{\prime}=\sum_{j\in\tilde{\mathcal{N}}_k}\tilde{w}_i\mu_j;\sigma_k^{\prime}=\sigma_k. pk=jN~kw~iμj;σk=σk.

Now, we can utilize this pipeline to understand every process.

在这里插入图片描述

3.4 Motion Editing

Given a set of user-defined handle points { h l ∈ R 3 ∣ l ∈ H ⊂ { 1 , 2 , ⋯   , N p } } \{h_l\in\mathbb{R}^3\mid l\in\mathcal{H}\subset \{1,2,\cdots,N_p\}\} {hlR3lH{1,2,,Np}}, the control graph P ′ P^{\prime} P can be deformed by minimizing the APAR energy formulated as:

E ( P ′ ) = ∑ i = 1 N p ∑ j ∈ N i w i j ∣ ∣ ( p i ′ − p j ′ ) − R ^ i ( p i − p j ) ∣ ∣ 2 , E(\mathcal{P}^{\prime})=\sum_{i=1}^{N_{p}}\sum_{j\in\mathcal{N}_{i}}w_{ij}||(p_{i}^{\prime}-p_{j}^{\prime})-\hat{R}_{i}(p_{i}-p_{j})||^{2}, E(P)=i=1NpjNiwij∣∣(pipj)R^i(pipj)2,

with the fixed position condition p l ′ = h l   f o r   l ∈ H . p_{l}^{\prime}=h_{l} \: \mathrm{for}\:l\in\mathcal{H}. pl=hlforlH.

4. Self-thoughts

  1. Only use one camera, if we use six cameras like AD, how to combine them?
  2. The edit of Gaussian is with the help of physical rigid body motion.
  3. ARAP can be replaced by ARAPReg.
  4. Only suitable for small objects with meaningful control points. Can we use only a few control points like human posture representation?
  • 20
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值