【读论文】Gaussian Grouping: Segment and Edit Anything in 3D Scenes

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

1. What

What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)

The first 3D Gaussian-based approach to jointly reconstruct and segment anything in the open-world 3D scene.
Each Gaussian with a compact Identity Encoding, supervised by 2D masks by SAM along with introduced 3D spatial consistency regularization, can also be further used for editing.

  • Explanation of Open-world

    An open-world scenario refers to an uncertain, dynamic and complex environment that contains a variety of objects, scenes and tasks.

    Or “open-world scene understanding” refers to the ability of a model to generalize to scenes or environments that it has not been explicitly trained on. In this context, the term “open-world” implies that the model needs to be able to adapt to and understand a wide range of scenes, including those that may be very different from the scenes in its training data.

2. Why

Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)

Maybe contain Background, Question, Others, Innovation:

  • Existing methods [8, 37] rely on manually-labeled datasets or require accurately scanned 3D point clouds [33, 42] as input.
  • Existing NeRFs-based methods [14, 17, 25, 39] are computation-hungry and hard to adjust for the downstream task because the learned neural networks, such as MLPs, cannot decompose each part or module in the 3D scene easily
  • As for Radiance-based Open World Scene Understanding: Unlike our approach, most of these methods are designed for in-domain scene modeling and cannot generalize to open-world scenarios.

3. How

Following this pipeline, we will introduce it in details.

在这里插入图片描述

3.1 Anything Mask Input and Consistency

Shown in Figure 2(a), a set of multi-view captures along with the automatically generated 2D segmentations by SAM, as well as the corresponding cameras calibrated via SfM are inputs.

Shown in Figure 2(b), to assign each 2D mask a unique ID in the 3D scene, a well-trained zero-shot tracker [7] was used to propagate and associate masks. Use colors to represent different segmentation labels, and the results are shown in Figure 2(b)

3.2 3D Gaussian Rendering and Grouping

Shown in Figure 2©, all of the core concepts of this paper were used.

  1. Identity Encoding

    A new parameter, i.e., Identity Encoding is introduced to each Gaussian with original S Θ i = { p i , s i , q i , α i , c i } S_{\Theta_{i}}=\{\mathbf{p}_{i},\mathbf{s}_{i},\mathbf{q}_{i},\alpha_{i},\mathbf{c}_{i}\} SΘi={pi,si,qi,αi,ci}. It is a compact vector of length 16 and similar to Spherical Harmonic (SH) coefficients in representing color, it is differentiable and learnable.

  2. Grouping via Rendering

    In the process of rendering labels, similar to α \alpha α-blending:

    E id = ∑ i ∈ N e i α i ′ ∏ j = 1 i − 1 ( 1 − α j ′ ) , E_{\text{id}}=\sum_{i\in\mathcal{N}}e_i\alpha_i'\prod_{j=1}^{i-1}(1-\alpha_j'), Eid=iNeiαij=1i1(1αj),

    but the denotations are different. e i e_i ei is the Identity Encoding of length 16 for each Gaussian and α i ′ \alpha_i' αi is a new weight, calculated by multiplying opacity α i \alpha_i αi and Σ 2 D \Sigma^{2\mathrm{D}} Σ2D, where Σ 2 D = J W Σ 3 D W T J T \Sigma^{2\mathrm{D}}=JW\Sigma^{3\mathrm{D}}W^TJ^T Σ2D=JWΣ3DWTJT according to [61].

  3. Grouping Loss

    • 2D Identity Loss: Given the rendered 2D features E i d E_{id} Eid before as input, first add a linear layer f f f to recover its feature dimension back to K+1 and then take s o f t m a x ( f ( E i d ) ) softmax (f(Eid)) softmax(f(Eid)) for identity classification. And cross-entropy loss was used.

    • 3D Regularization Loss:

      3D Regularization Loss leverages the 3D spatial consistency, which enforces the Identity Encodings of the top k-nearest 3D Gaussians to be close in their feature distance.

      L 3 d = 1 m ∑ j = 1 m D k l ( P ∥ Q ) = 1 m k ∑ j = 1 m ∑ i = 1 k F ( e j ) log ⁡ ( F ( e j ) F ( e i ′ ) ) \mathcal{L}_{\mathrm{3d}}=\frac{1}{m}\sum_{j=1}^{m}D_{\mathrm{kl}}(P\|Q)=\frac{1}{mk}\sum_{j=1}^{m}\sum_{i=1}^{k}F(e_{j})\log\left(\frac{F(e_{j})}{F(e_{i}^{\prime})}\right) L3d=m1j=1mDkl(PQ)=mk1j=1mi=1kF(ej)log(F(ei)F(ej))

      where P P P contains the sampled Identity Encoding e e e of a 3D Gaussian, while the set Q = { e 1 ′ , e 2 ′ , . . . , e k ′ } Q=\{e_1^{\prime},e_2^{\prime},...,e_k^{\prime}\} Q={e1,e2,...,ek} consists of its k k k nearest neighbors in 3D spatial space.

3.3 Downstream: Local Gaussian Editing

在这里插入图片描述

Pay more attention to inpainting, first, delete the relevant 3D Gaussians and then add a small number of new Gaussians to be supervised by the 2D inpainting results by LAMA [41] during rendering.

  • 19
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值