【读论文】【速读】4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

1. What

4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes rather than applying 3D-GS for each individual frame.

It uses an encoder and decoder structure to predict the motion of each Gaussian over time. The core idea is to represent the 4D information (x,y,z,t) into 2D HexPlane and then use MLP and decoder to extract the information of the change of Gaussian. This approach allows for efficient processing and storage of high-dimensional data while preserving the necessary spatiotemporal information.

2. Preliminary

There are two dynamic methods in NeRF and one method in Gaussian, as shown below:


As for NeRF, all the dynamic NeRF algorithms can be formulated as:

c , σ = M ( x , t ) c,\sigma=\mathcal{M}(\mathbf{x},t) c,σ=M(x,t)

  1. In Fig. 2 (a), the canonical mapping volume rendering transforms each sampled point into a canonical space: ϕ t : ( x , t ) → Δ x \phi_{t}:(\mathbf{x},t)\to\Delta\mathbf{x} ϕt:(x,t)Δx and calculates the color and density along each ray:

    c , σ = N e R F ( x + Δ x ) . c,\sigma=\mathrm{NeRF}(\mathbf{x}+\Delta\mathbf{x}). c,σ=NeRF(x+Δx).

  2. In Fig. 2 (b), the time-aware volume rendering. It won’t change the rendering path, oppositely, it directly calculates the features of each point at a time:

    c , σ = N e R F ( x , t ) . c,\sigma=\mathrm{NeRF}(\mathbf{x},t). c,σ=NeRF(x,t).

3. What


The network to learn the Gaussian deformation field includes an efficient spatial-temporal structure encoder H \mathcal{H} H and a Gaussian deformation decoder D \mathcal{D} D for predicting the deformation of each 3D Gaussian.

3.1 Spatial-Temporal Structure Encoder

The input is a 4D data containing x , y , z , t x,y,z,t x,y,z,t. It will be represented by six 2D planes about { ( x , y ) , ( x , z ) , ( y , z ) , ( x , t ) , ( y , t ) , ( z , t ) } \{(x,y),(x,z),(y,z),(x,t),(y,t),(z,t)\} {(x,y),(x,z),(y,z),(x,t),(y,t),(z,t)}. Each 2D plane will have a resolution, that is the canonical space with a fixed size, and each point such as ( x , t ) (x,t) (x,t) will contain information about the characteristics of the change in x-coordinate at different time points. Similarly, the x y xy xy plane captures features at different spatial locations (x and y coordinates).

Meanwhile, the 2D planes have an upsample level just like the mipmapping. In the calculation of the feature, it will use interpolation between the two adjacent layers: f h = ⋃ l ∏ i n t e r p ( R l ( i , j ) ) . f_{h}=\bigcup_{l}\prod\mathrm{interp}(R_{l}(i,j)). fh=linterp(Rl(i,j)).

Then, the plane information will become a vector with six values and pass an MLP to the decoder.

3.2 Multi-head Gaussian Deformation Decoder

When all the features of 3D Gaussians are encoded, we can compute any desired variable with a multi-head Gaussian deformation decoder D = { ϕ x , ϕ r , ϕ s } \mathcal{D}=\{\phi_{x},\phi_{r},\phi_{s}\} D={ϕx,ϕr,ϕs}

Δ X = ϕ x ( f d ) , Δ r = ϕ r ( f d ) , Δ s = ϕ s ( f d ) . \Delta\mathcal{X}=\phi_{x}(f_{d}),\Delta r=\phi_{r}(f_{d}),\Delta s=\phi_{s}(f_{d}). ΔX=ϕx(fd),Δr=ϕr(fd),Δs=ϕs(fd).

So finally, we can obtain the deformer 3D Gaussians:

( X ′ , r ′ , s ′ , σ , C ) = ( X + Δ X , r + Δ r , s + Δ s , σ , C ) . (\mathcal X',r',s',\sigma, \mathcal C)=(\mathcal X+\Delta\mathcal X,r+\Delta r,s+\Delta s,\sigma, \mathcal C). (X,r,s,σ,C)=(X+ΔX,r+Δr,s+Δs,σ,C).

