[论文精读] [NeRF] [SIGGRAPH 2022] Variable Bitrate Neural Fields-CSDN博客

本文链接：https://blog.csdn.net/qq_36767872/article/details/128526239

文章介绍了一种名为VariableBitrateNeuralFields的方法，用于压缩神经场中的特征网格，降低内存消耗高达100倍，同时实现多分辨率表示，适合流媒体应用。通过矢量量化自动解码器，该方法能够在没有直接监督的情况下学习离散的神经表征，允许根据可用带宽或细节需求调整数据流的质量。实验表明，该方法在保持较高视觉质量的同时，显著降低了存储需求，且优于传统的压缩技术。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

cover

Variable Bitrate Neural Fields

Abstract
Motivations
- Compression
- Multiresolution representation
Contributions
Method
Experiments
Paper Notes
References

ppt

Abstract

Neural approximations of scalar and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations. State-of-the-art results are obtained by conditioning a neural approximation with a lookup from trainable feature grids that take on part of the learning task and allow for smaller, more efficient neural networks. Unfortunately, these feature grids usually come at the cost of significantly increased memory consumption compared to stand-alone neural network models. We present a dictionary method for compressing such feature grids, reducing their memory consumption by up to 100x and permitting a multiresolution representation which can be useful for out-of-core streaming. We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available and with dynamic topology and structure.

标量场和矢量场的神经近似，如带符号的距离函数和辐射场，已经成为准确的、高质量的表示。最先进的结果是通过从可训练的特征网格中查找神经近似来获得的，该网格承担了部分学习任务，并允许更小、更有效的神经网络。不幸的是，与独立的神经网络模型相比，这些特征网格通常以显著增加的内存消耗为代价。我们提出了一种压缩这种特征网格的字典方法，将其内存消耗减少了100倍，并允许使用多分辨率表示，这对外核流媒体是有用的。我们将字典优化表述为一个矢量量化的自动解码器问题，这让我们能够在一个没有直接监督的空间中学习端到端的离散神经表征，并且具有动态拓扑和结构。
comparison
(Top-left shows a baseline neural radiance field whose uncompressed feature grid weighs 15 207 kB. Our method, shown bottom right, compresses this by a factor of 60x, with minimal visual impact (PSNR shown relative to training images). In a streaming setting, a coarse LOD can be displayed after receiving only the first 10 kB of data. All sizes are without any additional entropy encoding of the bit-stream.)

Motivations

Feature grid methods are a special class of neural fields which have enabled state-of-the-art signal reconstruction quality whilst being able to render and train at interactive rates.

基于特征网格的方法是一类特殊的神经场，它实现了最先进的信号重建质量，同时能够以交互速率进行渲染和训练。

Since 𝜓_𝜃(𝑥, interp(𝑥, 𝑍)) ≈ 𝑢(𝑥) is a non-linear function, this approach has the potential to reconstruct signals with frequencies above the usual Nyquist limit. Thus coarser grids can be used, motivating their use in signal compression.

特征网格方法是一个非线性函数，这种方法有可能重建频率高于通常奈奎斯特极限¹的信号。因此，可以使用更粗的栅格，从而促进它们在信号压缩中的使用。
Nyquist Rate是信息论里面的一个概念，如果对一个连续信号进行采样，然后想要用采样之后的信号来恢复出原有信号的完整信息，那么采样率必须大于Nyquist Rate，而这个Rate是此连续信号中最高频分量频率的两倍²。

Compression

The feature grid can be represented as a matrix 𝑍 ∈ R^𝑚×𝑘 where 𝑚 is the number of grid points, and 𝑘 is the dimension of the feature vector at each grid point. Since 𝑚 ×𝑘 may be quite large compared to the size of the MLP, the feature vectors are by far the most memory hungry component.

但是grid与MLP的大小相比可能相当大，特征向量是到目前为止最耗费内存的分量。

These methods require high-resolution feature grids to achieve good quality. This makes them less practical for graphics systems which must operate within tight memory, storage, and bandwidth budgets.

这些方法需要高分辨率的特征栅格才能获得良好的质量。这使得它们不太适用于必须在紧张的内存、存储和带宽预算内运行的图形系统。

Multiresolution representation

Beyond compactness, it is also desirable for a shape representation to dynamically adapt to the spatially varying complexity of the data, the available bandwidth, and desired level of detail.

除了紧凑度之外，形状表示还需要动态地适应数据的空间变化的复杂性、可用带宽和所需的细节级别。

Contributions

In this paper, we propose the vector-quantized auto-decoder (VQAD) method to directly learn compressed feature-grids for signals without direct supervision.
Our representation enables progressive,variable bitrate streaming of data by being able to scale the quality according to the available bandwidth or desired level of detail.
Our method enables end-to-end compression-aware optimization which results in significantly better results than typical vector quantization methods for discrete signal compression.
We evaluate our method by compressing feature-grids which represent neural radiance fields (NeRF) and show that our method is able to reduce the storage required by two orders of magnitude with relatively little visual quality loss without entropy encoding.

Method

Overview

We propose the vector-quantized auto-decoder method which uses the auto-decoder framework with an extra focus on learning compressed representations. The key idea is to replace bulky featurevectors with indices into a learned codebook (In prior work such as NGLOD [R2], the feature vectors consumed 512 bits each; the codebook indices that replace them in this work may be as small as 4 bits.). These indices, the codebook, and a decoder MLP network are all trained jointly.

overview

(a) shows the baseline uncompressed version of our data structure, in which we store the bulky feature vectors at every grid vertex, of which there may be millions. In (b), we store a compact 𝑏-bit code per vertex, which indexes into a small codebook of feature vectors. This reduces the total storage size, and this representation is directly used at inference time. This indexing operation is not differentiable; at training time (c), we replace the indices with vectors 𝐶_𝑖 of width 2^𝑏 , to which softmax 𝜎 is applied before multiplying with the entire codebook. This ‘soft-indexing’ operation is differentiable, and can be converted back to ‘hard’ indices used in (b) through an argmax operation.

Baseline: NGLOD [R2] & DeepSDF [R3]

Compressed Auto-decoder

In order to effectively apply discrete signal compression to featuregrids, we leverage the auto-decoder [R3] framework where only the decoder 𝑓_𝛾⁻¹ is explicitly constructed.

利用DeepSDF [R3]方法的框架，只显式构造解码器𝑓_𝛾⁻¹。可以通过计算解码变换系数vx后的误差来优化自动解码器。
eq4

A strength of the auto-decoder is that it can reconstruct transform coefficients with respect to supervision in a domain different from the signal we wish to reconstruct. We define a differentiable forward map as an operator 𝐹 which lifts a signal onto another domain. For radiance field reconstruction, the signal of interest 𝑢_𝑥 is volumetric density and plenoptic color, while the supervision is over 2D images. In this case, 𝐹 represents a differentiable renderer.

通过F这个操作，可以将信号转换到任意域。对于辐射场重建，感兴趣的信号 𝑢_𝑥是体密度和光学颜色，监督是在2D图像上。在这种情况下，𝐹表示可微分的渲染器。

eq5

Feature-Grid Compression

The feature grid is a matrix 𝑍 ∈ R^𝑚×𝑘 where 𝑚 is the size of the grid and 𝑘 is the feature vector dimension. Local embeddings are queried from the feature grid with interpolation at a coordinate 𝑥 and fed to a MLP 𝜓 to reconstruct continuous signals. The feature grid is learned by optimizing equation (6), where interp represents trilinear interpolation of the 8 feature grid points surrounding 𝑥. The forward map 𝐹 is applied to the output of the MLP 𝜓 ; in our experiments, it is a differentiable renderer [R1] and 𝑦 are the training image pixels.

通用feature-grid方法的优化：
eq6

The feature grid 𝑍 can be treated as a block-based decomposition of the signal where each row vector (block) of size 𝑘 controls the local spatial region.

特征网格𝑍可以被视为信号的基于块的分解，其中大小为𝑘的每个行向量(块)控制局部空间区域。

Hence, we consider block-based inverse transforms 𝑓_𝛾⁻¹ with block coefficients 𝑉. Since we want to learn the compressed features 𝑍 = 𝑓_𝛾⁻¹(𝑉), we substitute 𝑍

用压缩的特征表示替换Z
eq7

Considering the 𝐹(𝜓(𝑥, 𝜃, interp(𝑥, 𝑍))) as a map which lifts the discrete signal 𝑍 to a continuous signal where the supervision (and other operations) are applied, we can see that this is equivalent to a block-based compressed auto-decoder.

𝐹(𝜓(𝑥, 𝜃, interp(𝑥, 𝑍)))这部分将离散信号Z转变为连续的信号，可以看作是自动解码器。

This allows us to work only with the discrete signal 𝑍 to design a compressive inverse transform𝑓_𝛾⁻¹ for the feature-grid 𝑍 , in our case the vector-quantized inverse transform to directly learn compressed representations.

所以可以对离散信号Z进行矢量量化，学习压缩后的表示，再通过压缩逆变换，重建场景。

Vector-Quantization

We define our compressed representation 𝑉 as an integer vector 𝑉 ∈ Z^𝑚 with the range [0, 2^𝑏 − 1]. This is used as an index into a codebook matrix 𝐷 ∈ R^2𝑏 × 𝑘^ where 𝑚 is the number of grid points, 𝑘 is the feature vector dimension, and 𝑏 is the bitwidth. Concretely, we define our decoder function𝑓_𝐷⁻¹(𝑉) = 𝐷[𝑉] where [·] is the indexing operation.

将压缩表示𝑉定义为整数向量。它被用作码本矩阵D的索引。网格点的个数是m，k是特征向量的维度，b是位宽。具体来说，我们定义了我们的decoder函数𝑓_𝐷⁻¹(𝑉) = 𝐷[𝑉]，其中[·]是索引操作。

eq8

Solving this optimization problem is difficult because indexing is a non-differentiable operation with respect to the integer index 𝑉.

解决这个优化问题是困难的，因为索引是关于整数索引𝑉的不可微操作。（对应流程图中的(b)）

As a solution, in training we propose to represent the integer index with a softened matrix 𝐶 ∈ R^𝑚×2b from which the index vector 𝑉 = arg max_𝑖 𝐶[𝑖] can be obtained from a row-wise argmax. We can then replace our index lookup with a simple matrix product and obtain the following optimization problem, where the softmax function 𝜎 is applied row-wise on the matrix 𝐶. This optimization problem is now differentiable.

用softened矩阵C来表示整数索引。其中的索引向量𝑉可以从逐行的argmax中获得。然后用一个简单的矩阵乘积替换原本的索引查找。对应的优化问题如下，其中Softmax函数按行应用于矩阵C。这个优化问题现在是可微的。（对应流程图中的(c)）

eq9

In practice, we adopt a straight-through estimator approach to make the loss be aware of the hard indexing during training. That is, we use Equation 8 in the forward pass and Equation 9 in the backward pass.

在实践中，我们采用straight-through estimator的方法，使损失在训练过程中意识到硬索引。也就是说，我们在前向传递中使用公式8，在向后传中使用公式9。

At storage and inference, we discard the softened matrix 𝐶 and only store the integer vector 𝑉. Even without entropy coding, this gives us a compression ratio of 16𝑚𝑘/(𝑚𝑏 + 𝑘2^𝑏) which can be orders of magnitude when 𝑏 is small and 𝑚 is large. We generally observe 𝑚 to be in the order of millions, and evaluate 𝑏 ∈ {4, 6} for our experiments. In contrast to using a hash function [R6] for indexing, we need to store 𝑏-bit integers in the feature grid but we are able to use a much smaller codebook (table) due to the learned adaptivity of the indices.

存储V，不存储C。压缩比：16𝑚𝑘/(𝑚𝑏 + 𝑘2^𝑏)。相比hash方法，需要的codebook/table小很多。

Streaming Level of Detail

Rather than a single resolution feature-grid, we arrange 𝑉 in a multi-resolution sparse octree as in NGLOD [R2], to facilitate streaming level of detail. Thus, for a given coordinate, multiple feature vectors 𝑧 are obtained - one from each tree level - which can then be summed (i.e. in a Laplacian pyramid fashion) or concatenated before being passed to the MLP. We train a separate codebook for each level of the tree. Similarly to NGLOD [R2], we also train multiple levels of details jointly.

将V分配在多分辨率稀疏八叉树中，在插值时，不同level的特征相加，再送入MLP。对于不同level的树，分别对应了一个codebook。

(From NGLOD) NGLOD

Experiments

数据集：RTMV dataset

Baseline

未压缩的情况下，质量最高，但是对应的存储比MLP的方法大很多。
在这里插入图片描述

Feature Grid Compression

对baseline进行压缩，三种压缩方式分别是低秩近似，后处理的k-means矢量量化和本文的学习式矢量量化。矢量量化的方式相比低秩近似，在存储上，可以达到显著的压缩比例。
在这里插入图片描述
而从两种矢量量化的对比来看，后处理的方式会导致明显的变色，psnr下降很多。

这个方法还可以用于其他形式的压缩，例如TSDF的压缩。虽然与压缩前相比引入了一些伪影，但是显著降低了存储。在这里插入图片描述

Random vs. Learned Indices

用hash进行压缩是一种更随机的方式，这种情况下，虽然不需要存储索引V，但是需要存储的表更大。
在这里插入图片描述
相似的压缩比例下，本文的方式重建的结果噪声更少。

Streaming Level of Detail

Mip-NeRF [R4] 是通过不同的cone宽度来调节的，但是它的比特率是恒定的。而VQ-AD中，可以同时实现压缩和调节，更适合渐进式数据流。
在这里插入图片描述

VQ-AD可以实现数量级更小的比特率，而不会像后处理方法(例如kmVQ)那样显著地牺牲质量。该图说明，VQ-AD的表示具有可变比特率，并编码了多个不同的分辨率，这些分辨率可以在不同的细节级别上渐进地流传输。但此方法的内存开销无法评估更高的比特率。在这里插入图片描述

Paper Notes

What problem is addressed in the paper?
ANS: Streamable, compressive representation for feature-grid NeRF.
Is it a new problem? If so, why does it matter? If not, why does it still matter?
ANS: No. Formulate the dictionary optimization as a vector-quantized auto-decoder problem.
What is the key to the solution? What is the main contribution?
ANS:
(1) Vector quantization. Reduce the storage required by two orders of magnitude with relatively little visual quality loss.
(2) Variable bitrate streaming of data. Scale the quality according to the available bandwidth or desired level of detail.
How the experiments sufficiently support the claims?
**ANS:**Achieve comparable quality and reduce bitrate significantly by learning vector quantization.
What can we learn from ablation studies?
ANS:
(1) K-means quantization has visible discoloration.
(2) Learned indices are able to reconstruct with less noise.
(3) Can also be applied in contexts other than fitting radiance fields, like truncated signed distance functions (TSDF).
(4) Suitable for progressive streaming and level of detail.
Potential fundamental flaws; how this work can be improved?
ANS:
(1) Need RGB-D input, initialize NGLOD’s octree with depth maps.
(2) Need large memory and compute resource at training time.

References

Paper: https://arxiv.org/abs/2206.07707
Project Page: https://nv-tlabs.github.io/vqad/
Code / NVIDIA Kaolin Wisp: https://github.com/NVIDIAGameWorks/kaolin-wisp
(A PyTorch library powered by NVIDIA Kaolin Core to work with neural fields (including NeRFs, NGLOD, instant-ngp and VQAD)