[论文精读] [NeRF] GeCoNeRF: Few-shot Neural Radiance Fields via Geometric Consistency-CSDN博客

本文链接：https://blog.csdn.net/qq_36767872/article/details/128811752

提出一种新的框架GeCoNeRF，利用几何一致性正则化，在少量样本条件下优化神经辐射场(NeRF)。通过深度引导的图像扭曲及特征级一致性建模，增强不同视点间的一致性，提高模型鲁棒性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

title

GeCoNeRF: Few-shot Neural Radiance Fields via Geometric Consistency

Abstract
Motivations
Contributions
Methodology
Experiments
Paper Notes
References

ppt

Abstract

We present a novel framework to regularize Neural Radiance Field (NeRF) in a few-shot setting with a geometry-aware consistency regularization. The proposed approach leverages a rendered depth map at unobserved viewpoint to warp sparse input images to the unobserved viewpoint and impose them as pseudo ground truths to facilitate learning of NeRF. By encouraging such geometry-aware consistency at a feature-level instead of using pixel-level reconstruction loss, we regularize the NeRF at semantic and structural levels while allowing for modeling viewdependent radiance to account for color variations across viewpoints. We also propose an effective method to filter out erroneous warped solutions, along with training strategies to stabilize training during optimization. We show that our model achieves competitive results compared to state-of-the-art few-shot NeRF models.

我们提出了一种新的框架，用几何感知的一致性正则化来正则化少数镜头环境中的神经辐射场(NeRF) [R1]。该方法利用未观察视角处的渲染深度图，将稀疏输入图像扭曲到未观察视角，并将其作为伪ground truth，以便于神经网络的学习。通过在特征级别实现这种几何感知的一致性，而不是使用像素级别的重建损失，我们在语义和结构级别正则化NeRF，同时允许建模视角相关的辐射来解释视点之间的颜色变化。我们还提出了一种有效的方法来过滤错误的翘曲解，以及在优化过程中稳定训练的训练策略。我们表明，与最先进的few-shot NeRF模型相比，我们的模型取得了具有竞争力的结果。
fig1

Given an image $I_i$ and estimated depth map $D_j$ of $j$ -th unobserved viewpoint, we warp the image $I_i$ to that novel viewpoint as $I_{i→j}$ by establishing geometric correspondence between two viewpoints. Using the warped image as a pseudo ground truth, we cause rendered image of unseen viewpoint, $I_j$ , to be consistent in structure with warped image, with occlusions taken into consideration.

给定第 $j$ 个未观察视点的图像 $I_i$ 和估计深度图 $D_j$ ，通过在两个视点之间建立几何对应，我们将图像 $I_i$ 扭曲到该新视点作为 $I_{i→j}$ 。利用扭曲图像作为伪ground truth，使不可见视点的渲染图像 $I_j$ 在结构上与扭曲图像一致，并考虑遮挡。

Motivations

However, despite its impressive performance, NeRF requires a large number of densely, well distributed calibrated images for optimization, which limits its applicability. When limited to sparse observations, NeRF easily overfits to the input view images and is unable to reconstruct correct geometry.

尽管NeRF的性能令人印象深刻，但它需要大量密集、分布良好的校准图像进行优化，这限制了它的适用性。当仅限于稀疏观测时，NeRF很容易与输入的视图图像过度匹配，并且无法重建正确的几何。

The task that directly addresses this problem, also called a few-shot NeRF, aims to optimize highfidelity neural radiance field in such sparse scenarios, countering the underconstrained nature of said problem by introducing additional priors. Specifically, previous works attempted to solve this by utilizing a semantic feature [R2], entropy minimization [R3], SfM depth priors [R4] or normalizing flow [R5], but their necessity for handcrafted methods or inability to extract local and fine structures limited their performance.

直接解决这一问题的任务，也被称为few-shot NeRF，旨在优化这种稀疏场景下的高保真神经辐射场，通过引入额外的先验来抵消所述问题的不足性质。具体地说，以前的工作试图通过利用语义特征[R2]、最小熵、SfM深度先验或归一化流 [R5]来解决这一问题，但它们需要手工方法或无法提取局部和精细结构，限制了它们的性能。

To alleviate these issues, we propose a novel regularization technique that enforces a geometric consistency across different views with a depth-guided warping and a geometry-aware consistency modeling. Based on these, we propose a novel framework, called Neural Radiance Fields with Geometric Consistency (GeCoNeRF), for training neural radiance fields in a few-shot setting.

为了缓解这些问题，我们提出了一种新的正则化技术，该技术通过深度引导扭曲和几何感知一致性建模来增强不同视图之间的几何一致性。在此基础上，我们提出了一种新的框架，称为具有几何一致性的神经辐射场(GeCoNeRF)，用于在few-shot环境中训练神经辐射场。

Contributions

We can leverage a depth rendered by NeRF to warp sparse input images to novel viewpoints, and use them as pseudo ground truths to facilitate learning of fine details and highfrequency features by NeRF.
By encouraging images rendered at novel views to model warped images with a consistency loss, we can successfully constrain both geometry and appearance to boost fidelity of neural radiance fields even in highly under-constrained few-shot setting.
We also present a method to generate a consistency mask to prevent inconsistently warped information from harming the network.
Finally, we provide coarse-to-fine training strategies for sampling and pose generation to stabilize optimization of the model.

Methodology

Overview

NeRF inherently renders not only color image but depth image as well. Combined with known viewpoint difference, the rendered depths can be used to define a geometric correspondence relationship between two arbitrary views.

NeRF不仅固有地渲染彩色图像，而且还渲染深度图像。结合已知的视点差异，渲染的深度可用于定义两个任意视图之间的几何对应关系。

Specifically, we consider a depth image rendered by the NeRF model, $D_j$ at unseen viewpointj. By formulating a warping function $ψ(I_i; D_j , R_{i→j})$ that warps an image $I_i$ according to the depth $D_j$ and viewpoint difference $R_{i→j}$ , we can encourage a consistency between warped image $I_{i→j} = ψ(I_i; D_j , R_{i→j})$ and rendered image $I_j$ at $j$ -th unseen viewpoint, which in turn improves the few-shot novel view synthesis performance.

具体地说，我们考虑由NeRF模型在不可见视点 $j$ 处呈现的深度图像 $D_j$ 。通过根据深度 $D_j$ 和视点差 $R_{i→j}$ 对图像 $I_i$ 进行扭曲的扭曲函数 $ψ(I_i; D_j , R_{i→j})$ ，我们可以鼓励在第 $j$ 个不可见视点处的扭曲图像 $I_{i→j} = ψ(I_i; D_j , R_{i→j})$ 和渲染图像 $I_j$ 之间的一致性，这进而提高了few-shot新视图合成性能。

fig2

GeCoNeRF regularizes the networks with consistency modeling. Consistency loss function $L^M_{cons}$ is applied between unobserved viewpoint image and warped observed viewpoint image, while disparity regularization loss $L_{reg}$ regularizes depth at seen viewpoints.

GeCoNeRF使用一致性建模对网络进行正则化。在未观测视点图像和扭曲观测视点图像之间应用一致性损失函数 $L^M_{cons}$ ，而视差正则化损失 $L_{reg}$ 用于正则化视点深度。

Preliminaries

NeRF [R1]
eq1
eq2
eq3

Rendered Depth-Guided Warping

To render an image at novel viewpoints, we first sample a random camera viewpoint, from which corresponding ray vectors are generated in a patch-wise manner. As NeRF outputs density and color values of sampled points along the novel rays, we use recovered density values to render a consistent depth map. Following [R1], we formulate per-ray depth values as weighted composition of distances traveled from origin. Since ray $r_p$ corresponding to pixel $p$ is parameterized as $r_p(t) = o + td_p$ , the depth rendering is defined similarly to the color rendering,

为了在新的视点上渲染图像，我们首先对随机的相机视点进行采样，从该视点以patch的方式生成相应的光线矢量。当NeRF输出沿新光线的采样点的密度和颜色值时，我们使用恢复的密度值来呈现一致的深度图。我们将每射线深度值表示为从原点出发的距离的加权合成。由于对应于像素 $p$ 的光线 $r_p$ 被参数化为 $r_p(t) = o + td_p$ ，所以深度渲染的定义类似于颜色渲染，
eq4

where $D(r_p)$ is a predicted depth along the ray $r_p$ . As described in Figure 1, we use the rendered depth map $D_j$ to warp input ground truth image $I_i$ to $j$ -th unseen viewpoint and acquire a warped image $I_{i→j}$ , which is defined as a process such that $I_{i→j} = ψ(I_i; D_j , R_{i→j})$ . More specifically, pixel location $p_j$ in target unseen viewpoint image is transformed to $p_{j→i}$ at source viewpoint image by viewpoint difference $R_{j→i}$ and camera intrinsic parameter $K$

其中 $D(r_p)$ 是沿射线 $r_p$ 的预测深度。如图1中所描述的，我们使用渲染的深度图 $D_j$ 将输入的ground truth图像 $I_i$ 到第 $j$ 个不可见视点扭曲，并获得扭曲图像 $I_{i→j}$ ，其被定义为 $I_{i→j} = ψ(I_i; D_j , R_{i→j})$ 的过程。更具体地说，目标不可见视点图像中的像素位置 $p_j$ 通过视点差 $R_{j→i}$ 和相机内参数 $K$ 被变换为源视点图像处的 $p_{j→i}$ ，

eq5

where $\sim$ indicates approximate equality and the projected coordinate $p_{j→i}$ is a continuous value. With a differentiable sampler, we extract color values of $p_{j→i}$ on $I_i$ . More formally, the transforming components process can be written as follows

其中 $\sim$ 表示近似相等，并且投影坐标 $p_{j→i}$ 是连续值。用可微采样器提取图像 $I_i$ 像素 $p_{j→i}$ 上的颜色值。更正式的转换组件流程如下
eq6

where $s am pl er (\cdot)$ is a bilinear sampling operator.

其中 $s am pl er (\cdot)$ 是一种双线性采样算子。

Acceleration. Rendering full image with NeRF voluemtric rendering is computationally heavy and extremely timetaking, requiring tens of seconds for a single iteration. To overcome the computational bottleneck of full image rendering and warping, rays are sampled on a strided grid to make the patch with stride s, which we have set as 2. After the rays undergo volumetric rendering, we upsample the low-resolution depth map back to original resolution with bilinear interpolation. This full-resolution depth map is used for the inverse warping. This way, detailed warped patches of full-resolution can be generated with only a fraction of computational cost that would be required when rendering the original sized ray batch.

加速。使用Nerf卷积渲染完整图像的计算量很大，而且非常耗时，单次迭代需要数十秒。为了克服全图像渲染和变形的计算瓶颈，在步长网格上对光线进行采样，生成步长为s的patch，我们将步长设置为2。在对光线进行体渲染后，使用双线性插值法将低分辨率深度图向上采样到原始分辨率。此全分辨率深度贴图用于反向变形。这样，渲染原始大小的光线batch时，可以生成全分辨率的详细变形patch，而所需的计算成本只有一小部分。

Consistency Modeling

Given the rendered patch $I_j$ at $j$ -th viewpoint and the warped patch $I_{i→j}$ with depth $D_j$ and viewpoint difference $R_{i→j}$ , we define the consistency between the two to encourage additional regularization for globally consistent rendering.

给定第 $j$ 视点处的绘制patch $I_j$ 和具有深度 $D_j$ 和视点差 $R_{i→j}$ 的扭曲patch $I_{i→j}$ ，我们定义两者之间的一致性，以鼓励全局一致的渲染的附加正则化。

One viable option is to naïvely apply the pixel-wise image reconstruction loss $L_{pix}$ ,

一种可行的选择是简单地应用像素方式的图像重建损失 $L_{pix}$
eq7

However, we observe that this simple strategy is prone to cause failures in reflectant non-Lambertian surfaces where appearance changes greatly regarding viewpoints. In addition, geometry-related problems, such as self-occlusion and artifacts, prohibits naïve usage of pixel-wise image reconstruction loss for regularization in unseen viewpoints.

然而，这种简单的策略很容易导致反射式非朗伯表面的失败，因为在这些表面上，视点的外观变化很大。此外，与几何相关的问题，如自遮挡和伪影，也说明不应当简单使用像素级图像重建损失来进行不可见视点的正则化。

Feature-level consistency modeling

To overcome these issues, we propose masked feature-level regularization loss that encourages structural consistency while ignoring view-dependent radiance effects.

为了克服这些问题，我们提出了掩蔽的特征级别的正则化损失，该损失鼓励结构一致性，同时忽略视相关的辐射效应。

Given an image $I$ as an input, we use a convolutional network to extract multi-level feature maps such that $f_{φ,l}(I) ∈ R^{H_l ×W_l ×C_l}$ , with channel depth $C_l$ for $l$ -th layer. To measure feature-level consistency between warped image $I_{i→j}$ and rendered image $I_j$ , we extract their features maps from $L$ layers and compute difference within each feature map pairs that are extracted from the same layer.

在给定图像 $I$ 作为输入的情况下，我们使用卷积网络来提取多层特征映射，使得 $f_{φ,l}(I) ∈ R^{H_l ×W_l ×C_l}$ ，其中以 $l$ 层为通道深度 $C_l$ 。为了衡量扭曲图像 $I_{i→j}$ 和渲染图像 $I_j$ 之间的特征级别一致性，我们从层中提取它们的特征图，并计算从同一层提取的每个特征图对之间的差异。

eq8

In accordance with the idea of using the warped image $I_{i→j}$ as pseudo ground truths, we allow a gradient backpropagation to pass only through the rendered image and block it for the warped image. By applying the consistency loss at multiple levels of feature maps, we cause $I_j$ to model after $I_{i→j}$ both on semantic and structural level.

根据将扭曲图像 $I_{i→j}$ 用作伪ground truths的想法，我们仅允许梯度反向传播通过渲染图像，并阻止扭曲图像的梯度反向传播。通过在多个层次的特征映射上应用一致性损失，使得 $I_j$ 在语义和结构两个层面上都模仿 $I_{i→j}$ 。

For this loss function $L_{cons}$ , we find $l$ -1 distance function most suited for our task and utilize it to measure consistency across feature difference maps. Empirically, we have discovered that VGG-19 network yields best performance in modeling consistencies, likely due to the absence of normalization layers that scale down absolute values of feature differences. Therefore, we employ VGG19 network as our feature extractor network $f_φ$ throughout all of our models.

使用VGG19网络作为特征提取网络，计算特征图间的 $l$ -1距离。

It should be noted that our loss function differs from that of DietNeRF in that while DietNeRF’s consistency loss is limited to regularizing the radiance field in a globally semantic level, our loss combined with warping module is also able to give the network highly rich information on a local, structural level as well. In other words, contrary to DietNeRF giving only high-level feature consistency, our method of using multiple levels of convolutional network for feature difference calculation can be interpreted as enforcing a mixture of all levels, from high-level semantic consistency to low-level structural consistency.

需要注意的是，我们的损失函数与DietNeRF [R2]的不同之处在于，DietNeRF的一致性损失仅限于在全局语义水平上规则化辐射场，而我们的损失与Warping模块相结合还能够在局部、结构水平上为网络提供高度丰富的信息。换言之，与DietNeRF只给出高层特征一致性相反，我们使用多层卷积网络进行特征差异计算的方法可以看作从高层语义一致性到低层结构一致性的所有级别的混合。

Occlusion handling

In order to prevent imperfect and distorted warpings caused by erroneous geometry from influencing the model, which degrade overall reconstruction quality, we construct consistency mask $M_l$ to let NeRF ignore regions with geometric inconsistencies.

为了避免几何错误引起的扭曲和变形影响模型的整体重建质量，我们构造了一致性掩码 $M_l$ ，使神经网络忽略几何不一致区域。

fig3

Instead of applying mask to the images before inputting them into feature extractor network, we apply resized masks $M_l$ directly to the feature maps, after using nearest-neighbor down-sampling to make them match the dimensions of $l$ -th layer outputs.

我们不是在将图像输入到特征提取网络之前对它们施加掩码，而是在使用最近邻下采样使它们与 $l$ 层输出的维度匹配之后，直接对特征图应用调整大小的掩码 $M_l$ 。

We generate $M$ by measuring consistency between rendered depth values from target viewpoint and source viewpoint,

我们通过测量来自目标视点和源视图的渲染深度值之间的一致性来生成 $M$

eq9

where $[\cdot]$ is Iverson bracket, and $p_{j→i}$ refers to the corresponding pixel in source viewpoint $i$ for reprojected target pixel $p_j$ of $j$ -th viewpoint. Here we measure euclidean distance between depth points rendered from target and source viewpoints as a criterion for a threshold masking.

其中 $[\cdot]$ 是艾弗森括号， $p_{j→i}$ 是指第 $j$ 视点的重新投影的目标像素 $p_j$ 在源视点 $i$ 中的对应像素。在这里，我们测量从目标和源视点渲染的深度点之间的欧几里得距离作为阈值掩码的标准。

fig4

Mask generation by comparing geometry between novel view $j$ and source view $i$ , with $I_{i→j}$ being warped patch generated for view $j$ . For (a) and (b), warping does not occur correctly due to artifacts and self-occlusion, respectively. Such pixels are masked out by $M_l$ , allowing only (c), with accurate warping, as training signal for rendered image $I_j$

（通过比较新视图 $j$ 和源视图 $i$ 之间的几何来生成掩码，其中 $I_{i→j}$ 是为视图 $j$ 生成的扭曲面片。对于(a)和(b)，分别由于伪影和自遮挡而不能正确地变形。这样的像素被 $M_l$ 掩蔽，仅允许©作为渲染图像 $I_j$ 的训练信号，具有精确的扭曲。）

If distance between two points are greater than given threshold value $τ$ , we determine two rays as rendering depths of separate surfaces and mask out the corresponding pixel in viewpoint $I_j$ . The process takes place over every pixel in viewpoint $I_j$ to generate a mask $M$ the same size as rendered pixels. Through this technique, we filter out problematic solutions at feature level and regularize NeRF with only high-confidence image features.

如果两点之间的距离大于给定的阈值 $τ$ ，则将两条光线确定为单独曲面的渲染深度，并遮蔽视点 $I_j$ 中对应的像素。该过程在视点 $I_j$ 中的每个像素上进行，以生成与渲染像素相同大小的遮罩 $M$ 。通过该技术，我们在特征级过滤掉有问题的解，并仅用高置信度的图像特征来正则化NeRF。

Based on this, the consistency loss $L_{cons}$ is extended as,

一致性损失 $L_{cons}$ 被拓展为
eq10

$m_l$ is the sum of non-zero values.

$m_l$ 是非零值之和。

Edge-aware disparity regularization

Since our method is dependent upon the quality of depth rendered by NeRF, we directly impose additional regularization on rendered depth to facilitate optimization. We further encourage local depth smoothness on rendered scenes by imposing $l$ -1 penalty on disparity gradient within randomly sampled patches of input views. In addition, inspired by [R7], we take into account the fact that depth discontinuities in depth maps are likely to be aligned to gradients of its color image, and introduce an edge-aware term with image gradients $\partial I$ to weight the disparity values. Specifically, following [R7], we regularize for edge-aware depth smoothness,

由于我们的方法依赖于NeRF绘制的深度的质量，所以我们直接对渲染的深度施加额外的正则化以便于优化。我们还通过对随机采样的输入视图块内的视差梯度施加 $l$ -1惩罚来进一步鼓励渲染场景的局部深度平滑。此外，受[R7]的启发，我们考虑到深度图中的深度不连续性可能与其颜色图像的梯度对齐的事实，引入了边缘感知项图像梯度 $\partial I$ 来加权视差值。具体地说，遵循[R7]，我们对边缘感知深度光滑度进行正则化，
eq11

where $D^∗_i = D_i/\overline{D_i}$ is the mean-normalized inverse depth from [R7] to discourage shrinking of the estimated depth.

其中 $D^∗_i = D_i/\overline{D_i}$ 是平均归一化逆深度[R7]，以阻止估计深度缩小。

Training Strategy

Total losses

We optimize our model with a combined final loss of original NeRF’s pixel-wise reconstruction loss $L_{obs}$ and two types of regularization loss, $L^M_{cons}$ for unobserved view consistency modeling and $L_{reg}$ for disparity regularization.

我们综合考虑了原始Nerf像素级重建损失 $L_{obs}$ 和两种正则化损失，未观测视点一致性建模 $L^M_{cons}$ 和视差正则化 $L_{reg}$ 。

Progressive camera pose generation

Difficulty of accurate warping increases the further target view is from the source view, which means that sampling far camera poses straight from the beginning of training may have negative effects on our model. Therefore, we first generate camera poses near source views, then progressively further as training proceeds. We sample noise value uniformly between an interval of $[- β, + β]$ and add it to the original Euler rotation angles of input view poses, with parameter $β$ growing linearly from 3 to 9 degrees throughout the course of optimization. This design choice can be intuitively understood as stabilizing locations near observed viewpoints at start and propagating this regularization to further locations, where warping becomes progressingly more difficult.

精确翘曲的难度增加了，进一步的目标视角来自源视点，这意味着从训练开始就直接采样远摄像机姿势可能会对我们的模型产生负面影响。因此，我们首先在源视图附近生成相机姿势，然后随着训练的进行逐渐深入。我们在 $[- β, + β]$ 的区间内均匀采样噪声值，并将其添加到输入视点姿势的原始欧拉旋转角中，在整个优化过程中，参数 $β$ 从3度线性增长到9度。这种设计选择可以直观地理解为在开始时稳定观察视点附近的位置，并将这种正则化传播到更远的位置，在那里翘曲逐渐变得更加困难。

Positional encoding frequency annealing

We find that most of the artifacts occurring are highfrequency occlusions that fill the space between scene and camera. This behaviour can be effectively suppressed by constraining the order of fourier positional encoding to low dimensions. Due to this reason, we adopt coarse-to-fine frequency annealing strategy previously used by [R8] to regularize our optimization. This strategy forces our network to primarily optimize from coarse, low-frequency details where self-occlusions and fine features are minimized, easing the difficulty of warping process in the beginning stages of training. Following [R8], the annealing equation is $α (t) = m t / K$ , with $m$ as the number of encoding frequencies, $t$ as iteration step, and we set hyper-parameter $K$ as 15 $k$ .

我们发现，大多数伪影都是高频遮挡，填补了场景和摄像机之间的空间。通过将傅里叶位置编码的顺序限制在低维，可以有效地抑制这种行为。由于这个原因，我们采用了[R8]以前使用的从粗到精的频率退火策略来规则化我们的优化。这一策略迫使我们的网络主要从粗略的、低频的细节进行优化，其中自遮挡和精细特征被最小化，从而减轻了训练开始阶段的翘曲过程的难度。追随[R8],，退火方程为 $α (t) = m t / K$ ，m为编码频率数，t为迭代步长，超参数K设为15k。

Experiments

实验部分基于两个数据集，在三个视角的输入下，重建场景。与对比方法相比，本文的方法可以学到更精细的细节，并且重建的表面更平滑，背景中的伪影也更少。
tab1
fig5
fig6
消融实验中可以看出，特征一致性损失可以显著提升baseline [R6]的性能。
tab2
fig8
tab4
再考虑一致性掩码时，对外观和几何都有改善。
对深度的平滑约束也有利于场景重建。
fig7
而模型的两个渐进式训练策略，也有利于提升模型的稳定性和表现。
tab3
另外，如果不使用本文中依据深度和相机参数对任意未知视角进行图像变形的方式生成伪标签，而是只在已知的视角间变形，仍会因为视角间较大的差异造成结果的显著下降。
fig9

Paper Notes

What problem is addressed in the paper?
ANS: Few-shot NeRF/Sparse inputs.
Is it a new problem? If so, why does it matter? If not, why does it still matter?
ANS: No. It enforces a geometric consistency across different views with depth-guided warping and consistency modeling.
What is the key to the solution? What is the main contribution?
ANS:
(1) Rendered depth-guided warping. The rendered depths can be used to define a geometric correspondence relationship between two arbitrary views.
(2) Masked feature-level regularization loss. To encourages structural consistency while ignoring view-dependent radiance effects.
(3) Edge-aware disparity regularization. To encourage local depth smoothness.
How the experiments sufficiently support the claims?
ANS: Better details, higher stability in rendering smooth surfaces and reducing artifacts.
What can we learn from ablation studies?
ANS:
(1) Feature-level consistency loss
(2) Occlusion mask
(3) Progressive training strategies
(4) Edge-aware disparity regularization
(5) Consistency between known views
Potential fundamental flaws; how this work can be improved?
ANS: Only provide 3-view settings