GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time

最新推荐文章于 2025-06-04 11:30:10 发布

c2a2o2

最新推荐文章于 2025-06-04 11:30:10 发布

阅读量1.5k

点赞数 19

文章标签： 3d

本文链接：https://blog.csdn.net/c2a2o2/article/details/137641343

版权

GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time
GGRt：实时无姿态可推广的3D高斯溅射

Hao Li  豪礼11Equal ContributionEqual ContributionWork done interning at Baidu VISWork done interning at Baidu VIS
11平等贡献平等贡献在百度VIS实习完成的工作在百度VIS实习完成的工作Yuanyuan Gao  高媛媛1111Chenming Wu  吴晨鸣2211Dingwen Zhang  张定文11Corresponding AuthorCorresponding Author
11通讯作者通讯作者
Yalun Dai  戴亚伦33Chen Zhao  陈曌22Haocheng Feng  冯浩成22Errui Ding  丁二瑞22
Jingdong Wang  王敬东22Junwei Han  韩俊伟11

Abstract 摘要 [2403.10147] GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time

This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. Specifically, we design a novel joint learning framework that consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model. With the joint learning mechanism, the proposed framework can inherently estimate robust relative pose information from the image observations and thus primarily alleviate the requirement of real camera poses. Moreover, we implement a deferred back-propagation mechanism that enables high-resolution training and inference, overcoming the resolution constraints of previous methods. To enhance the speed and efficiency, we further introduce a progressive Gaussian cache module that dynamically adjusts during training and inference. As the first pose-free generalizable 3D-GS framework, GGRt achieves inference at ≥ 5 FPS and real-time rendering at ≥ 100 FPS. Through extensive experimentation, we demonstrate that our method outperforms existing NeRF-based pose-free techniques in terms of inference speed and effectiveness. It can also approach the real pose-based 3D-GS methods. Our contributions provide a significant leap forward for the integration of computer vision and computer graphics into practical applications, offering state-of-the-art results on LLFF, KITTI, and Waymo Open datasets and enabling real-time rendering for immersive experiences. Project page: GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time.
本文提出了一种新的方法，广义的新的视图合成，简化了真实的相机的姿态，在处理高分辨率图像的复杂性，和冗长的优化过程，从而促进了更强的适用性的三维高斯飞溅（3D-GS）在现实世界的场景。具体来说，我们设计了一种新的联合学习框架，它由迭代姿态优化网络（IPO-Net）和可推广的3D-Gaussians（G-3DG）模型组成。通过联合学习机制，该框架能够从图像观测中估计出鲁棒的相对姿态信息，从而初步缓解了对摄像机真实的姿态的要求。此外，我们实现了一种延迟反向传播机制，可以实现高分辨率的训练和推理，克服了以前方法的分辨率限制。为了提高速度和效率，我们进一步引入了渐进式高斯缓存模块，在训练和推理过程中动态调整。作为第一个姿态自由的可推广的3D-GS框架，GGRt实现了 ≥ 5 FPS的推理和 ≥ 100 FPS的实时渲染。通过大量的实验，我们证明了我们的方法优于现有的NeRF为基础的姿势免费技术的推理速度和有效性。它也可以接近基于真实的姿态的3D-GS方法。我们的贡献为将计算机视觉和计算机图形集成到实际应用中提供了重大飞跃，在LLFF、KITTI和Waymo Open数据集上提供了最先进的结果，并实现了沉浸式体验的实时渲染。项目页面：https://3d-aigc.github.io/GGRt。

Keywords:

Pose-Free Generalizable 3D-GS Real-time Rendering
关键词：无姿态泛化3D-GS实时绘制

1Introduction 1介绍

Recently invented Neural Radiance Fields (NeRF) [18] and 3D Gaussian Splatting (3D-GS) [11] bridge the gap between computer vision and computer graphics in the tasks of image-based novel view synthesis and 3D reconstruction. With a variety of follow-up variants, they are rapidly pushing the boundary towards revolutionizing many areas, such as virtual reality, film production, immersive entertainment, etc. To enhance generalization capabilities across previously unseen scenes, recent developments have introduced innovative approaches such as the generalizable NeRF [27] and 3D-GS [2].
最近发明的神经辐射场（NeRF）[ 18]和3D高斯溅射（3D-GS）[ 11]在基于图像的新视图合成和3D重建任务中弥合了计算机视觉和计算机图形学之间的差距。随着各种后续变体的出现，它们正在迅速推动许多领域的革命性发展，例如虚拟现实，电影制作，沉浸式娱乐等。为了增强以前看不见的场景的泛化能力，最近的发展引入了创新方法，例如可泛化的NeRF [ 27]和3D-GS [ 2]。

Refer to caption

Figure 1:Our proposed GGRt stands for the first pose-free generalizable 3D Gaussian splatting approach, capable of inference at over 5 FPS, and delivering real-time rendering performance.
图一：我们提出的GGRt代表了第一个无姿态的可推广的3D高斯溅射方法，能够以超过5 FPS的速度进行推理，并提供实时渲染性能。

Despite their ability to reconstruct new scenes without optimization, the previous works usually rely on the actual camera pose for each image observation, which actually cannot always be captured accurately in real-world scenarios. Besides, these methods show unsatisfactory view synthesis performance and struggle to reconstruct at higher resolutions due to the large number of parameters used. Last but not least, for such methods, each time when synthesizing a novel view demands a complete forward pass of the whole network, making real-time rendering intractable.
尽管它们能够在不优化的情况下重建新场景，但以前的作品通常依赖于每次图像观察的实际相机姿势，实际上在现实世界的场景中并不总是准确地捕捉到。此外，这些方法显示不令人满意的视图合成性能和斗争，重建在更高的分辨率，由于大量的参数使用。最后但并非最不重要的是，对于此类方法，每次合成新视图时都需要整个网络的完整前向传递，使得实时渲染变得棘手。

To tackle these challenges, this paper proposes GGRt, which brings the benefits of a primitive-based 3D representation—fast and memory-efficient rendering—to the generalizable novel view synthesis under the pose-free condition. Specifically, we introduce a novel pipeline that jointly learns the IPO-Net and the G-3DG model. Such a pipeline can estimate relative camera pose information robustly and thus effectively alleviate the requirement for real camera poses. Subsequently, we develop a deferred back-propagation (DBP) mechanism, allowing our method to efficiently perform high-resolution training and inference, a capability that surpasses the low-resolution limitations of existing methods [12, 20, 9, 26]. Furthermore, we also design an innovative Gaussians cache module with the idea of reusing the relative pose information and image features of the reference views in two continuous training and inferencing iterations. Thus, the Gaussians cache can progressively grow and diminish throughout the training and inferencing processes, further accelerating the speed of both.
为了解决这些挑战，本文提出了GGRt，它带来了一个基于连续性的三维表示的好处-快速和内存高效的渲染-在姿态自由的条件下，可推广的新的视图合成。具体来说，我们引入了一种新的管道，共同学习IPO-Net和G-3DG模型。这样的流水线可以鲁棒地估计相对相机姿态信息，从而有效地减轻对真实的相机姿态的要求。随后，我们开发了一种延迟反向传播（DBP）机制，使我们的方法能够有效地执行高分辨率训练和推理，这种能力超越了现有方法的低分辨率限制[ 12，20，9，26]。此外，我们还设计了一个创新的高斯缓存模块的思想，重用的相对姿态信息和图像特征的参考视图在两个连续的训练和推理迭代。因此，高斯缓存可以在整个训练和推理过程中逐渐增长和减少，进一步加快两者的速度。

To the best of our knowledge, our work stands for the first pose-free generalizable 3D Gaussian splatting, inference at ≥ 5 FPS, and rendering in real-time at ≥ 100 FPS. Extensive experiments demonstrate that our method surpasses existing NeRF-based pose-free approaches in inference speed and effectiveness. Compared to pose-based 3D-GS methods, our approach provides faster inference and competitive performance, even without the camera pose prior.
据我们所知，我们的工作代表了第一个姿态自由的可推广的3D高斯飞溅，在 ≥ 5 FPS的推理，并在 ≥ 100 FPS的实时渲染。大量的实验表明，我们的方法优于现有的NeRF为基础的姿势自由的方法在推理速度和有效性。与基于姿态的3D-GS方法相比，即使没有相机姿态先验，我们的方法也提供了更快的推理和有竞争力的性能。

2Related Work 2相关工作

2.1Generalizable Novel View Synthesis
2.1可推广的新视图合成

Pioneering approaches involving novel view synthesis leverage image-based rendering techniques, such as light field rendering [23, 21] and view interpolation [27, 32]. The introduction of NeRF [18] marks a significant milestone that uses neural networks to model the volume scene function and demonstrates impressive results in this task but requires per-scene optimization and accurate camera poses. To address the problem of generalization, researchers have explored several directions. For instance, PixelNeRF [34] presents a NeRF architecture that is conditioned on image inputs in a fully convolutional fashion. NeuRay [15] enhances the NeRF framework by predicting the visibility of 3D points relative to input views, allowing the radiance field construction to concentrate on visible image features. Furthermore, GNT [27] integrates multi-view geometry into an attention-based representation, which is then decoded through an attention mechanism in the view transformer for rendering novel views.
涉及新颖视图合成的开创性方法利用基于图像的渲染技术，例如光场渲染[23，21]和视图插值[27，32]。NeRF [ 18]的引入标志着一个重要的里程碑，它使用神经网络来建模体积场景函数，并在这项任务中展示了令人印象深刻的结果，但需要每个场景的优化和准确的相机姿势。为了解决泛化问题，研究人员已经探索了几个方向。例如，PixelNeRF [ 34]提出了一种NeRF架构，该架构以完全卷积的方式基于图像输入。NeuRay [ 15]通过预测相对于输入视图的3D点的可见性来增强NeRF框架，允许辐射场构建集中于可见图像特征。此外，GNT [ 27]将多视图几何结构集成到基于注意力的表示中，然后通过视图Transformer中的注意力机制进行解码，以渲染新视图。

A recent work LRM [10] and its multi-view version [13], also adopt a transformer for generalizable scene reconstruction using either a single image or posed four images. However, those works only demonstrate the capability in object-centric scenes, while our work targets a more ambitious goal of being generalizable in both indoor and outdoor scenes. Fu et al. [4] propose to use a generalizable neural field from posed RGB images and depth maps, eschewing a fusion module. Our work, in contrast, requires only camera input without pose information.
最近的工作LRM [ 10]及其多视图版本[ 13]也采用Transformer，用于使用单个图像或构成的四个图像进行可推广的场景重建。然而，这些作品只展示了以对象为中心的场景的能力，而我们的工作则针对一个更雄心勃勃的目标，即在室内和室外场景中推广。Fu等人。[ 4]建议使用来自构成的RGB图像和深度图的可泛化神经场，避免融合模块。相比之下，我们的工作只需要相机输入，而不需要姿势信息。

The aforementioned works use implicit representation inherited from NeRF and its variants, showing slow training and inferencing speed. Differently, pixleSplat [2] is the first generalizable 3D-GS work that tackles the problem of synthesizing novel views between a pair of images. However, it still requires accurate poses and only supports a pair of images as inputs. Instead, our work dismisses the demand for image poses and supports large-scale scene inference with unlimited images as reference views.
上述作品使用继承自NeRF及其变体的隐式表示，显示出缓慢的训练和推理速度。显然，pixleSplat [ 2]是第一个可推广的3D-GS工作，它解决了在一对图像之间合成新视图的问题。然而，它仍然需要精确的姿势，并且只支持一对图像作为输入。相反，我们的工作驳回了图像构成的需求，并支持大规模的场景推理与无限的图像作为参考意见。

2.2Pose-free Modeling for Novel View Synthesis
2.2用于新视图合成的姿态自由建模

The first attempt towards pose-free novel view synthesis is iNeRF [33], which uses key-point matching to predict camera poses. NeRF– [31] proposes to optimize camera pose embeddings and NeRF jointly. [14] proposes to learn neural 3D representations and register camera frames using coarse-to-fine positional encodings. [1] integrates scale and shift-corrected monocular depth priors to train their model, enabling the joint acquisition of relative poses between successive frames and novel view synthesis of the scenes. [16] employs a strategy that synergizes pre-trained depth and optical-flow priors. This approach is used to progressively refine blockwise NeRFs, facilitating the frame-by-frame recovery of camera poses.
第一次尝试无姿态的新视图合成是iNeRF [ 33]，它使用关键点匹配来预测相机姿态。NeRF- [ 31]建议联合优化相机姿势嵌入和NeRF。[ 14]提出学习神经3D表示并使用粗到细的位置编码来注册相机帧。[ 1]集成了缩放和偏移校正的单目深度先验来训练它们的模型，从而能够联合获取连续帧之间的相对姿态和场景的新颖视图合成。[ 16]采用一种策略，协同预先训练的深度和光流先验。这种方法用于逐步细化块式NeRF，促进相机姿态的逐帧恢复。

The implicit modeling inherent to NeRF complicates the simultaneous optimization of scene and camera poses. However, the recent innovation of 3D-GS provides an explicit point-based scene representation, enabling real-time rendering and highly efficient optimization. A recent work [5] pushes the boundary of simultaneous scene and pose optimization. However, those approaches need tremendous efforts in training and optimization per scene.
NeRF固有的隐式建模使场景和相机姿势的同时优化变得复杂。然而，3D-GS的最新创新提供了显式的基于点的场景表示，从而实现实时渲染和高效优化。最近的工作[ 5]推动了同时场景和姿势优化的边界。然而，这些方法需要在每个场景的训练和优化方面付出巨大的努力。

In generalizable settings, SRT [19], VideoAE [12], RUST [20], MonoNeRF [26], DBARF [3] and FlowCam [22] learn a generalizable scene representation from unposed videos using NeRF’s implicit representation. Those works show unsatisfactory view synthesis performance without per-scene optimization and inherent all the drawbacks NeRF originally had, such as real-time rendering of explicit primitives. PF-LRM [28] extends LRM to be applicable in pose-free scenes by using a differentiable PnP solver, but it shows the same limitations of LRM [10] mentioned above. To the best of our knowledge, our work stands for the first pose-free generalizable 3D-GS that enables efficient inferencing and real-time rendering, exhibiting SOTA performance in various metrics compared to previous approaches.
在可概括的设置中，SRT [ 19]，VideoAE [ 12]，RUST [ 20]，MonoNeRF [ 26]，DBARF [ 3]和FlowCam [ 22]使用NeRF的隐式表示从无姿势视频中学习可概括的场景表示。这些工作表现出不令人满意的视图合成性能，没有每场景优化和固有的所有缺点NeRF原来有，如显式图元的实时渲染。PF-LRM [ 28]通过使用可微Pendash求解器将LRM扩展为适用于无姿态场景，但它显示了与上述LRM [ 10]相同的限制。据我们所知，我们的工作代表了第一个无姿态可推广的3D-GS，它可以实现高效的推理和实时渲染，与以前的方法相比，在各种指标上表现出SOTA性能。

3Our Approach 3我们的方法

Given � unposed images 𝕀={𝐈�∈ℝ�×�×3∣�=1⋯�} as references, our goal is to synthesize target (or called query) image 𝐈�∈ℝ�×�×3 from novel view with corresponding poses 𝕋={𝐓�→�∣�=1⋯�}. Our GGRt is designed to train a generalizable Gaussian splatting network in a self-supervised manner, without relying on any camera pose or depth acquired in advance.
给定 � 无姿态图像 𝕀={𝐈�∈ℝ�×�×3∣�=1⋯�} 作为参考，我们的目标是从具有对应姿态 𝕋={𝐓�→�∣�=1⋯�} 的新视图合成目标（或称为查询）图像 𝐈�∈ℝ�×�×3 。我们的GGRt旨在以自我监督的方式训练可推广的高斯溅射网络，而不依赖于预先获取的任何相机姿势或深度。

Refer to caption

Figure 2:An overview of our method, demonstrated by using two continuous training steps given � selected nearby images. In the first training step, reference views are selected from nearby time �∈𝒩(�), then the IPO-Net estimates the relative poses between reference and target image 𝐓�→� for 3D-Gaussian predictions. Then 𝐈�1⋯𝐈�4 forms three image pairs and is fed into the G-3DG model to predict Gaussians 𝐆1⋯𝐆3 for novel view splatting and store them in Gaussians cache. In the second step, since 𝐈�2⋯𝐈�4 are collaboratively used by the last step, we directly query their image ID in the Cache Gaussians and pick up corresponding Gaussian points 𝐆2,𝐆3 with newly predicted 𝐆4 for novel view splatting.
图2：我们的方法概述，通过使用两个连续的训练步骤来演示，给定 � 选定的附近图像。在第一个训练步骤中，从附近的时间 �∈𝒩(�) 中选择参考视图，然后IPO-Net估计参考图像和目标图像 𝐓�→� 之间的相对姿态以进行3D高斯预测。然后 𝐈�1⋯𝐈�4 形成三个图像对并被馈送到G-3DG模型中以预测用于新颖视图飞溅的高斯 𝐆1⋯𝐆3 并将它们存储在高斯缓存中。在第二步中，由于 𝐈�2⋯𝐈�4 被最后一步协作使用，因此我们直接在该高速缓存高斯中查询它们的图像ID，并拾取具有新预测的 𝐆4 的对应高斯点 𝐆2,𝐆3 用于新颖的视图飞溅。

3.1The Joint Learning Framework
3.1联合学习框架

3.1.1Shared Image Encoder.
3.1.1共享图像编码器。

We utilize the ResNet backbone pre-trained with DINO and augmented with feature pyramid networks (FPN) to extract both geometric and semantic cues from each reference view 𝐈� and the target view 𝐈�. The extracted features are denoted as 𝐅� and 𝐅�.
我们利用使用DINO预训练并使用特征金字塔网络（FPN）增强的ResNet主干，从每个参考视图 𝐈� 和目标视图 𝐈� 中提取几何和语义线索。所提取的特征被表示为 𝐅� 和 𝐅� 。

3.1.2Iterative Pose Optimization Network.
3.1.2迭代姿态优化网络。

The goal of our pose estimation is to obtain the relative pose 𝐓�→� between target and reference images. Hence the target image can be aggregated by the reference images. To this end, an intuitive solution is to build a cost function 𝒞 that enforces the feature-metric consistency across the target view and all nearby views (i.e. minimize the re-projection error):
我们的姿态估计的目标是获得目标图像和参考图像之间的相对姿态 𝐓�→� 。因此，目标图像可以由参考图像聚合。为此，一个直观的解决方案是构建一个成本函数 𝒞 ，该成本函数在目标视图和所有附近视图之间强制执行特征度量一致性（即最小化重新投影误差）：

𝒞=1|𝒩(�)|∑�∈𝒩(�)‖�(𝐓�→�𝐃�,𝐅�)−�(𝐮�,𝐅�)‖,

(1)

where � denotes the interpolation function (e.g., bilinear interpolation), and 𝐃� is the predicted depth of the target image. Afterward, following the RAFT [25] architecture, we adopt a Conv-GRU module to iterative update the camera pose 𝐓�→� and depth map 𝐃�. Specifically, at the iteration step �=0, given camera poses 𝐓�→��=0 and depth map 𝐃��=0, we compute an initial cost map 𝒞�=0 using Eq. 1. Then, the Conv-GRU module predicts the relative camera pose difference Δ𝐓�→� and depth difference Δ𝐃� to update the camera poses and the depth map for a predefined maximal number of iterations �, such that:
其中 � 表示插值函数（例如，双线性插值），并且 𝐃� 是目标图像的预测深度。之后，遵循RAFT [ 25]架构，我们采用Conv-GRU模块来迭代更新相机姿态 𝐓�→� 和深度图 𝐃� 。具体地，在迭代步骤 �=0 ，给定相机姿态 𝐓�→��=0 和深度图 𝐃��=0 ，我们使用等式2计算初始成本图 𝒞�=0 。1.然后，Conv-GRU模块预测相对相机姿态差 Δ𝐓�→� 和深度差 Δ𝐃� ，以更新预定义的最大迭代次数 � 的相机姿态和深度图，使得：

𝐓�→�(�)=𝐓�→�(�−1)+Δ𝐓�→�,𝐃�(�)=𝐃�(�−1)+Δ𝐃�.

(2)

Then we transfer our pre-estimated relative poses 𝐓�→� between reference views and target view to relative poses 𝐓�→�+1 between nearby references view for our Gaussisans prediction. We denote the network as IPO-Net in the following context.
然后，我们将参考视图和目标视图之间的我们的预先估计的相对姿态 𝐓�→� 转换为附近参考视图之间的相对姿态 𝐓�→�+1 ，以用于我们的高斯预测。在下面的上下文中，我们将网络表示为IPO-Net。

3.1.3Generalizable 3D-Gaussians.
3.1.3可推广的3D高斯模型。

Unlike previous generalizable methods that rely on implicit neural rendering and require ray aggregation for each target view, our approach is based on 3D-GS, which employs an explicit representation. As a result, we can generate Gaussian points from reference views and combine them to render a larger scene. To accomplish this, we organize the image set 𝕀 into several image pairs as well as their relative poses {(𝐈�1,𝐈�2,𝐓1→2),…,(𝐈��−1,𝐈��),𝐓�−1→�} and perform pixel-aligned Gaussian prediction 𝐆�={𝐠�=(𝝁�,𝚺�,𝜶�,𝐒�)}�� for each image pair (𝐈��,𝐈��+1).
与以前依赖于隐式神经渲染并需要每个目标视图的光线聚合的可推广方法不同，我们的方法基于3D-GS，它采用显式表示。因此，我们可以从参考视图生成高斯点，并将它们联合收割机渲染成更大的场景。为了实现这一点，我们将图像集 𝕀 组织成若干图像对以及它们的相对姿态 {(𝐈�1,𝐈�2,𝐓1→2),…,(𝐈��−1,𝐈��),𝐓�−1→�} ，并且针对每个图像对 (𝐈��,𝐈��+1) 执行像素对准的高斯预测 𝐆�={𝐠�=(𝝁�,𝚺�,𝜶�,𝐒�)}�� 。

In particular, given an image pair (𝐈��,𝐈��+1), we design a module dubbed Generalizable 3D-Gaussians(G-3DG) to predict Gaussian points, which consists of three parts: 1) Epipolar Sampler, 2) Cross-Attention module, and 3) Local Self-Attention module. Let 𝐮� be the pixel coordinate from 𝐅�� and ℓ be the epipolar line induced by its ray in 𝐅��+1 . First, along ℓ , we sampled the features 𝐅��+1[𝐮ℓ�+1] and the annotated points in it with the corresponding depths 𝐃ℓ�+1:
特别地，给定图像对 (𝐈��,𝐈��+1) ，我们设计了一个被称为Generalizable 3D-Gaussians（G-3DG）的模块来预测高斯点，该模块由三个部分组成：1）极线采样器，2）交叉注意模块，和3）局部自注意模块。令 𝐮� 为来自 𝐅�� 的像素坐标，并且 ℓ 为由其在 𝐅��+1 中的射线引起的核线。首先，沿着 ℓ ，我们对特征 𝐅��+1[𝐮ℓ�+1] 和其中具有对应深度 𝐃ℓ�+1 的注释点进行采样：

�(𝐅��+1)=𝐅��+1[𝐮ℓ�+1]⊕𝒫ℰ(𝐃ℓ�+1),

(3)

where ⊕ and 𝒫ℰ(⋅) indicates concatenation and positional encoding. Subsequently, we employ the Cross-Attention module 𝒞𝒜(⋅) to determine per-pixel correspondence. The feature 𝐅�^ incorporates a weighted sum of the depth positional encoding, with the expectation that the highest weight corresponds to the correct correspondence.
其中 ⊕ 和 𝒫ℰ(⋅) 表示级联和位置编码。随后，我们采用交叉注意模块 𝒞𝒜(⋅) 来确定每像素对应关系。特征 𝐅�^ 结合了深度位置编码的加权和，期望最高权重对应于正确的对应。

𝐅��^=𝐅��+𝒞𝒜(�=𝐅��,�=�(𝐅��+1),�=�(𝐅��+1)).

(4)

Furthermore, we use a self-attention module to ensure our network propagates scaled depth estimation to part of the �-th image feature maps that may not have any epipolar correspondences in the (�+1)-th image. For high-resolution novel view synthesis, we separate a high-resolution image into several small crops and utilize a Local Self-Attention module ℒ𝒮𝒜(⋅). This way, we can keep the training objective the same as the global self-attention trained on full images. In detail, we split features 𝐅�� into �×� patches and conduct self-attention (𝒮𝒜(⋅)) for every patch. Then, we add positional encoding 𝒫ℰ(�,�) to retain the image-wise positional information, where � and � denote the height and width of the features 𝐅�.
此外，我们使用自我注意模块来确保我们的网络将缩放的深度估计传播到第 � 个图像特征图的一部分，这些图像特征图在第 (�+1) 个图像中可能没有任何对极对应。对于高分辨率新视图合成，我们将高分辨率图像分成几个小作物，并利用局部自注意模块 ℒ𝒮𝒜(⋅) 。这样，我们可以保持训练目标与在完整图像上训练的全局自我注意力相同。具体来说，我们将特征 𝐅�� 分成 �×� 补丁，并对每个补丁进行自我关注（ 𝒮𝒜(⋅) ）。然后，我们添加位置编码 𝒫ℰ(�,�) 以保留逐图像位置信息，其中 � 和 � 表示特征 𝐅� 的高度和宽度。

𝐅��~=ℒ𝒮𝒜(𝐅��^)=𝐅��^+[𝒮𝒜(𝐅^�,1,1�)𝒮𝒜(𝐅^�,1,2�)⋯𝒮𝒜(𝐅^�,1,��)𝒮𝒜(𝐅^�,2,1�)𝒮𝒜(𝐅^�,2,2�)⋯𝒮𝒜(𝐅^�,2,��)⋮⋮⋱⋮𝒮𝒜(𝐅^�,�,1�)𝒮𝒜(𝐅^�,�,2�)⋯𝒮𝒜(𝐅^�,�,��)]+𝒫ℰ(�,�).

(5)

Upon 𝐅��~, we predict its corresponding Gaussian points 𝐆� following the implementation of pixelSplat[2]. Our proposed training strategy naturally enables us to concatenate all the Gaussian points generated by image pairs for large scene generalization:
在 𝐅��~ 上，我们在pixelSplat[ 2]的实现之后预测其对应的高斯点 𝐆� 。我们提出的训练策略自然使我们能够连接图像对生成的所有高斯点，以进行大场景泛化：

𝔾={𝐆1,𝐆2,…,𝐆�−1}.

(6)

Moreover, our approach can achieve comparable performance while significantly reducing the required time for encoding reference images and therefore facilitating real-time rendering.
此外，我们的方法可以实现相当的性能，同时显着减少编码参考图像所需的时间，从而促进实时渲染。

3.1.4Gaussians Cache Mechanism.
3.1.4高斯缓存机制。

As shown in Fig. 2, for two continuous training/inferencing iterations, many reference views 𝐈�� of the current iteration are co-used by the last iteration. Re-inferencing them in the next iteration will be time-consuming and unnecessary since they share the same relative pose and features. Therefore, instead of re-predicting Gaussian points for all image pairs, we propose a dynamic store, query, and release mechanism called Gaussians Cache. In �-th step, it stores the predicted Gaussian points with corresponding image IDs {�:𝐆�} in the cache. In (�+1)-th step, it queries the Cache using image ID � to restore the Gaussian points 𝐆�. Furthermore, after querying all the IDs in the current iteration, the Cache releases the remaining unmatched Gaussians to optimize memory usage. This ensures that these Gaussians will not be utilized in the future, thereby reducing memory footprint without compromising performance.
如图2所示，对于两个连续的训练/推理迭代，当前迭代的许多参考视图 𝐈�� 被最后一次迭代共同使用。在下一次迭代中重新推断它们将是耗时且不必要的，因为它们共享相同的相对姿势和特征。因此，而不是重新预测高斯点的所有图像对，我们提出了一个动态的存储，查询和释放机制称为高斯缓存。在第 � 步中，将预测的高斯点与对应的图像ID {�:𝐆�} 一起存储在该高速缓存中。在第3步中，使用图像ID � 查询该高速缓存以恢复高斯点 𝐆� 。此外，在查询当前迭代中的所有ID之后，该高速缓存释放剩余的不匹配的高斯以优化内存使用。这可以确保将来不会使用这些高斯函数，从而在不影响性能的情况下减少内存占用。

3.2End-to-end Training with Deferred Optimization
3.2延迟优化的端到端培训

3.2.1Jointly Training Strategy.
3.2.1共同培养战略。

Learning scenes without camera pose is challenging due to the lack of 3D spatial priors to learn potential occlusions, varying lighting conditions, and the camera intrinsics encountered in unstructured environments. To address this problem, as depicted in Figure 2(b), we activate the gradients of both the IPO-Net and G-3DG model, allowing for their simultaneous learning. By leveraging supervision from our rendered images, we can effectively optimize our IPO-Net.
由于缺乏3D空间先验知识来学习潜在的遮挡、变化的照明条件以及在非结构化环境中遇到的相机本质，因此学习没有相机姿势的场景是具有挑战性的。为了解决这个问题，如图2（b）所示，我们激活了IPO-Net和G-3DG模型的梯度，允许它们同时学习。通过利用我们渲染图像的监督，我们可以有效地优化我们的IPO-Net。

Specifically, to strike a balance between pose estimation and training a generalized Gaussian network, we employ joint training of our GGRt model using the loss function ℒjoint , which incorporates dynamic weight coefficient adjustment:
具体来说，为了在姿态估计和训练广义高斯网络之间取得平衡，我们使用损失函数 ℒjoint 对GGRt模型进行联合训练，其中包含动态权重系数调整：

ℒjoint =2�⋅�(ℒdepth +ℒphoto )+(1−2�⋅�)ℒrgb ,

(7)

where � denotes the training step. For IPO-Net optimization, we adopt photometric loss ℒphoto [8] and edge-aware smoothness ℒdepth [7] for our target view:
其中 � 表示训练步骤。对于IPO-Net优化，我们为目标视图采用光度损失 ℒphoto [ 8]和边缘感知平滑 ℒdepth [ 7]：

ℒphoto =1|𝒩�|∑�∈𝒩�(�1−ssim⁡(𝐈�′−𝐈�)2+(1−�)‖𝐈�′−𝐈�‖),

(8)

ℒdepth =|∂�𝐃|exp−|∂�𝐈|+|∂�𝐃|exp−|∂�𝐈|,

(9)

where 𝐈�′ denotes the warped image from the reference view � to target view �. ∂� and ∂� are the image gradients. For our Gaussian model training, we simply apply MSE error between the rendered target image 𝐂^ and the ground truth target image 𝐂, where 𝐮 denotes pixel coordinate from image 𝐈�:
其中 𝐈�′ 表示从参考视图 � 到目标视图 � 的变形图像。 ∂� 和 ∂� 是图像梯度。对于我们的高斯模型训练，我们简单地在渲染的目标图像 𝐂^ 和地面实况目标图像 𝐂 之间应用MSE误差，其中 𝐮 表示来自图像 𝐈� 的像素坐标：

ℒrgb =∑u∈𝐮‖𝐂^(u)−𝐂(u)‖22.

(10)

3.2.2Deferred Back-propagation for Generalizable Gaussians.
3.2.2广义高斯的延迟反向传播。

Refer to caption

Figure 3:Illustration of deferred back-propagation pipeline of our G-3DG model (left column) and the procedure of our local self-attention module in deferred back-propagation (right column). Details are shown in Sec. 3.
图三：我们的G-3DG模型的延迟反向传播管道的图示（左列）和我们的延迟反向传播中的局部自注意模块的过程（右列）。详情见第二节。3.

One of our goals is to render high-resolution images at a fast speed. However, with the growing image size, there will be insufficient memory to train a full high-resolution image due to the limited GPU memory. Inspired by [35], as shown in Fig. 3, we specifically design deferred back-propagation with G-3DG model that allows us to train GGRt in a high resolution under memory constraints. Specifically, we generate a full-resolution image during the initial stage without utilizing auto-differentiation. Subsequently, we compute the image loss and its gradient on a per-pixel basis with regard to the rendered image. During the second stage, we adopt a patch-wise approach to re-render pixels, enabling auto-differentiation. This allows us to back-propagate the cached gradients to the network parameters, facilitating their accumulation.
我们的目标之一是快速渲染高分辨率图像。然而，随着图像大小的增长，由于GPU内存有限，将没有足够的内存来训练完整的高分辨率图像。受[ 35]的启发，如图3所示，我们专门设计了G-3DG模型的延迟反向传播，使我们能够在内存限制下以高分辨率训练GGRt。具体来说，我们在初始阶段生成全分辨率图像，而不使用自动微分。随后，我们计算图像损失及其梯度的每像素的基础上，关于渲染图像。在第二阶段，我们采用了一种分片的方法来重新渲染像素，从而实现自动区分。这允许我们将缓存的梯度反向传播到网络参数，促进它们的积累。

As mentioned earlier, the crucial aspect of ensuring successful training in the technique is maintaining consistency in our training objective across the two stages. Employing traditional self-attention mechanisms for both the entire image and patch images would result in an imbalance, as the features in the whole image can globally attend to information. In contrast, patch images can only focus on one patch at a time. To solve this problem, for every image pair (𝐈��,𝐈��+1) in � reference images, we split the 𝐈�� into �×� patches [𝐈1,1�,⋯,𝐈�,��] , each patch follows the same shape with Local Self-Attention module elaborated above. For each patch 𝐈�,��, we rewrite the epipolar sampler of Eq. 3 as:
如前所述，确保成功训练的关键方面是在两个阶段保持我们训练目标的一致性。采用传统的自注意机制的整个图像和补丁图像将导致不平衡，因为在整个图像中的功能可以全局关注的信息。相比之下，补丁图像一次只能聚焦在一个补丁上。为了解决这个问题，对于 � 参考图像中的每个图像对 (𝐈��,𝐈��+1) ，我们将 𝐈�� 分割成 �×� 块 [𝐈1,1�,⋯,𝐈�,��] ，每个块遵循上面详细描述的局部自注意模块的相同形状。对于每个补丁 𝐈�,�� ，我们重写Eq. 3如：

�(𝐅��+1)=𝐅��+1[𝐮ℓ�,��+1]⊕�(�ℓ�,��+1)

(11)

where ℓ�,� denotes the epipolar lines induced by patch 𝐈��,�. During the Cross-Attention module, we replace � from 𝐅�� to 𝐅�,�� in Eq. 4, which aims to aggregate global features from 𝐅�+1𝐬 for each pixel:
其中 ℓ�,� 表示贴片 𝐈��,� 诱导的核线。在交叉注意模块中，我们将等式中的 � 从 𝐅�� 替换为 𝐅�,�� 。4，旨在为每个像素聚合来自 𝐅�+1𝐬 的全局特征：

𝐅�,��=𝐅�,��+𝒞𝒜(�=𝐅�,��,�=�(𝐅��+1),�=�(𝐅��+1))

(12)

As for the self-attention module, firstly, we keep the positional encoding in global image-wise and crop them using the patch position to retain positional information between different patches. After that, the self-attention block 𝒮𝒜(⋅) is adopted for our patch feature 𝐅�,��:
对于自注意模块，首先，我们保持全局图像的位置编码，并使用补丁位置来裁剪它们，以保留不同补丁之间的位置信息。之后，我们的补丁功能 𝐅�,�� 采用了自我注意块 𝒮𝒜(⋅) ：

𝐅�,��=𝐅�,��+𝒮𝒜(𝐅�,��)+Crop(𝒫ℰ(�,�),�,�)

(13)

This design allows us to achieve consistent results between full image rendering and deferred rendering.
这种设计使我们能够在全图像渲染和延迟渲染之间实现一致的结果。

3.3Efficient Rendering 3.3高效渲染

By generating pixel-aligned Gaussian points in the reference views without being constrained by a specific target view, we can decompose the Gaussian generation process and the novel view splatting process. This decomposition enables us to render real-time at a high frame rate (≥ 100 FPS). During practical implementation, we generate and cache all the Gaussian points in the scene (i.e. Gaussians Cache). As a result, when presented with a query view within the scene, we can efficiently retrieve the corresponding Gaussians by utilizing the nearby reference images’ IDs. This allows us to render a novel view quickly and accurately.
通过在参考视图中生成像素对齐的高斯点，而不受特定目标视图的约束，我们可以分解高斯生成过程和新的视图飞溅过程。这种分解使我们能够以高帧速率（ ≥ 100 FPS）实时渲染。在实际实现中，我们生成并缓存场景中的所有高斯点（即高斯缓存）。因此，当在场景内呈现查询视图时，我们可以通过利用附近的参考图像的ID来有效地检索相应的高斯。这使我们能够快速准确地呈现新颖的视图。

4Experiments 4实验

4.1Implementation Details
4.1实现细节

The experiments for novel view synthesis are conducted under two settings, i.e., the generalized and finetuned settings. Firstly, we train our model in several scenes and directly evaluate our model on test scenes (i.e., unseen scenes). Secondly, we finetune our generalized model on each unseen scene with a small number of steps and compare them with per-scene optimized NeRF methods.
新视图合成的实验在两种设置下进行，即，广义和微调的设置。首先，我们在几个场景中训练我们的模型，并直接在测试场景中评估我们的模型（即，看不见的场景）。其次，我们微调我们的广义模型在每个看不见的场景与少量的步骤，并将它们与每个场景优化的NeRF方法进行比较。

4.1.1Datasets.

Following [3], we train and evaluate our method on LLFF [17]. To further demonstrate the capability of our model in general settings, we evaluate our performance in forward-facing outdoor datasets (i.e. Waymo Open Dataset [24] and KITTI Dataset [6]).
在[ 3]之后，我们在LLFF [ 17]上训练和评估我们的方法。为了进一步证明我们的模型在一般设置中的能力，我们评估了我们在面向前的户外数据集（即Waymo Open Dataset [ 24]和KITTI Dataset [ 6]）中的性能。

4.1.2Parameters.

We train our method end-to-end on datasets of multi-view unposed images using the Adam optimizer to minimize the overall loss ℒall . The learning rates set for IPO-Net and G-3DG model are 5×10−4 and 2×10−5 respectively, decaying exponentially over the course of training. For the LLFF dataset, our training and rendering resolutions are set to 378×504. The number of reference views is 5 for the generalized setting and 10 for the finetuning setting. As for the Waymo dataset, the rendering resolution is set to 640×960 for both generalized and finetuning settings. The training resolutions are set to 196×288 and 504×760 respectively. For the KITTI dataset, we train and render with the same resolution 176×612. The number of reference views is set to 5 for both the generalized and finetuning settings. We split our training images into 4 patches during deferred back-propagation.
我们使用Adam优化器在多视图无姿态图像的数据集上端到端地训练我们的方法，以最大限度地减少整体损失 ℒall 。IPO-Net和G-3DG模型的学习率分别为 5×10−4 和 2×10−5 ，在训练过程中呈指数衰减。对于LLFF数据集，我们的训练和渲染分辨率设置为 378×504 。一般化设置的参考视图数为5，微调设置的参考视图数为10。至于Waymo数据集，渲染分辨率设置为 640×960 ，用于通用和微调设置。训练分辨率分别设置为 196×288 和 504×760 。对于KITTI数据集，我们使用相同的分辨率 176×612 进行训练和渲染。对于概化和微调设置，参照视图的数量均设置为5。我们在延迟反向传播期间将训练图像分成4个补丁。

4.1.3Metrics.

For render quality evaluation, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) [30], and the Learned Perceptual Image Patch Similarity (LPIPS) [36] are adopted.
对于渲染质量评估，采用峰值信噪比（PSNR），结构相似性指数度量（SSIM）[ 30]和学习感知图像块相似性（LPIPS）[ 36]。

Table 1:Quantitative performance of novel view synthesis on the Waymo [24] and LLFF [17] datasets under generalized conditions. Entries in bold indicate the best performance in a pose-free context, while highlighted represents the best overall.
表1：在广义条件下Waymo [ 24]和LLFF [ 17]数据集上新视图合成的定量性能。粗体表示在无姿势环境中的最佳性能，而突出显示表示整体最佳性能。

Scene 场景		PSNR↑				SSIM↑				LPIPS↓
		Pose-free ✘ 姿态自由		Pose-free ✔ 姿态自由		Pose-free ✘ 姿态自由		Pose-free ✔ 姿态自由		Pose-free ✘ 姿态自由		Pose-free ✔ 姿态自由
		IBRNet	PixelSplat	DBARF	Ours 我们	IBRNet	PixelSplat	DBARF	Ours 我们	IBRNet	PixelSplat	DBARF	Ours 我们
Waymo	003	31.40	31.45	25.17	31.20	0.917	0.920	0.834	0.912	0.127	0.101	0.225	0.137
	19	30.06	32.22	23.45	32.34	0.907	0.928	0.810	0.928	0.130	0.082	0.225	0.110
	36	29.88	33.58	22.59	31.64	0.904	0.942	0.807	0.919	0.150	0.087	0.254	0.120
	69	30.24	31.81	21.97	32.10	0.889	0.914	0.779	0.908	0.175	0.119	0.307	0.146
	81	31.30	31.02	24.17	31.30	0.901	0.904	0.785	0.892	0.147	0.108	0.298	0.159
	126	36.00	33.68	29.70	34.15	0.940	0.931	0.884	0.937	0.106	0.100	0.178	0.094
LLFF	fern 蕨类	23.61	22.90	23.12	24.66	0.743	0.734	0.724	0.764	0.240	0.121	0.277	0.174
	flower 花	22.92	24.51	21.89	24.80	0.849	0.806	0.793	0.795	0.123	0.106	0.176	0.098
	fortress 堡垒	29.05	26.36	28.13	28.30	0.850	0.781	0.820	0.817	0.087	0.080	0.126	0.109
	horns 角	24.96	23.86	24.17	26.34	0.831	0.836	0.799	0.872	0.144	0.127	0.194	0.118
	leaves 叶	19.03	19.49	18.85	21.03	0.737	0.663	0.649	0.729	0.289	0.158	0.313	0.150
	orchids 兰花	18.52	17.65	17.78	19.00	0.573	0.533	0.506	0.589	0.259	0.191	0.352	0.219
	room 房间	28.81	27.82	27.50	29.02	0.926	0.920	0.901	0.932	0.099	0.067	0.142	0.081
	trex	23.51	22.75	22.70	24.22	0.818	0.788	0.783	0.804	0.160	0.107	0.207	0.166

Table 2:Results for novel view synthesis on KITTI [6] dataset in generalized settings. Specifically, ‘w/o ft’ denotes applying our method directly on KITTI using the weights trained on the Waymo dataset without finetuning, whereas ‘w/ ft’ denotes the results from our model finetuned on KITTI in scene ‘04’. Resolution is 176×612. Time includes inferencing and rendering.
表2：在广义设置中KITTI [ 6]数据集上的新视图合成的结果。具体来说，“w/o ft”表示使用在Waymo数据集上训练的权重直接在KITTI上应用我们的方法，而不进行微调，而“w/ ft”表示我们的模型在场景“04”中在KITTI上进行微调的结果。分辨率为 176×612 。时间包括推理和渲染。

Metric 度量	Pose-free ✘ 姿态自由		Pose-free ✔ 姿态自由
Metric 度量	IBRNet	PixelSplat	VideoAE	RUST 锈	FlowCAM	DBARF	Ours w/o ft 我们的无英尺	Ours w/ ft
PSNR ↑	22.5	23.35	15.17	14.18	17.69	18.36	20.24	22.59
LPIPS ↓	0.44	0.129	0.462	0.654	0.405	0.425	0.388	0.327
Time(s) ↓ 时间 ↓	0.850	0.285	≈ 2	≈ 1	4.170	0.850	0.295	0.295

4.2Benchmarking

We first conduct experiments to compare our method with other methods, including both pose-required and pose-free, in light field dataset, LLFF [17] and forward-facing autonomous driving dataset, Waymo Open dataset [24]. As shown in Tab. 1, our method achieves remarkable performance improvements compared to other approaches. Notably, our method comprehensively surpasses the best pose-free method DBARF [3], across all scenarios in both datasets. Specifically, in the ‘69’ scene of the Waymo dataset, our method’s PSNR exceeds that of DBARF by up to 10.3 dB. Furthermore, compared to state-of-the-art pose-based methods like IBRNet [29] and pixelSplat [2], our method also delivers highly competitive results. For instance, our method outperforms IBRNet and pixelSplat in most scenarios on the LLFF dataset. In the ‘horns’ scene, our approach achieves a PSNR of 1.38 dB and 2.48 dB higher than IBRNet and pixelSplat, respectively.
我们首先在光场数据集LLFF [ 17]和面向前方的自动驾驶数据集Waymo Open数据集[ 24]中进行实验，将我们的方法与其他方法进行比较，包括姿势要求和姿势自由。如Tab中所示。1，我们的方法实现了显着的性能改善相比，其他方法。值得注意的是，我们的方法在两个数据集的所有场景中全面超越了最好的无姿态方法DBARF [ 3]。具体来说，在Waymo数据集的'69'场景中，我们方法的PSNR超过DBARF高达 10.3 dB。此外，与最先进的基于姿势的方法（如IBRNet [ 29]和pixelSplat [ 2]）相比，我们的方法也提供了极具竞争力的结果。例如，我们的方法在LLFF数据集上的大多数情况下都优于IBRNet和pixelSplat。在“喇叭”场景中，我们的方法分别实现了比IBRNet和pixelSplat高 1.38 dB和 2.48 dB的PSNR。

Refer to caption

Figure 4:Novel view synthesis qualitative outcomes on the LLFF [17] dataset under generalized settings, with significant regions highlighted by red rectangles.
图四：在一般化设置下，LLFF [ 17]数据集上的新视图合成定性结果，红色矩形突出显示了重要区域。

Refer to caption

Figure 5:Qualitative results for novel view synthesis on Waymo [24] dataset with generalized settings. Areas of distinction are marked with red rectangles.
图5：Waymo [ 24]数据集上新视图合成的定性结果。区别区域用红色矩形标记。

We also conduct experiments on the KITTI [6] dataset to compare our method with other pose-free generalizable NeRF methods. As illustrated in Tab. 2, our approach outperforms the VideoAE [12], RUST [20], and FlowCAM [22], even without specifically training on the KITTI dataset. Notably, when we directly apply our Waymo-trained model to KITTI without any additional training, our method still surpasses those approaches that have been specifically trained on KITTI. Through fine-tuning our method on the KITTI dataset, we observe even more significant improvements. The PSNR values reach a remarkable value of 22.59 dB, surpassing the state-of-the-art method FlowCAM by a substantial margin of 4.9 dB.
我们还在KITTI [ 6]数据集上进行了实验，将我们的方法与其他无姿态可推广的NeRF方法进行了比较。如Tab中所示。2，我们的方法优于VideoAE [ 12]，RUST [ 20]和FlowCAM [ 22]，即使没有专门在KITTI数据集上进行训练。值得注意的是，当我们直接将Waymo训练的模型应用于KITTI而不进行任何额外的训练时，我们的方法仍然优于那些专门在KITTI上训练的方法。通过在KITTI数据集上微调我们的方法，我们观察到了更显著的改进。PSNR值达到了 22.59 dB的显著值，超过了最先进的方法FlowCAM的 4.9 dB的实质性裕度。

Refer to caption

Figure 6:Results of novel view synthesis on the KITTI [6] dataset under generalized conditions, with red rectangles indicating areas of note. Note that ground truth (GT) is replicated for easier comparison.
图6：在广义条件下KITTI [ 6]数据集上的新视图合成结果，红色矩形表示值得注意的区域。请注意，为了更容易比较，复制了地面实况（GT）。

4.2.1Pose Accuracy Evaluation.
4.2.1姿势精度评估。

We assess the accuracy of our pose estimation on both quantitative and qualitative levels. Since we focus on estimating relative poses rather than absolute poses, we compare our results exclusively with DBARF. The comparison results are presented in Tab. 3, where we achieve better rotation and translation errors in most of the scenes. Notably, our rotation errors are significantly lower in the ‘Flower’ and ‘Trex’ scenes, consequently bringing significant reconstruction quality improvement by 2.91dB and 1.52dB in PSNR.
我们在定量和定性两个层面上评估我们的姿势估计的准确性。由于我们专注于估计相对姿态，而不是绝对姿态，我们比较我们的结果专门与DBARF。比较结果见表1。3，我们在大多数场景中实现了更好的旋转和平移误差。值得注意的是，我们的旋转误差显着较低的“花”和“Trex”的场景，从而带来了显着的重建质量提高了2.91dB和1.52dB的PSNR。

Refer to caption

Figure 7:Qualitative results for pose optimization on three scenes of LLFF [17] dataset.
图7：LLFF [ 17]数据集三个场景的姿势优化的定性结果。 Table 3:Quantitative results of camera pose accuracy on LLFF [17] dataset. Rotation denotes degree and translation is scaled by 102.
表3：LLFF [ 17]数据集上相机姿态准确度的定量结果。旋转表示度数，平移按 102 缩放。

Error 误差	Method 方法	Room 房间	Trex	Flower 花	Fern 蕨类	Fortress 堡垒	Orchids 兰花	Leaves 叶	Horns 角
Rotation↓ 旋转 ↓	DBARF	9.590	126.93	14.650	8.314	2.740	14.43	13.927	8.455
Rotation↓ 旋转 ↓	Ours 我们	7.692	3.449	5.885	5.783	2.160	4.157	16.964	4.684
Translation↓ 翻译 ↓	DBARF	0.060	0.070	0.010	0.020	0.009	0.046	0.022	0.027
Translation↓ 翻译 ↓	Ours 我们	0.043	0.014	0.004	0.008	0.006	0.012	0.014	0.005

4.3Ablation Study 4.3消融研究

4.3.1Effectiveness of Gaussians Cache.
4.3.1高斯缓存的有效性。

In this ablation study, we compare our method’s training and inferencing time consumption on a single RTX 3090 GPU and further evaluate its impact on the metric of PSNR. As shown in Tab. 4, the proposed Gaussians Cache prevents the need for re-predicting Gaussians that were processed in the previous iteration, resulting in a 2× speed increase during training and 8× boost during inference when compared to the baseline without Gaussians cache. Additionally, the proposed caching technique has been shown to have no negative impact on performance, underscoring its efficacy.
在这项消融研究中，我们比较了我们的方法在单个RTX 3090 GPU上的训练和推理时间消耗，并进一步评估了其对PSNR指标的影响。如Tab中所示。4，所提出的高斯缓存防止了重新预测在先前迭代中处理的高斯的需要，导致与没有高斯缓存的基线相比，训练期间的 2× 速度增加和推理期间的 8× 提升。此外，建议的缓存技术已被证明对性能没有负面影响，强调其功效。

Table 4:Ablation study evaluating the proposed Gaussians Cache mechanism. Training&inference conducted on Waymo ‘019’ dataset with a resolution of 228×320.
表4：评估所提出的高斯缓存机制的消融研究。在Waymo '019'数据集上进行训练和推理，分辨率为 228×320 。

Gaussians Cache 高斯缓存	Training Time↓ 培训时间 ↓	Inference Time↓ 推断时间 ↓	PSNR↑
✗	4s/iter	1.02s/iter	32.34
✓	2s/iter	0.13s/iter	32.34

4.3.2Effectiveness of Deferred Back-Propagation.
4.3.2延迟反向传播的有效性。

We evaluate the performance influence of the proposed DBP technique. As shown in Tab. 5, when using the regular back-propagation, a single RTX 3090 GPU can only process 2 reference views with the resolution of 384×496. If we want to use more reference views, we need to reduce the resolution of the observed images, which would bring undesired performance drops. However, when equipped with the proposed DBP technique, we can use 5 reference views with the resolution of 384×496 and gain 1.37−3.02 dB in PSNR compared to the traditional approach. The performance gains become even more under the finetune setting. Notably, when continually increasing the number of reference views to 7, our approach reaches 31.5 dB in PSNR, setting a new state-of-the-art record.
我们评估所提出的DBP技术的性能影响。如Tab中所示。5、当使用常规反向传播时，单个RTX 3090 GPU只能处理2个分辨率为 384×496 的参考视图。如果我们想使用更多的参考视图，我们需要降低观察图像的分辨率，这将带来不希望的性能下降。然而，当配备所提出的DBP技术时，与传统方法相比，我们可以使用分辨率为 384×496 的5个参考视图和PSNR中的增益为 1.37−3.02 dB。在微调设置下，性能增益甚至更多。值得注意的是，当参考视图的数量不断增加到7时，我们的方法达到了31.5 dB的PSNR，创下了新的最先进的记录。

Apart from DBP, a simple idea to train a high-resolution model with limited hardware is to crop images from reference views randomly and only render the cropped areas. We compared the training strategy with the Crop Image strategy and our DBP. After training the model with cropped images, it fails to capture the matching information of the entire image, resulting in poor rendering results when directly using a full image, which will. On the contrary, our proposed DBP works well under this circumstance as shown in Fig. 8.
除了DBP之外，用有限的硬件训练高分辨率模型的一个简单想法是随机地从参考视图中裁剪图像，只渲染裁剪的区域。我们将训练策略与作物图像策略和我们的DBP进行了比较。在使用裁剪图像训练模型后，它无法捕获整个图像的匹配信息，导致直接使用完整图像时渲染效果不佳，这将。相反，我们提出的DBP在这种情况下工作得很好，如图8所示。

Refer to caption

Figure 8:Visualization of our method and crop image training. (a) Individual image patches are rendered separately and stitched together. (b) The entire image is rendered using the weights trained with our crop image training approach. (c) The entire image is rendered using our deferred back-propagation technique.
图8：我们的方法和作物图像训练的可视化。(a)单独的图像面片被单独渲染并缝合在一起。(b)整个图像使用我们的裁剪图像训练方法训练的权重进行渲染。(c)整个图像使用我们的延迟反向传播技术渲染。

Refer to caption

Figure 9:Ablation of our Local Self Attention for DBP. Left: the top and bottom row demonstrate the visualization of our DBP w/ or w/o local self-attention. ’Render w/o Grad’ denotes the full image ; Right: the PSNR over the course of training, a significant drop can be observed without our Local Self Attention.
图9：我们对DBP的局部自我注意力的消融。左：上一行和下一行展示了我们的DBP的可视化w/ or w/o局部自我注意。“渲染w/o格拉德”表示完整图像;右：在训练过程中的PSNR，在没有我们的局部自注意的情况下可以观察到显著下降。 Table 5:Ablation study of our proposed deferred back-propagation technique on Scene ‘ROOM’ from the LLFF dataset [17]. ‘OOM’ is an abbreviation for ‘out of memory’.
表5：我们提出的延迟反向传播技术在LLFF数据集的场景“房间”上的消融研究[ 17]。“OOM”是“Out of Memory”的缩写。

			PSNR ↑		Mem. (GB) Mem.（GB）
Method 方法	Ref. View 参考文献视图	Resolution 决议	Gen. 将军	Ft	Gen. 将军	Ft
Our w/o Defer 我们的无延迟	2	384×496	27.65	27.77	28.48	29.03
Our w/o Defer 我们的无延迟	3	384×496	-	-	OOM	OOM
Our w/o Defer 我们的无延迟	5	192×248	26.00	28.51	31.05	31.05
Our w/ Defer 我们的w/延迟	5	384×496	29.02	29.85	29.35	29.35
Our w/ Defer 我们的w/延迟	7	384×496	-	31.50	-	34.61

4.3.3Effectiveness of Local Self-Attention.
4.3.3本地自我注意力的有效性。

Here we conduct ablations of our proposed local self-attention. As shown in Fig. 9(a), without our local self-attention, the rendered images between the first and second steps of DBP are not consistent. Specifically, without the inclusion of Local Self-Attention, the PSNR performance experiences a drop after around 1,500 training steps, as illustrated in Fig. 9 (b). To our knowledge, by utilizing Local Self-Attention, the results of both processes remain consistent. Otherwise, the loss calculated by the forward-only rendering process no longer serves as an appropriate guide for the gradient backpropagation of the DBP rendering process, resulting in a significant decrease in PSNR.
在这里，我们对我们提出的局部自我注意力进行消融。如图9（a）所示，如果没有我们的局部自我注意，DBP的第一步和第二步之间的渲染图像是不一致的。具体地，在不包括局部自注意的情况下，PSNR性能在大约1,500个训练步骤之后经历下降，如图9（B）所示。据我们所知，通过利用局部自我注意，这两个过程的结果保持一致。否则，由仅向前渲染过程计算的损失不再用作DBP渲染过程的梯度反向传播的适当引导，从而导致PSNR的显著降低。

5Conclusion 5结论

This paper introduces a novel method for generalizable novel view synthesis that eliminates the need for camera poses, enables high-resolution real-time rendering, and eliminates lengthy optimization. Our method contains jointly trained IPO-Net and G-3DG models, as well as the progressive Gaussian cache module, enabling robust relative pose estimation and fast scene reconstruction from image observations without prior poses. We incorporate a deferred back-propagation mechanism for high-resolution training and inference, overcoming GPU memory limitations. GGRt achieves impressive inferencing and real-time rendering speeds, outperforming existing pose-free techniques and approaching pose-based 3D-GS methods. Extensive experimentation on diverse datasets confirms its effectiveness.
本文介绍了一种新的方法，可概括的新的视图合成，消除了需要的相机姿势，使高分辨率的实时渲染，并消除冗长的优化。我们的方法包含联合训练的IPO-Net和G-3DG模型，以及渐进式高斯缓存模块，可以在没有先验姿势的情况下从图像观察中进行鲁棒的相对姿势估计和快速场景重建。我们采用延迟反向传播机制进行高分辨率训练和推理，克服GPU内存限制。GGRt实现了令人印象深刻的推理和实时渲染速度，优于现有的无姿态技术，并接近基于姿态的3D-GS方法。在不同数据集上进行的大量实验证实了其有效性。