【定位系列论文阅读】-InLoc: Indoor Visual Localization with Dense Matching and View Synthesis (一)

醉酒柴柴

已于 2023-08-04 12:18:41 修改

阅读量447

点赞数 1

分类专栏：杂七杂八文章标签：论文阅读

于 2023-08-04 09:16:38 首次发布

本文链接：https://blog.csdn.net/weixin_46050242/article/details/132095645

版权

杂七杂八专栏收录该内容

10 篇文章 3 订阅

订阅专栏

后文链接

文章目录

论文速览
0.Abstract
- 0.1 逐句翻译
- 0.2 总结
1. Introduction
2. Related work
3. The InLoc dataset for visual localization 用于视觉定位的InLoc数据集
4. Indoor visual localization with dense matching and view synthesis 基于密集匹配和视图合成的室内视觉定位

论文速览

文章信息

题目

基于密集匹配和视图合成的室内视觉定位

来源

paperwithcode中的一篇文章

代码

代码地址

概述

研究什么东西

评价

文章给我带来的收获：
1.方法好值得借鉴：使用图像检索来进行室内定位
2.实验设计比较完整
文章打分：⭐⭐⭐⭐⭐
值得再读：√ （有开源代码，和我想研究的方向符合）

0.Abstract

0.1 逐句翻译

We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map.
The contributions of this work are three-fold.
我们试图预测查询照片相对于大型室内3D地图的6个自由度(6DoF)姿态。

First, we develop a new large-scale visual localization method targeted for indoor environments.
这项工作的贡献有三方面。

The method proceeds along three steps: (i) efficient retrieval of candidate poses that ensures scalability to large-scale environments, (ii) pose estimation using dense matching rather than local features to deal with textureless indoor scenes, and (iii) pose verification by virtual view synthesis to cope with significant changes in viewpoint, scene layout, and occluders.
首先，我们开发了一种针对室内环境的大规模视觉定位方法。该方法分为三个步骤:(i)候选姿态的有效检索，确保大规模环境的可扩展性;(ii)使用密集匹配而不是局部特征进行姿态估计，以处理无纹理的室内场景;(iii)通过虚拟视图合成进行姿态验证，以应对视点、场景布局和遮挡物的重大变化。

Second, we collect a new dataset with reference 6DoF poses for large-scale indoor localization.
其次，我们收集了一个具有参考6DoF姿态的新数据集，用于大规模室内定位。

Query photographs are captured by mobile phones at a different time than the reference 3D map, thus presenting a realistic indoor localization scenario.
查询照片由手机在不同的时间拍摄，而不是参考3D地图，从而呈现真实的室内定位场景。

Third, we demonstrate that our method signifi- cantly outperforms current state-of-the-art indoor localization approaches on this new challenging data.
第三，我们证明了我们的方法在这个新的具有挑战性的数据上明显优于当前最先进的室内定位方法。

0.2 总结

使用密集匹配姿态估计（常见的是局部特征）
使用虚拟试图合成抵挡遮挡

1. Introduction

第一段（室内导航的好处）

Autonomous navigation inside buildings is a key ability of robotic intelligent systems [24, 39]. Successful navigation requires both to localize a robot and to determine a path to its goal.
在建筑物内自主导航是机器人智能系统的一项关键能力 [24, 39]。成功的导航既需要定位机器人，也需要确定通往目标的路径。

One approach to solving the localization problem is to build a 3D map of the building and then use a camera1 to estimate the current position and orientation of the robot (Figure 1).
解决定位问题的一种方法是绘制建筑物的三维地图，然后使用摄像头1 估算机器人的当前位置和方向（图 1）。
在这里插入图片描述
大规模室内视觉定位。给定一个几何配准的RGBD图像数据库，我们通过检索候选图像，估计候选相机姿态，并选择最佳匹配的相机姿态来预测查询RGB图像的6DoF相机姿态。为了解决室内视觉定位的固有困难，我们引入了“InLoc”方法，该方法执行一系列逐步严格的验证步骤。

Imagine also the benefit of an intelligent indoor navigation system that helps you find your way, for exam-ple, at Chicago airport, Tokyo Metropolitan station or the CVPR conference center. Besides intelligent systems, the visual localization problem is also highly relevant for any type of Mixed Reality application, including Augmented Reality [16, 44, 72].
想象一下智能室内导航系统的好处，例如在芝加哥机场、东京大都会车站或 CVPR 会议中心，它可以帮助您找到方向。除智能系统外，视觉定位问题还与任何类型的混合现实应用（包括增强现实）高度相关 [16, 44, 72]。

第二段（室内定位面临的挑战）

Due to the availability of datasets, e.g., obtained from Flickr [38] or captured from autonomous vehicles [19, 43], large-scale localization in urban environments has been an active field of research [6, 9, 14, 15, 19, 20, 27, 29, 34, 38, 44, 53–57, 65–67, 75, 79, 80].
由于数据集的可用性，例如从 Flickr [38] 或自动驾驶车辆 [19, 43] 获取的数据集，城市环境中的大规模定位已成为一个活跃的研究领域 [6, 9, 14, 15, 19, 20, 27, 29, 34, 38, 44, 53-57, 65-67, 75, 79, 80]。

In contrast, indoor localization [11, 12, 39, 58, 59, 64, 69, 74] has received less attenion in the last years.
相比之下，室内定位 [11, 12, 39, 58, 59, 64, 69, 74] 在过去几年受到的关注较少。

At the same time, indoor localization is, in many ways, a harder problem than urban localization: 1) Due to the short distance to the scene geometry, even small changes in viewpoint lead to large changes in image appearance.
与此同时，室内定位在许多方面比城市定位更难：1) 由于与场景几何形状的距离很短，即使视角的微小变化也会导致图像外观的巨大变化。

For the same reason, ocluders such as humans or chairs often have a stronger impact compared to urban scenes.
出于同样的原因，与城市场景相比，人或椅子等排除物通常会产生更大的影响。

Thus, indoor localization approaches have to handle significantly larger changes in appearance between a query and reference images.
因此，室内定位方法必须处理查询图像和参考图像之间明显更大的外观变化。

Large parts of indoor scenes are textureless and textured areas are typically rather small.
室内场景的大部分区域都没有纹理，而且纹理区域通常很小。

As a result, feature matches are often clustered in small regions of the images, resulting in unstable pose estimates [29].
因此，特征匹配往往集中在图像的小区域，导致姿态估计不稳定 [29]。

To make matters worse, buildings are often highly symmetric with many repetitive elements, both on large (similar corridors, rooms, etc.) and small (similar chairs, tables, doors etc.) scale.
更糟糕的是，建筑物通常高度对称，有许多重复的元素，无论是大的（相似的走廊、房间等）还是小的（相似的椅子、桌子、门等）都是如此。

While structural ambiguities also cause problems in urban environments, they often only occur in larger scenes [9, 54, 67].
虽然结构模糊也会在城市环境中造成问题，但它们往往只出现在较大的场景中 [9, 54, 67]。

The appearance of indoor scenes changes considerably over the course of a day due to the complex illumination conditions (indirect light through windows and active illumination from lamps).
由于光照条件复杂（窗户的间接光照和灯具的主动光照），室内场景的外观在一天中会发生很大变化。
Indoor scenes are often highly dynamic over time as furniture and personal effects are moved through the environment. In contrast, the overall appearance of building facades does not change too much over time.
随着家具和个人物品在环境中的移动，室内场景通常会随着时间的推移而高度动态化。相比之下，建筑物外墙的整体外观不会随时间发生太大变化。

第三段（本文使用的方法克服上述问题）

This paper addresses these difficulties inherent to indoor visual localization by proposing a new localization method.
本文提出了一种新的室内视觉定位方法，解决了这些问题。

Our approach starts with an image retrieval step, using a compact image representation [6] that scales to large scenes.
我们的方法从图像检索步骤开始，使用紧凑的图像表示[6]，可扩展到大型场景。

Given a shortlist of potentially relevant database images, we apply two progressively more discriminative geometric verification steps: (i) We use dense matching of CNN descriptors that capture spatial configurations of higher-level structures (rather than individual local features) to obtain the correspondences required for camera pose estimation. (ii) We then apply a novel pose verification step based on virtual view synthesis that can accurately verify whether the query image depicts the same place by dense pixel-level matching, again not relying on sparse local features.
给定潜在相关数据库图像的候选列表，我们应用两个逐步更具判别性的几何验证步骤:(i)我们使用CNN描述符的密集匹配，这些描述符捕获更高级别结构的空间配置(而不是单个局部特征)，以获得相机姿态估计所需的对应关系。(ii)然后，我们应用了一种基于虚拟视图合成的新型姿态验证步骤，该步骤可以通过密集的像素级匹配准确地验证查询图像是否描绘了同一位置，同样不依赖于稀疏的局部特征。

第四段（为解决问题本文使用的方法）

Historically, the datasets used to evaluate indoor visual localization were restricted to small, often room-scale, scenes.
一直以来，用于评估室内视觉定位的数据集仅限于小型场景，通常是房间规模的场景。

Driven by the interest in semantic scene understanding [10, 23, 78] and enabled by scalable reconstruction techniques [28, 47, 48], large-scale indoor datasets covering multiple rooms or even whole buildings are becoming available [10, 17, 23, 64, 74, 76–78].
随着人们对语义场景理解的兴趣[10, 23, 78]以及可扩展重建技术的发展[28, 47, 48]，覆盖多个房间甚至整栋建筑的大规模室内数据集开始出现[10, 17, 23, 64, 74, 76-78]。

However, most of these datasets focus on reconstruction [76,77] and semantic scene understanding [10, 17, 23, 78] and are not suitable for localization.
然而，这些数据集大多侧重于重建[76,77]和语义场景理解[10, 17, 23, 78]，并不适用于定位。

To address this issue, we create a new dataset for indoor localization that, in contrast to other existing indoor localization datasets [10, 26, 64], has two important properties.
为了解决这个问题，我们创建了一个新的室内定位数据集，与其他现有的室内定位数据集[10, 26, 64]相比，该数据集具有两个重要特性。

First, the dataset is large-scale, capturing two university buildings.
首先，该数据集规模大，捕捉了两座大学建筑。

Second, the query images are acquired using a smartphone at a time months apart from the date of capture of the reference 3D model.
其次，查询图像是使用智能手机获取的，获取时间与参考 3D 模型的获取时间相差数月。

As a result, the query images and the reference 3D model often contain large changes in scene appearance due to the different layout of furniture, occluders (people), and illumination, representing a realistic and challenging indoor localization scenario.
因此，由于家具、遮挡物（人）和光照的布局不同，查询图像和参考 3D 模型的场景外观往往会有很大的变化，这代表了一种现实而又具有挑战性的室内定位场景。

第五段(本文的贡献)

Contributions. Our contributions are three-fold.
贡献。我们的贡献有三方面。

First, we develop a novel visual localization approach suitable for large-scale indoor environments.
首先，我们开发了一种新的适合大规模室内环境的视觉定位方法。

The key novelty of our approach lies in carefully introducing dense feature extraction and matching in a sequence of progressively stricter verification steps.
我们的方法的关键新颖之处在于在一系列逐步严格的验证步骤中仔细地引入密集的特征提取和匹配。

To the best of our knowledge, the present work is the first to clearly demonstrate the benefit of dense data association for indoor localization.
据我们所知，目前的工作是第一个清楚地证明密集数据关联对室内定位的好处。

Second, we create a new dataset suitably designed for large-scale indoor localization that contains large variation in appearance between queries and the 3D database due to large viewpoint changes, moving furniture, occluders or changing illumination.
其次，我们创建了一个适合大规模室内定位的新数据集，其中包含由于大型视点变化，移动家具，遮挡物或照明变化而导致查询和3D数据库之间外观的巨大变化。

The query images are taken at a different time from the reference database, using a handheld device, and at different moments of the day, to capture enough variability, bridging the gap to realistic usage scenarios.
查询图像是在参考数据库的不同时间，使用手持设备，在一天中的不同时刻拍摄的，以捕获足够的可变性，弥合与实际使用场景的差距。

The code and data are publicly available on the project page [1].
代码和数据在项目页面[1]上是公开的。

Third, the proposed method shows a solid improvement over existing state-ofthe-art results, showing an absolute improvement of 17– 20% in the percent of correctly localized queries within a 0.25 – 0.5 m error, which is of high importance for indoor localization.
第三，与现有的最先进的结果相比，所提出的方法有了明显的改进，在0.25 - 0.5 m的误差范围内，正确定位查询的百分比绝对提高了17 - 20%，这对室内定位非常重要。

2. Related work

We next review previous work on visual localization.
接下来，我们将回顾以往关于视觉定位的研究。

第一段（介绍基于图像检索的定位）

Image retrieval based localization. Visual localization in large-scale urban environments is often approached as an
image retrieval problem.
基于图像检索的定位。大规模城市环境下的视觉定位通常被视为一个图像检索问题。

The location of a given query image is predicted by transferring the geotag of the most similar image retrieved from a geotagged database [6, 9, 18, 35, 54, 66, 67].
通过传输从地理标记数据库中检索到的最相似图像的地理标记来预测给定查询图像的位置[6,9,18,35,54,66,67]。

This approach scales to entire cities thanks to compact image descriptors and efficient indexing techniques [7, 8, 22, 31, 33, 49, 63, 70] and can be further improved by spatial re-ranking [51], informative feature selection [21, 22] or feature weighting [27, 32, 54, 67].
由于紧凑的图像描述符和高效的索引技术[7,8,22,31,33,49,63,70]，该方法可以扩展到整个城市，并且可以通过空间重新排序[51]，信息特征选择[21,22]或特征加权[27,32,54,67]进一步改进。

Most of the above methods are based on image representations using sparsely sampled local invariant features.
以上大多数方法都是基于稀疏采样的局部不变特征的图像表示。

While these representations have been very successful, outdoor image-based localization has recently also been approached using densely sampled local descriptors [66] or (densely extracted) descriptors based on convolutional neural networks [6, 35, 40, 75].
虽然这些表示非常成功，但最近也使用密集采样的局部描述符[66]或(密集提取的)基于卷积神经网络的描述符[6,35,40,75]进行了基于户外图像的定位。

However, the main shortcoming of all the above methods is that they output only an approximate location of the query, not an exact 6DoF pose.
然而，上述所有方法的主要缺点是它们只输出查询的大致位置，而不是精确的6DoF姿态。

第二段（介绍使用三维地图进行可视化定位）

Visual localization using 3D maps.
使用三维地图进行可视化定位。

Another approach is to directly obtain 6DoF camera pose with respect to a pre- built 3D map.
另一种方法是根据预先构建的三维地图直接获取 6DoF 摄像机姿态。

The map is usually composed of a 3D point cloud constructed via Structure-from-Motion (SfM) [2] where each 3D point is associated with one or more local feature descriptors.
该地图通常由通过运动结构（SfM）构建的三维点云组成[2]，其中每个三维点都与一个或多个局部特征描述相关联。

The query pose is then obtained by feature matching and solving a Perspective-n-Point problem (PnP) [14, 15, 20, 29, 34, 38, 53, 55].
然后通过特征匹配和解决透视-点问题（PnP）来获取查询姿势[14, 15, 20, 29, 34, 38, 53, 55]。

Alternatively, pose estimation can be formulated as a learning problem, where the goal is to train a regressor from the input RGB(D) space to camera pose parameters [11, 34, 59, 73]. While promising, scaling these methods to large-scale datasets is still an open challenge.
另外，姿势估计也可以表述为一个学习问题，目标是训练一个从输入 RGB（D）空间到相机姿势参数的回归器 [11、34、59、73]。虽然这些方法前景广阔，但将其推广到大规模数据集仍是一项公开挑战。

第三段(介绍室内3D地图 )

Indoor 3D maps. Indoor scene datasets [50, 52, 62, 68] have been introduced for tasks such scene recognition, classification, and object retrieval.
室内3D地图。室内场景数据集[50,52,62,68]已经被引入到场景识别、分类和对象检索等任务中。

With the increased availability of laser range scanners and time-of-flight (ToF) sensors, several datasets include depth data besides RGB images [5, 10, 23, 26, 36, 60, 78] and some of these datasets also provide reference camera poses registered into the 3D point cloud [10, 26, 78], though their focus is not on localization.
随着激光距离扫描仪和飞行时间(ToF)传感器可用性的提高，除了RGB图像外，还有一些数据集包括深度数据[5,10,23,26,36,60,78]，其中一些数据集还提供了注册到3D点云中的参考相机姿势[10,26,78]，尽管它们的重点不是定位。

Datasets focused specifically on indoor localization [59, 64, 69] have so far captured fairly small spaces such as a single room (or a single floor at largest) and have been constructed from densely-captured sequences of RGBD images.
专门针对室内定位的数据集[59,64,69]迄今为止捕获了相当小的空间，如单个房间(或最多一个楼层)，并且是从密集捕获的RGBD图像序列构建的。

More recent datasets [17, 76] provide larger scale (multi-floor) indoor 3D maps containing RGBD images registered to a global floor map.
最近的数据集[17,76]提供了更大比例尺(多楼层)室内3D地图，其中包含注册到全球楼层地图的RGBD图像。

However, they are designed for object retrieval, 3D reconstruction, or training deep-learning architectures.
然而，它们是为对象检索、3D重建或训练深度学习架构而设计的。

Most importantly, they do not contain query images taken from viewpoints far from database images, which are necessary for evaluating visual localization.
最重要的是，它们不包含从远离数据库图像的视点获取的查询图像，而这对于评估视觉定位是必要的。

第四段（本文使用的数据集）

To address the shortcomings of the above datasets for large-scale indoor visual localization, we introduce a new dataset that includes query images captured at a different time from the database, taken from a wide range of viewpoints, with a considerably larger 3D database distributed across multiple floors of multiple buildings.
为了解决上述数据集在大规模室内视觉定位方面的缺点，我们引入了一个新的数据集，该数据集包括从数据库中不同时间捕获的查询图像，这些图像是从广泛的视点获取的，并且具有相当大的3D数据库，分布在多个建筑物的多个楼层。

Furthermore, our dataset contains various difficult situations for visual localization, e.g., textureless and highly symmetric office scenes, repetitive tiles, and repetitive objects that confuse the existing visual localization methods designed for outdoor scenes. The newly collected dataset is described next.
此外，我们的数据集包含各种视觉定位困难的情况，例如，无纹理和高度对称的办公室场景，重复的瓷砖和重复的物体，这些都会混淆现有的户外场景视觉定位方法。接下来描述新收集的数据集。

3. The InLoc dataset for visual localization 用于视觉定位的InLoc数据集

第一段（介绍本文数据集）

Our dataset is composed of a database of RGBD images geometrically registered to the floor maps augmented with a
separate set of RGB query images taken by hand-held devices to make it suitable for the task of indoor localization
(Figure 2). The provided query images are annotated with manually verified ground-truth 6DoF camera poses (reference poses) in the global coordinate system of the 3D map.
我们的数据集由一个RGBD图像数据库组成，该数据库与地面图进行几何配准，并增强了一组由手持设备拍摄的单独RGB查询图像，以使其适合室内定位任务(图2)。提供的查询图像在3D地图的全球坐标系中使用手动验证的地面真值6DoF相机姿势(参考姿势)进行注释。
在这里插入图片描述
来自InLoc数据集的示例图像。(上)数据库图像。(下)查询图像。所选图像显示了在室内环境中遇到的挑战:即使视点的微小变化也会导致外观的巨大差异;大型无纹理表面(如墙壁);自我重复结构(如走廊);由于不同的照明来源(例如，主动照明与间接照明)，全天的显著变化。

第二段（介绍数据集）

Database. The base indoor RGBD dataset [76] consists of 277 RGBD panoramic images obtained from scanning two
buildings at the Washington University in St. Louis with a Faro 3D scanner.
数据库基础室内 RGBD 数据集 [76] 由 277 幅 RGBD 全景图像组成，这些图像是用 Faro 3D 扫描仪扫描圣路易斯华盛顿大学的两座建筑后获得的。

Each RGBD panorama has about 40M 3D points in color.
每张 RGBD 全景图像都有大约 4000 万个彩色 3D 点。

The base images are divided into five scenes: DUC1, DUC2, CSE3, CSE4, and CSE5, representing five floors of the mentioned buildings, and are geometrically registered to a known floor plan [76].
基础图像分为五个场景：DUC1、DUC2、CSE3、CSE4 和 CSE5 分别代表上述建筑的五个楼层，并与已知的平面图进行几何注册[76]。

The scenes are scanned sparsely on purpose, to cover a larger area with a small number of scans to reduce the required manual work, as well as due to the long operating times of the high-end scanner used. The area per scan varies between 23.5 and 185.8 m2 .
由于使用的高端扫描仪工作时间较长，为了以较少的扫描次数覆盖较大的面积，减少所需的人工工作，我们特意对场景进行了稀疏扫描。每次扫描的面积在 23.5 至 185.8 平方米之间

This inherently leads to critical view changes between query and database images when compared with other existing datasets [64, 69, 74]2
与其他现有数据集[64、69、74]2 相比，这必然会导致查询图像和数据库图像之间的重要视图变化。
.

第三段（本文数据库）

For creating an image database suitable for indoor visual localization evaluation, a set of perspective images is generated by following the best practices from outdoor visual localization [19, 66, 79].
为了创建适合室内视觉定位评估的图像数据库，我们遵循室外视觉定位的最佳实践[19,66,79]生成一组透视图像。

We obtain 36 perspective RGBD images from each panorama by extracting standard perspective views (60◦ FoV) with a sampling stride of 30◦ in yaw and ±30◦ in pitch directions, resulting in 10K perspective images in total (Table 1).
我们通过提取标准视角视图(60◦FoV)，以30◦偏角和±30◦俯仰方向的采样步幅，从每个全景中获得36个视角RGBD图像，总共产生10K视角图像(表1)。
在这里插入图片描述
表1。InLoc数据集的统计信息。

Our database contains significant challenges, such as repetitive patterns (stairs, pillars), frequently appearing building structures (doors, windows), furniture changing position, people moving across the scene, and textureless and highly symmetric areas (walls, floors, corridors, classrooms, open spaces).
我们的数据库包含重大挑战，例如重复的图案(楼梯，柱子)，频繁出现的建筑结构(门，窗)，家具改变位置，人们在场景中移动，以及无纹理和高度对称的区域(墙壁，墙壁，墙壁)。地板、走廊、教室、开放空间)。

第四段（捕获查询图片）

Query images. We captured 356 photos using a smartphone camera (iPhone 7), distributed only across two floors, DUC1 and DUC2.
查询图片。我们使用智能手机摄像头（iPhone 7）拍摄了 356 张照片，仅分布在 DUC1 和 DUC2 两个楼层。

The other three floors in the database are not represented in the query images, and play the role of confusers at search time, contributing to the buildingscale localization scenario.
数据库中的其他三层楼在查询图片中没有体现，在搜索时起到混淆作用，有助于楼宇定位场景。

Note that these query photos are taken at different times of the day, to capture the variety of occluders and layouts (e.g., people, furniture) as well as illumination changes.
请注意，这些查询照片是在一天中的不同时间拍摄的，以捕捉各种遮挡物和布局（如人、家具）以及光照变化。

第五段（位姿计算）

Reference pose generation. For all query photos, we estimate 6DoF reference camera poses w.r.t. the 3D map. Each
query camera reference pose is computed as follows:
参考姿态生成。对于所有查询照片，我们估计参考相机在3D地图上的6DoF姿势。每个查询相机参考位姿计算如下:

(i) Selection of the visually most similar database images.
(i)选择视觉上最相似的数据库图像。

For each query, we manually select one panorama location which is visually most similar to the query image using the perspective images generated from the panorama.
对于每个查询，我们使用从全景生成的透视图像手动选择一个在视觉上与查询图像最相似的全景位置。

(ii) Automatic matching of query images to selected database images.
(ii)查询图像与选定的数据库图像自动匹配。

We match the query and perspective images by using affine covariant features [45] and nearestneighbor search followed by Lowe’s ratio test [42].
我们使用仿射协变特征[45]和最近邻搜索，然后使用Lowe’s ratio检验[42]来匹配查询图像和透视图像。

(iii) Computing the query camera pose and visually verifying the reprojection.
(iii)计算查询相机姿态并从视觉上验证重投影。

All the panoramas (and perspective images) are already registered to the floor plan and have pixel-wise depth information.
所有的全景(和透视图像)都已经注册到平面图上，并具有逐像素的深度信息。

Therefore, we compute query pose via P3P-RANSAC [25], followed by bundle adjustment [3], using correspondences between query image points and scene 3D points obtained by feature matching.
因此，我们利用特征匹配得到的查询图像点与场景三维点的对应关系，通过P3P-RANSAC[25]计算查询姿态，然后进行束平差[3]。

We evaluate the obtained poses visually by inspecting the reprojection of edges detected in the corresponding RGB panorama into the query image (see examples in figure 3).
我们通过检查在相应的RGB全景图中检测到的边缘重投影到查询图像中来评估获得的姿态(参见图3中的示例)。

在这里插入图片描述
经过验证的查询姿势示例。如第3节所述，我们在视觉上和定量上评估了参考相机姿势的质量。红点是数据库3D点投影到查询图像使用其估计姿态。

(iv) Manual matching of difficult queries to selected database images.
手动将困难的查询与选定的数据库图像进行匹配。

Pose estimation from automatic matches often gives inaccurate poses for difficult queries which are, e.g., far from any database image.
自动匹配的姿态估计通常会为困难的查询提供不准确的姿态，例如，远离任何数据库图像。

Hence, for queries with significant misalignment in reprojected edges, we manually annotate 5 to 20 correspondences between image pixels and 3D points and apply step (iii) on the manual matches.
因此，对于重投影边缘中有明显不对齐的查询，我们手动注释图像像素和3D点之间的5到20个对应，并对手动匹配应用步骤(iii)。

(v) Quantitative and visual inspection. For all estimated poses, we measure the median reprojection error, computed as the distance of the reprojected 3D database point to the nearest edge pixel detected in the query image, after removing correspondences with gross errors (with distance over 20 pixels) due to, e.g., occlusions.
(v) 定量和目测。对于所有估计姿势，我们测量重投影误差中值，计算方法是重投影三维数据库点到查询图像中检测到的最近边缘像素的距离，然后剔除由于遮挡等原因造成的严重误差（距离超过 20 像素）的对应点。

For query images that have under 5 pixels median reprojection error, we manually inspect the reprojected edges in the query image and finally accept 329 reference poses out of the 356 query images.
对于中位重投影误差低于 5 像素的查询图像，我们会手动检查查询图像中的重投影边缘，最终在 356 张查询图像中接受了 329 个参考姿势。

4. Indoor visual localization with dense matching and view synthesis 基于密集匹配和视图合成的室内视觉定位

We propose a new method for large-scale indoor visual localization. We address the three main challenges of indoor
environments:
提出了一种大规模室内视觉定位的新方法。我们解决室内环境的三大挑战:

第一段（挑战一：缺乏稀疏的局部特征）

(1) Lack of sparse local features. Indoor environments are full of large textureless areas, e.g., walls, ceilings, floors and windows, where sparse feature extraction methods detect very few features.
(1)缺乏稀疏的局部特征。室内环境充满了大面积的无纹理区域，例如墙壁、天花板、地板和窗户，在这些区域，稀疏特征提取方法检测到的特征很少。

To overcome this problem, we use multi-scale dense CNN features for both image description and feature matching.
为了克服这个问题，我们使用多尺度密集CNN特征进行图像描述和特征匹配。

Our features are generic enough to be pre-trained beforehand on (outdoor) scenes, avoiding costly re-training, e.g., as in [11, 34, 73], of the localization machine for each particular environment.
我们的特征足够通用，可以预先在(室外)场景上进行预训练，避免了昂贵的重新训练，例如，在[11,34,73]中，针对每个特定环境的定位机器。

第二段（挑战二：图像变化大）

(2) Large image changes. Indoor environments are cluttered with movable objects, e.g., furniture and people, and 3D structures, e.g., pillars add concave bays, causing severe occlusions when viewed from a close distance.
(2)图像变化大。室内环境中充斥着可移动的物体，例如家具和人，以及3D结构，例如柱子增加了凹凹的凹槽，从近距离观看时会造成严重的遮挡。

The most similar images obtained by retrieval may therefore be visually very different from a query image.
因此，通过检索获得的最相似的图像在视觉上可能与查询图像非常不同。

To overcome this problem, we rely on dense feature matches to collect as much positive evidence as possible.
为了克服这个问题，我们依靠密集的特征匹配来收集尽可能多的正面证据。

We employ image descriptors extracted from a convolutional neural network that can match higher-level structures of the scene rather than relying on matching individual local features.
我们使用从卷积神经网络中提取的图像描述符，该描述符可以匹配场景的高级结构，而不是依赖于匹配单个局部特征。

In detail, our pose estimation step performs coarse-to-fine dense feature matching, followed by geometric verification and estimation of the camera pose using P3P-RANSAC.
具体来说，我们的姿态估计步骤执行粗到细的密集特征匹配，然后使用P3P-RANSAC进行相机姿态的几何验证和估计。

第三段（挑战三：图像变化大）

(3) Self-similarity. Indoor environments are often very self-similar, e.g., due to many symmetric and repetitive elements on a large and small scale (corridors, rooms, tiles, windows, chairs, doors, etc.).
3) 自相似性。室内环境通常具有很强的自相似性，例如，由于存在许多对称和重复的大小元素（走廊、房间、瓷砖、窗户、椅子、门等）。

Existing matching strategies count the positive evidence, i.e., how much of the image (or how many inliers) have been matched, to decide whether two images match.
现有的匹配策略通过计算正面证据，即图像中有多少部分（或有多少离群值）被匹配，来决定两幅图像是否匹配。

This is, however, problematic as large textureless areas can be matched well, hence providing strong (incorrect) positive evidence.
然而，这种方法存在问题，因为大面积的无纹理区域可以很好地匹配，从而提供强有力的（不正确的）正面证据。

To overcome this problem, we propose to count also the negative evidence, i.e., what portion of the image does not match, to decide whether two views are taken from the same location.
为了克服这个问题，我们建议同时计算负面证据，即图像中不匹配的部分，以判断两张图像是否取自同一位置。

To achieve this, we perform explicit pose estimate verifi-cation based on view synthesis.
为此，我们在视图合成的基础上进行了明确的姿势估计验证

In detail, we compare the query image with a virtual view of the 3D model rendered from the estimated camera pose of the query.
具体来说，我们将查询图像与根据查询图像的相机姿态估计值渲染的 3D 模型虚拟视图进行比较。

This novel approach takes advantage of the high quality of the RGBD image database and incorporates both the positive and negative evidence by counting matching and non-matching pixels across the entire query image.
这种新颖的方法利用了 RGBD 图像数据库的高质量，通过计算整个查询图像中匹配和不匹配的像素，同时纳入了正反两方面的证据。

As shown by our experiments, this approach is orthogonal to the choice of local descriptors.
我们的实验表明，这种方法与局部描述符的选择是正交的。

The proposed verification by view synthesis is consistently showing a significant improvement regardless of the choice of features used for estimating the pose
无论选择哪种特征来估计姿势，所提出的通过视图合成进行验证的方法都有显著的改进。

第四段（InLoc的管道的三个步骤）

The pipeline of InLoc has the following three steps.
InLoc的管道有以下三个步骤。

Given a query image, (1) we obtain a set of candidate images by finding the N best matching images from the reference image database registered to the map.
给定一张查询图像，(1)我们从配准到地图的参考图像数据库中找到N张最匹配的图像，得到一组候选图像。

(2) For these N retrieved candidate images, we compute the query poses using the associated 3D information that is stored together with the database images.
(2)对于这N张检索到的候选图像，我们使用与数据库图像存储在一起的相关3D信息计算查询姿态。

(3) Finally, we re-rank the computed camera poses based on verification by view synthesis.
(3)最后，在视图合成验证的基础上，对计算得到的相机姿态进行重新排序。

The three steps are detailed next.
下面详细介绍这三个步骤。

4.1. Candidate pose retrieval 候选位姿检索

As demonstrated by existing work [6, 35, 66], aggregating feature descriptors computed densely on a regular grid mitigates issues such as a lack of repeatability of local features detected on textureless scenes, large-illumination changes, and a lack of discriminability of image description, dominated by features from repetitive structures (burstiness).
正如现有研究[6,35,66]所证明的那样，在规则网格上密集计算的聚合特征描述符可以缓解诸如在无纹理场景中检测到的局部特征缺乏可重复性、大光照变化以及由重复结构特征(突发性)主导的图像描述缺乏可判别性等问题。

As already mentioned in section 1, these problems are also occurring in large-scale indoor localization, which motivates our choice of using an image descriptor based on dense feature aggregation.
如第1节所述，这些问题在大规模室内定位中也会出现，这促使我们选择使用基于密集特征聚合的图像描述符。

Both query and database images are described by NetVLAD [6] (but other variants could also be used), normalized L2 distances of the descriptors are computed, and the poses of the N best matching images from the database are chosen as candidate poses.
查询图像和数据库图像都使用NetVLAD[6]进行描述(但也可以使用其他变体)，计算描述符的归一化L2距离，并从数据库中选择N个最佳匹配图像的姿态作为候选姿态。

In section 5, we compare our approach with the state-of-the-art image descriptors based on local feature detection and show benefits of our approach for indoor localization.
在第5节中，我们将我们的方法与基于局部特征检测的最先进的图像描述符进行了比较，并展示了我们的方法在室内定位方面的优势。

4.2. Pose estimation using dense matching 基于密集匹配的姿态估计

第一段（改进使用密集特征检索）

A severe problem in indoor localization is that standard geometric verification based on local feature detection [51,54] does not work on textureless or self-repetitive scenes, such as corridors, where robots (and also humans) often get lost.
室内定位的一个严重问题是，基于局部特征检测的标准几何验证[51,54]不适用于无纹理或自我重复的场景，如走廊，在那里机器人(也包括人类)经常迷路。

Motivated by the improvements in candidate pose retrieval with dense feature aggregation (Section 4.1), we use features densely extracted on a regular grid for verifying and re-ranking the candidate images by feature matching and pose estimation.
基于密集特征聚合对候选姿态检索的改进(第4.1节)，我们使用在规则网格上密集提取的特征，通过特征匹配和姿态估计对候选图像进行验证和重新排序。

A possible approach would be to match DenseSIFT [41] followed by RANSAC-based verification.
一种可能的方法是匹配DenseSIFT[41]，然后进行基于ransac的验证。

Instead of tailoring DenseSIFT description parameters (patch sizes, strides, scales) to match across images with significant viewpoint changes, we use an image representation extracted by a convolutional neural network (VGG-16 [61]) as a set of multi-scale features extracted on a regular grid that describes more higher-level information with a larger receptive field (patch size).

我们使用卷积神经网络(VGG-16[61])提取的图像表示作为一组在规则网格上提取的多尺度特征，而不是裁剪DenseSIFT描述参数(补丁大小，步长，尺度)来匹配具有显著视点变化的图像，该网格以更大的接受场(补丁大小)描述更多更高级别的信息。

We first find geometrically consistent sets of correspondences using the coarser conv5 layer containing high-level
information. Then we refine the correspondence by search

醉酒柴柴

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
【定位系列论文阅读】-InLoc: Indoor Visual Localization with Dense Matching and View Synthesis (一)

我们试图预测查询照片相对于大型室内3D地图的6个自由度(6DoF)姿态。这项工作的贡献有三方面。首先，我们开发了一种针对室内环境的大规模视觉定位方法。该方法分为三个步骤:(i)候选姿态的有效检索，确保大规模环境的可扩展性;(ii)使用密集匹配而不是局部特征进行姿态估计，以处理无纹理的室内场景;
复制链接

扫一扫