【定位系列论文阅读】-Indoor Visual Positioning Aided by CNN-Based Image Retrieval: Training-Free(二)

醉酒柴柴

已于 2023-08-04 12:16:20 修改

阅读量144

点赞数

分类专栏：杂七杂八文章标签：论文阅读 cnn 人工智能

于 2023-08-03 14:54:13 首次发布

本文链接：https://blog.csdn.net/weixin_46050242/article/details/132081846

版权

杂七杂八专栏收录该内容

10 篇文章 3 订阅

订阅专栏

文章目录

5. Discussion 讨论
6. Conclusions 结论
- 逐句翻译

3.4.3. Scale Determination 比例尺的确定

The main property of 2D-to-2D motion estimation is the epipolar constraint, which is based on
the constraint of zero equation.
二维到二维运动估计的主要特性是基于零方程约束的极面约束。

Therefore, the equivalence is valid when essential matrix multiplies a multiplicative scalar.
因此，当本质矩阵乘以一个乘法标量时，等价是有效的。

In other words, the essential matrix lacks scale to merely correspond to real scenario. This section explains the method to determine the scale by two reference images.
换句话说，基本矩阵缺乏尺度，无法仅仅对应真实场景。本节介绍通过两幅参考图像确定比例尺的方法。

In the image retrieval periods, two most similar images I1, I2 are output. With camera calibrated, the scale can be computed from two images transformation and given pose.
在图像检索周期，输出两个最相似的图像I1, I2。通过对相机进行标定，可以根据两幅图像的变换和给定的姿态计算出尺度。

By applying epipolar constraint to image I1, I2, we can get the transform matrix T12 from I1 to I2:
通过对图像I1, I2应用极面约束，可以得到从I1到I2的变换矩阵T12:

在这里插入图片描述
Images I1, I2 can be written in the form of homogeneous coordinate.
图像I1, I2可以写成齐次坐标的形式。

Take image I1 as example,the pose is P1 = fx1, y1, z1, qx1, qy1, qz1, qw1g.
以图像I1为例，姿态为P1 = fx1, y1, z1, qx1, qy1, qz1, qw1g。

Its translation is t1 = [x1, y1, z1] T and the rotation R1 can be donated as Equation (13). Then image I1 can be represented by transform matrix T1 as Equation (14).
其平移量为t1 = [x1, y1, z1] T，旋转量为R1，如式(13)所示。则图像I1可以用变换矩阵T1表示为式(14)。

在这里插入图片描述
Transformation T 0 12 from image I1 to image I2 can be computed from a function as Equation (15), where inv(T1) is the inverse of matrix T1.
从图像I1到图像I2的变换t012可以用方程(15)的函数来计算，其中inv(T1)是矩阵T1的逆。

The relationship of transformation T 0 12 and its corresponding rotation R0 12 and translation t 0 12 is in Equation (16).
变换t012及其对应的旋转r012与平移t012的关系如式(16)所示。

在这里插入图片描述
Comparing T 0 12 with transform matrix T12 computed from 2D-to-2D feature correspondence, we find the rotation R12 almost equals to R 0 12, however t12 and t 0 12 show a great difference and contain a scale as Equation (17), where s is the scale.
将t012与二维特征对应计算的变换矩阵T12进行比较，我们发现旋转R12几乎等于r012，但是T12和t012有很大的区别，并且包含一个尺度，如式(17)所示，其中s为尺度。

在这里插入图片描述

4. Experimental Evaluation 实验评价

4.1. Data Acquisition 数据采集

第一段（用两个数据集进行视觉里程表）

For our experiment setup, we utilize images and their poses from trajectories for visual
odometer since the task of visual positioning is similar to visual odometer tasks.
在我们的实验设置中，由于视觉定位任务类似于视觉里程表任务，因此我们利用来自轨迹的图像及其姿态进行视觉里程表。

In this experiment,the ICL-NUIM dataset [60] and the TUM RGB-D dataset [61] are adopted.
本实验采用ICL-NUIM数据集[60]和TUM RGB-D数据集[61]。

第二段（介绍第一个数据集）

ICL-NUIM: A dataset consists of RGB-D images from camera trajectories from two indoor scenes,living room and office room.
ICL-NUIM:数据集由两个室内场景(客厅和办公室)的相机轨迹的RGB-D图像组成。

The images were collected by a handheld Kinect RGB-D camera and the ground truth of trajectories was obtained by using Kintinuous [62].
图像由手持Kinect RGB-D相机采集，轨迹的地面真值通过kincontinuum获得[62]。

The images were captured at 640 × 480 resolutions. Four trajectories were recorded in each scene, and images were taken at different positions for different trajectories. Images obtained at different pose are shown in Figures 6 and 7.
图像以640 × 480的分辨率拍摄。在每个场景中记录4个轨迹，在不同的位置为不同的轨迹拍摄图像。在不同姿态下获得的图像如图6和图7所示。

在这里插入图片描述
ICL-NUIM数据集办公室场景图像。

ICL-NUIM数据集客厅场景图像。

第三段（介绍第二个数据集）

TUM RGB-D: A dataset contains the color and depth images of a Microsoft Kinect sensor and the ground-truth trajectory of camera pose with the goal of establishing a benchmark for the evaluation of visual SLAM systems.
TUM RGB-D:包含Microsoft Kinect传感器的颜色和深度图像以及相机姿势的地面真实轨迹的数据集，目的是为视觉SLAM系统的评估建立基准。

The images are at a resolution of 640 × 480 and ground-truth trajectory was obtained from high accuracy motion-caption system.
图像分辨率为640 × 480，地面真实轨迹由高精度运动字幕系统获取。

The dataset consists of 89 sequences from different camera motions.
该数据集由来自不同摄像机运动的89个序列组成。

These datasets are employed to verify the performance of the proposed method, and images deal with different scales of captured area vary in capabilities of representing the scenario in different levels.
利用这些数据集验证了所提方法的性能，处理不同尺度捕获区域的图像在不同层次上表示场景的能力有所不同。

Among these datasets, as shown in Figure 6, images in office scene of ICL-NUIM can represent a larger area such as half a room, whereas as shown in Figure 8, area represented in TUM RGB-D varies from a corner to part of room.
在这些数据集中，如图6所示，ICL-NUIM办公室场景的图像可以表示更大的区域，例如半个房间，而如图8所示，TUM RGB-D所表示的区域从一个角落到房间的一部分不等。
在这里插入图片描述

ICL-NUIM和T的两种场景

As illustrated in Table 2, we choose a part of images in the dataset to represent scenarios, the number of train images and test images are also shown in this table. 如表2所示，我们选择数据集中的一部分图像来表示场景，该表中也显示了火车图像和测试图像的数量。
在这里插入图片描述
我们从ICL-NUIM和TUM RGB-D的不同场景中选择图像组成我们自己的实验数据集。

It should be noted that, database images are hand-picked to cover the application scenarios.
应该注意的是，数据库映像是精心挑选来覆盖应用程序场景的。

The intrinsic parameters of the RGB camera can be obtained from Reference [63]
RGB相机的内在参数可从文献[63]得到

4.2. Performance of Image Retrieval 图像检索性能

第一段（使用特征提取方法评估图像检索性能）

In this section, we evaluate the image retrieval performance by means of feature extraction
methods on our database.
在本节中，我们通过在我们的数据库中使用特征提取方法来评估图像检索性能。

It is important to note that, the process of evaluating the performance of image retrieval is time-consuming since feature extraction period is expensive, nevertheless, this process is not needed by our applying visual positioning algorithm.
值得注意的是，由于特征提取周期昂贵，对图像检索性能进行评估的过程非常耗时，然而，我们应用视觉定位算法时不需要这个过程。

第二段（使用匹配数量来评估图像检索效果）

Generally, mean average precision (mean AP) is applied to evaluate the performance of image
retrieval task quantitatively, which compares the query image and the top retrieved images belonging to the same categories.
一般使用mean average precision (mean AP)定量评价图像检索任务的性能，将查询图像与属于同一类别的顶部检索图像进行比较。

The comparison of traditional image retrieval methods with CNN-based image retrieval methods has illustrated in Reference [29,64].
文献[29,64]阐述了传统图像检索方法与基于cnn的图像检索方法的比较。

However, in the procedure of proposed image retrieval based visual positioning, mean AP cannot effectively demonstrate the retrieval result efficiently.
然而，在提出的基于视觉定位的图像检索过程中，平均AP不能有效地展示检索结果。

Due to the feature extraction and matching period can affect the result of pose estimation heavily, pairs of images should share as many feature points as possible, which makes it essential for the query image and the retrieved images share some common areas.
由于特征提取和匹配周期对姿态估计的结果影响很大，因此对图像应尽可能多地共享特征点，这使得查询图像和检索图像共享一些公共区域至关重要。

Therefore, we calculate the number of matched features between images to evaluate the result of image retrieval.
因此，我们计算图像之间匹配特征的数量来评估图像检索的结果。

第三段（使用汉明距离来计算匹配）

To evaluate the performance of image retrieval, we extract feature points and descriptors from
each pair of images, and calculate the number of good-match.
为了评估图像检索的性能，我们从每对图像中提取特征点和描述符，并计算良好匹配的数量。

In our retrieval period, three of the most similar images are returned.
在我们的检索期间，返回三个最相似的图像。

We extract ORB features from these retrieved images together with the query image, then match features of the query image and its corresponding retrieved images.
我们将这些检索到的图像与查询图像一起提取ORB特征，然后将查询图像与其对应的检索图像进行特征匹配。

Noted that Hamming Distance is employed to compute the distance between ORB descriptors.
注意，使用汉明距离来计算ORB描述符之间的距离。

Then the minimal distance of matched descriptors is computed in all image pairs, and matched feature points whose distance is less than a threshold value can be labeled as good-match.
然后在所有图像对中计算匹配描述子的最小距离，将距离小于阈值的匹配特征点标记为良好匹配。

Moreover, when calculating the good-match in ORB descriptors, the threshold value is defined by the larger number between twice of the minimal distance and a constant, since sometimes the minimal Hamming distance can be quite small.
此外，在计算ORB描述符中的good-match时，阈值由最小距离和一个常数的两倍之间的较大数值定义，因为有时最小汉明距离可能非常小。

As shown in Table 3, a great number of good-matches are detected, which is sufficient for eight-point-algorithm in pose estimation period.
如表3所示，检测到大量的良好匹配，这对于8点算法在姿态估计周期内是足够的。

在这里插入图片描述

面向ORB的FAST和旋转的BRIEF在两个数据集中的平均匹配次数。

The experiment results show that top-ranked similar images share more good-match.
实验结果表明，排名靠前的相似图像具有更多的良好匹配。

第四段（本文提出方法的准确率）

In Reference [65], Jason et al. also developed an image based indoor localization scheme which
uses FLANN search on SIFT features, and their experiment only successfully matched 78 out of
83 images to achieve a 94% retrieval accuracy.
在文献[65]中，Jason等人也开发了一种基于图像的室内定位方案，该方案在SIFT特征上使用FLANN搜索，他们的实验在83幅图像中只成功匹配了78幅，检索准确率达到94%。

Whereas, in our proposed method, which aided by CNN-based image retrieval, achieved more than 99% image retrieval rate of 8267 images (the output images share the common area with the query image) owing to CNN features have more powerful representations for images.
而在我们提出的方法中，借助于基于CNN的图像检索，由于CNN特征对图像具有更强大的表征能力，在8267张图像(输出图像与查询图像共享公共区域)中实现了99%以上的图像检索率。

4.3. Localization Results and Analysis 定位结果及分析

第一段（作图分析方法性能）

Figures 9 and 10 summarize the performance of pose estimation stage of proposed scheme.
图9和图10总结了所提方案姿态估计阶段的性能。
在这里插入图片描述
定位误差的累积分布函数。

角度误差的累积分布函数。

As shown in Figure 9a, our method is able to localize the position within sub-meter level of accuracy for over 90% of the query images in both datasets.
如图9a所示，我们的方法能够在两个数据集中对90%以上的查询图像进行亚米级精度的定位。

Furthermore, more than 80% of the query images are successfully localized within 0.25 m of the ground truth position.
此外，超过80%的查询图像被成功定位在距离地面真实位置0.25 m的范围内。

As shown in Figure 9b, about 90% of the query images are localized within 3 degrees of ground truth position.
如图9b所示，大约90%的查询图像定位在地真位置3度以内。

We reported the performance in terms of the median errors of translation and orientation for each scene in the datasets, as shown in Table 4. The median errors of translation of our proposed method are around sub-meter level, which the 90% accuracy is 0.28 m in ICL-NUIM dataset and 0.45 m in TUM RGB-D dataset. The median error of orientation is within 1◦ and the 90% accuracy is 0.94◦ for ICL-NUIM dataset and 2.03◦ for TUM RGB-D dataset.
我们根据数据集中每个场景的平移和方向的中位数误差报告了性能，如表4所示。该方法的翻译中值误差在亚米级左右，其中ICL-NUIM数据集的90%准确率为0.28 m, TUM RGB-D数据集的90%准确率为0.45 m。

It is important to note that the statistics in Table 4 has not removed the outliers, which enlarged the mean error of localization.
方向的中位数误差在1◦内，ICL-NUIM数据集的90%精度为0.94◦，TUM RGB-D数据集的90%精度为2.03◦。值得注意的是，表4中的统计数据并没有去除异常值，这扩大了定位的平均误差。
在这里插入图片描述
来自不同数据集的不同场景下的本地化性能

第二段（作表分析方法性能）

The proposed localization method combines CNN features and point features to estimate the
pose.
提出的定位方法结合CNN特征和点特征来估计姿态。

We compare the accuracy of the proposed method with the average pose estimation errors of three different CNN-based localization methods: (i) PoseNet which directly regress the camera pose by CNN; (ii) 4D PoseNet which was modified from PoseNet to accommodate the RGB-D input; (iii) CNN+LSTM [42], which utilize the PoseNet as a baseline pose estimator and the LSTM works as a temporal filter to process the estimated pose sequence.
我们将该方法的精度与三种不同的基于CNN的定位方法的平均姿态估计误差进行了比较:(i) PoseNet，它直接由CNN回归相机姿态;(ii)由PoseNet改良而成的4D PoseNet，以容纳RGB-D输入;(iii) CNN+LSTM[42]，利用PoseNet作为基线姿态估计器，LSTM作为时间滤波器处理估计的姿态序列。

Table 5 summarizes statistics of the average pose estimation errors from those methods on ICL-NUIM dataset.
表5总结了这些方法在ICL-NUIM数据集上的平均位姿估计误差统计。
在这里插入图片描述

We achieved better position accuracy on both scenarios, and achieved comparable accuracy in orientation
我们在两种情况下都获得了更好的定位精度，并且在定向方面也获得了相当的精度

第三段(本文使用的模型需要的存储成本低)

More importantly, compared with state-of-the-art learning-based methods, the proposed scheme
uses much fewer images in database construction period, which is significant for generalizing
application of visual positioning.
更重要的是，与目前最先进的基于学习的方法相比，该方法在数据库构建期间使用的图像更少，这对于视觉定位的推广应用具有重要意义。

As is known, learning-based methods need quantity of images with poses to train a model (see Table 6).
众所周知，基于学习的方法需要大量的带有姿态的图像来训练模型(见表6)。
在这里插入图片描述

However, in proposed visual localization scheme, much fewer images are required.
而在本文提出的视觉定位方案中，所需的图像数量要少得多。

Less than 10% images are needed compared to learning-based methods, and we can still achieve comparable localization accuracy.
与基于学习的方法相比，只需要不到10%的图像，我们仍然可以达到相当的定位精度。

Furthermore, CNN features can reduce the cost of model storage.
此外，CNN特征可以降低模型存储的成本。

Our CNN features model takes up 1.8 M storage for 886 images from TUM RGB-D database, which raw images occupy memory of 419.4 M, and 1.2 M storage is needed for 593 images from ICL NUIM database, which raw RGB images need 175.1 M
我们的CNN特征模型对来自TUM RGB- d数据库的886张图像占用1.8 M内存，其中原始图像占用419.4 M内存;对来自ICL NUIM数据库的593张图像占用1.2 M内存，其中原始RGB图像占用175.1 M内存

第四段(介绍本方法检索的速度)

We implemented the proposed localization scheme on Intel Core i7-7700 CPU @ 3.60 GHz. It takes 324.5 ms on average to find 10 best matches for a single image on 593 images of ICL NUIM database,and 336.5 ms to find 10 best matches on 886 images of TUM RGB-D database. Pose estimation costs 88.5 ms on average.
我们在Intel酷睿i7-7700 CPU @ 3.60 GHz上实现了所提出的本地化方案。在ICL NUIM数据库的593张图片中，每张图片找到10个最佳匹配平均耗时324.5 ms，在TUM RGB-D数据库的886张图片中找到10个最佳匹配平均耗时336.5 ms。姿势估计平均花费88.5毫秒。

The whole procedure from image retrieval to pose estimation takes ~0.45 s to output a location for a single image. We also employed a NVIDIA TITAN XP GPU to accelerate the computation of image retrieval, and 20 ms and 12 ms are taken to find 10 best matches of ICL NUIM and TUM RGB-D respectively. The whole procedure takes ~0.1 s for a single image.
从图像检索到姿态估计的整个过程大约需要0.45秒来输出单个图像的位置。我们还使用了NVIDIA TITAN XP GPU来加速图像检索的计算，分别用20 ms和12 ms找到ICL NUIM和TUM RGB-D的10个最佳匹配。

Computation and storing the CNN presentation of database images are done offline, and the period of retrieval evaluation, which is illustrated in Section 4.2, is unnecessary in localization procedure
整个过程只需要0.1秒。数据库图像CNN表示的计算和存储是离线完成的，在定位过程中不需要4.2节所示的检索评估周期

5. Discussion 讨论

第一段（介绍本文是视觉定位的原理）

n this study, we presented an image retrieval aided approach for indoor visual localization.
在本研究中，我们提出了一种图像检索辅助的室内视觉定位方法。

A CNN-based image retrieval method was adopted to recognize the given query images by retrieving the matching images that were geo-tagged.
采用一种基于cnn的图像检索方法，通过检索地理标记的匹配图像来识别给定的查询图像。

The CNN-based strategy not only provides the output images with high spatial correlation, but also gives a new idea of scene representation.
基于cnn的策略不仅提供了高空间相关性的输出图像，而且提供了一种新的场景表示思路。

To put it another way, we no longer have to represent the space by its 3D model, instead we can design its spatial model representing methods in line with the usage of the spatial model.
换句话说，我们不再需要用空间的三维模型来表示空间，而是可以根据空间模型的使用来设计空间模型的表示方法。

For instance, a group of CNN features with original images and poses can represent the whole area for visual localization purpose.
例如，一组具有原始图像和姿势的CNN特征可以代表整个区域，以达到视觉定位的目的。

A feature-points-correspondences strategy was then applied to estimate the precise location and pose of the query image. Experimental results demonstrated that our monocular visual odometer-inspired pose estimation methods resulted in high-precision localization consequence.
然后采用特征点对应策略来估计查询图像的精确位置和姿态。实验结果表明，基于单目视觉里程计的姿态估计方法可以获得高精度的定位结果。

第二段（本文方法的优越性）

It is obvious that our result outperformed that of retrieval-based methods without CNN feature
extraction in robustness, due to the complex and unstable indoor environments.

很明显，由于室内环境的复杂和不稳定，我们的结果在鲁棒性上优于没有CNN特征提取的基于检索的方法。

Compared with 2D-to-3D and 3D-to-3D methods in pose estimation methods, our strategy only depends on calibrated monocular camera in online localization period.
相对于姿态估计方法中的2D-to-3D和3D-to-3D方法，我们的策略在在线定位阶段只依赖于标定过的单目摄像机。

Furthermore, our method is free of building 3D models, which the process is considered as an expensive process.
此外，我们的方法不需要建立3D模型，这一过程被认为是一个昂贵的过程。

Compared with end-to-end learning-based methods that directly regress the pose from input images, the advantage of our method exists in offline preparing period.
与直接从输入图像中回归姿态的端到端学习方法相比，我们的方法的优势在于离线准备阶段。

As is known, deep learning has become an extremely powerful tool in computer vision tasks, but due to deep learning is a kind of data-hungry method, a massive of high-quality training data is required.
众所周知，深度学习已经成为计算机视觉任务中非常强大的工具，但由于深度学习是一种数据饥渴的方法，需要大量高质量的训练数据。

Experimental results have shown that when achieving comparable localization accuracy, the number of database images of our method is far smaller than that in learning-based methods.
实验结果表明，在达到相当的定位精度时，我们的方法的数据库图像数量远远少于基于学习的方法。

Besides, learning-based strategies heavily rely on the training stage with quantity of geo-tagged images, it is assumed that when expanding the applicant area will cause the whole model retrained as well as the growth of model size, whereas in our methods, the increasing of raw data has no effect on the extractor, but only correspondingly add to the database.
此外，基于学习的策略严重依赖于具有地理标记图像数量的训练阶段，假设当扩展应用区域时将导致整个模型的重新训练以及模型大小的增长，而在我们的方法中，原始数据的增加对提取器没有影响，只是相应地添加到数据库中。

第三段（本文方法的优越性和展望）

Moreover, owing to the scheme of data acquisition and the process of image-based localization,
our method has massive potential to extend to a crowdsourcing-based method.
此外，由于数据采集方案和基于图像的定位过程，我们的方法具有扩展到基于众包的方法的巨大潜力。

The raw data from different resources can be integrated to compose the database.
可以集成来自不同资源的原始数据来组成数据库。

In the indoor environment, images are captured by cameras on cellphones, robots or other platforms, and the pose information can be obtained through pose measuring infrastructures.
在室内环境中，通过手机、机器人或其他平台上的摄像头捕捉图像，并通过姿态测量基础设施获得姿态信息。

In fact, our proposed scheme is not a data-hungry solution as DCNNs, a set of limited images with high-precision pose is the key.
事实上，我们提出的方案并不像DCNNs那样需要大量的数据，一组具有高精度姿态的有限图像是关键。

In the future, the problems we need to address are the strategy of defining the space by a set of images and the approach to get high-precision pose information for database images.
在未来，我们需要解决的问题是通过一组图像来定义空间的策略和获取数据库图像高精度姿态信息的方法。

For image retrieval phase, a more efficient and robust method as well as more complicated and larger scale of environment needs to be considered in the future work.
对于图像检索阶段，需要在未来的工作中考虑更高效、鲁棒的方法以及更复杂、更大规模的环境。

6. Conclusions 结论

逐句翻译

In summary, our solution is highly available to different and complex environments and easily extendable to the change of raw data.
总之，我们的解决方案可高度适用于不同的复杂环境，并可轻松扩展到原始数据的变化。

We utilize a CNN-based image retrieval strategy which represents the scene by CNN features, and match the query image with database images.
我们利用基于 CNN 的图像检索策略，用 CNN 特征表示场景，并将查询图像与数据库图像进行匹配。

After that, the pose of the query image is recovered from the ORB feature points’ correspondence, which is efficient and effective.
然后，根据 ORB 特征点的对应关系恢复查询图像的姿态，这样做既高效又有效。

Based on the state-of-the-art studies of indoor visual localization systems, to the best of our
knowledge, this work is the first to adopt both CNN-based image retrieval strategy and merely RGB images for accurate localization which is highly applicable to monocular vision positioning task.
在室内视觉定位系统研究的基础上，据我们所知，本工作首次采用基于cnn的图像检索策略和仅RGB图像进行精确定位，高度适用于单目视觉定位任务。

We think the image-based localization methods may become the mainstream owing to the scheme of data acquisition and the algorithm of pose estimation accorded with the current state of data expansion.
我们认为基于图像的定位方法可能会成为主流，因为数据采集方案和姿态估计算法符合当前数据扩展的现状。

The from-coarse-to-accurate strategy will be efficiently adopted to much larger applied range.
将由粗到精的策略有效地应用到更大的应用范围。

醉酒柴柴

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【定位系列论文阅读】-Indoor Visual Positioning Aided by CNN-Based Image Retrieval: Training-Free(二)

ICL-NUIMICL-NUIM:数据集由两个室内场景(客厅和办公室)的相机轨迹的RGB-D图像组成。图像由手持Kinect RGB-D相机采集，轨迹的地面真值通过kincontinuum获得[62]。图像以640 × 480的分辨率拍摄。在每个场景中记录4个轨迹，在不同的位置为不同的轨迹拍摄图像。在不同姿态下获得的图像如图6和图7所示。ICL-NUIM数据集办公室场景图像。ICL-NUIM数据集客厅场景图像。
复制链接

扫一扫

专栏目录