论文翻译及笔记 -- "Fast loop-closure detection using visual-word-vectors from image sequences" (上篇)

最新推荐文章于 2021-11-02 11:07:36 发布

C ．Lee

最新推荐文章于 2021-11-02 11:07:36 发布

阅读量1k

点赞数

分类专栏： SLAM 文章标签：计算机视觉算法

本文链接：https://blog.csdn.net/weixin_44832149/article/details/105288982

版权

SLAM 专栏收录该内容

10 篇文章 4 订阅

订阅专栏

Fast loop-closure detection using visual-word-vectors from image sequences - 上

Abstract
1.Inroduction
2.From image to sequence description
- 2.1. Single-image-based visual place recognition
- 2.2. Establishing the necessity of sequence-visual-word-vectors
3. Proposed methodology
4. Results
- 4.1. Offline training and performance evaluation
- - 4.1.1. Vocabulary training
  - 4.1.2. Trajectory segmentation

Abstract

In this paper, a novel pipeline for loop-closure detection is proposed. We base our work on a bag of binary feature words and we produce a description vector capable of characterizing a physical scene as a whole. Instead of relying on single camera measurements, the robot’s trajectory is dynamically segmented into image sequences according to its content. The visual word occurrences from each sequence are then combined to create sequence-visual-word-vectors and provide additional information to the matching functionality. In this way, scenes with considerable visual differences are firstly discarded, while the respective image-to-image associations are provided subsequently. With the purpose of further enhancing the system’s performance, a novel temporal consistency filter (trained offline) is also introduced to advance matches that persist over time. Evaluation results prove that the presented method compares favorably with other state-of-the-art techniques, while our algorithm is tested on a tablet device, verifying the computational efficiency of the approach.

Keywords
Loop-closure detection, image sequences, visual SLAM, mobile robotics, low-power embedded systems

本文提出了一种用于闭环检测的新型方法。我们的工作基于一个二进制特征的词袋，并且产生了能够描述整个物理场景的描述向量。机器人的轨迹不再依赖于单个摄像机的测量，而是根据其内容动态地分割成图像序列。来自每个序列的视觉单词出现然后被组合以创建sequence-visual-word-vectors，并向匹配功能提供附加信息。以此方式，首先丢弃视觉差异很大的场景，而随后提供各个图像到图像的关联。为了进一步提高系统性能，还引入了一种新颖的时间一致性过滤器（经过离线训练），以促进随时间推移而持续存在的匹配。评估结果证明，该方法与其他最新技术相比具有优势，而我们的算法在平板设备上进行了测试，验证了该方法的计算效率。

1.Inroduction

The problem of visual place recognition (vPR) refers to the ability of a system to recognize a scene based on visual sensing; it has been used during the last decade to address many challenges in mobile robotics. As part of a simultaneous localization and mapping (SLAM) system, vPR has been applied in a variety of forms and alterations, such as the loop-closure detection (LCD) and the relocalization procedures. An LCD engine is responsible for detecting revisited trajectory regions and creating additional edge constraints between the current and earlier pose nodes on graph-based SLAM systems (Folkesson and Christensen, 2004; Grisetti et al., 2010; Thrun and Montemerlo, 2006). Those additional edge constraints provide supplementary information regarding the measurements’ arrangement in the 3D space, and they can be used to further improve the SLAM output in an online or post-processing manner (Latif et al., 2013; Mei et al., 2009; Mur-Artal et al., 2015; Strasdat et al., 2010). Moreover, a relocalization system utilizes the visual information to recover the robot’s position in an already known environment (the problem of a kidnapped robot) or in localization-failure scenarios (Konolige et al., 2010; Mur-Artal and Tardós, 2014; Wolf et al., 2005). Even though these challenges refer to different applications, they share the same basic functionality of identifying a previously visited scene and thus can be addressed by common solutions.

视觉位置识别（visual place recognition，vPR）问题是指系统基于视觉感应识别场景的能力。在过去的十年中，它一直被用来应对移动机器人领域的许多挑战。作为同时定位和建图（SLAM）系统的一部分，vPR已以多种形式和改造被应用，例如应用于闭环检测（LCD）和重新定位过程。 LCD负责检测出再次访问的轨迹区域，并在基于图的SLAM系统上的当前位姿节点和早期位姿节点之间创建边约束（Folkesson和Christensen，2004； Grisetti等，2010； Thrun和Montemerlo，2006）。这些额外的边约束提供了有关3D空间中测量的补充信息，它们可用于以在线或后处理方式进一步改善SLAM输出（Latif等，2013; Mei等，2009 ; Mur-Artal等，2015; Strasdat等，2010）。此外，重新定位系统利用视觉信息来恢复机器人在已知环境（被绑架的机器人的问题）或在定位失败的情况下的位置（Konolige等，2010； Mur-Artal和Tardós，2014； Wolf等人，2005）。即使这些挑战涉及不同的应用，它们也具有识别出访问过场景的相同基本功能，因此可以通过通用解决方案来解决。

During the past decade, a plethora of vPR techniques has been presented in the literature. Williams et al. (2009) distinguishes the approaches into three main categories with respect to the type of data they associate. In the fifirst category, referred to as “map-to-map”, correspondences are found between features, taking into account both their appearance and their relative location inside the world. Furthermore, “image-to-map” methods aim to recognize places by associating features between the latest acquired frame and a retained spatial representation of the already-seen world. Finally, “image-to-image” matching approaches (or appearance-based techniques) detect correspondences between the images themselves and present better scaling capabilities in long-trajectory cases.

在过去的十年中，文献中提出了过多的vPR技术。威廉姆斯等（2009年）根据它们关联的数据类型将方法分为三个主要类别。在第一个类别（称为“地图-地图”）中，考虑到特征的外观及其在世界中的相对位置，可以找到特征之间的对应关系。此外，“图像-地图”方法旨在关联最新获取的帧与已经看到世界的保留空间表示之间的特征相来识别位置。最后，“图像-图像”匹配方法（或基于外观的技术）可检测图像自身之间的对应关系，并在长轨迹情况下表现出更好的扩展能力。

The most common approach for addressing appearance-based LCD tasks refers to the characterization of each individual frame by an aggregation of local image descriptors.As the robot moves, revisited places are detected by measuring content similarities between the current input frame (query) and all the previous ones (database). To provide effificiency in the implementation, the bag of visual words (BoVW) model can be utilized as a means of quantizing the extracted descriptors’ space. In the general case, every input frame is assigned with one image-visual-word-vector (I-VWV). The entries of this vector correspond to a weighted frequency of occurrence for every visual word in the given image (histogram). The created I-VWVs are treated as image descriptors, thus loop-closing pairs of camera poses are recognized by calculating similarity metrics between them. The aforementioned approach was initially inspired by image-retrieval techniques (Sivic and Zisserman, 2003), yet in some vPR algorithms, measurements obtained from close-in-time instances are summed to enhance the results. Finally, it has been proven that the representation of the created BoVW with a tree structure (vocabulary tree) signifificantly improves the computational effificiency (Nister and Stewenius, 2006).

解决基于外观的LCD任务最常用的方法是通过对局部图像描述符的聚类来表征每个单独的帧。随着机器人的移动，通过测量当前输入帧（query）与所有先前输入帧（database）之间的内容相似度来检测出再次访问的地点。为了提高执行效率，视觉单词袋（BoVW）模型可作为量化提取的描述子空间的一种手段。在一般情况下，每个输入帧都分配有一个image-visual-word-vector（I-VWV）。该向量的分量对应于给定图像（直方图）中每个单词的加权出现频率。创建的I-VWV被视为图像描述子，因此，通过计算相机姿态之间的相似性度量，可以识别出闭环的相机姿态对。前面提到的方法最初是受图像检索技术启发的（Sivic和Zisserman，2003年），但是在某些vPR算法中，将从实时实例获得的测量值进行了求和以增强结果。最后，已经证明，用树结构（vocabulary tree）表示的BoVW表示可以显着提高计算效率（Nister和Stewenius，2006年）。

In this work, we present an improved pipeline for appearance-based LCD, which combines the visual information from multiple frames to describe a physical scene as a total. As the robot moves, the input camera stream is dynamically segmented into intervals (image sequences), based on the scene’s content variations. For each image sequence, the extracted feature descriptors are converted into visual words and combined to produce one global sequence-visual-word-vector (S-VWV) as well as the individual I-VWVs. Thus, the revisited trajectory regions are detected on a fifirst level by measuring the similarities between all S-VWVs in the database, while the loop-closing frames are determined using the individual I-VWVs only for the associated sequences’ image-members. A typical example of the aforementioned procedure is illustrated in Figure 1. Note that, henceforth, the term “sequence” will refer to “sequence of images,” for brevity.

在这项工作中，我们为基于外观的LCD提供了一种改进的方法，该方法结合了来自多个帧的视觉信息从而整体上描述一个物理场景。随着机器人的移动，输入的相机流会根据场景的内容变化动态地划分为间隔（图像序列）。对于每个图像序列，将提取的特征描述子(extracted feature descriptors)转换为视觉单词（visual words)，并组合以生成一个global sequence-visual-word-vector（S-VWV）以及各个I-VWV。因此，通过测量数据库中所有S-VWV彼此之间的相似性，可以首先检测出再次访问的轨迹区域，而闭环帧仅供相关序列的图像成员使用单个I-VWV确定。在图1中示出了上述过程的典型示例。注意，为简洁起见，从此以后，术语“sequence”将指代“sequence of images”。

笔记：1）S-VWV：一个向量表征一个图像序列，I-VWV：一个向量表征一个图像；2）a visual word是相似的feature descriptors的聚类，visual words是彼此不同的单词

Fig. 1. 3D representation of proposed loop-closure detection pipeline tested on Malaga 2009 Parking 6L (Blanco et al., 2009) dataset. As the robot moves, the executed trajectory is segmented into intervals or sequences (illustrated with different colors). The formulated S-VWVs are used to detect sequence matches (marked with the magenta plain) and signal the existence of loop-closing frames. The individual image-to-image associations (marked with green links) are provided via the individual I-VWVs.

**图1.**在马拉加2009 Parking 6L（Blanco et al，2009）数据集上测试的提出的闭环检测方法的3D表示。随着机器人的移动，执行的轨迹被分为间隔或序列（以不同的颜色表示）。制定的S-VWV用于检测序列匹配（标有品红色平原）并发出闭环帧的信号。各个图像之间的关联（标有绿色连接/连杆）是通过各个I-VWV提供的。

The main contributions of the paper in hand can be summarized as follows:

Using a description vector capable of characterizing an image sequence as a whole, our method provides more information to the matching functionality advancing the LCD performance. Additionally, since our pipeline relies on such a descriptor, rather than accumulating the similarities between multiple I-VWVs, the system’s performance is not restricted by a per-frame perception of the environment.
With the view of further enhancing the produced sequence similarity measurements, our algorithm introduces a temporal consistency fifilter over the similarity matrix entries. The corresponding kernel’s coeffifi- cients are calculated using a cost-function minimization scheme on a set of training samples.
The proposed methodology entails a reduced computational complexity, as compared with other vPR techniques, since our fifirst level of sequence-to-sequence matching excludes the trajectory regions that are absolutely different in the general view. In addition, the nature of our pipeline provides an effificient way to further reduce the visual word votes by considering only the entries that persist during the sequence formulation. An implementation of the proposed algorithm is tested on a mobile device utilizing the parallel execution capabilities of the ARM-NEON coprocessor and proving its ability to run in real time (in the sense of processing the input faster or in equal time with the execution frequency of a modern key-frame SLAM system), even for a less powerful machine.

论文的主要贡献可总结如下：

使用能够描述整个图像序列的描述子向量，我们的方法为匹配功能提供了更多信息，从而提高了LCD性能。此外，由于我们的方法依赖于这样的描述子，而不是累积多个I-VWV之间的相似性，因此系统的性能不受每帧环境的感知的限制。
为了进一步提升产生的序列相似性度量，我们的算法在相似性矩阵条目上引入了时间一致性过滤器(temporal consistency fifilter)。相应的内核系数，使用成本函数最小化方案对一组训练样本进行计算得来。
与其他vPR技术相比，提出的方法需要降低计算复杂度，因为我们的sequence-to-sequence匹配的第一个级别不包括总体上绝对不同的轨迹区域。此外，我们方法的性质提供了一种有效的方法，可以通过仅考虑在序列制定过程中持续存在的条目来进一步减少视觉单词投票。使用ARM-NEON协处理器的并行执行功能在移动设备上测试了所提出算法的实现，并证明了其实时运行的能力（在某种意义上可以更快，或与现代关键帧SLAM系统的执行频率相等的时间内，来处理输入），即使对于功能较弱的机器也是如此。

A preliminary version of the presented work was presented in Bampis et al.(2016). In this paper, we advance the system’s performance by adopting a rotation- and scaleinvariant local feature descriptor and a dynamic sequence identifification technique, while, additionally, we address the temporal consistency fifiltering as a classifification problem operating on the sequence similarity scores. Furthermore, we provide a complete justifification of the benefifits offered by a unifified visual-word-vector and extend our experiments to fully evaluate the performance of our algorithm. Finally, extensive comparative results are presented against other state-of-the-art sequence-based vPR techniques, proving the capabilities of the S-VWV based description.

Bampis等人（2016）提出了本文工作的初步版本。在本文中，我们通过采用旋转不变和尺度不变的局部特征描述符以及动态序列识别技术来提高系统的性能，同时，我们还将时间一致性过滤作为处理序列相似性得分的分类问题。此外，我们提供了未统一的视觉词向量提供的益处的完整证明，并扩展了我们的实验以全面评估算法的性能。最后，与其他基于序列的最新vPR技术相比，提供了广泛的比较结果，证明了基于S-VWV的描述的功能。

The following section contains a discussion of related work on the fifield of vPR and subsequently introduces the advantageous matching properties of the introduced S-VWVs. Section 3 describes in detail our online pipeline, together with the preprocessing steps for the vocabulary tree formulation and fifilter training. In Section 4, our experimental evaluation and comparative results against other state-of-the-art approaches are presented. The computational benefifits of the proposed approach, together with the employed parallelization techniques and implementation details of the tested mobile device application, are summarized and assessed in Section 5. Finally, Section 6 draws our fifinal conclusions by describing our algorithm’s potentials and contributions.

下一节将讨论有关vPR领域的相关工作，然后介绍引入的S-VWV的有利匹配特性。第3节详细描述了我们的在线方法，以及词汇表公式化和fifilter训练的预处理步骤。在第4节中，介绍了我们与其他最新方法的实验评估和比较结果。第5节总结并评估了所提出方法的计算优势，以及所采用的并行化技术和测试的移动设备应用的实现细节。最后，第6节通过描述算法的潜力和贡献得出了最终结论。

2.From image to sequence description

In this section, we discuss some of the most representative techniques in the fifield of appearance-based vPR with the aim of leading our reader to the comprehension of the pro posed sequence description method. For an extended survey of vPR, the reader can refer to the work of Lowry et al. (2016).

在本节中，我们将讨论基于外观的vPR领域中一些最具代表性的技术，以使我们的读者理解所提出的序列描述方法。对于vPR的扩展调查，读者可以参考Lowry等人（2016）的工作。

2.1. Single-image-based visual place recognition

Probably the most acknowledged method in the fifield of vPR is FAB-MAP (Cummins and Newman, 2008). According to that method, co-currency probabilities between observed visual words are used to perform appearance-based vPR. Although FAB-MAP constitutes the foundationof a plethora of later methodologies, it suffers in terms of performance, when repetitive patterns are accounted (Piniés et al., 2010), and execution time, owing to the expensive extraction and matching of SURF features (Bay et al., 2006). In a later work, the same authors introduced an improved sparse approximation of their original technique, called FAB-MAP 2.0 (Cummins and Newman, 2011), allowing their system to scale by more than two orders of magnitude. Another representative approach was proposed by Angeli et al.(2008), where the description relied on two visual vocabularies (one from SIFT descriptors (Lowe, 2004) and another from local color histograms). Using a Bayesian fifilter, the detection was enhanced, taking into account the matching probability of previously obtained measurements. Schindler et al. (2007) provided a more sophisticated representation of the visual vocabulary, with a tree structure addressing city-scale vPR challenges. In their work, the Greedy N-Best Paths algorithm was used so as to cluster the feature descriptors incrementally.

vPR领域中最广为人知的方法可能是FAB-MAP（Cummins和Newman，2008）。根据该方法，观察到的视觉单词之间的并发概率用于执行基于外观的vPR。尽管FAB-MAP构成了众多后来方法的基础，但由于要考虑重复模式（Piniés等，2010）的性能和执行时间，因此FAB-MAP的性能会受到影响，原因是SURF功能的提取和匹配成本很高（Bay等人，2006）。在后来的工作中，同一位作者介绍了一种改进的稀疏近似法，即他们的原始技术，称为FAB-MAP 2.0（Cummins和Newman，2011年），使他们的系统可以扩展两个以上的数量级。 Angeli等人（2008年）提出了另一种代表性方法，其中的描述依赖于两个视觉词汇（一个来自SIFT描述符（Lowe，2004），另一个来自局部颜色直方图。考虑到先前获得的测量值的匹配概率，使用贝叶斯滤波器可以增强检测。 Schindler等人（2007年）提供了一种更为复杂的视觉词汇表述，其树形结构可以应对城市规模的vPR挑战。在他们的工作中，使用了贪婪N最佳路径算法，以便对特征描述符进行聚类。

More recent techniques have deviated from the aforementioned probabilistic approach of detecting loop closures with flfloating-point descriptors, like SIFT or SURF, offering faster but still competitive results (Gálvez-López and Tardós, 2012; Khan and Wollherr, 2015; Mur-Artal and Tardós, 2014). More specififically, visual words from binary features, found in every camera measurement, are used to create I-VWVs. Thus, the detection of revisited places is achieved by obtaining similarity metrics, based on the L1/L2 norm, between the individual I-VWVs. Gálvez López and Tardós (2012) proposed a typical technique for this approach with the DBoW2 algorithm. Since in their case the Bayesian fifiltering was not included, the matching candidates were forced to follow a temporal consistency constraint. Mur-Artal and Tardós (2014) enforced DBoW2 by exploring the usage of a more sophisticated binary descriptor (“Oriented FAST and Rotated BRIEF”, or “ORB” (Rublee et al., 2011)) and provided a real-time vPR, relocalization, and LCD system.

较新的技术已经不同于上述用浮点描述符检测循环闭合的概率方法，例如SIFT或SURF，可提供更快但仍具有竞争力的结果（Gálvez-López和Tardós，2012； Khan和Wollherr，2015； Mur-Artal和Tardós，2014年）。更具体地说，在每次相机测量中都使用来自二进制特征的视觉文字来创建I-VWV。因此，通过基于各个I-VWV之间的L1 / L2范数获得相似性度量来实现对重新访问地点的检测。 GálvezLópez和Tardós（2012）提出了一种使用DBoW2算法的典型技术。因为在他们的情况下不包括贝叶斯过滤，所以匹配的候选对象被迫遵循时间一致性约束。 Mur-Artal和Tardós（2014）通过探索更复杂的二进制描述符（“定向的FAST和Roted Brief”或“ ORB”（Rublee等，2011））的使用来实施DBoW2，并提供了实时vPR，重新定位和LCD系统。

Additionally, since the offlfline formulation of a visual vocabulary is not suitable for every application, some researchers have suggested the online development of a BoVW by estimating an average representa tion of repetitive descriptors. For instance, Labbé and Michaud (2013) proposed the formulation of an online vocabulary based on a randomized forest of k-d-trees, achieving exquisite performance for large-scale environments. Although their technique is capable of rec ognizing revisited places in constant time, independently of the traversed trajectory’s length, the computationally expensive SURF feature extraction and the constant updates of their vocabulary render the approach less appealing for normal scale scenarios (such as 20k–30k input frames). Aiming for an immediate reduction of the execution time, Khan and Wollherr (2015) proposed the online creation of a binary vocabulary based on the insertion of new visual words whenever an unfamiliar descriptor is obtained. Since their method (“Incremental Bag of Binary Words for Appearance-Based Loop-Closure Detection” or “IBuILD”) utilizes effificient binary operations, it constitutes a more attractive solution in terms of computational complexity.

Recently, the concept of sequence-to-sequence matching has also been introduced in the literature. Newman et al. (2006), in their work for outdoor SLAM applications, pointed out the advantages of matching sequences instead of individual frames using an accumulative version of similarity matrices. The same notion appears (even on some abstract level) in other techniques as well (e.g. Gálvez-López and Tardós, 2012; Mur-Artal and Tardós, 2014) proving that LCD performance can be strengthened when visual information from more than one camera measurements is considered. Even though these techniques aim to take advantage of the additional information from the entire scene, they treat each sequence as an aggregation of image description vectors rather than visual words, subjecting their matching procedure to a per-frame view of the environment (visual words are redundantly clustered by camera measurements). On the contrary, our method reformulates the process of creating visual-word-vectors and considers the whole sequence as a single “super-frame”. This approach offers invariance to the visual words’ distribution over the camera measurements, and will be further analyzed in the following subsection.

Finally, sequence-based techniques have been reported, addressing the vPR task under extreme environmental changes originated from different lighting conditions (day and night) or year seasons (Arroyo et al., 2015; Milford and Wyeth, 2012). Even though the choice of more traditional local feature descriptors is avoided (owing to the inability of matching under such intense environmental changes (Valgren and Lilienthal, 2010)), the use of global sequence descriptors is proven to be crucial for the achieved performance. Most recently, condition-invariant vPR techniques have been presented based on the classifification characteristics of convolution neural networks (CNNs). Methods like those presented by Sünderhauf et al. (2015a,b) and Arroyo et al. (2016) treat the output of particular CNN layers, initially trained for object detection tasks, as image descriptors and address the vPR problem by measuring the distances between them. Even though CNN-based techniques offer superior retrieval performances, they are still decoupled from the LCD and SLAM functionalities. Sizikova et al. (2016) and Fei et al. (2016), in their respective works, accurately pointed out the CNN’s dependence on viewpoint invariant surface appearances and the lack of topological information at the higher network levels, which characterize them as suboptimal for LCD tasks. On the contrary, local feature-based techniques are widely used in visual SLAM applications (Cieslewski and Scaramuzza, 2017; Davison et al., 2007; Klein and Murray, Lim et al., 2014; 2007; Mur-Artal et al., 2015) and can be effificiently combined with an illumination invariant image representation technique (e.g. Maddern et al., 2014; Shakeri and Zhang, 2016) to further improve their robustness over potential environmental changes. However, such an application is beyond the scope of this paper and thus it is not further discussed.

2.2. Establishing the necessity of sequence-visual-word-vectors

Given an actual pair of loop-closing images, there is no guarantee that a suffificient subset of common visual words will be detected in every case, since a single image can be subject to aliasing, contain noise or moving objects, etc. Thus, it is expected that an absolute thresholding, over the similarity scores between single instances, would fail in detecting some of the trajectory’s loops, or would also result in many false-positive detections (when a tolerant thresholding is applied). Many existing techniques (e.g. Gálvez-López and Tardós, 2012; Newman et al., 2006; Milford and Wyeth, 2012; Mur-Artal and Tardós, 2014) choose to support their detection by accumulating similarity metrics $F_{s}\left(S_{1}, S_{2}\right)$ from many images acquired close together in time. In the general case, succeeding frames are treated as sequences of multiple I-VWVs. These groups of I-VWVs (for instance S1 and S2) are then compared with the database and assigned an additive score of $F_{s}\left(S_{1}, S_{2}\right)=\sum_{i j}^{i \in S_{1} j \in S_{2}} F_{I}\left(I_{i}, I_{j}\right)$ . Although this approach produces effective results, it is limited to a per-frame representation of the environment rather than offering a description of the whole sequence. Considering a simple example, such as the one presented in Figure 2(a), each visual word of a given scene might not constantly be inside the camera’s frustum or, for any reason, might not found by the feature detector (Figure 2(b)). This inconsistency entails I-VWVs with considerably uneven values even though they refer to the same scene. As a result, the $F_{s}\left(S_{1}, S_{2}\right)$ scores often lead to a false interpretation of the actual sequences’ similarity in a variety of operational scenarios.

给定实际的一对闭环图像，不能保证在每种情况下都会检测到足够的公共视觉单词子集，因为单个图像可能会出现混叠，包含噪声或运动的物体等。可以预料，在单个实例之间的相似性分数上进行绝对阈值处理将无法检测到某些轨迹循环，或者还会导致许多假阳性的检测结果（在应用容限阈值时）。许多现有技术（例如Gálvez-López和Tardós，2012； Newman等人，2006； Milford和Wyeth，2012； Mur-Artal和Tardós，2014）选择通过累积相似性指标（ $F_{s}\left(S_{1}, S_{2}\right)$ 来支持其检测。），这是因为它们在时间上相互靠近而获得的许多图像中。在一般情况下，后续帧被视为多个I-VWV的序列。然后将这些I-VWV组（例如S1和S2）与数据库进行比较，并为其分配加分 $F_{s}\left(S_{1}, S_{2}\right)=\sum_{i j}^{i \in S_{1} j \in S_{2}} F_{I}\left(I_{i}, I_{j}\right)$ 。尽管这种方法产生了有效的结果，但它仅限于环境的每帧表示，而不是提供整个序列的描述。考虑一个简单的示例，例如图2（a）中所示的示例，给定场景的每个视觉单词可能不会一直处于相机的视锥内部，或者由于某种原因可能无法被特征检测器找到（图2（b））。即使I-VWV指向同一场景，这种不一致也导致I-VWV的值明显不均匀。结果， $F_{s}\left(S_{1}, S_{2}\right)$ 分数经常导致在各种操作场景中对实际序列相似性的错误解释。

Fig. 2. Efficiency of proposed loop-closure detection approach with simplified real-world scenario.
(a) Simplistic real-world example with robot passing through the same scene twice (camera pose sets $1 . x$ and $2 . x$ ). A different subset of the scene’s visual words is observed by each pose.
(b) The input images produce I-VWVs with considerably uneven structures between sequences $1 . x$ and $2 . x .$
( c) The proposed S-VWVs contain the vocabulary entries found during each sequence in a common description vector, as if they were observed by two “super-frames”.

图2.在简化的实际情况下，所提出的闭环检测方法的效率。
（a）一个简单的现实世界示例，其中机器人两次通过同一场景（相机姿态设置 $1 . x$ and $2 . x$ ）。每个姿态都会观察到场景视觉单词的不同子集。
（b）输入图像产生在序列 $1 . x$ 和 $2 . x$ 之间具有相当不均匀结构的I-VWV。
（c）所提出的S-VWV用一个普通描述向量包含每个序列中发现的词汇条目，就好像它们是由两个“超帧”观察到的一样。

笔记 : 1） a sequence - images，一个sequence指的是按时间排序的image序列。

With this notion in mind, the aforementioned approaches can be characterized as sequence-matching techniques rather than sequence-descriptive ones. As opposed to treating a sequence as the summation of individual matching scores, our method achieves a description vector that contains every visual word found in the scene. Using a computationally effificient approach, for a given sequence of images, the visual words found in every camera measurement are gathered and vote on the respective bin of a common description vector (S-VWV). Note that multiple instances corresponding to the same visual word (i.e. a visual word observed by multiple frames) are treated as one, since they refer to the same feature in the world. A realization worth noticing here is that the proposed S-VWV-to-S-VWV matching would present the same results as the earlier approaches only under the false assumption that the used similarity metrics preserved the additive property of linear mapping. As can be seen in Figure 2©, our method produces description vectors with better matching properties, as confifirmed by our experimental evaluation (see Section 4). also achieved by the work of Lynen et al. (2014). Although their system allowed for the detected features to be matched against the whole database (regardless of the image they belonged to), their method was restricted to operate offlfline, after the conclusion of the full trajectory, while the sequence formulation was performed at query time. On the contrary, here the sequence distinction and matching is achieved online, as the trajectory escalates, quantizing the searching space through the means of the BoVW model.

考虑到这一概念，上述方法可以被描述为序列匹配技术，而不是序列描述技术。与将序列视为单个匹配分数的总和不同，我们的方法获得的描述向量包含场景中发现的每个视觉单词。使用计算有效的方法，对于给定的图像序列，将收集每次相机测量中发现的视觉单词，并在公共描述矢量（S-VWV）的相应bin上进行投票。注意，对应于相同视觉单词（即，由多个帧观察到的视觉单词）的多个实例被视为一个，因为它们指的是世界上的同一特征。此处需要注意的一个实现是，仅在错误的假设（使用的相似性度量标准保留了线性映射的累加属性）的情况下，所提出的S-VWV到S-VWV匹配才会提供与早期方法相同的结果。如图2（c）所示，我们的方法产生的描述向量具有更好的匹配特性，正如我们的实验评估所证实的那样（请参见第4节）。也由Lynen等人（2014）的工作实现。尽管他们的系统允许将检测到的特征与整个数据库进行匹配（无论它们属于哪个图像），但在确定完整轨迹后，他们的方法仅限于离线运行，而序列制定则在查询时执行。相反，在这里，随着轨迹的增加，序列区分和匹配是在线实现的，通过BoVW模型的手段量化了搜索空间。

3. Proposed methodology

Our online LCD algorithm is divided into two main steps, while the vocabulary and the filter’s kernel coefficients are learned offline through a training scheme. In the first step of the proposed online pipeline, sequence matches are detected, while the individual image associations are extracted in the second step.

我们的在线LCD算法分为两个主要步骤，而词汇量和滤波器的核系数是通过训练方案离线学习的。所提出的在线方法的第一步中，检测到序列匹配，而在第二步中，提取各个图像关联。

3.1. Vocabulary training（字典/词汇表训练）

To quantize the feature descriptors’ space, a visual vocabulary needs to be created. Aiming to offer a real-time implementation, we choose to utilize the binary description of ORB. In an offlfline step, a generic set of training descriptors is provided as input to a k-median hierarchical clustering, with k-means++ seeding (Arthur and Vassilvitskii, 2007) and Hamming as the distance metric. In accordance with the conclusions drawn by Nister and Stewenius (2006) and Gálvez-López and Tardós (2012), we formulate a vocabulary tree with L = 6 levels and K = 10 branches per level, leading to a total set of $W=K^{L}$ discrete visual words $w_{i}(i \in[1, W])$ . Two different kinds of multiset need to be defifined here, namely $\mathbb{N}_{i}^{D}$ and $\mathbb{N}^{D}$ , corresponding to the ith word’s occurrences and the occurrences of the total visual words in the training dataset, respectively.

为了量化特征描述子的空间，需要创建视觉词汇表。为了提供实时的实现，我们选择使用ORB的二进制描述。在离线步骤中，提供了一组通用的训练描述子作为k-median hierarchical聚类的输入，其中k-means++ 种子（Arthur和Vassilvitskii，2007）和Hamming作为距离度量。根据Nister和Stewenius（2006）以及Gálvez-López和Tardós（2012）得出的结论，我们制定了一个词汇树，其中L = 6层，K = 10 分支/层，得到总的 $W=K^{L}$ 个离散视觉词 $w_{i}(i \in[1, W])$ 。这里需要定义两种不同的多重集，即 $\mathbb{N}_{i}^{D}$ 和 $\mathbb{N}^{D}$ ，分别对应于第i个单词的出现次数和训练数据集中总视觉单词的出现次数。

3.2. Creating sequence and image descriptors

The main objective of our sequence distinction functionality does not refer to the actual semantics of the observed environment, but rather to the identifification of groups of frames that share common features. To achieve a dynamic partition of the image stream, we utilize the variance of the obtained visual words. At a time instant t, during the sequence’s $S_{t}$ escalation, an occupancy vector $V_{S_{t}}^{O}$ is retained to keep track of the already-seen visual words. This binary vector, represented as $V_{S_{t}}^{O}=\left[w_{1}^{o}, \ldots, w_{i}^{o}, \ldots, w_{W}^{o}\right]$ , shares the same length as the vocabulary, while each value $w_{i}^{o}$ declares the existence $\left(w_{i}^{o}=1\right)$ or absence $\left(w_{i}^{o}=0\right)$ of the corresponding word $w_{i}$ in the current sequence. As the robot moves, the $N_{f}$ most prominent ORB features are extracted using the oFAST algorithm (the orientation invariant alteration of FAST (Rosten and Drummond, 2006), proposed by Rublee et al. (2011)) from every input image. The descriptors are then mapped onto visual words through the created vocabulary and marked NEW (seen for the fifirst time during $S_{t}$ ) or OLD (already seen during $S_{t}$ ) by checking their indexes with vector $V_{S_{t}}^{O}$ . Thus, using a “visual word variance” metric, defifined as $\sigma_{v}=N_{N E W} /\left(N_{N E W}+N_{O L D}\right)$ , we signal the completion of the current $S_{t}$ and the beginning of a new $S_{t+1}$ each time $\sigma_{v}>r_{v}$ , with $r_{v}$ being a visual word variance, above which the input frame does not share enough visual words with the rest of the sequence. $N_{N E W}$ and NOLD denote the number of visual words marked NEW or OLD, respectively. Using this metric, a new sequence is instigated when the percentage of NEW visual words dominates the entire set of the input image’s features. Then the new vector $V_{S_{t+1}}^{O}$ is initialized to zero, and the same procedure is repeated for the next sequence. In the case of $\sigma_{v} \leq r_{v}$ , the $V_{S_{t}}^{O}$ vector is updated with the visual words marked NEW and the following input image is characterized as a member of the current $S_{t}$ . We additionally force a maximum and minimum visual words’ capacity for the sequences, preventing their uncontrolled growth and allowing the $V_{S_{t}}^{O}$ vectors to initialize at least some elements. Finally, images that do not contain a minimum number of visual words are rejected as less informative.

我们的序列区分功能的主要目标不是指观察到的环境的实际语义，而是指识别具有共同特征的帧组。为了实现图像流的动态划分，我们利用了获得的视觉词的variance。在时间t时刻，在序列的 $S_{t}$ 扩展期间，会保留占用向量 $V_{S_{t}}^{O}$ 来跟踪已经看到的视觉单词。该二进制向量表示为 $V_{S_{t}}^{O}=\left[w_{1}^{o}, \ldots, w_{i}^{o}, \ldots, w_{W}^{o}\right]$ ，与字典有相同的长度，而每个值 $w_{i}^{o}$ 声明当前序列中相应单词 $w_{i}$ 的存在 $\left(w_{i}^{o}=1\right)$ 或不存在 $\left(w_{i}^{o}=0\right)$ 。随着机器人的移动，使用oFAST算法（Rublee等人（2011）提出的FAST的方向不变性变化（Rosten和Drummond，2006年））从每个输入图像中提取 $N_{f}$ 个最突出的ORB特征。然后，通过创建的词汇表将描述符映射到视觉单词上，并通过使用向量 $V_{S_{t}}^{O}$ 检查其索引，将其标记为NEW（在 $S_{t}$ 期间首次出现）或OLD（在 $S_{t}$ 期间已经看到过）。因此，使用定义为 $\sigma_{v}=N_{N E W} /\left(N_{N E W}+N_{O L D}\right)$ 的“视觉单词方差”度量，每次 $\sigma_{v}>r_{v}$ 时，我们都会通知当前 $S_{t}$ 的完成和新 $S_{t+1}$ 的开始，其中 $r_{v}$ 为视觉单词差异，高于该范围时，输入帧与序列的其余部分没有共有足够的视觉单词。 $N_{N E W}$ 和 $N_{O L D}$ 分别表示标记为NEW或OLD的视觉单词的数量。使用此指标，当新的视觉单词的百分比主导输入图像的整个功能集时，便会引发新的序列。然后，将新的向量 $V_{S_{t+1}}^{O}$ 初始化为零，并对下一个序列重复相同的过程。在 $\sigma_{v} \leq r_{v}$ 的情况下，用标记为NEW的视觉词更新 $V_{S_{t}}^{O}$ 向量，并将随后的输入图像表征为当前 $S_{t}$ 的成员。此外，我们会强制序列使用最大和最小视觉单词，以防止其不受控制地增长，并允许 $V_{S_{t}}^{O}$ 向量初始化至少一些元素。最后，不包含最少视觉单词的图像由于信息量较少而被拒绝。

笔记：
1）用向量 $V_{S_{t}}^{O}=\left[w_{1}^{o}, \ldots, w_{i}^{o}, \ldots, w_{W}^{o}\right]$ 表征sequence， $w_{i}^{0}=1$ 表示单词存在， $w_{i}^{0}=0$ 表示不存在。
2）This binary vector, represented as $V_{S_{t}}^{O}=\left[w_{1}^{o}, \ldots, w_{i}^{o}, \ldots, w_{W}^{o}\right]$ , shares the same length as the vocabulary.???
3) “visual word variance” metric, defifined as $\sigma_{v}=N_{N E W} /\left(N_{N E W}+N_{O L D}\right)$
4） $r_{v}$ being a visual word variance，是如何定义的？
5) variance 方差，变动

Having a completed sequence $S$ with $M$ image-members $I_{m}(m \in[1, M]),$ we now proceed to its description. The following multisets of visual words need to be defined. Multisets $\mathbb{N}_{i}^{S}$ and $\mathbb{N}_{i}^{I_{m}}$ are defined as the $i$ th visual word’s occurrences in sequence $S$ and image $I_{m},$ respectively. Additionally, $\mathbb{N}^{S}$ and $\mathbb{N}^{I_{m}}$ are defined as the total visual word’s occurrences in $S$ and $I_{m},$ respectively. The aforementioned multisets are governed by
$\begin{aligned} \mathbb{N}_{i}^{S} &=\bigcup_{m=1}^{M} \mathbb{N}_{i}^{I_{m}} （1）\\ \mathbb{N}^{S} &=\bigcup_{m=1}^{M} \mathbb{N}^{I_{m}} （2） \end{aligned}$

笔记： $\mathbb{N}_{i}^{S}$ - 第i个单词在序列S中出现的次数， $\mathbb{N}_{i}^{I_{m}}$ - 第i个单词在第m个图像中出现的次数， $\mathbb{N}^{I_{m}}$ -第m个图像中所有单词的总数， $\mathbb{N}^{S}$ - 序列S中所有单词的总数。

The widely used “term frequency-inverse document frequency” (tf-idf) (Sivic and Zisserman, 2003 ) was selected as a means of defining each visual word’s participation and creating the following visual-word-vectors: (i) one $\mathrm{S}-\mathrm{VWV}$ $\left(\bar{v}^{(S)}\right)$ describing the whole observed visual content of the sequence’s respective area, and (ii) $M$ I-VWVs $\left(\overline{\boldsymbol{v}}^{\left(I_{m}\right)}\right)$ for the individual image-members. These descriptors
$\overline{\boldsymbol{v}}^{(S)}=\left[v_{1}^{(S)}, \ldots, v_{i}^{(S)}, \ldots, v_{W}^{(S)}\right]$
and
$\overline{\boldsymbol{v}}^{\left(I_{m}\right)}=\left[v_{1}^{\left(I_{m}\right)}, \ldots, v_{i}^{\left(I_{m}\right)}, \ldots, v_{W}^{\left(I_{m}\right)}\right]$
are calculated via
$\begin{aligned} v_{i}^{(S)} &=\frac{N_{i}^{S}}{N^{S}} \log \frac{N^{D}}{N_{i}^{D}} （3）\\ v_{i}^{\left(I_{m}\right)} &=\frac{N_{i}^{I_{m}}}{N^{I_{m}}} \log \frac{N^{D}}{N_{i}^{D}}（4） \end{aligned}$
where $N_{i}^{S}=\left|\mathbb{N}_{i}^{S}\right|, N_{i}^{I_{m}}=\left|\mathbb{N}_{i}^{I_{m}}\right|, N^{S}=\left|\mathbb{N}^{S}\right|, N^{I_{m}}=\left|\mathbb{N}^{I_{m}}\right|$ $N_{i}^{D}=\left|\mathbb{N}_{i}^{D}\right|,$ and $N^{D}=\left|\mathbb{N}^{D}\right|,$ with the notation $|\mathbb{X}|$ representing the cardinality of multiset $\mathbb{X}$ . Equation (4) refers to the description of the individual frames, while using equation (3) we are able to create a global description with better sequence-matching capabilities, as described in Section $2.2 .$ Note that the additional computational burden for producing two versions of visual-word-vectors is negligible, since the most time-consuming part of the process (the tree traversal) is executed only once per visual word.

笔记：1）cardinality- 基数，negligible- 微不足道的，tree traversal - 树遍历（最耗时）

Finally, to restrict the matching search only between S-VWVs that include mutual visual information, inverted indexing is applied (Jegou et al., 2008 ). A set of $W$ lists (one for every visual word $w_{i}$ ) is retained, keeping track of sequence indexes whose S-VWVs contain common visual words. Thus, sequence similarity scores are calculated through the inverted indexing list, achieving a reduction in the computational complexity.

最后，为了在仅包括相同视觉信息的S-VWV之间限制匹配搜索，应用了逆索引（Jegou等，2008）。保留一组 $W$ 列表（每个可视单词 $w_{i}$ ），以跟踪其S-VWV包含常见可视单词的序列索引。因此，通过逆索引列表计算序列相似性得分，从而降低了计算复杂度。

笔记：应用逆索引？

3.3. Sequences-to-sequence matching

To match the individual sequences, we make use of a similarity metric based on the $L 2$ -norm. More specifically, using an $L 2$ -score similarity between a query $\left(S_{q}\right)$ and a database $\left(S_{d}\right)$ sequence that the inverse indexes indicate
$2\left(\bar{v}_{q}^{(S)}, \bar{v}_{d}^{(S)}\right)=1-0.5\left|\frac{\bar{v}_{q}^{(S)}}{\left|\bar{v}_{q}^{(S)}\right|}-\frac{\bar{v}_{d}^{(S)}}{\left|\bar{v}_{d}^{(S)}\right|}\right| （5）$
we obtain a metric that produces higher values as the vectors become more similar. As the trajectory escalates, the calculated L2-scores can be arranged to incrementally for-
mulate a similarity matrix $\mathbf{M}_{S},$ like the one presented in Figure 3 (a). This matrix is symmetric, with each element containing a corresponding normalized (Gálvez-López and Tardós, 2012 ) $2\left(\overline{\boldsymbol{v}}_{i}^{(S)}, \overline{\boldsymbol{v}}_{j}^{(S)}\right)$ measurement.

笔记：
1）similarity metric - $L 2$ -norm，相似性度量 - 2范数？？
2) similarity matrix $\mathbf{M}_{S},$ 相似性矩阵 $\mathbf{M}_{S}$ ，是一个对称阵，每个元素对应归一化的 $2\left(\overline{\boldsymbol{v}}_{i}^{(S)}, \overline{\boldsymbol{v}}_{j}^{(S)}\right)$ 测量。

A naive approach to the detection of loop-closing sequences would be to apply an absolute thresholding over the values of matrix $\mathbf{M}_{S} .$ With a view to further enhancing the cases of sequence matches with indexes that advance concurrently along time $\left(S_{i \pm k^{-}} \text {to }-S_{j \pm k}, k=0,1,2, \ldots\right),$ we propose a novel temporal consistency filtering, the coefficients of which are trained in an offline step. Quantitatively interpreting the temporal constrain, we expect this filter to advance a sequence similarity score $2\left(\overline{\boldsymbol{v}}_{i}^{(S)}, \overline{\boldsymbol{v}}_{j}^{(S)}\right)$ proportionally with the values of $2\left(\bar{v}_{i \pm k}^{(S)}, \bar{v}_{j \pm k}^{(S)}\right) .$ In the same fashion, the filter should penalize $2\left(\overline{\boldsymbol{v}}_{i}^{(S)}, \overline{\boldsymbol{v}}_{j}^{(S)}\right)$ proportionally with the scores of $2\left(\overline{\boldsymbol{v}}_{i+k_{1}}^{(S)}, \overline{\boldsymbol{v}}_{j+k_{2}}^{(S)}\right),$ with $\left(k_{1} \neq k_{2}\right) .$ In other words, the resulting similarity measurement between sequences $S_{i}$ and $S_{j}$ will tend to become higher as the respective submatrix ( $\mathbf{m}_{S}$ ) of $\mathbf{M}_{S}$ - centered around the $(i, j)$ entry-comes closer to a diagonal view (e.g. Figure $3(\mathrm{b})$ ) and by analogy lower in cases of temporal inconsistency (e.g. Figure $3(\mathrm{c})) .$

时间过滤器将按 $2\left(\bar{v}_{i \pm k}^{(S)}, \bar{v}_{j \pm k}^{(S)}\right)$ 的值成比例地提升或减少相似性评分 $2\left(\overline{\boldsymbol{v}}_{i}^{(S)}, \overline{\boldsymbol{v}}_{j}^{(S)}\right)$ ，换句话说，随着以 $(i, j)$ 项为中心的 $\mathbf{M}_{S}$ 的各个子矩阵 ( $\mathbf{m}_{S}$ ) 更接近对角线视图，序列 $S_{i}$ 和 $S_{j}$ 之间的相似度测量结果趋于变高 (e.g. Figure $3(\mathrm{b})$ ) ，并且在时间上不一致的情况下，比喻降低(e.g. Figure $3(\mathrm{c})) .$ 。

Fig. 3. Impact of the proposed consistency filter on the sequence similarity matrix. The filtered similarity entries corresponding to loop-closure events are easily separable from the non-loop-closing ones. Note that $\mathbf{M}_{S}$ and $\mathbf{M}_{S}^{F}$ are fully formulated only for visualization purposes. During the online algorithm execution, the matrices are only partially computed, owing to the incorporated inverse indexing.

图3.提出的一致性滤波器对序列相似性矩阵的影响。与闭环事件相对应的过滤相似项很容易与非闭环事件分开。请注意， $\mathbf{M}_{S}$ 和 $\mathbf{M}_{S}^{F}$ 的制定仅用于可视化目的。在执行在线算法期间，由于合并了逆索引，因此仅部分计算了矩阵。

笔记：
1）新的时间一致性过滤，其coefficients（系数）在离线步骤中进行训练。
2）Quantitatively 定量地， proportionally成比例地，penalize减少
3）?? $2\left(\overline{\boldsymbol{v}}_{i+k_{1}}^{(S)}, \overline{\boldsymbol{v}}_{j+k_{2}}^{(S)}\right),$ with $\left(k_{1} \neq k_{2}\right) .$ ？？？

Considering an example of window size $w_{F}=3$ (corresponding to $k = 1$ ), those two notions can be efficiently combined into a filter kernel with the structure
$F=\left[\begin{array}{ccc} \alpha_{0} & -\alpha_{1} & -\alpha_{2} \\ -\alpha_{3} & \alpha_{4} & -\alpha_{5} \\ -\alpha_{6} & -\alpha_{7} & \alpha_{8} \end{array}\right]（6）$
with $\alpha_{i} \geq 0 .$ The correlation operation of $F$ with the $\mathbf{M}_{S}$ matrix results in a more intelligible interpretation $\left(\mathbf{M}_{S}^{F}\right)$ as shown in Figure $3(\mathrm{d}) .$ To avoid the manual selection of the $F$ coefficients and its size, an offline supervised training scheme based on cost-function minimization is formulated. Another way to consider our consistency filter is as a classifier that separates the loop-closing (class LC) similarity measurements from the non-closing ones (class $\mathrm{N}$ -LC). For each tested sequence pair $S_{i}, S_{j}>,$ this classifier uses the corresponding $\mathbf{m}_{S}$ neighborhood (i.e. a window around $\left.\mathbf{M}_{S}(i, j) \text { of size } w_{F} \times w_{F}\right),$ as a descriptor and decides whether it should fall into category LC or N-LC. Thus, we adopt a logistic-regression approach and we search for a first-order multivariate polynomial, with coefficients $\overline{\boldsymbol{\theta}}=\left[\theta_{0}, \theta_{1}, \ldots, \theta_{n}\right]^{\mathrm{T}},$ for which $\overline{\boldsymbol{x}} \cdot \overline{\boldsymbol{\theta}} \geq 0$ indicates the detection of a sequence loop-closure event. Note that $\overline{\boldsymbol{x}}=\left[1, \hat{x}_{1}, \ldots, \hat{x}_{n}\right]$ denotes the rearrangement of a $\mathbf{m}_{S}$ submatrix’s entries into a normalized feature vector format and $n=w_{F}^{2} .$ The normalization $\hat{x}_{i}=x_{i} / \max \left(x_{i}\right), x_{i} \in \mathbf{m}_{S}$ provides the required invariance over any similarity scale. Consequently, the values of $\theta_{1}$ to $\theta_{n}$ correspond to the filter’s coefficients (equation $(6)),$ while $r_{s}=-\theta_{0}$ can be characterized as a threshold that should be applied over the $\mathbf{M}_{S}^{F}$ entries to identify the loop-closing sequences. The final cost-function minimization scheme is governed by
$\begin{aligned} \overline{\boldsymbol{\theta}}=& \underset{\bar{\theta}}{\operatorname{argmin}} J(\overline{\boldsymbol{\theta}}) （7）\\ J(\overline{\boldsymbol{\theta}})=&-\frac{1}{l_{t r}} \sum_{i=1}^{l_{t r}}\left[y_{t r}^{(i)} \log h_{\theta}\left(\overline{\boldsymbol{x}}_{t r}^{(i)}, \overline{\boldsymbol{\theta}}\right)\right. &\left.+\left(1-y_{t r}^{(i)}\right) \log \left(1-h_{\theta}\left(\overline{\boldsymbol{x}}_{t r}^{(i)}, \overline{\boldsymbol{\theta}}\right)\right)\right]（8）\\ h_{\theta}(\overline{\boldsymbol{x}}, \overline{\boldsymbol{\theta}})=& \frac{1}{1+e^{\overline{\boldsymbol{x}} \cdot \bar{\theta}}}（9） \end{aligned}$
In equation $8), l_{t r}$ denotes the size of the learning set, while $\overline{\boldsymbol{x}}_{t r}^{(i)}$ and $y_{t r}^{(i)}$ denote the individual training feature vectors and their corresponding loop-closure ground-truth, respectively. since two classes are used, we assign $y_{t r}^{(i)}=1$ to the training LC elements and $y_{t r}^{(i)}=0$ to the N-LC ones. This set of equations corresponds to a standard binary logisticregression formulation. Looking for a hypothesis vector $\overline{\boldsymbol{\theta}}$ with the characteristics explained before, the sigmoid function of equation (9) maps the range $\mathbb{R}$ of $\overline{\boldsymbol{x}} \cdot \boldsymbol{\theta}$ output into the interval $(0, 1) .$ Then, the first summation term of equation(8) quantifies the cost of $\overline{\boldsymbol{x}} \cdot \overline{\boldsymbol{\theta}}<0$ for the ground-truth LC training sample, while the second one quantifies the cost of $\overline{\boldsymbol{x}} \cdot \boldsymbol{\theta} \geq 0$ for the $\mathrm{N}$ -LC ground-truth cases. Finally, the hypothesis vector $\theta$ can be achieved by minimizing the total cost (equation (7) ) through gradient-descent, with the training samples being already normalized into the interval $[0, 1] .$ The selection of logistic regression as a classification technique is justified owing to its high tolerance over unbalanced training samples. As King and Zeng (2001) and Crone and Finlay (2012) pointed out, this effect can be accounted for when the training and testing data contain approximately the same amount of LC and N-LC events; this will be further considered in Section 4.1 .3。

Moreover, to select the window size $w_{F},$ we formulate a cross-validation step using another stand-alone set of feature vectors, $\overline{\boldsymbol{x}}_{c v}^{(i)} .$ We assess multiple filter size scenarios $\left(w_{F}=2,3,4,5,6,7\right)$ corresponding to multiple feature vector’s lengths $n=w_{F}^{2}$ and we create a $\bar{\theta}_{h}$ hypothesis for each one of them $\in[0,5]) .$ Next, the cross-validation error for every $\vec{\theta}_{h}$ is evaluated using
$J_{c v}\left(\overline{\boldsymbol{\theta}}_{h}\right)=\frac{1}{2 l_{c v}} \sum_{i=1}^{l_{c v}}\left(h_{\theta}\left(\overline{\boldsymbol{x}}_{c v}^{(i)}, \overline{\boldsymbol{\theta}}_{h}\right)-y_{c v}^{(i)}\right)^{2}(10)$
while the hypothesis producing the lowest $J_{c v}\left(\bar{\theta}_{h}\right)$ is to be adopted for the final filter’s kernel. Similar attempts to influence the values of $\mathbf{M}_{S}$ matrix can be found in other techniques as well (Milford and Wyeth, 2012 ; Newman et al., 2006 ), yet in our case, the filtering is interpreted as a classification approach. It should also be noted that, during the online execution of our algorithm, the only $\mathbf{m}_{S}$ sub-matrices that we need to formulate and filter are those indicated by the inverse indexing lists. Thus, the $\mathbf{M}_{S}$ and $\mathbf{M}_{S}^{F}$ matrices retain a sparse structure.

Filtered matching scores overpassing $r_{s}=-\theta_{0}$ (or equivalently, matching scores with $\bar{x} \cdot \bar{\theta}_{h} \geq 0$ ) are considered to contain loop-closure frame candidates and the next step of our method refers to their individual image-member associations. The sequence distinction technique described in Section 3.2 does not ensure that the produced trajectory intervals will be aligned between multiple traversals of the same area. Thus, some image-members of the query $S_{q}$ may actually need to be matched with the members of different neighboring database sequences. For this reason, we allow each $S_{q}$ to be associated with multiple $S_{d},$ as long as they are subsequent.

笔记：
1）这个滤波器，其作用是大概提升sequence-to-sequence相似性的评分，但不太懂它的细节。
2) 另一个思考角度是，一致性过滤器实际上是一个分类器，它将闭环（LC类）相似性度量与非闭环（N-LC类）相似性度量分开，该分类器采用逻辑回归，采用离线的监督训练。（将非闭环（N-LC类）相似性度量滤掉？?？）

3.4. Image-to-image matching

To provide a typical LCD technique, our method should provide image-to-image pairs as a final output. Although we find our sequence matches sufficient, so as to detect revisited regions of the trajectory, it is possible for some camera poses to be associated without necessarily observing the same content. One can consider the example of two trajectory tracks for which, even though they remain parallel and spatially close to each other for the majority of their length, their respective courses slowly deviate until they observe significantly different views. The corresponding two sequences assigned to those tracks ( $S_{m_{1}}$ and $S_{m_{1}}$ ), are naturally going to be matched despite the fact that their last camera measurements might not correspond to loopclosure events. In such a scenario, during the slow deviation of the trajectories, the visual content from both sequences does not drastically change, preventing the activation of our visual word variance constraint and the further segmentation of the sequences. In those cases, a simple “one-toone” pairing would fail, since the last image-members of $S_{m_{1}}$ and $S_{m_{2}}$ should not be considered as loop-closures. To address these cases, individual I-VWVs need to be considered. For a highly accurate SLAM system, it would be sufficient to detect a single pair of loop-closing camera poses per sequence match using the highest L2-scoring I-VWV pair. Yet, as a general rule (assuming an odometry with low accuracy), we need to seek for as many detections as possible. More specifically, for every image-member of $S_{m_{1}}$ we seek in its paired sequence $S_{m_{2}}$ (or paired sequences, if more than one association was produced by the previous step ) for the image that produces the maximum I-VWV L2-score. Subsequently, in order to reject image pairs that cannot be visually associated, a loop-closure event is identified if the measured similarity is greater than a threshold $r_{i} .$ A common practice for many LCD systems (e.g. Bampis et al., $2016;$ Gálvez-López and Tardós, $2012;$ Lynen et al. 2014; Mur-Artal etal., 2015 ) is to apply a final geometricalverification test to accept a loop-closing pair of images. since such tests are based on the computationally expensive estimation of a valid camera transformation matrix, the $r_{i}$ threshold must be decided so as to reduce the geometricalverification steps to the minimum required by the SLAM and the pose-graph optimization technique (Latif et al.,2013).

最终要输出 Image-to-image 的匹配。尽管我们找到了足够的序列匹配，以检测轨迹的重新访问区域，但某些摄像机姿态可能有在没有观察相同内容的情况下也被关联起来了。换句话说，或许 $S_{m_{1}}$ and $S_{m_{1}}$ 的序列相似度可能很高，但序列中也会存在个别非回环。

4. Results

In this section, we evaluate the individual components of our system and compare the achieved overall performance against other state-of-the-art methods. To measure the accuracy of an implementation we utilize precision– recall metrics. As a reminder, “precision” is defifined as the ratio between accurately detected loop-closing frames (true-positive) and the total number of detections returned by the method (true-positive plus false-positive). Additionally, “recall” is defifined as the number of true-positive detections found, over the total number of loop-closing frames that exist in the used dataset (true-positive plus false-negative). For our experiments, we consider a sequence match as true-positive if at least one loop-closing camera pose is contained. Nine different datasets (indoors and outdoors) were used for our experiments, namely Bovisa 2008-09-01 (BV) (Rawseeds, 2007), Bicocca 2009-02-25b (BC) (Rawseeds, 2007), New College (NC)1 (Smith et al., 2009), Lip6 Indoor (L6I) (Angeli et al., 2008), Lip6 Outdoor (L6O) (Angeli et al., 2008), Malaga 2009 Parking 6L (MG6L) (Blanco et al., 2009), City Center (CC) (Cummins and Newman, 2008), KITTI sequence 00 (KITTI00) (Geiger et al., 2013), and KITTI sequence 05 (KITTI05) (Geiger et al., 2013). Regarding the KITTI dataset, we considered only sequences 00 and 05, since, among the rest, they provide the most meaningful loop-closure events in urban and long-term operational conditions. Table 1 contains a brief description of every case. Datasets BC through L60 were used as training and cross-validation sets for our method’s parameters, while the remaining datasets (MG6L through KITTI05) were treated as testing cases, measuring the performance of our fifinal system. In such a way, the achieved detection accuracy is not directly inflfluenced by the algorithm’s optimization, thus offering a fair evaluation. Note that the loop-closure ground-truth information for the cases of BC, NC, MG6L, and CC was manually created within the work of Gálvez-López and Tardós (2012). The L6I and L6O datasets contain their own ground-truth information, as provided by Angeli et al. (2008), while for the KITTI sequences, this information was obtained through the corresponding odometry data.

在本节中，我们评估系统的各个组件，并将获得的总体性能与其他最新方法进行比较。为了衡量实现的准确性，我们使用了精确度和召回率指标。提醒一下，“精度”定义为精确检测到的闭环帧（真-正）与该方法返回的检测总数（真-正加假-正）之间的比率。此外，“调用”定义为在使用的数据集中存在的闭环帧总数中找到的真阳性检测数（真阳性加假阴性）。对于我们的实验，如果包含至少一个闭环相机姿势，我们认为序列匹配为真阳性。我们的实验使用了9种不同的数据集（室内和室外），分别是Bovisa 2008-09-01（BV）（Rawseeds，2007），Bicocca 2009-02-25b（BC）（Rawseeds，2007），New College（NC） 1（Smith等人，2009），Lip6室内（L6I）（Angeli等人，2008），Lip6 Outdoor（L6O）（Angeli等人，2008），Malaga 2009 Parking 6L（MG6L）（Blanco等。（2009），市中心（CC）（Cummins和Newman，2008），KITTI序列00（KITTI00）（Geiger等，2013）和KITTI序列05（KITTI05）（Geiger等，2013）。关于KITTI数据集，我们仅考虑了序列00和05，因为在其余序列中，它们提供了在城市和长期运行条件下最有意义的闭环事件。表1包含每种情况的简要说明。 BC到L60的数据集被用作我们方法参数的训练和交叉验证集，而其余的数据集（MG6L到KITTI05）被当作测试用例，用来衡量我们最终系统的性能。这样，算法的优化不会直接影响所达到的检测精度，因此可以提供合理的评估。请注意，在Gálvez-López和Tardós（2012）的工作中手动创建了BC，NC，MG6L和CC情况的闭环接地真相信息。 L6I和L6O数据集包含自己的地面真实信息，如Angeli等（2008）提供的，而对于KITTI序列，该信息是通过相应的里程计数据获得的。

表1. 所使用的数据集的性质
在这里插入图片描述

4.1. Offline training and performance evaluation

4.1.1. Vocabulary training

Using a vocabulary training set corresponding to a specifific environment with limited visual variations inevitably biases the system’s performance to the respective operational conditions. Within the scope of this work, we aim to create a vocabulary that is able to perform in a variety of indoor and outdoor conditions. In accordance with these terms, the BV dataset was selected as a stand-alone training sample in order to offer an objective evaluation. Using 10k frames, a set of 9M ORB descriptors was extracted and used as an input to our hierarchical clustering. Thus, a binary vocabulary tree was produced retaining a total of 10^6 discrete visual words as leaf nodes.

使用BV数据集离线训练字典，包含10^6个单词。

4.1.2. Trajectory segmentation

As described in Section 3.2, our algorithm dynamically separates the input image stream into sequences based on the observed visual words’ variance. Considering the system’s overall performance as a final objective, a validation test based on precision-recall metrics was formulated to measure the effect of different $r_{v} .$ Multiple values were selected and assessed on the four training datasets. In this step, the production of precision-recall measurements is not straightforward, since our system does not create any loop-closure output during the sequences’ partition. To this end, we temporarily fixed the kernel of our consistency filter to have all its $\alpha_{i}$ elements equal to zero except for $\alpha_{4}=1,$ canceling its effect during the detection and promoting $r_{s}$ as a means of alternating the precision-recall measurements. Figure $4(\mathrm{a})$ to $4(\mathrm{d})$ shows the most informative curves we obtained by considering sequence matches for every training dataset, respectively. The curves shown in Figure $4(\text { e })$ were created accordingly by treating all the datasets as a unified environment(as if the same robot traveled through each dataset, one after another). Note that, for a range of $r_{v}$ values between 0.6 and $0.8,$ the achieved performance remains relatively stable. Considering the BC dataset, it appears that the best performance is achieved using a visual word variance threshold of $r_{v}=0.6$ while, for the case of $\mathrm{NC},$ the most beneficial case was $r_{v}=0.75 .$ This is because the BC dataset corresponds to an indoor environment; therefore, visual changes tend to be more severe than NC. Given a specified application scenario (indoors or outdoors, dynamic or non-dynamic, frontal or lateral camera view, etc.), the most appropriate $r_{v}$ value can be selected accordingly. However, in this paper we aim for a generic setup and thus the value of $r_{v}=0.75$ was adopted. The resulting sequences for the most representative regions of $\mathrm{BC}$ and $\mathrm{NC}$ (containing turning points and sight changes) are shown in Figure $5 .$ Note that, for the L6I and L6O datasets, no odometry ground-truth is provided by Angeli et al. (2008).

如第3.2节所述，我们的算法会根据观察到的视觉单词的差异(variance)将输入图像流动态分离为多个序列。将系统的整体性能作为最终目标，制定了基于precision-recall指标的验证测试，以衡量不同 $r_ {v}$ 值的影响。在四个训练数据集上选择并评估了多个 $r_ {v}$ 值。在此步骤中，precision-recall测量的产生并不是直接简单的，因为系统在序列划分期间不会创建任何闭环输出。为此，我们暂时将一致性过滤器的内核固定为使其所有 $\alpha_{i}$ 元素等于零，但 $\alpha_{4}=1$ ，在检测期间取消过滤器的作用，并促进 $r_{s}$ 作为测量precision-recall的一种方法。图4（ a）到4（ d）分别显示了，基于每个训练数据集的序列匹配而获得的最有用的曲线。通过将所有数据集视为一个统一的环境，相应地创建了图 4（e）中所示的曲线（就像同一个机器人依次遍历每个数据集一样）。请注意，对于 $r_{v}$ 范围在0.6到0.8之间的值，获得的性能保持相对稳定。考虑到BC数据集，似乎使用 $r_{v}$ = 0.6 的视觉单词variance阈值可获得最佳性能，而对于NC，最有利情况是 $r_{v}$ = 0.75。这是因为BC数据集对应于室内环境。因此，视觉变化往往比 NC 更为严重。给定指定的应用场景（室内或室外，动态或非动态，正面或侧面相机视图等），可以相应地选择最合适的 $r_{v}$ 值。但是，在本文中，针对通用设置，因此采用了 $r_{v}$ = 0.75 的值。图5中显示了 BC 和 NC 最具代表性的区域的结果序列（包含转折点和视线变化）。注意，对于 L6I 和 L6O 数据集，Angeli等人（2008年）没有提供里程表的真实性。

在这里插入图片描述

Fig. 4. Precision-recall curves for different $r_{v}$ values tested on various training datasets. All instances from datasets (a) through (d) are considered to comprise a unified dataset (e). The best performance is obtained within the range $[0.6, 0.8] .$ Considering the unified dataset, $r_{v}=0.75$ corresponds to the highest achieved recall rate.

图4.在各种训练数据集上测试的不同 $r_{v}$ 值的precision-recall曲线。数据集（a）到（d）的所有实例均被视组成统一的数据集（e）。在[0.6,0.8]范围内，可获得最佳性能。考虑统一数据集， $r_{v}$ = 0.75 对应最高的召回率。

在这里插入图片描述
图5 当 $r_{v}$ =0.75时，得到的序列。用不同颜色标注。当输入帧在序列的其余部分中没有包含足够数量的共有视觉单词时，将对轨迹进行分段。

笔记:
1 ) 根据visual word variance metric来划分轨迹成各个序列，具体地，确定 $r_{v}$ 的最佳值；
2 ）图4 的precision-recall曲线由 $r_{s}$ 测得？ $r_{s}$ 如何定义？
3 ) formulate 制定

The proposed visual word variance metric is not the only kind of measurement considered for our methodology. Other approaches, capable of running in real time and online (while the robot is moving) without the requirement of accessing the whole database beforehand, were also examined. Table 2 presents some of the evaluated techniques together with their respective best-case recall rates (for 100% precision) and average execution time, tested on the aforementioned unified training dataset. The “progressive L2-score” method marks the completion of a sequence whenever the L2-score between the current and the previously acquired input frame becomes smaller than a certain level, while the “visual word variance” method refers to the proposed approach that we previously evaluated. The “windowed progressive L2-score” and “windowed visual word variance” methods apply an additional averaging sliding window over the two aforementioned metrics. In these methods, the mean values of L2-score and $\sigma_{v}$ were calculated, respectively, between the current and the last $p$ input frames. Thus, the most recent sequence is finalized whenever the average L2-score becomes less than a certain level ("windowed progressive L2-score "), or when the average $\sigma_{v}$ becomes greater than one (“windowed visual word variance”). These approaches were selected with the aim of preventing unnecessary partition of the input stream when an instantaneous change of the view occurs (e.g. instantaneously looking sideways) while the robot is still located in the same scene. Interestingly, the two methods did not provide any considerable advantage to the system’s performance. This is because the proposed sequence matching scheme, in which, even if a continuous trajectory region is segmented without a semantic meaning, the produced additional sequences can all still be matched to a potentially loop-closing non-segmented database entry. The only disadvantage of such a division is the additional processing steps induced by the unnecessary segmentation of the searching space. Yet, as can be seen in Table 2, the continuous calculations of the extra L2-score or $\sigma_{v}$ measurements render the window-based approaches unfavorable, compared with their corresponding straightforward ones. Finally, the “image islands” method is inspired by the techniques described by Gálvez-López and Tardós (2012) and Milford and Wyeth $(2012) .$ The procedure starts with a calculation of the $L 2$ -scores between the I-VWVs. Then, close-in-time sets of images that present considerable similarity scores with the database are grouped to create the required sequences. The remaining frames are not employed. The rest of the proposed steps (S-VWV formulations, comparisons, etc.) remain the same. As seen, although this approach presents higher recall rates for $\%$ precision accuracy, it is still the most costly in terms of execution time. Considering a case of a powerful processing architecture, the “image islands” technique would be the most beneficial approach for distinguishing the required sequences. Yet, within the scope of this paper, time efficiency is crucial and thus the proposed visual word variance is adopted since it achieves a nice trade-off between performance and operational frequency.

表2 已测试的序列区分（分割）方法
在这里插入图片描述
上述提出的视觉单词差异度量（visual word variance metric）不是我们的方法论考虑的唯一一种度量。还研究了其他方法，这些方法可以实时和在线（机器人移动时）运行，而无需事先访问整个数据库。表2列出了一些经过评估的技术，以及它们各自的最佳情况召回率（对于100％精度）和平均执行时间，并在上述统一训练数据集（unifiled）上进行了测试。每当当前和先前获取的输入帧之间的L2-score小于某个水平时，“progressive L2-score”方法就标记某个序列的完成（即分割出一个序列）。…容易读懂，不翻。