NetVLAD: CNN architecture for weakly supervised place recognition

最新推荐文章于 2023-04-22 19:35:52 发布

64318@461

最新推荐文章于 2023-04-22 19:35:52 发布

阅读量1.1k

点赞数 3

分类专栏： slam 特征提取感知文章标签：自动驾驶

本文链接：https://blog.csdn.net/weixin_56836871/article/details/122159110

版权

本文提出了一种基于CNN的NetVLAD架构，用于改进弱监督的位置识别。通过借鉴并改进传统的VLAD表示，设计了一个能够聚合中间层特征的网络，以产生紧凑的单一向量表示。此外，引入了弱监督排名损失，解决训练数据不足的问题。在处理位置识别任务时，利用GPS信息作为潜在正样本和确定负样本的来源，设计的损失函数旨在提高地理位置相近的图像排名。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

背景知识：

Vector of Locally Aggregated Descriptors（VLAD）image retrieval.
【CC】是广泛使用的图像提取方式，本文是在在这个提取器上做改进；具体是啥下面有介绍

weakly supervised ranking loss
【CC】本文的另外一个创新点是弱监督的LOSS设计，后面有介绍

Place recongnition as an instance retrieval task：the query image location is estimated using the locations of the most visually similar images querying a large geotagged database；image is represented using local invariant features such as SIFT. the locations of top ranked images are used as suggestions for the location of the query
【CC】位置识别可以看成一个实例提取任务：给定一个查询图片，其位置通过已存储图片库中最相近的图片进行估计，图片库是经过几何标注的。肯定不能存储原图，一般是经过局部不变性过滤器进行了特征提取的，比如SIFT.

Representation compressed and effificiently indexed. image database augmented by 3D structure enables recovery of accurate camera pose.
【CC】经过提取后的特征肯定要压缩，并且能够高效索引（即，最好能够找一种排序算法）. 图片库是经过3D增强的，能够从库中的特征还原camera的3D位姿

what is the appropriate representation of a place that is rich enough to distinguish similarly
【CC】位置识别问题本质是如何设计一个算子或者NN 表达一个PLACE能够提供足够的信息用来度量相似性

解题思路：

First, what is a good CNN architecture for place recognition?
【CC】设计一个NN的网络做特征提取用来做位置识别
inspired by the Vector of Locally Aggregated Descriptors (VLAD) representation, develop a convolutional neural network architecture aggregates mid-level (conv5) convolutional features extracted from the entire image into a compact single vector representation，resulting aggregated representation compressed PCA
【CC】整体的思路是对VLAD的改进. 引入卷积层（可以看到后面用来两种经典卷积网络VGG16/AlexNet做改造）,加入一个COVN5的block，然后使用PCA进行压缩

Second, how to gather suffificient amount of annotated data?
【CC】如何搞到足够标签数据
we know the two panoramas are captured at approximately similar positions based on their (noisy) GPS but we don’t know which parts of the panoramas depict the same parts of the scene
【CC】全局场景数据会有噪声，只知道两幅图片位置相近，但不知道那部分特征是公共的

Third, how can we train the developed architecture tailored for the place recognition task
【CC】看看后面的rank loss函数设计就知道，只需要简单进行图片分类

a function f as the “image representation extractor”, given an image Ii it produces a fixed size vector f(Ii). the representations for th