In this work, we propose VLocNet, a new convolutional neural network architecture for 6-DoF global pose regression and odometry estimation from consecutive monocular images.
本文提出了VLocNet,是一种用于对连续图像进行6自由度全局姿态回归和里程计估计的CNN网络。
Our multitask model incorporates hard parameter sharing, thus being compact and enabling real-time inference, in addition to being end-to-end trainable.
我们的多任务模型实现了参数分享,因此除了可以进行端到端的训练之外,同还十分的紧凑可以实时运行。
We propose a novel loss function that utilizes auxiliary learning to leverage relative pose information during training, thereby constraining the search space to obtain consistent pose estimates.
even our single task model exceeds the performance of state-of-the-art deep architectures for global localization, while achieving competitive performance for visual odometry estimation.
Furthermore, we present extensive experimental evaluations utilizing our proposed Geometric Consistency Loss that show the effectiveness of multitask learning and demonstrate that our model is the first deep learning technique to be on par with, and in some cases outperforms state-of-theart SIFT-based approaches.
From a robot’s learning perspective, it is unlucrative and unscalable to have multiple specialized single-task models as they inhibit both intertask and auxiliary learning. This has lead to a recent surge in research targeted towards frameworks for learning unified models for a range of tasks across different domains.
An evident advantage is the resulting compact model size in comparison to having multiple task-specific models. Auxiliary learning approaches on the other hand, aim at maximizing the prediction of a primary task by supervising the model to additionally learn a secondary task.
For instance, in the context of localization, humans often describe their location to each other with respect to some reference landmark in the scene and giving their position relative to it. Here, the primary task is to localize and the auxiliary task is to be able to identify landmarks.
Similarly, we can leverage the complementary relative motion information from odometry to constrict the search space while training the global localization model.
In this work, we address the problem of global pose regression by simultaneously learning to estimate visual odometry as an auxiliary task. We propose the VLocNet architecture consisting of a global pose regression sub-network and a Siamese-type relative pose estimation sub-network. Our network based on the residual learning framework, takes two consecutive monocular images as input and jointly regresses the 6-DoF global pose as well as the 6-DoF relative pose between the images. We incorporate a hard parameter sharing scheme to learn inter-task correlations within the network and present a multitask alternating optimization strategy for learning shared features across the network. Furthermore, we devise a new loss function for global pose regression that incorporates the relative motion information during training and enforces the predicted poses to be geometrically consistent with respect to the true motion model.
We present extensive experimental evaluations on both indoor and outdoor datasets comparing the proposed method to state-ofthe-art approaches for global pose regression and visual odometry estimation. We empirically show that our proposed VLocNet architecture achieves state-of-the-art performance compared to existing CNN-based techniques. To the best of our knowledge, our presented approach is the first deep learning-based localization method to perform on par with local feature-based techniques. Moreover, our work is the first attempt to show that a joint multitask model can precisely and efficiently outperform its task-specific counterparts for global pose regression and visual odometry estimation.
本文提出框架的主要任务是通过最小化所提出的几何一致性损失函数(Geometric Consistency Loss function)来精确估计全局位姿,同时又利用两个连续帧之间的相对运动来约束全局定位的搜索空间。本文将这一问题定义成一种辅助学习,该辅助学习以估计全局定位为主目标,以估计相对运动为第二目标。通过相对运动估计学习得到的特征将被全局位姿回归部分用于学习针对不同场景更具区分力的描述子。
本文所提出的框架包含三个流程的神经网络:一个流程为全局位姿回归网络,另外两个流程应用于实现连体式里程估计。整体结构如图1所示。给定一组连续图像(It,It−1) ( I t , I t − 1 ) 本文提出的网络先预测出两幅图像的全局位姿pt=[xt,qt],pt−1=[xt−1,qt−1] p t = [ x t , q t ] , p t − 1 = [ x t − 1 , q t − 1 ] 以及相对位姿pt,t−1=[xt,t−1,qt,t−1] p t , t − 1 = [ x t , t − 1 , q t , t − 1 ] ,全局位姿归回网络输入为图像It I t ,连体式里程估计输入为连续图像(It,It−1) ( I t , I t − 1 ) 。
A. 全局位姿回归
全局定位子网络的输入是图像It I t ,先前预测的位姿p^t−1 p ^ t − 1 ,输出是新预测的当前位姿p^t p ^ t
1)子网络结构:子网络是在ResNet-50上构建的,在最后一个平均迟化层之前与ResNet-50是一样的,包含5个具有多个残差单元的残差模块,其中每个单元具有由三个卷积层组成的瓶颈结构,每一个卷积层后接一个批量标准化层(batch normalization),缩放层(scale)以及修正线性单元层(ReLUs)。本文对残差单元进行了改进,将ReLUs换成了指数线性单元(ELUs),从而可以减少神经元中的偏差,同时避免了梯度消失以及收敛速度更快的问题,本文将最后的平均池化层换成了全局平局池化,并在其后增加了三个内积层fc1,fc2,fc3 f c 1 , f c 2 , f c 3 ,f