Deep and Direct Visual SLAM
摘要:Daniel Cremers 关于深度网络和直接法SLAM的报告演讲,演讲时间2021年10月。这次展示主要探讨了三个主题:直接法SLAM的介绍和优势及应用;使用深度网络从单张图片中估计深度并应用于单目视觉SLAM,端到端地估计两帧间位姿等信息;单目稠密重构。其介绍的大部分内容来自主讲人课题组的文章,在文末已经列出。
Direct Visual SLAM
Keypoint-Based frontend method is bound to be suboptimal for many
reasons:
- throw away potentially valuable brightness information from sensor.
You are not working on the raw sensory data. So invariably from a
statistical (say bayesian inference) point of view, this solution
will never be optimal. You create an intermediate abstraction and at
that point you throw away potentially valuable information - Any mistake you make in assigning correspondence will propagate and
deterioate your reconstructions.
In direct method, we don’t minimize a geometric reprojection error of
points in the image but we minimize a photometric color consistency
error to infer camera motion and 3D map.
Works:
- LSD-SLAM[1]
- DSO[2]
loss function:
min ξ ∈ R 6 ∫ Ω ∣ I K F ( x ) − I ( π ( g ξ ( u ⋅ x ) ) ) ∣ d x \min_{\xi\in\mathbb R^6} \int_\Omega \lvert I_{KF}(x) - I(\pi(g_\xi(u\cdot x)) ) \rvert dx ξ∈R6min∫Ω∣IKF(x)−I(π(gξ(u⋅x)))∣dx
each pixel
x
x
x in the keyframe, the brightness of the key frame
I
K
F
I_{KF}
IKF should be the same as the brightness of corresponding point in new image
I
I
I.
Deep Visual SLAM
Using deep learning approch to predict depth of pixels in single image,
and find a reconstruction s.t. the depth maps for each keyframe are
consistent with the deep net predictions.
This method is semi-supervised or self-supervised in the sense that you
basically take the second camera only in training, you predict depth so
that it’s consistent with the second camera intensities, and in
application you only need one camera.[3]
Applying deep learning both in the frontend tracking in terms of
non-linear factor graphs and also in the backend optimization where we
expand the classical loss function by additional terms that assure
consistency with these predictions. Deep Depth, Deep Pose, Deep
Uncertainty.[4]
Suppose two consecutive images It and It′
as the input of Pose Net, the out put be the transpose
Tt′t, use the brightness consistency as the
loss function of self-supervised learning:
L
s
e
l
f
=
r
(
I
t
,
I
t
′
−
t
)
L_{self} = r(I_t, I_{t'-t})
Lself=r(It,It′−t)
To deal with vary of aperture, exposure, warped images, train a network
to compensate with an affine brightness transformation is also predicted
by this network:
⇒ L s e l f = r ( a t t ′ I t + b t t ′ , I t ′ − t ) \Rightarrow L_{self} = r({\color{lightgreen}a_t^{t'}} I_t + {\color{lightgreen}b_t^{t'}}, I_{t'-t}) ⇒Lself=r(att′It+btt′,It′−t)
It’s difficult to model all the phenomena (moving objects,
glass/metallic structure) in real word. One solution is to down weight
areas in the residual where the brightness is not likely preserved. This
is called aleutoric uncertainty that we can also perdict by the deep
network.
⇒ L s e l f = r ( a t t ′ I t + b t t ′ , I t ′ − t ) Σ t + log Σ t \Rightarrow L_{self} = \frac{r({\color{lightgreen}a_t^{t'}} I_t + {\color{lightgreen}b_t^{t'}}, I_{t'-t})}{\color{red}\Sigma_t} + \log{\color{red}\Sigma_t} ⇒Lself=Σtr(att′It+btt′,It′−t)+logΣt
It tells us how likely is the brightness preserved or not. And then with
a gaussian distribution we can down weight the residuals. To make sure
that not everything is down weighted we add log term in behind.
(最后的log项是概率分布取对数似然的结果,详见[5])
Mono Dense reconstruction
MonoRec(Felix Wimbauer et al.): A neural network to generate a dense
reconstruction, the network predicts depth not for a single frame but
for a sequence of consecutive frames so that we can exploit the
brightness consistency across frames for the prediction.
参考文献
[1] ENGEL J. SCHPS T. CREMERS D. LSD-SLAM:
Large-scale direct monocular SLAM[J]. Springer, Cham, 2014.
[2] ENGEL J. KOLTUN V. CREMERS D. Direct sparse
odometry[J]. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2017, 40(3): 611-625.
[3] YANG N. WANG R. STUCKLER J. 等. Deep Virtual
Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct
Sparse Odometry[C/OL]//Proceedings of the European Conference on
Computer Vision (ECCV). 2018: 817-833[2022-07-10].
https://openaccess.thecvf.com/content_ECCV_2018/html/Nan_Yang_Deep_Virtual_Stereo_ECCV_2018_paper.html.
[4] YANG N. STUMBERG L von. WANG R. 等. D3VO: Deep
Depth, Deep Pose and Deep Uncertainty for Monocular Visual
Odometry[C/OL]//Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2020: 1281-1292[2022-07-10].
https://openaccess.thecvf.com/content_CVPR_2020/html/Yang_D3VO_Deep_Depth_Deep_Pose_and_Deep_Uncertainty_for_Monocular_CVPR_2020_paper.html.
[5] KLODT M. VEDALDI A. Supervising the new with
the old: learning SFM from SFM[C/OL]//Proceedings of the European
Conference on Computer Vision (ECCV). 2018: 698-713[2022-07-10].
https://openaccess.thecvf.com/content_ECCV_2018/html/Maria_Klodt_Supervising_the_new_ECCV_2018_paper.html.