color constancy theory 色彩恒常性、自动白平衡
光源估计 各种网站
2019 RAL Learning Matchable Image Transformations for Long-term Metric Visual Localization
learn a drop-in replacement for the standard RGB-to-grayscale colorspace mapping used to pre-process RGB images for use with conventional feature detection/matching algorithms
builds upon prior work on color constancy theory
mapping the RGB colorspace onto a grayscale colorspace
appropriate objective function which should ideally be tied to the performance of
the target localization pipeline
mapping the RGB colorspace onto a grayscale colorspace that explicitly maximizes a chosen performance metric of a vision-based localization pipeline.
We investigate two approaches to formulating such a mapping
1. a single function , similarly to [11], [13], [14];
- Robust monocular visual teach and repeat aided by local ground planarity and color-constant imagery 2017
- Robust, long-term visual localisation using illumination invariance 2014
- Expanding the limits of vision-based localization for long-term routefollowing autonomy 2017
2. a parametrized function tailored to the specific image pair
.Additionally, the functional form of either mapping may be specified analytically (e.g., from physics) or learned from data using a function approximator such as a neural network.
In the absence of accurate ground truth data, we might instead choose to maximize the number or quality of feature matches in the front-end of a feature-based localization pipeline.
In this work we learn an objective function by training a deep convolutional neural network (CNN) to act as a differentiable proxy to the localization front-end.
This proxy network can then be used to define a fully differentiable objective function, allowing us to train a nonlinear colorspace mapping using gradient-based methods.
related works
Appearance robustness in metric visual localization has previously been studied from the perspective of illumination invariance
11- 14 hand-engineered image transformations to improve feature matching over time
- [12]Dealing with shadows: Capturing intrinsic scene appearance for image-based outdoor localisation 2013
[15], [16]. affine models and other simple analytical transformations
- [16]Illumination change robustness in direct visual SLAM
[17]–[20] have focused on learning feature descriptors that are robust to certain types of appearance change
- Learning place-dependant features for long-term vision-based localisation
- Made to measure: Bespoke landmarks for 24-hour, all-weather localisation with a camera
- Image features for visual teach-and-repeat navigation in changing environments
- Learning place-and-timedependent binary descriptors for long-term visual localization
image-to-image translation [7], [8]
- Image-to-image translatio with conditional adversarial networks
- Unpaired image-to-image translation using Cycle-Consistent adversarial networks
[21] the authors train a convolutional encoderdecoder network to enhance the temporal consistency : Here the main source of appearance change is the camera
itself
[5] learning a manyto-one mapping onto a privileged appearance condition and
[6] learning multiple pairwise mappings between appearance categories such as day and night.
- How to train a CAT: Learning canonical appearance transformations for direct visual localization under illumination change
- Adversarial training for adverse conditions: Robust metric localisation using appearance transfer
appearanceinvariant place recognition [22], [23], which typically relies on patch matching or whole-image statistics to identify images corresponding to nearby physical locations
- Addressing challenging place recognition tasks using generative adversarial networks
- Night-to-day image translation for retrieval-based localization
, [5], [6], [21] require well-aligned training images exhibiting appearance variation, which are difficult to obtain at scale in the real world, and it is not clear how categorical
appearance mappings such as [6], [22], [23] should be applied to continuous appearance change in long-term deployments.
Grayscale images generated using this procedure are somewhat resistant to variations in lighting and shadow, and have been shown to improve stereo localization quality in the
presence of shadows and changing daytime lighting conditions [11], [13], [14]
A. Differentiable Matcher Proxy
We consider the task of training a CNN Mθ, with parameters θ, to predict the number of inlier feature matches returned by a non-differentiable feature detector/matcher M for a given image pair
B. Physically Motivated Transformation
Prior work in [9] has shown that under the assumptions of a single black-body illuminant and an infinitely narrow sensor response function, an appropriately weighted linear combination of the log-responses of a three-channel (e.g.,RGB) camera represents a projection onto an invariant onedimensional chromaticity space that is independent of both the intensity and color temperature of the illuminant, and depends only on the imaging sensor and the materials in the scene
- [9]Study of the photodetector characteristics of a camera for color constancy in natural scenes 2010
Grayscale images generated using this procedure are somewhat resistant to variations in lighting and shadow, and have been shown to improve stereo localization quality in the presence of shadows and changing daytime lighting conditions [11], [13], [14], but have not been successful in adapting to nighttime navigation with headlights
relax the constraints in equation (3) and generalize equation (2) to be of the form
C. Learned Nonlinear Transformations
While the assumption of a single black-body illuminant in [9] is reasonable for daytime navigation where the dominant light source is the sun, it does not hold in many common navigation scenarios such as nighttime driving with headlights.
Moreover, the assumption of an infinitely narrow sensor response is unrealistic for real cameras.
we investigate the possibility of learning a bespoke nonlinear mapping that maximizes matchability for a particular combination of imaging sensor, estimator and environment.
We consider two versions of this MLP-based transformation,both with and without incorporating an additional pairwise context feature obtained from encoder network Eφ.
IV. EXPERIMENTS
h the greatest improvements generally obtained from the SumLog and SumLog-E transformations.
We saw little improvement in match counts on the RobotCar/Overcast-Night experiment, which we attribute to motion blur in the nighttime images making feature matching exceptionally difficult.
Fig. 4 shows the outputs of each image transformation for sample RGB image pairs in the VKITTI/0020 Morning and Sunset sequences (Fig. 4(a)) and the challenging sequence InTheDark/0041 (Fig. 4(b)).
We see that each model produced image pairs that are visually more consistent than standard Gray images, and that local illumination variations such as shadows, uneven lighting, and specular reflections were minimized by optimizing equation (6).