Abstract
Solving the optical flow estimation problem as a supervised learning task
Compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.
Introduction
Training CNNs end-to-end to learn predicting the optical flow field from a pair of images
- 1.learning image feature representations
- 2.Match them at different locations in the two images
Method
Develop an architecture with a correlation layer that explicitly provides matching capabilities.
Learn strong features at multiple levels of scale and abstraction
Find the actual correspondences based on these features(The layers on top of the correlation layer learn how to predict flow from these matches)
Related Work
Optical Flow
Variational approaches
- Variational approaches have dominated optical flow estimation since [19]
- Improvements [29,5,34]
- Large displacements and combinatorial matching were integrated into the variational approach [6,35]
- Feature information is aggregated from fine to coarse using sparse convolutions and max-pooling [35] [authors]
- [32] study statistics of optical flow and learn regularizers using Gaussian scale mixtures
- [31] model local statistics of optical flow with Gaussian mixture models.
- [4] compute principal components of a training set of flow fields.
- Train classifiers to select among different inertial estimates [21] or to obtain occlusion probabilities [27].
- [33] approach the task with factored gated restricted Boltzmann machines.
- [23] use a special autoencoder called ‘synchrony autoencoder’
Network Architectures
Convolutional Networks
Solution
- Apply a conventional CNN in a ‘sliding window’ fashion
Drawback:
- High computational const
- Per-patch nature( disallowing to account for global output properties, for example sharp edges )
- Upsample all feature maps to the desired full resolution and stack them together
[10] Refine a coarse depth map by training an additional network( inputs: coarse prediction and the input image)
[9] iteratively refine the coarse feature maps with the use of ‘upconvolutional’ layers
The authors’
‘Upconvolve’ the whole coarse features maps, allowing to transfer more high-level information to the fine prediction
P
Given a dataset consisting of image pairs and ground truth flows, train a network to predict the x–y flow fields directly from the images. - Stack both input images together and feed them through a rather generic network ( allowing the network to decide itself how to process the image pair to extract the motion information. )
- Create two separate, yet identical processing sreams for the two images and to combine them at a later stage
Find correspondences
Introduce a ‘correlation layer’ that performs multiplicative patch comparisons between two feature maps
Consider only a single comparison of two patches
c is a square for a square patch of size K:=2k + 1
convolve data with other data, has no trainable weights, invove c * K^2 multiplications
limit the maximum displacement for comparisons and also introduce striding in both feature maps
a maximum displacement d, each location x1 correlations c(x1, x2) only in a neighborhood of size D := 2d + 1,use strides s1 and s2 ( quantize x1 globally, quantize x2 within the neighborhood centered around x1)
the result produced by the correlation is 4-dimensional
an output of size (w × h × D2 )
Refinement
‘Upconvolutional’ layers, consisting of unpooling and a convolution [38,37,16,28,9]
Apply the ‘upconvolution’ to feature maps, and concatenate it with corresponding feature maps from the ‘contractive’ part of the network
An upsampled coarser flow prediction.
Preserve hige-level information passed from coarser feature maps
Fine local information provided in lower layer feature maps
Further refinement
- Bilinear upsampling
- Variational approachfrom [6] +v
Start at the 4 times downsampled resolution
Use the coarse to fine scheme with 20 iterations to bring the flow field to the full resolution
Run 5 more iterations at the full image resolution
Additionally compute image boundaries with the approach from [26] and respect the detected boundaries by replacing the smoothness coefficient by
b(x, y) denotes the thin boundary strength resampled at the respective scale and between pixels
- More computationally expensive than simple bilinear upsampling
- Obtain smooth and subpixel-accurate flow field
4.Training Data
4.1.Existing Datasets
Middlebury
KITTI
MPI Sintel
4.2.Flying Chairs
4.3.Data Augmentation
Advantage
Improve gengeralization
Avoid overfitting
Method
Variety of images
- geometric transformations
- Additive Gaussian noise
- Change in brightness, contrast, gamma and color
Variety of flow fields
Apply the same strong geometric transformation to both images of a pair, but additionally a smaller relative transformation between the two
images.
- Sample translation from a the range [−20%, 20%] of the image width for x and y;
- Rotation from [−17◦, 17◦];
- Scaling from [0.9, 2.0].
- The Gaussian noise has a sigma uniformly sampled from [0, 0.04];
- Contrast is sampled within [−0.8, 0.4];
- Multiplicative color changes to the RGB channels per image from [0.5, 2];
- Gamma values from [0.7, 1.5]
- Additive brightness changes using Gaussian with a sigma of 0.2.
5.Experiments
5.1.Network and Training Details
Nine convolutional layers with stride of 2 (the simplest form of pooling) in six of them and a ReLU nonlinearity after each layer.
Without any fully connected layers, allowing the networks to take images of arbitrary size as input
CNNs
choose Adam [22] as optimization method (faster convergence than standard stochastic gradient descent with momentum),β1 = 0.9 and β2 = 0.999
Use fairly small mini-batches of 8 image pairs.
Start with learning rate λ = 1e−4 and then divide it by 2 every 100k iterations after the first 300k
Upscale the input images during testing may improve the performance
For the correlation layer in FlowNetC
Parameters k = 0, d = 20, s1 = 1, s2 = 2
Training loss we use the endpoint error the Euclidean distance between the predicted flow vector and the ground truth, averaged over all pixels
Due to exploding gradients with λ = 1e−4, start by training with a very low learning rate λ = 1e−6, slowly increase it to reach λ = 1e−4 after 10k iterations and then follow the schedule just described.
Upscale with a factor of 1.25
Fine-tuning
Fine-tune on the Sintel training set
1.Use images from the Clean and Final versions of Sintel together and fine-tune using a low learning rate λ = 1e−6 for several thousand iterations.
2.After defining the optimal number of iterations using a validation set, we then fine-tune on the whole training set for the same number of iterations.
5.3.Analysis
Training data
Aim
Check if we benefit from using the Flying Chairs dataset instead of Sintel
Method
Trained a network just on Sintel, leaving aside a validation set to control the performance
Result
The network trained exclusively on Sintel has EPE roughly 1 pixel higher than the net trained on Flying Chairs and fine-tuned on Sintel
Question: Is data augmentation still necessary?
Training a network without data augmentation on the Flying Chairs results in an EPE increase of roughly 2 pixels when testing on Sintel.
Comparing the architectures
- The FlowNetC adapts to the kind of data it is presented during training.
- FlowNetC seems to have more problems with large displacements.
Endpoint error(EPE)
End-to-end point error is calculated by comparing an estimated optical flow vector (V_est) with a groundtruth optical flow vector (V_gt).
Calculate
End-to-end point error is defined as the scalar length of difference vector:
||V_est - V_gt||
For a given frame in the video, you will usually have many such vectors, and the common quality measure of your optical flow estimation is the average end-to-end point error.
Interpolation error
Not needing any groundtruth
Achieved by using the optical flow to extrapolate (“warp”) the current frame. The extrapolated image is then compared with the real next frame of the video.
Interpolation error can be a good measure for how well the optical flow can be used for video encoding, while end-to-end point error can be a good measure for how it can be used for computer vision tasks, such as shape from motion and the likes.