论文笔记：FlowNet Learning Optical Flow with Convolutional Networks

最新推荐文章于 2022-03-24 11:01:50 发布

eight_Jessen

最新推荐文章于 2022-03-24 11:01:50 发布

阅读量509

点赞数

分类专栏：论文笔记文章标签：深度学习 pytorch 神经网络机器学习

本文链接：https://blog.csdn.net/eight_Jessen/article/details/107944390

版权

论文笔记专栏收录该内容

49 篇文章 7 订阅

订阅专栏

Abstract

Solving the optical flow estimation problem as a supervised learning task
Compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.

Introduction

Training CNNs end-to-end to learn predicting the optical flow field from a pair of images

1.learning image feature representations
2.Match them at different locations in the two images
Method

Develop an architecture with a correlation layer that explicitly provides matching capabilities.

Learn strong features at multiple levels of scale and abstraction

Find the actual correspondences based on these features(The layers on top of the correlation layer learn how to predict flow from these matches)

Related Work

Optical Flow

Variational approaches

Variational approaches have dominated optical flow estimation since [19]
Improvements [29,5,34]
Large displacements and combinatorial matching were integrated into the variational approach [6,35]
Feature information is aggregated from fine to coarse using sparse convolutions and max-pooling [35] [authors]
[32] study statistics of optical flow and learn regularizers using Gaussian scale mixtures
[31] model local statistics of optical flow with Gaussian mixture models.
[4] compute principal components of a training set of flow fields.
Train classifiers to select among different inertial estimates [21] or to obtain occlusion probabilities [27].
[33] approach the task with factored gated restricted Boltzmann machines.
[23] use a special autoencoder called ‘synchrony autoencoder’

Network Architectures

Convolutional Networks

Solution

Apply a conventional CNN in a ‘sliding window’ fashion

Drawback:

High computational const
Per-patch nature( disallowing to account for global output properties, for example sharp edges )

Upsample all feature maps to the desired full resolution and stack them together
[10] Refine a coarse depth map by training an additional network( inputs: coarse prediction and the input image)

[9] iteratively refine the coarse feature maps with the use of ‘upconvolutional’ layers

The authors’

‘Upconvolve’ the whole coarse features maps, allowing to transfer more high-level information to the fine prediction
P

Given a dataset consisting of image pairs and ground truth flows, train a network to predict the x–y flow fields directly from the images.
Stack both input images together and feed them through a rather generic network ( allowing the network to decide itself how to process the image pair to extract the motion information. )
Create two separate, yet identical processing sreams for the two images and to combine them at a later stage

Find correspondences

Introduce a ‘correlation layer’ that performs multiplicative patch comparisons between two feature maps

Consider only a single comparison of two patches

c is a square for a square patch of size K:=2k + 1
convolve data with other data, has no trainable weights, invove c * K^2 multiplications
limit the maximum displacement for comparisons and also introduce striding in both feature maps
a maximum displacement d, each location x1 correlations c(x1, x2) only in a neighborhood of size D := 2d + 1,use strides s1 and s2 ( quantize x1 globally, quantize x2 within the neighborhood centered around x1)
the result produced by the correlation is 4-dimensional
an output of size (w × h × D2 )

Refinement

‘Upconvolutional’ layers, consisting of unpooling and a convolution [38,37,16,28,9]
Apply the ‘upconvolution’ to feature maps, and concatenate it with corresponding feature maps from the ‘contractive’ part of the network
An upsampled coarser flow prediction.

Preserve hige-level information passed from coarser feature maps

Fine local information provided in lower layer feature maps
Further refinement

Bilinear upsampling
Variational approachfrom [6] +v
Start at the 4 times downsampled resolution
Use the coarse to fine scheme with 20 iterations to bring the flow field to the full resolution
Run 5 more iterations at the full image resolution
Additionally compute image boundaries with the approach from [26] and respect the detected boundaries by replacing the smoothness coefficient by

b(x, y) denotes the thin boundary strength resampled at the respective scale and between pixels

More computationally expensive than simple bilinear upsampling
Obtain smooth and subpixel-accurate flow field

4.Training Data

4.1.Existing Datasets

Middlebury
KITTI
MPI Sintel

4.2.Flying Chairs

4.3.Data Augmentation

Advantage

Improve gengeralization

Avoid overfitting

Method

Variety of images

geometric transformations
Additive Gaussian noise
Change in brightness, contrast, gamma and color

Variety of flow fields

Apply the same strong geometric transformation to both images of a pair, but additionally a smaller relative transformation between the two
images.

Sample translation from a the range [−20%, 20%] of the image width for x and y;
Rotation from [−17◦, 17◦];
Scaling from [0.9, 2.0].
The Gaussian noise has a sigma uniformly sampled from [0, 0.04];
Contrast is sampled within [−0.8, 0.4];
Multiplicative color changes to the RGB channels per image from [0.5, 2];
Gamma values from [0.7, 1.5]
Additive brightness changes using Gaussian with a sigma of 0.2.

5.Experiments

5.1.Network and Training Details

Nine convolutional layers with stride of 2 (the simplest form of pooling) in six of them and a ReLU nonlinearity after each layer.
Without any fully connected layers, allowing the networks to take images of arbitrary size as input
CNNs

choose Adam [22] as optimization method (faster convergence than standard stochastic gradient descent with momentum),β1 = 0.9 and β2 = 0.999

Use fairly small mini-batches of 8 image pairs.

Start with learning rate λ = 1e−4 and then divide it by 2 every 100k iterations after the first 300k

Upscale the input images during testing may improve the performance

For the correlation layer in FlowNetC

Parameters k = 0, d = 20, s1 = 1, s2 = 2

Training loss we use the endpoint error the Euclidean distance between the predicted flow vector and the ground truth, averaged over all pixels

Due to exploding gradients with λ = 1e−4, start by training with a very low learning rate λ = 1e−6, slowly increase it to reach λ = 1e−4 after 10k iterations and then follow the schedule just described.

Upscale with a factor of 1.25

Fine-tuning
Fine-tune on the Sintel training set

1.Use images from the Clean and Final versions of Sintel together and fine-tune using a low learning rate λ = 1e−6 for several thousand iterations.

2.After defining the optimal number of iterations using a validation set, we then fine-tune on the whole training set for the same number of iterations.

5.3.Analysis

Training data
Aim

Check if we benefit from using the Flying Chairs dataset instead of Sintel

Method

Trained a network just on Sintel, leaving aside a validation set to control the performance
Result
The network trained exclusively on Sintel has EPE roughly 1 pixel higher than the net trained on Flying Chairs and fine-tuned on Sintel

Question: Is data augmentation still necessary?
Training a network without data augmentation on the Flying Chairs results in an EPE increase of roughly 2 pixels when testing on Sintel.

Comparing the architectures

The FlowNetC adapts to the kind of data it is presented during training.
FlowNetC seems to have more problems with large displacements.

Endpoint error(EPE)

End-to-end point error is calculated by comparing an estimated optical flow vector (V_est) with a groundtruth optical flow vector (V_gt).

Calculate

End-to-end point error is defined as the scalar length of difference vector:
||V_est - V_gt||

For a given frame in the video, you will usually have many such vectors, and the common quality measure of your optical flow estimation is the average end-to-end point error.

Interpolation error

Not needing any groundtruth

Achieved by using the optical flow to extrapolate (“warp”) the current frame. The extrapolated image is then compared with the real next frame of the video.

Interpolation error can be a good measure for how well the optical flow can be used for video encoding, while end-to-end point error can be a good measure for how it can be used for computer vision tasks, such as shape from motion and the likes.