论文笔记:FlowNet Learning Optical Flow with Convolutional Networks

Abstract

Solving the optical flow estimation problem as a supervised learning task
Compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.

Introduction

Training CNNs end-to-end to learn predicting the optical flow field from a pair of images

  • 1.learning image feature representations
  • 2.Match them at different locations in the two images
    Method

Develop an architecture with a correlation layer that explicitly provides matching capabilities.

Learn strong features at multiple levels of scale and abstraction

Find the actual correspondences based on these features(The layers on top of the correlation layer learn how to predict flow from these matches)

Related Work

Optical Flow

Variational approaches

  1. Variational approaches have dominated optical flow estimation since [19]
  2. Improvements [29,5,34]
  3. Large displacements and combinatorial matching were integrated into the variational approach [6,35]
  4. Feature information is aggregated from fine to coarse using sparse convolutions and max-pooling [35] [authors]
  5. [32] study statistics of optical flow and learn regularizers using Gaussian scale mixtures
  6. [31] model local statistics of optical flow with Gaussian mixture models.
  7. [4] compute principal components of a training set of flow fields.
  8. Train classifiers to select among different inertial estimates [21] or to obtain occlusion probabilities [27].
  9. [33] approach the task with factored gated restricted Boltzmann machines.
  10. [23] use a special autoencoder called ‘synchrony autoencoder’

Network Architectures

Convolutional Networks

Solution

  1. Apply a conventional CNN in a ‘sliding window’ fashion

Drawback:

  1. High computational const
  2. Per-patch nature( disallowing to account for global output properties, for example sharp edges )
  1. Upsample all feature maps to the desired full resolution and stack them together
    [10] Refine a coarse depth map by training an additional network( inputs: coarse prediction and the input image)

    [9] iteratively refine the coarse feature maps with the use of ‘upconvolutional’ layers


    The authors’

    ‘Upconvolve’ the whole coarse features maps, allowing to transfer more high-level information to the fine prediction
    P

    Given a dataset consisting of image pairs and ground truth flows, train a network to predict the x–y flow fields directly from the images.
  2. Stack both input images together and feed them through a rather generic network ( allowing the network to decide itself how to process the image pair to extract the motion information. )
  3. Create two separate, yet identical processing sreams for the two images and to combine them at a later stage

    Find correspondences

    Introduce a ‘correlation layer’ that performs multiplicative patch comparisons between two feature maps

    Consider only a single comparison of two patches
    在这里插入图片描述
    c is a square for a square patch of size K:=2k + 1
    convolve data with other data, has no trainable weights, invove c * K^2 multiplications
    limit the maximum displacement for comparisons and also introduce striding in both feature maps
    a maximum displacement d, each location x1 correlations c(x1, x2) only in a neighborhood of size D := 2d + 1,use strides s1 and s2 ( quantize x1 globally, quantize x2 within the neighborhood centered around x1)
    the result produced by the correlation is 4-dimensional
    an output of size (w × h × D2 )

Refinement

‘Upconvolutional’ layers, consisting of unpooling and a convolution [38,37,16,28,9]
Apply the ‘upconvolution’ to feature maps, and concatenate it with corresponding feature maps from the ‘contractive’ part of the network
An upsampled coarser flow prediction.

Preserve hige-level information passed from coarser feature maps

Fine local information provided in lower layer feature maps
Further refinement

  1. Bilinear upsampling
  2. Variational approachfrom [6] +v
    Start at the 4 times downsampled resolution
    Use the coarse to fine scheme with 20 iterations to bring the flow field to the full resolution
    Run 5 more iterations at the full image resolution
    Additionally compute image boundaries with the approach from [26] and respect the detected boundaries by replacing the smoothness coefficient by
    在这里插入图片描述
    b(x, y) denotes the thin boundary strength resampled at the respective scale and between pixels
  • More computationally expensive than simple bilinear upsampling
  • Obtain smooth and subpixel-accurate flow field

4.Training Data

4.1.Existing Datasets

Middlebury
KITTI
MPI Sintel

4.2.Flying Chairs

4.3.Data Augmentation

Advantage

Improve gengeralization

Avoid overfitting

Method

Variety of images

  • geometric transformations
  • Additive Gaussian noise
  • Change in brightness, contrast, gamma and color

Variety of flow fields

Apply the same strong geometric transformation to both images of a pair, but additionally a smaller relative transformation between the two
images.

  • Sample translation from a the range [−20%, 20%] of the image width for x and y;
  • Rotation from [−17◦, 17◦];
  • Scaling from [0.9, 2.0].
  • The Gaussian noise has a sigma uniformly sampled from [0, 0.04];
  • Contrast is sampled within [−0.8, 0.4];
  • Multiplicative color changes to the RGB channels per image from [0.5, 2];
  • Gamma values from [0.7, 1.5]
  • Additive brightness changes using Gaussian with a sigma of 0.2.

5.Experiments

5.1.Network and Training Details

Nine convolutional layers with stride of 2 (the simplest form of pooling) in six of them and a ReLU nonlinearity after each layer.
Without any fully connected layers, allowing the networks to take images of arbitrary size as input
CNNs

choose Adam [22] as optimization method (faster convergence than standard stochastic gradient descent with momentum),β1 = 0.9 and β2 = 0.999

Use fairly small mini-batches of 8 image pairs.

Start with learning rate λ = 1e−4 and then divide it by 2 every 100k iterations after the first 300k

Upscale the input images during testing may improve the performance

For the correlation layer in FlowNetC

Parameters k = 0, d = 20, s1 = 1, s2 = 2

Training loss we use the endpoint error the Euclidean distance between the predicted flow vector and the ground truth, averaged over all pixels

Due to exploding gradients with λ = 1e−4, start by training with a very low learning rate λ = 1e−6, slowly increase it to reach λ = 1e−4 after 10k iterations and then follow the schedule just described.

Upscale with a factor of 1.25

Fine-tuning
Fine-tune on the Sintel training set

1.Use images from the Clean and Final versions of Sintel together and fine-tune using a low learning rate λ = 1e−6 for several thousand iterations.

2.After defining the optimal number of iterations using a validation set, we then fine-tune on the whole training set for the same number of iterations.

5.3.Analysis

Training data
Aim

Check if we benefit from using the Flying Chairs dataset instead of Sintel

Method

Trained a network just on Sintel, leaving aside a validation set to control the performance
Result
The network trained exclusively on Sintel has EPE roughly 1 pixel higher than the net trained on Flying Chairs and fine-tuned on Sintel

Question: Is data augmentation still necessary?
Training a network without data augmentation on the Flying Chairs results in an EPE increase of roughly 2 pixels when testing on Sintel.

Comparing the architectures

  1. The FlowNetC adapts to the kind of data it is presented during training.
  2. FlowNetC seems to have more problems with large displacements.

Endpoint error(EPE)

End-to-end point error is calculated by comparing an estimated optical flow vector (V_est) with a groundtruth optical flow vector (V_gt).

Calculate

End-to-end point error is defined as the scalar length of difference vector:
||V_est - V_gt||

For a given frame in the video, you will usually have many such vectors, and the common quality measure of your optical flow estimation is the average end-to-end point error.

Interpolation error

Not needing any groundtruth

Achieved by using the optical flow to extrapolate (“warp”) the current frame. The extrapolated image is then compared with the real next frame of the video.

Interpolation error can be a good measure for how well the optical flow can be used for video encoding, while end-to-end point error can be a good measure for how it can be used for computer vision tasks, such as shape from motion and the likes.

FlowNet: Learning Optical Flow with Convolutional Networks

1.Flow field color coding

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值