论文笔记之视频:SDC-Net_Video prediction using spatially-displaced convolution

Abstract

Learn a motion vector and a kernel for each pixel and synthesize a pixel by applying the kernel at a displaced location in the source image, defined by the predicted motion vector.

1.Introduction

Video prediction task

  1. Aaccurately capture not only how objects move
  2. how their displacement affects the visibility and appearance of surrounding structures.
  3. Models can be trained on raw unlabeled video frames.
  4. Previous approaches for video prediction often focus on direct synthesis of pixels using generative models.
  5. [28] proposed to partition input sequences into a dictionary of image patch centroids and trained recurrent neural networks (RNN) to generate target images by indexing the dictionaries.
  6. [31] and [34] used a convolutional Long-Short-Term-Memory (LSTM) encoder-decoder architecture conditioned on previous frame data.
  7. [17] presented a predictive coding RNN architecture to model the motion dynamics of objects in the image for frame prediction.
  8. [21] proposed a multi-scale conditional generative adversarial network (GAN) architecture to alleviate the short range dependency of single-scale architectures.
  9. Learning to transform input frames [14] proposed a generative adversarial network (GAN) approach with a joint future optical-flow and future frame discriminator.
  10. [10] presented a model to learn offset vectors for sampling for frame interpolation, and perform frame synthesis using bilinear interpolation guided by the learned sampling vectors.
    sampling vector-based synthesis results are often affected by speckled noise.
  11. [24, 23] and [36] for frame synthesis is to learn to predict sampling kernels that adapt to each output pixel.
    effective,cannot model large motion, since its displacement is limited by the kernel size

2.Methods

One approach

  1. Given a sequence of frames I 1 : t I_{1:t} I1:t (the immediate past t frames), our work aims to predict the next future frame I t + 1 I_t+1 It+1.

    Formulate the problem as a transformation learning problem:
    在这里插入图片描述
    G is a learned function that predicts transformation parameters, and T is a transformation function

    T can be a bilinear sampling operation guided by a motion vector:
    在这里插入图片描述
    f is a bilinear interpolator, (u, v) is a motion vector predicted by G,I_t(x, y) a pixel value at (x, y) in the immediate past frame It
    Another approach
    Define T as a convolution module that combines motion or displacement learning and resampling into a single operation
    在这里插入图片描述
    K(x, y) ∈ R^{N×N} is an N×N 2D kernel predicted by G at (x, y), P_t(x, y) is an N×N patch centered at (x, y) in I_t

2.1 Spatially Displaced Convolution

在这里插入图片描述
The predicted pixel It+1(x, y) is thus the weighted sampling of pixels in an N×N region centered at (x + u, y + v) in It.

K(x,y) is a kernel of all-zeros except for a one at the center reduce the SDC to equation (2)

u and v to zero reduces it to equation (3)

not the same as applying equation (2) and equation (3) in succession.

Formulate the model as:

在这里插入图片描述
T is realized with SDC and operates on the most recent input It, and Fi is the backwards optical flow between Ii and I_i-1.

2.2. Network Architecture

  1. Realize G using a fully convolutional network

Input: a sequence of past frames I_{1:t} and past optical flows F_{2:t}

Output: pixel-wise separable kernels {Ku, Kv} and a motion vector (u, v)

Use 3D convolutions to convolve across width, height, and time.

We concatenate RGB channels from input images to the two optical flow channels to create 5 channels per frame.

  1. The topology of the architecture gets inspiration from various V-net type typologies [7, 22, 29], with an encoder and a decoder.

Each layer of the encoder applies 3D convolutions followed by LeakyRELU [8] and a convolution with a stride (1,2,2) to downsample features to capture long-range spatial dependencies.

  1. Use 3x3x3 convolution kernels, except for the first and second layers where use 3x7x7 and 3x5x5 for capturing large displacements.
  2. Each decoder sub-part applies deconvolutions [16] followed by LeakyRELU
  3. A convolution after corresponding features from the contract�ing part have been concatenated
    decoding part has several heads
  4. One head for (u, v) and one each for Ku and Kv
  5. The last two decoding layers of Ku and Kv use upsampling with a trilinear mode, instead of normal deconvolution to minimize the checkerboard effect [25]
  6. Apply repeated convolutions in each decoding head to reduce the time dimension to 1

2.3 Optical Flow

A direct use of optical-flow for frame prediction leads to undesirable foreground stretching in dis-occluded pixels.

2.4 Loss Functions

L1 Loss is better at capturing small changes compared to L2

L 1 = ∣ ∣ I t + 1 − I t + 1 g ∣ ∣ L_1 = || I_{t+1} - I^g_{t+1} || L1=It+1It+1g
( I i (I_{i} (Ii is a predicted frame, \(I^g_i\) is a target

Use perceptual and style loss[11]

Perceptual:
在这里插入图片描述
Style Loss
在这里插入图片描述 ( Ψ l ( I i ) (Ψ_l(I_i) (Ψl(Ii):the feature map from the lth selected layer of a pre-trained Imagenet VGG-16 for I i I_i Ii

L: the number of layers considered

k l k_l kl: a normalization factor 1 / C l H l K l 1/C_{l}H_{l}K_{l} 1/ClHlKl(channel,height,width) fot the lth selected layer

Conjunction

L f i n e t u n e = w l L 1 + w s L s t y l e + w p L p e r c e p t u a l L_{finetune} = w_lL_1 + w_sL_{style} + w_pL_{perceptual} Lfinetune=wlL1+wsLstyle+wpLperceptual
Loss to initialize the adaptive kernels significantly speed up training
在这里插入图片描述
Use L2 norm to initialize kernels Ku and Kv as a middle-one-hot vector each all elements in each kernel are set very close to zero, except for the middle element which is initialized close to one
1 N / 2 1^{N/2} 1N/2is a middle-one-hot vector

2.5 Training

Optimize with Adam using β1 = 0.9, and β2 = 0.999 with no weight decay

  1. Optimize model to learn (u, v) using L1 loss with a learning rate of 1e−4 for 400 epochs
    Optimizing for (u, v) alone allows our network to capture large and coarse motions faster.
  2. Fix all weights of the network except for the decoding heads of Ku and Kv and train them using \(L_{kernel}\) loss defined in equation (9) to initialize kernels at each output pixel as middle-one-hot vectors.
  3. Optimize all weights in model using L1 loss and a learning rate of 1e−5 for 300 epochs to jointly fine-tune the (u, v) and (Ku, Kv) at each pixel
  4. Further fine-tune all weights in our model using Lf inetune at a learning rate of 1e−5
    0.2, 0.06, 36.0 for wl, wp, and ws respectively

3 Experiments

Evaluate quality of prediction

  1. L1
  2. Mean-Squared-Error(MSE)

M S E = 1 / N ∑ i = 1 n ( y i − y ^ i ) MSE = 1/N \sum_{i=1}^{n} (y_i -\hat y_i) MSE=1/Ni=1n(yiy^i)

  1. Peak-Signal-To-Noise(PSNR)

P S N R = 10 l o g 10 ( 2 b i t s − 1 ) 2 M S E PSNR = 10log_{10}\frac{(2^bits - 1)^2} {MSE} PSNR=10log10MSE(2bits1)2

  1. Structual-Similarity-Image-Metric(SSIM)
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    Setting the weights alpha ,beta ,gamma to 1, the formula can be reduced to the form shown at the top

    l:lightness; c:contrast; s:structure
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值