论文笔记之视频：SDC-Net_Video prediction using spatially-displaced convolution

最新推荐文章于 2020-09-03 20:30:37 发布

eight_Jessen

最新推荐文章于 2020-09-03 20:30:37 发布

阅读量285

点赞数

分类专栏：论文笔记文章标签：深度学习机器学习神经网络 pytorch

本文链接：https://blog.csdn.net/eight_Jessen/article/details/107944768

版权

论文笔记专栏收录该内容

49 篇文章 7 订阅

订阅专栏

Abstract

Learn a motion vector and a kernel for each pixel and synthesize a pixel by applying the kernel at a displaced location in the source image, defined by the predicted motion vector.

1.Introduction

Video prediction task

Aaccurately capture not only how objects move
how their displacement affects the visibility and appearance of surrounding structures.
Models can be trained on raw unlabeled video frames.
Previous approaches for video prediction often focus on direct synthesis of pixels using generative models.
[28] proposed to partition input sequences into a dictionary of image patch centroids and trained recurrent neural networks (RNN) to generate target images by indexing the dictionaries.
[31] and [34] used a convolutional Long-Short-Term-Memory (LSTM) encoder-decoder architecture conditioned on previous frame data.
[17] presented a predictive coding RNN architecture to model the motion dynamics of objects in the image for frame prediction.
[21] proposed a multi-scale conditional generative adversarial network (GAN) architecture to alleviate the short range dependency of single-scale architectures.
Learning to transform input frames [14] proposed a generative adversarial network (GAN) approach with a joint future optical-flow and future frame discriminator.
[10] presented a model to learn offset vectors for sampling for frame interpolation, and perform frame synthesis using bilinear interpolation guided by the learned sampling vectors.
sampling vector-based synthesis results are often affected by speckled noise.
[24, 23] and [36] for frame synthesis is to learn to predict sampling kernels that adapt to each output pixel.
effective，cannot model large motion, since its displacement is limited by the kernel size

2.Methods

One approach

Given a sequence of frames $I_{1:t}$ (the immediate past t frames), our work aims to predict the next future frame $I_t+1$ .

Formulate the problem as a transformation learning problem:

G is a learned function that predicts transformation parameters, and T is a transformation function

T can be a bilinear sampling operation guided by a motion vector:

f is a bilinear interpolator， (u, v) is a motion vector predicted by G，I_t(x, y) a pixel value at (x, y) in the immediate past frame It
Another approach
Define T as a convolution module that combines motion or displacement learning and resampling into a single operation

K(x, y) ∈ R^{N×N} is an N×N 2D kernel predicted by G at (x, y), P_t(x, y) is an N×N patch centered at (x, y) in I_t

2.1 Spatially Displaced Convolution

在这里插入图片描述
The predicted pixel It+1(x, y) is thus the weighted sampling of pixels in an N×N region centered at (x + u, y + v) in It.

K(x,y) is a kernel of all-zeros except for a one at the center reduce the SDC to equation (2)

u and v to zero reduces it to equation (3)

not the same as applying equation (2) and equation (3) in succession.

Formulate the model as:

在这里插入图片描述
T is realized with SDC and operates on the most recent input It, and Fi is the backwards optical flow between Ii and I_i-1.

2.2. Network Architecture

Realize G using a fully convolutional network

Input: a sequence of past frames I_{1:t} and past optical flows F_{2:t}

Output: pixel-wise separable kernels {Ku, Kv} and a motion vector (u, v)

Use 3D convolutions to convolve across width, height, and time.

We concatenate RGB channels from input images to the two optical flow channels to create 5 channels per frame.

The topology of the architecture gets inspiration from various V-net type typologies [7, 22, 29], with an encoder and a decoder.

Each layer of the encoder applies 3D convolutions followed by LeakyRELU [8] and a convolution with a stride (1,2,2) to downsample features to capture long-range spatial dependencies.

Use 3x3x3 convolution kernels, except for the first and second layers where use 3x7x7 and 3x5x5 for capturing large displacements.
Each decoder sub-part applies deconvolutions [16] followed by LeakyRELU
A convolution after corresponding features from the contract�ing part have been concatenated
decoding part has several heads
One head for (u, v) and one each for Ku and Kv
The last two decoding layers of Ku and Kv use upsampling with a trilinear mode, instead of normal deconvolution to minimize the checkerboard effect [25]
Apply repeated convolutions in each decoding head to reduce the time dimension to 1

2.3 Optical Flow

A direct use of optical-flow for frame prediction leads to undesirable foreground stretching in dis-occluded pixels.

2.4 Loss Functions

L1 Loss is better at capturing small changes compared to L2

$L_1 = || I_{t+1} - I^g_{t+1} ||$
$I_{i}$ is a predicted frame, $I^g_i$ is a target

Use perceptual and style loss[11]

Perceptual:
在这里插入图片描述
Style Loss
$Ψ_l(I_i)$ :the feature map from the lth selected layer of a pre-trained Imagenet VGG-16 for $I_i$

L: the number of layers considered

$k_l$ : a normalization factor $1/C_{l}H_{l}K_{l}$ (channel,height,width) fot the lth selected layer

Conjunction

$L_{finetune} = w_lL_1 + w_sL_{style} + w_pL_{perceptual}$
Loss to initialize the adaptive kernels significantly speed up training
在这里插入图片描述
Use L2 norm to initialize kernels Ku and Kv as a middle-one-hot vector each all elements in each kernel are set very close to zero, except for the middle element which is initialized close to one
$1^{N/2}$ is a middle-one-hot vector

2.5 Training

Optimize with Adam using β1 = 0.9, and β2 = 0.999 with no weight decay

Optimize model to learn (u, v) using L1 loss with a learning rate of 1e−4 for 400 epochs
Optimizing for (u, v) alone allows our network to capture large and coarse motions faster.
Fix all weights of the network except for the decoding heads of Ku and Kv and train them using $L_{kernel}$ loss defined in equation (9) to initialize kernels at each output pixel as middle-one-hot vectors.
Optimize all weights in model using L1 loss and a learning rate of 1e−5 for 300 epochs to jointly fine-tune the (u, v) and (Ku, Kv) at each pixel
Further fine-tune all weights in our model using Lf inetune at a learning rate of 1e−5
0.2, 0.06, 36.0 for wl, wp, and ws respectively

3 Experiments

Evaluate quality of prediction

L1
Mean-Squared-Error(MSE)

$\sum_{i=1}^{n} (y_i -\hat y_i)$

Peak-Signal-To-Noise(PSNR)

$10log_{10}\frac{(2^bits - 1)^2} {MSE}$

Structual-Similarity-Image-Metric(SSIM)

Setting the weights alpha ,beta ,gamma to 1, the formula can be reduced to the form shown at the top

l:lightness; c:contrast; s:structure

eight_Jessen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
论文笔记之视频：SDC-Net_Video prediction using spatially-displaced convolution

AbstractLearn a motion vector and a kernel for each pixel and synthesize a pixel by applying the kernel at a displaced location in the source image, defined by the predicted motion vector.1.IntroductionVideo prediction taskAaccurately capture not only
复制链接

扫一扫

专栏目录