[2106] Video Super-Resolution Transformer

koukouvagia

已于 2022-04-04 10:13:38 修改

阅读量1.7k

点赞数

文章标签：计算机视觉深度学习

于 2022-03-10 14:57:40 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123279092

版权

paper
code
mathematical reasoning mainly from this paper

Abstract

components in traditional Transformer design and their limitations

fully-connected self-attention layer (FCSA) neglect local information in video
ViTs split an image into several patches or tokens, which damage local spatial information since contents (eg. lines, edges, shapes, objects) divided into different tokens
token-wise feed-forward layer misalign features between video frames and ignore feature propagation across frames
this layer independently process each of input token embeddings without any interaction across frames

main contributions of VSR-Transformer

spatial-temporal convolutional attention (STCSA) layer: exploit locality and spatial-temporal data information through different layers
bidirectional optical flow-based feed-forward (BOFF) layer: use interaction across all frame embeddings for feature propagation and alignment

Preliminary

notation
a calligraphic letter $\mathcal{X}$ : a data sequence
a calligraphic letter $\mathcal{D}$ : a distribution
a bold upper case letter $\mathbf{X}$ : a matrix
a bold lower case letter $\mathbf{x}$ : a vector
a lower case letter x: an element of a matrix
$[T]$ : a set ${1, ..., T\}$
$\mathbf{1}\{.\}$ : an indicator function, where $\mathbf{1}\{A\}=1$ if A is true and $\mathbf{1}\{A\}=0$ if A is false
$\mathbb{E}_{\mathcal{D}}$ : an empirical expectation with respect to distribution $\mathcal{D}$

definition 1 (function distance) given a function $\mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n}$ and a target function $f^{\ast}: \mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n}$ , we define a distance between these 2 function as
$\mathcal{L}_{f^{\ast}, \mathcal{D}}(f):=\mathbb{E}_{\mathbf{X}\sim\mathcal{D}}[l(f(\mathbf{X}), f^{\ast}(\mathbf{X}))]$

for ground truth $Y=f^{\ast}(\mathcal{D})$ , loss denoted by $\mathcal{L}_\mathcal{D}(f)$

definition 2 ( $k$ -pattern function) a function $\mathcal{X}\rightarrow\mathcal{Y}$ is a k-pattern if for some ${\{\pm\}}^k\rightarrow\mathcal{Y}$ and index $j^{\ast}: f()=g(x_{j^{\ast, ..., j^{\ast}+k}})$ . we call a function $h_{\mathbf{u}, \mathbf{W}}(\mathbf{x})=\sum_{j}\langle {\mathbf{u}}^{(j)}, {\mathbf{v}}_{\mathbf{W}}^{(j)}\rangle$ can learn a k-pattern function from a feature ${\mathbf{v}}_{\mathbf{W}}^{(j)}$ of data $x$ with a layer ${\mathbf{u}}^{(j)}\in R^q$ if for $\epsilon>0$ , we have
$\mathcal{L}_{f^{\ast}, \mathcal{D}}(h_{\mathbf{u}, \mathbf{W}})\leq\epsilon$

feature ${\mathbf{v}}_{\mathbf{W}}^{(j)}$ learned by a convolutional attention network or a fully connected attention network parameterized by $\mathbf{W}$

$\implies$ any function that can capture locality of data mean it should learn a $k$ -pattern function

video super-resolution (VSR)

given a LR video sequence $\{V_1, ..., V_T\}\sim\mathcal{D}$ , where $V_t\in\mathbb{R}^{3\times H\times W}$ is t-th LR frame, $\mathcal{D}$ is a distribution of videos
extract features $\mathcal{X}=\{X_1, ..., X_T\}$ from LR video frames, where $X_t\in\mathbb{R}^{C\times H\times W}$ is $t$ -th feature
learn a non-linear mapping $F$ to reconstruct HR frames $\widehat{\mathcal{Y}}$ by utilizing spatial-temporal information across sequence
$\widehat{\mathcal{Y}}\triangleq(\widehat{Y}_1, ..., \widehat{Y}_T)=F(X_1, ..., X_T)$

given ground-truth HR frames $\mathcal{Y}=\{Y_1, ..., Y_T\}$ , where $Y_t$ is $t$ -th HR frame
minimize a loss function between generated HR frame $\widehat{Y}_t$ and ground-truth HR frame $Y_t$
$\widehat{F}=\underset{F}{\arg\min}\mathcal{L}_\mathcal{D}(F)\triangleq\widehat{\mathbb{E}}_{\mathcal{D}, t\in[T]}[d(\widehat{Y}_t, Y_t)]$

where, $d(\cdot, \cdot)$ is a distance metric, such as L1-loss, L2-loss, Charbonnier loss

for VSR tasks, a sequence method can be used, such as RNN, LSTM, Transformer
note that Transformer gain particular interest since it avoid recursion and thus allow parallel computing in practice

transformer block

given an input feature $X\in\mathbb{R}^{d\times n}$ ( $d$ -dimensional embeddings of $n$ tokens)
transformer block is a sequence-to-sequence function, mapping a sequence $\mathbb{R}^{d\times n}$ to another sequence $\mathbb{R}^{d\times n}$
consist of 2 parts, one is a self-attention layer with a skip connection
$f_1(X)=LN(X+\sum_{i=1}^hW_o^i(W_v^iX)SoftMax((W_k^iX)^T(W_q^iX))$

where, $W_o^i\in\mathbb{R}^{d\times m}$ is a linear layer, $W_v^i, W_k^i, W_q^i\in\mathbb{R}^{m\times d}$ are linear layers mapping feature to value, key, query, $h$ is heads number, $m$ is head size
the other is a token-wise feed-forward layer with a skip connection
$f_2(X)=LN(f_1(X)+W_2ReLU(W_1f_1(X)+b_1\mathbf{1}_n^T)+b_2\mathbf{1}_n^T)$

where, $W_1\in\mathbb{R}^{r\times d}, W_2\in\mathbb{R}^{d\times r}$ are linear layers, $b_1\in\mathbb{R}^r, b_2\in\mathbb{R}^d$ are bias, $r$ is hidden layer size of feed-forward layer

Method

model architecture

The framework of video super-resolution Transformer. Given a low-resolution (LR) video, we first use an extractor to capture features of the LR videos. Then, a spatial-temporal convolutional self-attention and an optical flow-based feed-forward network model a sequence of continuous representations. Note that these two layers both have skip connections. Last, the reconstruction network restores a high-resolution video from the representations and the up-sampling frames.

feature extractor capture features from LR input
transformer map features to a sequence of continuous representations
reconstruction restore HR videos from representations

loss function Charbonnier loss

Network architecture of the VSR-Transformer.

Network architecture of the feature extractor and reconstruction network.

T frames number, C channels number, H image height, W image width
I input channels number, O output channels number
CONV convolution, with K kernel size, S stride, P padding, G groups
PixelShuffle pixel shuffle with upscale factor of 2
LeakyReLU Leaky ReLU activation function with a negative slope of 0.01

spatial-temporal convolutional self-attention (STCSA)

drawbacks of FCSA

Q: whether FCSA layer learn $k$ -patterns with gradient descent
theorem 1 we assume $m = 1$ and $\vert u_i\vert\leq1$ , and weights are initialized as some permutation invariant distribution over $\mathbb{R}^n$ , and for all $\mathbf{x}$ we have $h_{\mathbf{u}, \mathbf{W}}^{FCSA}\in[-1, 1]$ which satisfies definition 2. then, the following holds
$\mathbb{E}_{W\sim\mathcal{W}}\Vert\frac\partial{\partial\mathbf{W}}\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}, \mathbf{W}}^{FCSA})\Vert_2^2\leq qn\min\{ \dbinom{n-1}{k}^{-1}, \dbinom{n-1}{k-1}^{-1}\}$

from theorem 1:

initial gradient is small, if $k=\Omega(\log{n})$ and fully connected attention layer is initialized as a permutation invariant distribution
fully connected attention layer result in gradient vanishing, if $q$ is not large enough
gradient descent will be “stuck” upon initialization, thus unable to learn $k$ -pattern function

$\implies$ FCSA layer cannot use spatial information of each frame since local information not encoded in embeddings of all tokens

detailed structure in STCSA

Illustration of the spatial-temporal convolutional self-attention. The unfold operation is to extract sliding local patches from a batched input feature map, while the fold operation is to combine an array of sliding local patches into a large feature map.

given feature maps of input video frames $X\in{\Reals}^{T\times C\times H\times W}$
step 1 capture spatial information of each frame in $x$
$X\in{\Reals}^{T\times C\times H\times W}\xrightarrow{W_q, W_k, W_v}Q, K, V\in{\Reals}^{T\times C\times H\times W}$

where, $W_q, W_k, W_v$ are 3 independent conv layers
step 2 unfold features into sliding local $H_p\times W_p$ -size patches in each frame, and reshape into query, key, value matrix
$V\in{\Reals}^{T\times C\times H\times W}\xrightarrow{unfold}{\Reals}^{T\times CH_pW_p\times\frac{HW}{H_pW_p}}\xrightarrow{reshape}{\Reals}^{n\_heads\times\frac{CH_pW_p}{n\_heads}\times T\frac{HW}{H_pW_p}}$

where, $n\_patches=\frac{HW}{H_pW_p}$ is patches number in each frame, $dim=CH_pW_p$ is dimension of each patch, $n\_heads$ is heads number
step 3 calculate similarity matrix and aggregate with value for attention matrix
$V)=softmax(\frac{Q^TK}{\sqrt{d}})V^T\in{\Reals}^{n\_heads\times T\frac{HW}{H_pW_p}\times T\frac{HW}{H_pW_p}}$

where, $d=\frac{CH_pW_p}{n\_heads}$ is hidden dimension
note that similarity matrix $Q^TK$ related to all embedding tokens of the whole video frames
step 4 reshape attention matrix, and fold tensors of updated sliding local patches into features
$Attention\in{\Reals}^{n\_heads\times T\frac{HW}{H_pW_p}\times T\frac{HW}{H_pW_p}}\xrightarrow{reshape}{\Reals}^{T\times CH_pW_p\times\frac{HW}{H_pW_p}}\xrightarrow{fold}{\Reals}^{T\times C\times H\times W}$

step 5 obtain final features, and achieve output with a skip connection and a normalization
$Attention\in{\Reals}^{T\times C\times H\times W}\xrightarrow{{W_o}}F\in{\Reals}^{T\times C\times H\times W}$

$f_1(X)=LN(X+F)\in{\Reals}^{T\times C\times H\times W}$

where, $W_o$ is a conv layer

step 2 to step 4 inspired by COLA-Net
with a summary of steps above, STCSA formulated as
$f_1(X)=LN(X+\sum_{i=1}^hW_o^i\kappa_2(\underbrace{\kappa_1(W_v^iX)}_\text{v}softmax({\underbrace{\kappa_1(W_k^iX)}_\text{w}}^T\underbrace{\kappa_1(W_q^iX)}_\text{q})))$

where, $\kappa_1(\cdot), \kappa_2(\cdot)$ are unfold and fold operation, $h$ is heads number which set $h = 1$ for good performance

why STCSA is suitable

Q: how STCSA layer learn $k$ -patterns with gradient descent
theorem 2 assume we initialize each element of weights uniformly drawn from
$\{\pm\frac1k\}$ . fix some $\delta>0$ , some k-pattern $f$ and some distribution $\mathcal{D}$ . then is $q>2^{k+3}\log(\frac{2^k}\delta)$ , and let $h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA}$ be a function satisfying definition 2, with probability at least $1-\delta$ over the initialization, when training a spatial-temporal convolutional self-attention layer using gradient descent with $\eta$ , we have
$\frac{1}{S}\sum_{s=1}^S\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA})\leq{\eta}^2S^2nk^{\frac52}2^{k+1}+\frac{k^22^{2k+1}}{q\eta S}+\eta nqk$

from theorem 2:

loss $\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA})$ will be small with finite $S$ steps in optimization, thus able to learn $k$ -pattern function

$\implies$ STCSA layer with gradient descent can capture locality of each frame

spatial-temporal position encoding

VSRT is permutation-invariant, thus requiring precise spatial-temporal position information
3D fixed position encoding: 2 spatial positional information (horizontal, vertical) and 1 temporal positional information
$i)=\begin{cases} \sin(pos\cdot{\alpha}_k) &\text{if } i=2k \\ cos(pos\cdot{\alpha}_k) &\text{if } i=2k+1 \end{cases}$

where, ${\alpha}_k=1/1000^{2k/\frac{d}3}$ , $k$ is an integer in $\frac{k}6)$ , $p o s$ is position in corresponding dimension, $d$ is channel dimension size

bidirectional optical flow-based feed-forward (BOFF)

Illustration of the bidirectional optical flow-based feed-forward layer. Given a video sequence, we first bidirectionally estimate the forward and backward optical flows and wrap the feature maps with the responding optical flows. Then we learn a forward and backward propagation network to produce two sequences of features from concatenated wrapped features and LR frames. Last, we fusion these two feature sequences into one feature sequence.

given features $X\in\mathbb{R}^{T\times C\times H\times W}$ output by STCSA layer
step 1: learn bidirectional optical flows between neighboring frames
$\overleftarrow{O}_t=\begin{cases} spy(V_1, V_1) &\text{if } t=1 \\ spy(V_{t-1}, V_t) &\text{if } t\in(1, T] \end{cases}, \overrightarrow{O}_t=\begin{cases} spy(V_{t+1}, V_t) &\text{if } t\in[1, T) \\ spy(V_T, V_T) &\text{if } t=T \end{cases}$

where, $\overleftarrow{O}, \overrightarrow{O}\in\mathbb{R}^{T\times2\times H\times W}$ are backward and forward optical flows; $spy(\cdot, \cdot)$ is a function as SPyNet which is pre-trained and updated in training
step 2: obtain bidirectional features along with backward and forward propagation
$\overleftarrow{X}=warp(X, \overleftarrow{O}), \overrightarrow{X}=warp(X, \overrightarrow{O})$

where, $\overleftarrow{X}, \overrightarrow{X}\in\mathbb{R}^{T\times C\times H\times W}$ are backward and forward features
step 3 aggerate frames and warped features, and feed into 2-layer CNN for backward and forward propagation
$f_2(X)=LN(f_1(X)+fusion(\overleftarrow{W_1}ReLU(\overleftarrow{W_2}[V, \overleftarrow{X}])+\overrightarrow{W_1}ReLU(\overrightarrow{W_2}[V, \overrightarrow{X})]))$

where, $[\cdot, \cdot]$ is an aggregation operator, $\overleftarrow{W_1}, \overleftarrow{W_2}, \overrightarrow{W_1}, \overrightarrow{W_2}$ are weights of backward and forward networks
extend 2-layer networks to multi-layer networks
$f_2(X)=LN(f_1(X)+fusion(R_1(V, \overleftarrow{X})+R_2(V, \overrightarrow{X})))$

where, $R_1, R_2$ are flexible networks

Experiment

dataset

	resolution	training set	testing set
REDS	$1280\times720$	266 clips	REDS4 4 clips
Vimeo-90K	$448\times256$	64,612 clips	Vimeo-90K-T 7,824 clips
Vid4	$720\times480$		4 clips each 34 frames
experiment detail

degradation bicubic down-sampling (BI)
input $64\times64$ -size, 5 or 7 frames
data augmentation random horizontal flipping, random $90^{\circ}$ rotation
frame normalized to $448\times256$ -size
optimizer Adam: $\beta_1=0.9, \beta_2=0.99$ , batch size=2 per GPU, 600K iterations
learning rate initial 2e-4, cosine decay to 1e-7

result on REDS

Quantitative comparison (PSNR/SSIM) on REDS4 for $4\times$ VSR. The results are tested on RGB channels. Red and blue indicate the best and the second best performance, respectively. “ $\dag$ ” means a method trained on 5 frames for a fair comparison.

Qualitative comparison on the REDS4 dataset for $4\times$ VSR. Zoom in for the best view.

key findings

the highest PSNR and comparable SSIM
when training with 5 frames, BasicVSR and IconVSR worse than EDVR
$\implies$ BasicVSR and IconVSR rely much on aggregation of long-term sequence information
64-channel VSRT better performance than 128-channel EDVR-L
VSRT able to recover finer details and sharper edges

result on Vimeo-90K

Quantitative comparison (PSNR/SSIM) on Vimeo-90K-T for $4\times$ VSR. Red and blue indicate the best and the second best performance, respectively.

Qualitative comparison on Vimeo-90K-T for $4\times$ VSR. Zoom in for the best view.

key findings

the highest PSNR and SSIM
generalization ability on Vid4 of VSRT better than EDVR but worse than BasicVSR and IconVSR
$\impliedby$ BasicVSR and IconVSR tested on all frames, while VSRT and EDVR tested on 7 frames
$\impliedby$ a distribution bias between Vimeo-90K-T and Vid4
VSRT able to generate sharp and realistic HR frames

result on Vid4

Quantitative comparison (PSNR/SSIM) on Vid4 for 4x VSR. Red and blue indicate the best and the second best performance, respectively. “Y” denotes the evaluation on Y channels.

Quantitative comparison (PSNR/SSIM) on Vid4 for $4\times$ VSR. Red and blue indicate the best and the second best performance, respectively. “Y” denotes the evaluation on Y channels. “ $\dag$ ” means a method trained and tested on 7 frames for a fair comparison.

>Qualitative comparison on Vid4 for $4\times$ VSR. Zoom in for the best view.

ablation study

optical flow

w/o optical flow: replace SPyNet in BOFF layer with a stack of Residual ReLU networks

Ablation study on REDS for $4\times$ VSR. Here, “w/o” and “w/ optical flow” mean the VSR-Transformer without and with the optical flow, respectively. Zoom in for the best view.

optical flow is important in BOFF layer and help feature propagation and alignment

STCSA layer & BOFF layer

w/o STCSA: remove STCSA layer
w/o BOFF: replace BOFF layer with a stack of Residual ReLU networks

Ablation study on REDS for $4\times$ VSR. Here, “w/o STCSA” and “w/o BOFF” mean the VSR-Transformer without the spatial-temporal convolutional self-attention (STCSA) layer and bidirectional optical flow-based feed-forward (BOFF) layer, respectively.