[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration

koukouvagia

已于 2022-04-04 10:09:07 修改

阅读量1.8k

点赞数

文章标签：计算机视觉深度学习

于 2022-03-18 18:12:58 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123572037

版权

paper
code

Content

Abstract

propose Restormer, an encoder-decoder Transformer for multi-scale local-global representation
no partition into local windows $\implies$ exploit distant image context
propose a multi-Dconv head transposed attention (MDTA) module
aggregate local and non-local pixel interactions $\implies$ process HR images efficiently
propose a gated-Dconv feed-forward network (GDFN)
perform controlled feature transformation

Our Restormer achieves the state-of-the-art performance on image restoration tasks while being computationally efficient.

Method

model architecture

Architecture of Restormer for high-resolution image restoration. Our Restormer consists of multiscale hierarchical design incorporating efficient Transformer blocks. The core modules of Transformer block are: (a) multi-Dconv head transposed attention (MDTA) that performs (spatially enriched) query-key feature interaction across channels rather the spatial dimension, and (b) Gated-Dconv feed-forward network (GDFN) that performs controlled feature transformation, i.e., to allow useful information to propagate further.

given a degraded image $I\in\Reals^{H\times W\times3}$

feature extraction
obtain low-level feature embeddings $F_0\in\Reals^{H\times W\times C}$

transformer
pass $F_0$ through a 4-level encoder-decoder and transform into deep features $F_d\in\Reals^{H\times W\times 2C}$
encoder reduce spatial size and expand channel capacity
decoder recover HR representations
up-, down-sampler pixel-unshuffle, shuffle
encoder features concat wit decoder features via skip connections for finer structural and textural detail

refinement
enrich $F_d$ at high spatial resolution for $F_r\in\Reals^{H\times W\times 2C}$

reconstruction
generate residual image $R\in\Reals^{H\times W\times3}$ and add to obtain restored image $\hat{I}=I+R$

multi-Dconv head transposed attention (MDTA)

problem complexity of MHSA is $\mathcal{O}(H^2W^2)$ with $H\times W$ -size input image
solution apply SA on channel dimension instead of spatial dimension $\implies$ complexity $\mathcal{O}(HW)$

given a layer normalized features $Y\in\Reals^{H\times W\times C}$
step 1 generate query, key, value by convs
$\begin{aligned} Q&=W_d^QW_p^QY \\ K&=W_d^KW_p^KY \\ V&=W_d^VW_p^VY \end{aligned}$

where, $W_p(\cdot)$ is a $1\times1$ conv to aggregate pixel-wise cross-channel context, $W_d(\cdot)$ is a $3\times3$ depth-wise conv to channel-wise spatial context
step 2 reshape $Q, K, V$ into $\hat{Q}, \hat{K}, \hat{V}$
$\begin{aligned} Q\in\Reals^{H\times W\times C}&\xrightarrow{reshape}\hat{Q}\in\Reals^{HW\times C} \\ K, V\in\Reals^{H\times W\times C}&\xrightarrow{reshape}\hat{K}, \hat{V}\in\Reals^{C\times HW} \end{aligned}$

step 3 calculate transposed attention map $A\in\Reals^{C\times C}$ instead of $\Reals^{HW\times HW}$
$Attention=(\hat{Q}, \hat{K}, \hat{V})=\hat{V}\cdot softmax(\frac{\hat{K}\cdot\hat{Q}}{\alpha})$

where, $\alpha$ is a learnable scaling parameter to control magnitude of dot product
step 4 define overall MDTA process as
$\hat{X}=Attention=(\hat{Q}, \hat{K}, \hat{V})+X$

similarly, divide number of channels into “heads” and learn separate attention maps in parallel

gated-Dconv feed-forward network (GDFN)

FFN operate on each pixel location separately and identically
2 modifications: gating mechanism, depth-wise conv

advantages of GDFN

control information flow
allow each level to focus on details complementary to the other levels

given a input features $X\in\Reals^{H\times W\times C}$
step 1 encode pixel- and channel-wise information
$\begin{aligned} X_1&=W_d^1W_p^1LN(X) \\ X_2&=W_d^2W_p^2LN(X) \end{aligned}$

where, $W_p(\cdot)$ is a $1\times1$ conv, $W_d(\cdot)$ is a $3\times3$ depth-wise conv
step 2 gating mechanism element-wise product of 2 parallel linear layers, one of which is activated with GELU
$gate(X)=GELU(X_1)\odot X_2$

step 3 define overall GDFN as
$\hat{X}=W_p^0gate(X)$

where, $\odot$ is element-wise multiplication
GDFN perform more operations than FFN $\implies$ reduce expansion ratio $\gamma$ for similar param. and FLOPs

progressive learning

training a Transformer model on small cropped patches cannot encode global image statistics $\implies$ sub-optimal performance on full-resolution images at test time
solution progressive learning: train network on smaller patches in early epochs and on gradually larger patches in later epochs
reduce batch size as patch size increase $\impliedby$ larger patched trained with longer time

Experiment

deraining

Image deraining results. When averaged across all five datasets, our Restormer advances state-of-the-art by 1.05 dB.

Image deraining example. Our Restormer generates rain-free image with structural fidelity and without artifacts.

deblurring

Single-image motion deblurring results. Our Restormer is trained only on the GoPro dataset and directly applied to the HIDE and RealBlur benchmark datasets.

Single-image motion deblurring on GoPro. Restormer generates sharper and visually-faithful result.

Defocus deblurring comparisons on the DPDD testset (containing 37 indoor and 39 outdoor scenes). S: single-image defocus deblurring. D: dual-pixel defocus deblurring. Restormer sets new state-of-the-art for both single-image and dual pixel defocus deblurring.

Dual-pixel defocus deblurring comparison on the DPDD dataset. Compared to the other approaches, our Restormer more effectively removes blur while preserving the fine image details.

denoising

Gaussian grayscale image denoising comparisons for two categories of methods. Top super row: learning a single model to handle various noise levels. Bottom super row: training a separate model for each noise level.

Gaussian color image denoising. Our Restormer demonstrates favorable performance among both categories of methods. On Urban dataset for noise level 50, Restormer yields 0.41 dB gain over CNN-based DRUNet, and 0.2 dB over Transformer model SwinIR.

Real image denoising on SIDD and DND datasets. “ $\ast$ ” denotes methods using additional training data. Our Restormer is trained only on the SIDD images and directly tested on DND. Among competing approaches, only Restormer surpasses 40 dB PSNR.

Visual results on image denoising. Top row: Gaussian grayscale denoising. Middle row: Gaussian color denoising. Bottom row: real image denoising. The image reproduction quality of our Restormer is more faithful to the ground-truth than other methods.

ablation study

improvement in multi-head attention

improvement in feed-forward network

designs for decoder at level-1

progressive learning

deeper or wider Restormer

koukouvagia

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration

Restormer: Efficient Transformer for High-Resolution Image Restoration
复制链接

扫一扫