[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration

paper
code

Abstract

  • propose Restormer, an encoder-decoder Transformer for multi-scale local-global representation
    no partition into local windows    ⟹    \implies exploit distant image context
  • propose a multi-Dconv head transposed attention (MDTA) module
    aggregate local and non-local pixel interactions    ⟹    \implies process HR images efficiently
  • propose a gated-Dconv feed-forward network (GDFN)
    perform controlled feature transformation

2111_restormer_f1
Our Restormer achieves the state-of-the-art performance on image restoration tasks while being computationally efficient.

Method

model architecture

2111_restormer_f2
Architecture of Restormer for high-resolution image restoration. Our Restormer consists of multiscale hierarchical design incorporating efficient Transformer blocks. The core modules of Transformer block are: (a) multi-Dconv head transposed attention (MDTA) that performs (spatially enriched) query-key feature interaction across channels rather the spatial dimension, and (b) Gated-Dconv feed-forward network (GDFN) that performs controlled feature transformation, i.e., to allow useful information to propagate further.

given a degraded image I ∈ R H × W × 3 I\in\Reals^{H\times W\times3} IRH×W×3

feature extraction
obtain low-level feature embeddings F 0 ∈ R H × W × C F_0\in\Reals^{H\times W\times C} F0RH×W×C

transformer
pass F 0 F_0 F0 through a 4-level encoder-decoder and transform into deep features F d ∈ R H × W × 2 C F_d\in\Reals^{H\times W\times 2C} FdRH×W×2C
encoder reduce spatial size and expand channel capacity
decoder recover HR representations
up-, down-sampler pixel-unshuffle, shuffle
encoder features concat wit decoder features via skip connections for finer structural and textural detail

refinement
enrich F d F_d Fd at high spatial resolution for F r ∈ R H × W × 2 C F_r\in\Reals^{H\times W\times 2C} FrRH×W×2C

reconstruction
generate residual image R ∈ R H × W × 3 R\in\Reals^{H\times W\times3} RRH×W×3 and add to obtain restored image I ^ = I + R \hat{I}=I+R I^=I+R

multi-Dconv head transposed attention (MDTA)

problem complexity of MHSA is O ( H 2 W 2 ) \mathcal{O}(H^2W^2) O(H2W2) with H × W H\times W H×W-size input image
solution apply SA on channel dimension instead of spatial dimension    ⟹    \implies complexity O ( H W ) \mathcal{O}(HW) O(HW)

given a layer normalized features Y ∈ R H × W × C Y\in\Reals^{H\times W\times C} YRH×W×C
step 1 generate query, key, value by convs
Q = W d Q W p Q Y K = W d K W p K Y V = W d V W p V Y \begin{aligned} Q&=W_d^QW_p^QY \\ K&=W_d^KW_p^KY \\ V&=W_d^VW_p^VY \end{aligned} QKV=WdQWpQY=WdKWpKY=WdVWpVY

where, W p ( ⋅ ) W_p(\cdot) Wp() is a 1 × 1 1\times1 1×1 conv to aggregate pixel-wise cross-channel context, W d ( ⋅ ) W_d(\cdot) Wd() is a 3 × 3 3\times3 3×3 depth-wise conv to channel-wise spatial context
step 2 reshape Q , K , V Q, K, V Q,K,V into Q ^ , K ^ , V ^ \hat{Q}, \hat{K}, \hat{V} Q^,K^,V^
Q ∈ R H × W × C → r e s h a p e Q ^ ∈ R H W × C K , V ∈ R H × W × C → r e s h a p e K ^ , V ^ ∈ R C × H W \begin{aligned} Q\in\Reals^{H\times W\times C}&\xrightarrow{reshape}\hat{Q}\in\Reals^{HW\times C} \\ K, V\in\Reals^{H\times W\times C}&\xrightarrow{reshape}\hat{K}, \hat{V}\in\Reals^{C\times HW} \end{aligned} QRH×W×CK,VRH×W×Creshape Q^RHW×Creshape K^,V^RC×HW

step 3 calculate transposed attention map A ∈ R C × C A\in\Reals^{C\times C} ARC×C instead of R H W × H W \Reals^{HW\times HW} RHW×HW
A t t e n t i o n = ( Q ^ , K ^ , V ^ ) = V ^ ⋅ s o f t m a x ( K ^ ⋅ Q ^ α ) Attention=(\hat{Q}, \hat{K}, \hat{V})=\hat{V}\cdot softmax(\frac{\hat{K}\cdot\hat{Q}}{\alpha}) Attention=(Q^,K^,V^)=V^softmax(αK^Q^)

where, α \alpha α is a learnable scaling parameter to control magnitude of dot product
step 4 define overall MDTA process as
X ^ = A t t e n t i o n = ( Q ^ , K ^ , V ^ ) + X \hat{X}=Attention=(\hat{Q}, \hat{K}, \hat{V})+X X^=Attention=(Q^,K^,V^)+X

similarly, divide number of channels into “heads” and learn separate attention maps in parallel

gated-Dconv feed-forward network (GDFN)

FFN operate on each pixel location separately and identically
2 modifications: gating mechanism, depth-wise conv

advantages of GDFN

  • control information flow
  • allow each level to focus on details complementary to the other levels

given a input features X ∈ R H × W × C X\in\Reals^{H\times W\times C} XRH×W×C
step 1 encode pixel- and channel-wise information
X 1 = W d 1 W p 1 L N ( X ) X 2 = W d 2 W p 2 L N ( X ) \begin{aligned} X_1&=W_d^1W_p^1LN(X) \\ X_2&=W_d^2W_p^2LN(X) \end{aligned} X1X2=Wd1Wp1LN(X)=Wd2Wp2LN(X)

where, W p ( ⋅ ) W_p(\cdot) Wp() is a 1 × 1 1\times1 1×1 conv, W d ( ⋅ ) W_d(\cdot) Wd() is a 3 × 3 3\times3 3×3 depth-wise conv
step 2 gating mechanism element-wise product of 2 parallel linear layers, one of which is activated with GELU
g a t e ( X ) = G E L U ( X 1 ) ⊙ X 2 gate(X)=GELU(X_1)\odot X_2 gate(X)=GELU(X1)X2

step 3 define overall GDFN as
X ^ = W p 0 g a t e ( X ) \hat{X}=W_p^0gate(X) X^=Wp0gate(X)

where, ⊙ \odot is element-wise multiplication
GDFN perform more operations than FFN    ⟹    \implies reduce expansion ratio γ \gamma γ for similar param. and FLOPs

progressive learning

training a Transformer model on small cropped patches cannot encode global image statistics    ⟹    \implies sub-optimal performance on full-resolution images at test time
solution progressive learning: train network on smaller patches in early epochs and on gradually larger patches in later epochs
reduce batch size as patch size increase    ⟸    \impliedby larger patched trained with longer time

Experiment

deraining

2111_restormer_t1
Image deraining results. When averaged across all five datasets, our Restormer advances state-of-the-art by 1.05 dB.

2111_restormer_f3
Image deraining example. Our Restormer generates rain-free image with structural fidelity and without artifacts.

deblurring

2111_restormer_t2
Single-image motion deblurring results. Our Restormer is trained only on the GoPro dataset and directly applied to the HIDE and RealBlur benchmark datasets.

2111_restormer_f4
Single-image motion deblurring on GoPro. Restormer generates sharper and visually-faithful result.

2111_restormer_t3
Defocus deblurring comparisons on the DPDD testset (containing 37 indoor and 39 outdoor scenes). S: single-image defocus deblurring. D: dual-pixel defocus deblurring. Restormer sets new state-of-the-art for both single-image and dual pixel defocus deblurring.

2111_restormer_f5
Dual-pixel defocus deblurring comparison on the DPDD dataset. Compared to the other approaches, our Restormer more effectively removes blur while preserving the fine image details.

denoising

2111_restormer_t4
Gaussian grayscale image denoising comparisons for two categories of methods. Top super row: learning a single model to handle various noise levels. Bottom super row: training a separate model for each noise level.

2111_restormer_t5
Gaussian color image denoising. Our Restormer demonstrates favorable performance among both categories of methods. On Urban dataset for noise level 50, Restormer yields 0.41 dB gain over CNN-based DRUNet, and 0.2 dB over Transformer model SwinIR.

2111_restormer_t6
Real image denoising on SIDD and DND datasets. “ ∗ \ast ” denotes methods using additional training data. Our Restormer is trained only on the SIDD images and directly tested on DND. Among competing approaches, only Restormer surpasses 40 dB PSNR.

2111_restormer_f6
Visual results on image denoising. Top row: Gaussian grayscale denoising. Middle row: Gaussian color denoising. Bottom row: real image denoising. The image reproduction quality of our Restormer is more faithful to the ground-truth than other methods.

ablation study
improvement in multi-head attention
improvement in feed-forward network
designs for decoder at level-1
progressive learning
deeper or wider Restormer
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值