[2108] [ICCV 2021] SwinIR: Image Restoration Using Swin Transformer

koukouvagia

已于 2022-05-11 19:12:32 修改

阅读量2.5k

点赞数

文章标签：计算机视觉深度学习

于 2022-04-13 16:25:52 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123742106

版权

paper
supp
code

Content

Abstract

process image with local attention mechanism
capture long-range dependency with shifted window MSA
better performance than SOTA, less parameter

PSNR results vs the total number of parameters of different methods for image SR ( $\times4$ ) on Set5

Method

model architecture

The architecture of the proposed SwinIR for image restoration.

shallow feature extraction
given LQ input $I_{LQ}\in\Reals^{H\times W\times C_{in}}$ , extract shallow features $F_0\in\Reals^{H\times W\times C}$
$F_0=H_{SF}(I_{LQ})$

where, $C$ is feature channel number, $H_{SF}(\cdot)$ is a $3\times3$ conv layer

deep feature extraction
extract deep features $F_D\in\Reals^{H\times W\times C}$ from $F_0$
$F_D=H_{DF}(F_0)$

where, $H_{DF}$ consists of $K$ RSTB and a conv layer
specifically, intermediate features $F_1, F_2, ..., F_K$ and output features $F_D$ as
$\begin{aligned} F_i&=H_{RSRB_i}(F_{i-1}), i=1, 2, ..., K \\ F_D&=H_{conv}(F_K) \end{aligned}$

where, $H_{RSRB_i}(\cdot)$ is $i$ -th RSTB, $H_{conv}$ is a $3\times3$ conv layer

reconstruction
aggregate shallow and deep features to reconstruct HQ image $I_{RHQ}$
$I_{RHQ}=H_{REC}(F_0+F_D)$

where, $H_{REC}(\cdot)$ is a reconstruction module

for super-resolution, a sub-pixel conv for up-sampling
for artifact reduction and denoising, a single conv

loss function
for super-resolution, use $L_1$ pixel loss
$\mathcal{L}=\Vert I_{RHQ}-I_{HQ}\Vert_1$

where, $I_{RHQ}$ is obtained by network from $I_{LQ}$ , $I_{HQ}$ is ground-truth HQ image

for artifact reduction and denoising, use Charbonnier loss
$\mathcal{L}=\sqrt{(I_{RHQ}-I_{HQ})^2-{\epsilon}^2}$

where, I_{RHQ} is obtained by network from $I_{LQ}$ , $I_{HQ}$ is ground-truth HQ image, $\epsilon$ is s constant set to $10^{-3}$

residual Swin transformer block (RSTB)

residual Swin transformer block (RSTB): $L$ Swin transformer layer (STL), a convolutional layer

given input features $F_{i, 0}$ of $i$ -th RSTB
extract intermediate features $F_{i, 1}, F_{i, 2}, ..., F_{i, L}$ by $L$ STL
$F_{i, j}=H_{STL_{i, j}}(F_{i, j-1}), j=1, 2, ..., L$

where, $H_{STL_{i, j}}(\cdot)$ is $j$ -th STL in $i$ -th RSTB

add a conv layer before residual connection
$F_{i, out}=H_{conv_i}(F_{i, L})+F_{i, 0}$

where, $H_{conv_i}(\cdot)$ is a conv layer in $i$ -th RSTB

2 benefits of design mentioned above

convolution with spatially invariant filters enhance translational equivariance
note that transformer viewed as spatially varying convolution
residual connection aggregate different levels of features

Swin transformer layer (STL)

given an input $F\in\Reals^{H\times W\times C}$
partition input into $F\in\Reals^{\frac{HW}{M^2}\times M^2\times C}$ features with non-overlapping $M\times M$ windows
where, $\frac{HW}{M^2}$ is windows number

compute standard self-attention separately for each window
produce query, key, value matrices $Q, K, V$ , for a local window feature $X\in\Reals^{M^2\times C}$
$Q=XP_Q, K=XP_K, V=XP_V$

where, $P_Q, P_K, P_V$ are projection matrices shared across windows
compute attention matrix by self-attention in a local window
$\mathrm{Attention}(Q, K, V)=\mathrm{SoftMax}(\frac{QK^T}{\sqrt{d}}+B)V$

where, $B$ is learnable relative positional encoding

$\mathrm{MLP}$ consist of 2 FC layers with GELU between them
$\mathrm{LN}$ layer added before both $\mathrm{MSA}$ and $\mathrm{MLP}$
residual connection employed for both modules

to sum up, whole STL formulated as
$\begin{aligned} X&=\mathrm{MSA}(\mathrm{LN}(X))+X \\ X&=\mathrm{MLP}(\mathrm{LN}(X))+X \end{aligned}$

shifted window partitioning used alternately for cross-window connections
shift feature by $(\lfloor\frac{M}2\rfloor, \lfloor\frac{M}2\rfloor)$ pixels before window partitioning

Experiment

datasets DIV2K and Flickr2K

super-resolution

Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for classical image SR on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for classical image SR ( $\times8$ ) on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Visual comparison of bicubic image SR ( $\times4$ ) methods. Best viewed by zooming.

Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for lightweight image SR on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Visual comparison of real-world image SR ( $\times4$ ) methods on real-world images.

artifact reduction

Quantitative comparison (average PSNR/SSIM/PSNR-B) with state-of-the-art methods for JPEG compression artifact reduction on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

denoising

Quantitative comparison (average PSNR) with state-of-the-art methods for grayscale image denoising on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Visual comparison of grayscale image denoising (noise level 50) methods on image “Monarch” from Set12.

Quantitative comparison (average PSNR) with state-of-the-art methods for color image denoising on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Visual comparison of color image denoising (noise level 50) methods on image “163085” from CBSD68.

ablation studies

Ablation study on RSTB design.

Ablation study on different settings of SwinIR. Results are tested on Manga109 for image SR ( $\times2$ ).

key findings

from (e) training data scale
- different from IPT which heavily relied on large training datasets, SwinIR achieve better results than RCAN using the same training data, even when dataset is small
from (f) model convergence
- SwinIR converge faster and better than RCAN, contradictory to previous observations that transformer often suffer from slow model convergence

koukouvagia

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[2108] [ICCV 2021] SwinIR: Image Restoration Using Swin Transformer

SwinIR: Image Restoration Using Swin Transformer
复制链接

扫一扫