面向一体化图像复原的内容感知型Transformer模型

Content-Aware Transformer for All-in-one Image Restoration

arXiv

GitHub

Image restoration has witnessed significant advancements with the development of deep learning models. Although ​​Transformer architectures​​ have progressed considerably in recent years, challenges remain, particularly the ​​limited receptive field in window-based self-attention​​. In this work, we propose ​​DSwinIR​​, a ​​Deformable Sliding window Transformer for Image Restoration​​. DSwinIR introduces a novel ​​deformable sliding window self-attention​​ that adaptively adjusts receptive fields based on image content, enabling the attention mechanism to focus on important regions and enhance feature extraction aligned with salient features. Additionally, we introduce a ​​central ensemble pattern​​ to reduce the inclusion of irrelevant content within attention windows. In this way, the proposed DSwinIR model integrates the ​​deformable sliding window Transformer​​ and ​​central ensemble pattern​​ to amplify the strengths of both ​​CNNs​​ and ​​Transformers​​ while mitigating their limitations. Extensive experiments on various image restoration tasks demonstrate that ​​DSwinIR achieves state-of-the-art performance​​. For example, in image deraining, compared to ​​DRSformer​​ on the ​​SPA dataset​​, DSwinIR achieves a ​​0.66 dB PSNR improvement​​. In all-in-one image restoration, compared to ​​PromptIR​​, DSwinIR achieves over a ​​0.66 dB​​ and ​​1.04 dB improvement​​ on ​​three-task​​ and ​​five-task settings​​, respectively.

图像修复技术​​随着深度学习模型的发展取得了显著进步。尽管近年来​​Transformer架构​​取得了长足发展,但仍存在挑战,特别是​​基于窗口的自注意力机制中感受野受限​​的问题。本文提出​​DSwinIR​​(​​可变形滑动窗口Transformer图像修复模型​​),该模型引入了一种新颖的​​可变形滑动窗口自注意力机制​​,能够根据图像内容自适应调整感受野,使注意力机制能够聚焦重要区域并增强与显著特征对齐的特征提取能力。此外,本文还引入了​​中心集成模式​​,以减少注意力窗口中无关内容的干扰。通过这种方式,所提出的DSwinIR模型整合了​​可变形滑动窗口Transformer​​和​​中心集成模式​​,在充分发挥​​CNN​​和​​Transformer​​各自优势的同时,有效缓解了它们的局限性。在各种图像修复任务上的大量实验表明,​​DSwinIR实现了最先进的性能表现​​。例如,在图像去雨任务中,与​​DRSformer​​在​​SPA数据集​​上的表现相比,DSwinIR实现了​​0.66 dB的PSNR提升​​。在全能型图像修复任务中,与​​PromptIR​​相比,DSwinIR在​​三任务​​和​​五任务设置​​下分别实现了超过​​0.66 dB​​和​​1.04 dB​​的性能提升。

Figure 1. Comparative analysis of feature extraction mechanisms with an anchor token (marked by ⋆) as the reference point. (a) Vanilla convolution applies a fixed sampling pattern, leveraging neighborhood features. (b) Deformable convolution introduces adaptive sampling locations based on content, enabling more effective feature integration from relevant regions. (c) Window attention suffers from boundary constraints where anchor tokens near window edges (especially corners) have limited receptive fields, resulting in suboptimal feature extraction. (d) Our proposed Deformable Sliding Window (DSwin) extends window attention with token-centric paradigm and the content-adaptive reception field, ensuring robust feature aggregation for anchor tokens.

  1. Introduction  

​Image restoration​​, a fundamental challenge in ​​computer vision​​, aims to recover high-quality images from degraded observations. ​​Deep learning approaches​​ have revolutionized this field, delivering remarkable progress in specialized tasks such as ​​image deraining, dehazing, and denoising​​ [9, 22, 25]. Recently, the development of ​​unified models​​ capable of addressing multiple degradation types simultaneously has gained significant attention due to their ​​practical value in real-world applications​​ [29].

​Transformer-based architectures​​ have become the ​​de facto models​​ for image restoration owing to their ​​dynamic and long-range modeling capabilities​​ [2, 21, 41]. Particularly, ​​Swin Transformer-based methods​​ have achieved widespread adoption in image restoration [37, 51, 57, 59], where its ​​efficient local attention mechanism​​ achieves an exceptional balance between ​​computational cost and restoration quality​​ for ​​dense prediction problems​​. However, two challenges remain due to the limitations of ​​local window partition​​: ​​insufficient interaction among different windows​​ and ​​limited receptive field​​. Subsequent works have attempted to address these challenges by exploiting ​​ingenious window design​​ through ​​cross-aggregation​​ [11, 53], ​​increasing window overlap​​ [10], or employing ​​sparse token selection strategy​​ [8, 71]. While these ​​ingenious window designs​​ have indeed extended the performance of ​​local attention​​, they are still based on ​​fixed prior patterns​​, such as ​​stacking horizontal and vertical windows​​ to improve performance. These methods have not completely solved the ​​two challenges brought by window partitioning​​.

图像修复​​作为​​计算机视觉​​领域的基础性挑战,其核心目标是从质量退化的观测数据中恢复高质量图像。​​深度学习方法​​已彻底革新了这一领域,在​​图像去雨、去雾和去噪​​等专项任务中取得了突破性进展。近年来,能够同时处理多种退化类型的​​统一模型​​因其​​在实际应用中的重大价值​​而备受关注。

​基于Transformer的架构​​凭借其​​动态长程建模能力​​,已成为图像修复领域的​​事实标准模型​​。特别是​​基于Swin Transformer的方法,其​​高效的局部注意力机制​​在​​密集预测问题​​中实现了​​计算成本与修复质量​​的卓越平衡。然而,由于​​局部窗口划分​​的固有局限,仍存在两大挑战:​​窗口间交互不足​​与​​感受野受限​​。后续研究尝试通过​​交叉聚合​​、​​增加窗口重叠或采用​​稀疏令牌选择策略​​等​​创新窗口设计​​来解决这些问题。尽管这些​​精妙的窗口设计​​确实提升了​​局部注意力​​的性能,但它们仍依赖于​​固定先验模式​​(如通过​​堆叠水平和垂直窗口​​来提升性能),尚未完全克服​​窗口划分带来的双重挑战​​。

In this work, we revisit the ​​inductive biases of convolutional operations​​ and introduce a novel ​​Deformable Sliding Window (DSwin) attention mechanism​​, as illustrated in ​​Fig.1​​. Inspired by the ​​proven effectiveness of sliding patterns in convolutional neural networks​​, we transform the conventional ​​window-first paradigm​​ into a ​​token-centric approach​​. This ​​fundamental shift​​ enables ​​smoother cross-window interaction​​ through ​​overlapping receptive fields​​.

To further enhance flexibility, we incorporate ​​adaptive window partitioning​​ inspired by ​​deformable convolution[16]​​. Instead of ​​fixed window regions​​, our ​​DSwin attention​​ dynamically reorganizes ​​receptive fields​​ based on ​​content-aware offsets​​ learned from ​​center token features​​, resulting in ​​more effective feature extraction tailored to image content​​.

Building upon this foundation, we present the ​​Deformable Sliding Window Transformer for Image Restoration (DSwinIR)​​. A key component of our architecture is the ​​Multiscale DSwin module (MSDSwin)​​, which employs ​​DSwin attention​​ with ​​varying kernel sizes​​ across different ​​attention heads​​ to capture ​​rich multiscale features​​—a crucial capability for effective ​​image restoration​​.

We conduct ​​extensive evaluations​​ across diverse ​​image restoration tasks​​, spanning both ​​all-in-one multiple degradation scenarios​​ and specialized ​​single-task settings​​. As demonstrated in ​​Fig.2​​, ​​DSwinIR delivers substantial improvements of 2.1 dB and 1.3 dB​​ in ​​synthetic and real-world deweathering tasks​​, respectively. Moreover, our approach establishes ​​new state-of-the-art performance​​ on ​​three-task and five-task degradation benchmarks​​, outperforming previous methods by approximately ​​0.7 dB and 0.9 dB​​. For ​​single-task restoration​​, ​​DSwinIR surpasses the current leading method DRSformer[8] by 0.62 dB​​ on the challenging ​​real-world deraining SPA dataset [55]​​.

Figure 2. Quantitative comparison of the proposed DSwinIR against existing methods across diverse image restoration tasks, achieving consistent superior performance. All metrics are reported in PSNR (dB).

​本研究​​重新审视了​​卷积操作的归纳偏置​​,创新性地提出了​​可变形滑动窗口(DSwin)注意力机制​​(如​​图1​​所示)。受​​卷积神经网络中滑动模式已验证的有效性​​启发,本文将传统的​​窗口优先范式​​转变为​​以令牌为中心的方法​​。这一​​根本性转变​​通过​​重叠感受野​​实现了​​更流畅的跨窗口交互​​。

为进一步增强灵活性,本文借鉴​​可变形卷积​的思想,引入了​​自适应窗口划分机制​​。不同于​​固定窗口区域​​,本文的​​DSwin注意力​​基于从​​中心令牌特征​​学习到的​​内容感知偏移量​​,动态重组​​感受野​​,从而实现​​更贴合图像内容的特征提取​​。

基于此,提出了​​可变形滑动窗口图像修复Transformer(DSwinIR)​​。其核心组件是​​多尺度DSwin模块(MSDSwin)​​,该模块通过在不同​​注意力头​​中采用​​可变卷积核尺寸​​的​​DSwin注意力​​,有效捕获​​丰富的多尺度特征​​——这对实现高质量​​图像修复​​至关重要。

本文在多样化​​图像修复任务​​上进行了​​全面评估​​,涵盖​​一体化多重退化场景​​和专业化​​单任务场景​​。如​​图2​​所示,​​DSwinIR在合成和真实场景的去天气任务中分别取得2.1dB和1.3dB的显著提升​​。更重要的是,本方法在​​三任务和五任务退化基准测试​​中创造了​​新的性能记录​​,相较现有方法分别提升约​​0.7dB和0.9dB​​。在​​单任务修复​​方面,​​DSwinIR以0.62dB的优势超越当前最优方法DRSformer​,这一成果在极具挑战性的​​真实场景去雨SPA数据集上得到验证。

Our main contributions can be summarized as follows:​

• We propose a novel ​​Deformable Sliding Window Attention mechanism​​ that transforms ​​window-based attention​​ into a ​​token-centric paradigm​​ with ​​adaptive, content-aware receptive fields​​, significantly enhancing ​​feature extraction capabilities​​ and ​​inter-window interaction​​.

• We develop ​​DSwinIR​​, a comprehensive ​​image restoration framework​​ built upon our ​​deformable sliding window attention​​. The architecture incorporates a ​​multiscale attention module​​ that leverages ​​varying kernel sizes across attention heads​​ to capture ​​rich hierarchical features​​ essential for ​​high-quality image restoration​​.

• Through ​​extensive experiments​​ across multiple ​​image restoration tasks​​, including both ​​all-in-one settings​​ and specialized ​​single-task scenarios​​, we demonstrate that ​​DSwinIR consistently outperforms existing methods​​, establishing ​​new state-of-the-art results​​ on numerous benchmarks.

​本研究的主要贡献可总结如下:​

• 提出创新的​​可变形滑动窗口注意力机制​​,将传统的​​基于窗口的注意力​​转化为具有​​自适应内容感知感受野​​的​​以令牌为中心​​的范式,显著提升了​​特征提取能力​​和​​窗口间交互效率​​。

• 开发了​​DSwinIR​​这一基于​​可变形滑动窗口注意力​​的​​完整图像修复框架​​。该架构包含​​多尺度注意力模块​​,通过在不同​​注意力头​​中采用​​可变卷积核尺寸​​,有效捕捉对实现​​高质量图像修复​​至关重要的​​多层次丰富特征​​。

• 在涵盖​​一体化多任务场景​​和专业化​​单任务场景​​的多种​​图像修复任务​​上进行​​系统实验​​,证明​​DSwinIR持续超越现有方法​​,在多个基准测试中创造了​​新的性能记录​​。

​   2. Method​   

In this section, we present ​​DSwinIR​​, a novel architecture for ​​image restoration​​ that introduces the ​​Deformable Sliding Window (DSwin) attention mechanism​​. We first provide an ​​architectural overview​​, followed by detailed descriptions of our ​​key components​​: the ​​DSwin attention module​​ and its ​​multi-scale extension​​.

本节详细介绍​​图像修复​​的新型架构​​DSwinIR​​,重点介绍其核心创新——​​可变形滑动窗口(DSwin)注意力机制​​。首先给出​​整体架构概览​​,随后深入解析两个​​关键组件​​:​​DSwin注意力模块​​及其​​多尺度扩展实现​​。

2.1. Overview​

​DSwinIR adopts a U-shaped encoder-decoder architecture​​ with our proposed ​​DSwin attention module​​ and ​​MSG-FFN​​ as core components, as shown in ​​Figure 3​​. The network is optimized using ​​L1 loss​​ between the restored output ​​yˆ​​ and ground truth ​​y​​:

Figure 3. Overview of the proposed DSwinIR architecture, illustrating the integration of the DSwin module and the MSG-FFN within a U-shaped network. (a) Detail implementation of the proposed DSwin. (b) Illustration of the proposed multi-scale DSwin attention module. (c) The improved FFN with multi-scale feature extraction.

DSwinIR采用U型编解码器架构​​,其核心组件包括本文提出的​​DSwin注意力模块​​和​​MSG-FFN​​(如​​图3​​所示)。网络通过计算修复输出​​ŷ​​与真实值​​y​​之间的​​L1损失​​进行优化:

​2.2. Deformable Sliding Window Attention Preliminaries​

Given an input feature map ​​X ∈ R^(H×W×C)​​, the ​​self-attention​​ is computed by comparing each ​​query feature x_i,j​​ with the features within its ​​receptive field​​. To incorporate ​​local context​​, we define the ​​attention weights​​ at position ​​(i, j)​​ as:

where ​​(u, v) ∈ Nk​​ denotes the ​​local neighborhood​​ defined by the ​​kernel size k​​ (such as the window size). The output at position ​​(i, j)​​ is computed as:

This formulation limits the ​​attention computation​​ to a ​​local neighborhood​​, similar to ​​convolution​​, making it ​​computationally efficient​​. Meanwhile, highlighting the ​​crucial role of receptive field​​.

​2.2 可变形滑动窗口注意力基础​

给定输入特征图​​X∈R^(H×W×C)​​,​​自注意力机制​​通过比较每个​​查询特征x_i,j​​与其​​感受野​​内特征的相似度进行计算。为融入​​局部上下文​​,定义位置​​(i,j)​​处的​​注意力权重​​为:

其中​​(u,v)∈N_k​​表示由​​卷积核尺寸k​​(如窗口尺寸)定义的​​局部邻域​​。位置​​(i,j)​​的输出特征计算为:

该公式将​​注意力计算​​限制在​​局部邻域​​内,既保持了与​​卷积操作​​相似的​​计算高效性​​,又凸显了​​感受野的关键作用​​。

Incorporating Deformable Offsets​

To ​​adaptively extend the receptive field​​, we introduce ​​deformable offsets​​ into the attention mechanism. Specifically, we learn ​​offsets ∆p^(u,v)_i,j​​ for each position ​​(i, j)​​ and each location in the local neighborhood ​​(u, v)​​:

where ​​f_θ​​ is a ​​lightweight module​​ that predicts the ​​offsets​​ for the sampling locations.

Leveraging the offsets, we sample the features at ​​deformed positions​​: , where ​​∆u^(u,v)_i,j​​ and ​​∆v^(u,v)_i,j​​ are the components of ​​∆p^(u,v)_i,j​​. The output feature is ensemble with the ​​adaptive selection tokens​​ as:

By introducing ​​deformable offsets​​, we ​​adaptively adjust the receptive field​​, allowing the attention to focus on ​​relevant regions beyond the fixed local window regions​​.

可变形偏移量融合​

为实现​​感受野的自适应扩展​​,本文在注意力机制中引入​​可变形偏移量​​。具体而言,为每个位置​​(i,j)​​及其邻域内位置​​(u,v)​​学习​​偏移量Δp_(i,j)^(u,v)​​:

其中​​f_θ​​是预测采样位置​​偏移量​​的​​轻量化模块​​。

利用这些偏移量,在​​变形位置​​处进行特征采样,其中​​Δu_(i,j)^(u,v)​​和​​Δv_(i,j)^(u,v)​​是​​Δp_(i,j)^(u,v)​​的分量。最终输出特征通过与​​自适应选择令牌​​集成得到:

通过引入​​可变形偏移量​​,实现了​​感受野的自适应调整​​,使注意力能够聚焦于​​超出固定局部窗口范围的相关区域​​。

2.3. Multi-Scale DSwin Attention Module​

We further extend the ​​basic DSwin attention​​ to a ​​multi-scale variant (MS-DSwin)​​. The ​​key insight​​ is to ​​leverage different receptive fields​​ within a single attention module.

​Multi-Scale Design​

In ​​MS-DSwin​​, we assign ​​different kernel sizes​​ to different ​​attention heads​​ within the ​​multi-head attention mechanism​​. Formally, given ​​H attention heads​​, each head ​​h ∈ 1, ..., H​​ is associated with a ​​unique kernel size kh​​. The ​​attention computation​​ for head h can be expressed as:

where ​​N_k_h​​ defines the ​​kernel size k_h​​ for head h, and ​​(∆u^h_i,j , ∆v^h_i,j )​​ are the ​​learned deformable offsets​​ specific to head h. The outputs from different heads are ​​concatenated​​ through a ​​linear projection​​:

where ​​W_o ∈ R^C×C​​ is the ​​output projection matrix​​, and ​​[; ]​​ denotes ​​concatenation​​.

​2.3 多尺度DSwin注意力模块​

本文将​​基础DSwin注意力​​扩展为​​多尺度变体(MS-DSwin)​​,其​​核心思想​​是在单个注意力模块中​​利用不同尺度的感受野​​。

​多尺度设计​

在​​MS-DSwin​​中,本文为​​多头注意力机制​​中的每个​​注意力头​​分配​​不同的卷积核尺寸​​。具体而言,给定​​H个注意力头​​,每个头​​h∈1,...,H​​对应一个​​独特的核尺寸k_h​​。头h的​​注意力计算​​可表示为:

其中​​N_k_h​​定义了头h的​​核尺寸k_h​​,​​(Δu^i,j_h, Δv^i,j_h)​​是头h特有的​​可学习变形偏移量​​。不同头的输出通过​​线性投影​​进行​​拼接​​:

这里​​W_o∈R^(C×C)​​是​​输出投影矩阵​​,​​[;]​​表示​​拼接操作​​。

​2.4. Feed-Forward Network​

​Multi-Scale Guided Feed-Forward Network​

To enhance ​​feature processing capabilities​​, we propose ​​MSG-FFN​​, a ​​multi-scale guided feed-forward network​​ that extends the ​​FFN design​​ with ​​parallel multi-scale convolution branches​​. Given an input feature map ​​X​​, ​​MSG-FFN​​ processes it as follows:

where ​​Conv_k×k​​ denotes ​​convolution with kernel size k​​, ​​Conv_3 × 3, d = 2​​ represents ​​dilated convolution​​ with ​​kernel size 3​​ and ​​dilation rate 2​​.

​2.4 前馈网络​

​多尺度引导前馈网络​

为增强​​特征处理能力​​,本文提出​​MSG-FFN​​——一种​​多尺度引导的前馈网络​​,通过引入​​并行多尺度卷积分支​​来扩展标准​​FFN设计​​。给定输入特征图​​X​​,其处理流程为:

其中​​Conv_k×k​​表示​​核尺寸为k的卷积​​,​​Conv_3×3,d=2​​代表​​空洞卷积​​(​​核尺寸3​​,​​空洞率2​​)。该设计通过​​多尺度特征融合​​显著提升了模型对复杂图像退化模式的适应能力。

   3 Results   

Figure 4. Ablation studies demonstrating the effectiveness of our key components. We evaluate (1) different attention mechanisms, showing improvements from Sliding Window (31.79 dB) and De formable Window (32.11 dB) over baselines; (2) DSwin config urations with various kernel sizes (K=5,7,9) and Multi-scale en hancement (32.69 dB); and (3) FFN module with MSG enhance ment, achieving the best performance (32.73 dB). All experiments report the average performance of three distinct degradation tasks with PSNR values in dB.

Table 1. Comprehensive evaluation of the proposed DSwinIR across diverse experimental settings in existing all-in-one image restoration research.

Figure 5. Visual comparison of restoration results across three degradation tasks: noise removal (top row), rain streak removal (middle row), and dehazing (bottom row). Zoom-in regions (shown in colored boxes) demonstrate that our method achieves superior detail preservation and degradation removal.

Table 5. Quantitative comparison on setting 4: real-world deweathering following [78]. Results of our DSwinIR is in bold.

Table 6. Quantitative comparison of different methods on the single image deraining task, evaluated on Rain100L [60] and SPAData [55]. The results are reported in terms of PSNR/SSIM. The best results are highlighted in bold.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值