【注意力机制的改进方案】一系列关于attention的高效改进解决方案

            <div id="content_views" class="htmledit_views" style="user-select: auto;">
                <div id="js_content"> 

编辑:NewBeeNLP

前几天逛github刷到一个『awesome-fast-attention』大列表,整理了一系列关于attention的高效改进文章,包括论文、引用量、源码实现、算法复杂度以及关键亮点。其中一部分论文,我们在之前的『Transformer Assemble』系列文章中也都有作过解读~

Efficient Attention

Paper (引用量)源码实现复杂度AutoRegressiveMain Idea
Generating Wikipedia by Summarizing Long Sequences[1] (208)memory-compressed-attention[2]
compresses key and value + blocked attention
CBAM: Convolutional Block Attention Module[3] (677)attention-module[4] combines the SE attention with a per pixel(local) weight
CCNet: Criss-Cross Attention for Semantic Segmentation[5] (149)CCNet[6]each pixel attends to its row and column simultaneously
Efficient Attention: Attention with Linear Complexities[7] (2)efficient-attention[8]Softmax(Q)*(Softmax(K^T)*V)
Star-Transformer[9] (24)fastNLP[10]uses a relay(global) node and attends to/from that node
Generating Long Sequences with Sparse Transformers[11] (139)torch-blocksparse[12]sparse block based attention
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond[13] (96)GCNet[14] squeeze and excitation with an attention pooling (instead of a GAP)
SCRAM: Spatially Coherent Randomized Attention Maps[15] (1)-uses PatchMatch to find close keys
Interlaced Sparse Self-Attention for Semantic Segmentation[16] (13)IN_PAPERcombination of a short length and then long range(dilated) attention
Permutohedral Attention Module for Efficient Non-Local Neural Networks[17] (2)Permutohedral_attention_module[18]uses permutohedral lattice approximation algorithm to approximate the attention output
Large Memory Layers with Product Keys[19] (28)XLM[20]search for nearest neighbor keys
Expectation-Maximization Attention Networks for Semantic Segmentation[21] (38)EMANet[22]applys expectation maximization to cluster keys into k clusters
Compressive Transformers for Long-Range Sequence Modelling[23] (20)compressive-transformer-pytorch[24]compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL
BP-Transformer: Modelling Long-Range Context via Binary Partitioning[25] (8)BPT[26]attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner
Axial Attention in Multidimensional Transformers[27] (5)axial-attention[28]apply attention on each axis separately
Reformer: The Efficient Transformer[29] (69)trax[30]uses LSH to find close keys
Transformer on a Diet[31] (2)transformer-on-diet[32]
dilated transformer like wavenet
Sparse Sinkhorn Attention[33] (4)sinkhorn-transformer[34]
uses a cost matrix to limit attention between buckets
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection[35] (1)-
learns the q, k connections == dynamically creates a sparse attention matrix
Efficient Content-Based Sparse Attention with Routing Transformers[36] (11)routing-transformer[37]
computes attention with same-cluster tokens (computed by online k-means)
Longformer: The Long-Document Transformer[38] (15)longformer[39]
global + blocked attention
Neural Architecture Search for Lightweight Non-Local Networks[40] (2)AutoNL[41]
computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions
ETC: Encoding Long and Structured Data in Transformers[42] (2)-
combines global attention (star transformer with multiple global tokens) with local attention
Multi-scale Transformer Language Models[43] (1)IN_PAPER
UNet like + retina attetion is something close to BP-Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models[44] (5)-
does not compute pairwise interactions
Jukebox: A Generative Model for Music[45] (9)jukebox[46]
better attention patterns from Sparse Transformer
GMAT: Global Memory Augmentation for Transformers[47] (0)gmat[48]
adds global tokens
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers[49] (0)google-research[50]
calculate an unbiased stochastic approximation of the attention matrix
Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer[51] (0)-
does not compute pairwise interactions and uses fixed mask patters
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention[52] (1)fast-transformers[53]
uses phi(q)(phi(k)v) and also improves the sequential sampling step
Linformer: Self-Attention with Linear Complexity[54] (3)linformer-pytorch[55]
project key and value from nd
Real-time Semantic Segmentation with Fast Attention[56] (0)-
l2_norm(q)*(l2_norm(k)*v)
Fast Transformers with Clustered Attention[57] (0)fast-transformers[58]
groups queries together with LSH
Big Bird: Transformers for Longer Sequences[59] (0)-
ETC with random connections

文章

本文参考资料

[1]

Generating Wikipedia by Summarizing Long Sequences: https://arxiv.org/abs/1801.10198v1

[2]

memory-compressed-attention: https://github.com/lucidrains/memory-compressed-attention

[3]

CBAM: Convolutional Block Attention Module: https://arxiv.org/abs/1807.06521v2

[4]

attention-module: https://github.com/Jongchan/attention-module

[5]

CCNet: Criss-Cross Attention for Semantic Segmentation: https://arxiv.org/abs/1811.11721v2

[6]

CCNet: https://github.com/speedinghzl/CCNet

[7]

Efficient Attention: Attention with Linear Complexities: https://arxiv.org/abs/1812.01243v8

[8]

efficient-attention: https://github.com/cmsflash/efficient-attention

[9]

Star-Transformer: https://arxiv.org/abs/1902.09113v2

[10]

fastNLP: https://github.com/fastnlp/fastNLP/blob/master/fastNLP/modules/encoder/star_transformer.py

[11]

Generating Long Sequences with Sparse Transformers: https://arxiv.org/abs/1904.10509v1

[12]

torch-blocksparse: https://github.com/ptillet/torch-blocksparse

[13]

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond: https://arxiv.org/abs/1904.11492v1

[14]

GCNet: https://github.com/xvjiarui/GCNet

[15]

SCRAM: Spatially Coherent Randomized Attention Maps: https://arxiv.org/abs/1905.10308v1

[16]

Interlaced Sparse Self-Attention for Semantic Segmentation: https://arxiv.org/abs/1907.12273v2

[17]

Permutohedral Attention Module for Efficient Non-Local Neural Networks: https://arxiv.org/abs/1907.00641v2

[18]

Permutohedral_attention_module: https://github.com/SamuelJoutard/Permutohedral_attention_module

[19]

Large Memory Layers with Product Keys: https://arxiv.org/abs/1907.05242v2

[20]

XLM: https://github.com/facebookresearch/XLM

[21]

Expectation-Maximization Attention Networks for Semantic Segmentation: https://arxiv.org/abs/1907.13426v2

[22]

EMANet: https://github.com/XiaLiPKU/EMANet

[23]

Compressive Transformers for Long-Range Sequence Modelling: https://arxiv.org/abs/1911.05507v1

[24]

compressive-transformer-pytorch: https://github.com/lucidrains/compressive-transformer-pytorch

[25]

BP-Transformer: Modelling Long-Range Context via Binary Partitioning: https://arxiv.org/abs/1911.04070v1

[26]

BPT: https://github.com/yzh119/BPT

[27]

Axial Attention in Multidimensional Transformers: https://arxiv.org/abs/1912.12180v1

[28]

axial-attention: https://github.com/lucidrains/axial-attention

[29]

Reformer: The Efficient Transformer: https://arxiv.org/abs/2001.04451v2

[30]

trax: https://github.com/google/trax/tree/master/trax/models/reformer

[31]

Transformer on a Diet: https://arxiv.org/abs/2002.06170v1

[32]

transformer-on-diet: https://github.com/cgraywang/transformer-on-diet

[33]

Sparse Sinkhorn Attention: https://arxiv.org/abs/2002.11296v1

[34]

sinkhorn-transformer: https://github.com/lucidrains/sinkhorn-transformer

[35]

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection: https://arxiv.org/abs/2003.09833v2

[36]

Efficient Content-Based Sparse Attention with Routing Transformers: https://arxiv.org/abs/2003.05997v1

[37]

routing-transformer: https://github.com/lucidrains/routing-transformer

[38]

Longformer: The Long-Document Transformer: https://arxiv.org/abs/2004.05150v1

[39]

longformer: https://github.com/allenai/longformer

[40]

Neural Architecture Search for Lightweight Non-Local Networks: https://arxiv.org/abs/2004.01961v1

[41]

AutoNL: https://github.com/LiYingwei/AutoNL

[42]

ETC: Encoding Long and Structured Data in Transformers: https://arxiv.org/abs/2004.08483v2

[43]

Multi-scale Transformer Language Models: https://arxiv.org/abs/2005.00581v1

[44]

Synthesizer: Rethinking Self-Attention in Transformer Models: https://arxiv.org/abs/2005.00743v1

[45]

Jukebox: A Generative Model for Music: https://arxiv.org/abs/2005.00341v1

[46]

jukebox: https://github.com/openai/jukebox

[47]

GMAT: Global Memory Augmentation for Transformers: https://arxiv.org/abs/2006.03274v1

[48]

gmat: https://github.com/ag1988/gmat

[49]

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers: https://arxiv.org/abs/2006.03555v1

[50]

google-research: https://github.com/google-research/google-research/tree/master/performer/fast_self_attention

[51]

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer: https://arxiv.org/abs/2006.05174v1

[52]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention: https://arxiv.org/abs/2006.16236v2

[53]

fast-transformers: https://github.com/idiap/fast-transformers

[54]

Linformer: Self-Attention with Linear Complexity: https://arxiv.org/abs/2006.04768v3

[55]

linformer-pytorch: https://github.com/tatp22/linformer-pytorch

[56]

Real-time Semantic Segmentation with Fast Attention: https://arxiv.org/abs/2007.03815v2

[57]

Fast Transformers with Clustered Attention: https://arxiv.org/abs/2007.04825v1

[58]

fast-transformers: https://github.com/idiap/fast-transformers

[59]

Big Bird: Transformers for Longer Sequences: https://arxiv.org/abs/2007.14062v1

[60]

A Survey of Long-Term Context in Transformers: https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/

END -

说个正事哈

由于微信平台算法改版,公号内容将不再以时间排序展示,如果大家想第一时间看到我们的推送,强烈建议星标我们和给我们多点点【在看】。星标具体步骤为:

(1)点击页面最上方深度学习自然语言处理”,进入公众号主页。

(2)点击右上角的小点点,在弹出页面点击“设为星标”,就可以啦。

感谢支持,比心

投稿或交流学习,备注:昵称-学校(公司)-方向,进入DL&NLP交流群。

方向有很多:机器学习、深度学习,python,情感分析、意见挖掘、句法分析、机器翻译、人机对话、知识图谱、语音识别等。

记得备注呦

推荐两个专辑给大家:

专辑 | 李宏毅人类语言处理2020笔记

专辑 | NLP论文解读


 
 
  1. 整理不易,还望给个在看!

原文链接

### YOLOv11 轻量化注意力机制改进 #### 设计理念与背景 为了使YOLOv11能够更高效地运行于资源受限环境,如移动设备和平板电脑,轻量化设计至关重要。传统的目标检测方法通常依赖高计算成本的特征提取方式,在保持高性能的同时减少模型复杂度成为研究热点之一[^1]。 #### 注意力机制的作用 引入注意力机制可以帮助网络聚焦于最具代表性和判别性的区域,从而提升小目标识别能力并改善整体性能。对于YOLO系列而言,这种优化不仅有助于解决小型物体难以被有效捕捉的问题,还能增强对不同尺度对象的理解和定位准确性[^3]。 #### 实现方案 具体到YOLOv11中的实现,可以通过以下几种策略来进行轻量化的注意力模块集成: - **通道维度上的压缩**:采用Squeeze-and-Excitation (SE) 或其变体来构建高效的通道级注意单元。这些结构允许以较低的成本调整各卷积层输出的重要性权重,进而促进信息的有效传递。 ```python class SEBlock(nn.Module): def __init__(self, channel, reduction=16): super(SEBlock, self).__init__() self.avg_pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channel, channel // reduction), nn.ReLU(inplace=True), nn.Linear(channel // reduction, channel), nn.Sigmoid() ) def forward(self, x): b, c, _, _ = x.size() y = self.avg_pool(x).view(b, c) y = self.fc(y).view(b, c, 1, 1) return x * y.expand_as(x) ``` - **空间维度的关注**:利用Spatial Attention Module(SAM),它可以在不显著增加额外参数的情况下突出显示图像中有价值的空间位置。SAM通过对输入特征图施加自适应掩码操作达到此目的。 ```python class SpatialAttention(nn.Module): def __init__(self, kernel_size=7): super(SpatialAttention, self).__init__() assert kernel_size in (3, 7), 'kernel size must be 3 or 7' padding = 3 if kernel_size == 7 else 1 self.conv1 = nn.Conv2d(2, 1, kernel_size, padding=padding, bias=False) self.sigmoid = nn.Sigmoid() def forward(self, x): avg_out = torch.mean(x, dim=1, keepdim=True) max_out, _ = torch.max(x, dim=1, keepdim=True) out = torch.cat([avg_out, max_out], dim=1) out = self.conv1(out) return self.sigmoid(out) * x ``` - **混合型注意力框架**:结合上述两种思路形成CBAM(Convolutional Block Attention Module),即先执行通道层面的选择再进行空间范围内的强化。这种方法既考虑到了各个channel间的关系又兼顾了局部区域内像素间的关联性,实现了较好的平衡效果。 通过以上措施,YOLOv11能够在维持原有优势的基础上进一步降低运算负担,并且借助精心设计的注意力组件更好地应对多样化场景下的挑战[^2]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值