CoAtNet:Marrying Convolution and Attentionfor All Data Sizes文章详解(结合代码)



论文中提到了的 ViT 的主要限制之一是其令人印象深刻的数据需求。虽然 ViT 在庞大的 JFT300M 数据集上显示出令人兴奋的结果,但它在数据量少的情况下性能仍然不如的经典 CNN。这表明 Transformers 可能缺少 CNN 拥有的泛化能力,因此需要大量数据来弥补。但是与 CNN 相比,注意力模型具有更高的模型容量。

CoAtNet 的目标是将 CNN 和 Transformer 的优点融合到一个单一的架构中,但是混合 CNN 和 Transformer 的正确方法是什么?

第一个想法是利用已经讨论过的 MBConv 块,它采用具有倒置残差的深度卷积,这种扩展压缩方案与 Transformer 的 FFN 模块相同。除了这种相似性之外,depthwise convolution 和 self-attention 都可以表示为一个预定义的感受野中每个维度的加权值之和。其中深度卷积可以表示为:

 其中 xᵢ 和 yᵢ 分别是位置 i 的输入和输出, wᵢ ₋ ⱼ 是位置 (i - j) 的权重矩阵, L (i) 分别是 i. 通道的局部邻域。

相比之下,self-attention 允许感受野不是局部邻域,并基于成对相似性计算权重,然后是 softmax 函数:

 其中 G 表示全局空间,xᵢ, xⱼ 是两对(例如图像的两个patch)。为了便于理解一个简化的版本(省略了多头 Q、K 和 V 投影),将每个patch与同一图像中的每个其他patch进行比较,以产生一个自注意力矩阵。


Input-Adaptive Weighting:矩阵 wᵢ an 是一个与输入无关的静态值,而注意力权重 Aᵢⱼ 取决于输入的表示。这使得 self-attention 更容易捕获输入中不同元素之间的关系,但代价是在数据有限时存在过度拟合的风险。

Translation Equivariance:卷积权重 wᵢ ⱼ ⱼ 关心的是 i 和 j 之间的相对偏移,而不是 i 和 j 的具体值。这种平移不变性可以在有限大小的数据集下提高泛化能力。

Global Receptive Field:与 CNN 的局部感受野相比,self-attention 中使用的更大感受野提供了更多的上下文信息。

综上所述,最优架构应该是自注意力的输入+自适应加权和全局感受野特性+ CNN 的平移不变性。所以作者提出的想法是在softmax初始化之后或之前将全局静态卷积核与自适应注意力矩阵相加:


有了上面的理论基础,下一步就是弄清楚如何堆叠卷积和注意力块。作者决定只有在特征图小到可以处理之后才使用卷积来执行下采样和全局相对注意力操作。并且执行下采样方式也有两种 :

像在 ViT 模型中一样将图像划分为块,并堆叠相关的自注意力块。该模型被用作与原始 ViT 的比较。
使用渐进池化的多阶段操作。这种方法分为5个阶段,但是前两个阶段,即经典的卷积层和用于降低维度的MBConv块。为了简单起见这里将其合并为一个阶段命名为S0。后面三个阶段可以是卷积或Transformer块,产生 4 种组合:S0-CCC、S0-CCT、S0-CTT 和 S0-TTT
这样产生的 5 个模型在泛化方面(训练损失和评估准确度之间的差距)和使用 1.3M 图像、超过 3B 图像的模型容量(拟合大型训练数据集的能力)进行了比较。

泛化能力:S0-CCC ≈ S0-CCT ≥ S0-CTT> S0-TTT ≫ ViT



对于模型容量:简单地添加更多的 Transformer 块并不意味着更好的泛化。下图所示的 S0-CTT 被选为这两种功能之间的最佳折衷方案。


from torch import nn, sqrt
import torch
import sys
from math import sqrt

from model.conv.MBConv import MBConvBlock
from model.attention.SelfAttention import ScaledDotProductAttention

class CoAtNet(nn.Module):
    def __init__(self, in_ch, image_size, out_chs=[64, 96, 192, 384, 768]):
        self.out_chs = out_chs
        # 最大池化下采样
        self.maxpool2d = nn.MaxPool2d(kernel_size=2, stride=2)
        self.maxpool1d = nn.MaxPool1d(kernel_size=2, stride=2)
        # 卷积提取特征
        self.s0 = nn.Sequential(
            nn.Conv2d(in_ch, in_ch, kernel_size=3, padding=1),
            nn.Conv2d(in_ch, in_ch, kernel_size=3, padding=1)
        # 增加维度
        self.mlp0 = nn.Sequential(
            nn.Conv2d(in_ch, out_chs[0], kernel_size=1),
            nn.Conv2d(out_chs[0], out_chs[0], kernel_size=1)
        # 卷积模块
        # 倒残差结构
        self.s1 = MBConvBlock(ksize=3, input_filters=out_chs[0], output_filters=out_chs[0], image_size=image_size // 2)
        self.mlp1 = nn.Sequential(
            nn.Conv2d(out_chs[0], out_chs[1], kernel_size=1),
            nn.Conv2d(out_chs[1], out_chs[1], kernel_size=1)

        self.s2 = MBConvBlock(ksize=3, input_filters=out_chs[1], output_filters=out_chs[1], image_size=image_size // 4)
        self.mlp2 = nn.Sequential(
            nn.Conv2d(out_chs[1], out_chs[2], kernel_size=1),
            nn.Conv2d(out_chs[2], out_chs[2], kernel_size=1)
        # 自注意力模块
        # 四个输入分别为d_model, d_k, d_v, h
        # :param d_model: Output dimensionality of the model
        # :param d_k: Dimensionality of queries and keys
        # :param d_v: Dimensionality of values
        # :param h: Number of heads
        self.s3 = ScaledDotProductAttention(out_chs[2], out_chs[2] // 8, out_chs[2] // 8, 8)
        self.mlp3 = nn.Sequential(
            nn.Linear(out_chs[2], out_chs[3]),
            nn.Linear(out_chs[3], out_chs[3])

        self.s4 = ScaledDotProductAttention(out_chs[3], out_chs[3] // 8, out_chs[3] // 8, 8)
        self.mlp4 = nn.Sequential(
            nn.Linear(out_chs[3], out_chs[4]),
            nn.Linear(out_chs[4], out_chs[4])

    def forward(self, x):
        B, C, H, W = x.shape
        # stage0
        y = self.mlp0(self.s0(x))
        y = self.maxpool2d(y)
        # 倒残差模块
        # stage1
        y = self.mlp1(self.s1(y))
        y = self.maxpool2d(y)
        # stage2
        y = self.mlp2(self.s2(y))
        y = self.maxpool2d(y)  # [1,192,28,28]
        # stage3
        y = y.reshape(B, self.out_chs[2], -1).permute(0, 2, 1)  # B,N,C [1,784,192]
        y = self.mlp3(self.s3(y, y, y))  # [1,784,384]
        y = self.maxpool1d(y.permute(0, 2, 1)).permute(0, 2, 1)  # [1,392,384]
        # stage4
        y = self.mlp4(self.s4(y, y, y))
        y = self.maxpool1d(y.permute(0, 2, 1))
        N = y.shape[-1]
        y = y.reshape(B, self.out_chs[4], int(sqrt(N)), int(sqrt(N)))

        return y

if __name__ == '__main__':
    x = torch.randn(1, 3, 224, 224)
    coatnet = CoAtNet(3, 224)
    y = coatnet(x)

  • 2
  • 12
    觉得还不错? 一键收藏
  • 0
Over the past several years JavaScript has undergone a remarkable transformation. It is now one of the most important programming languages in the world. With the ongoing importance of Ajax-based development and the rise of full-featured JavaScript libraries, most of the stigma around JavaScript has vanished. The most beginner-friendly library, jQuery, is responsible for most of this turnaround. jQuery is used at some of the largest organizations in the world, including Amazon, IBM, Twitter, NBC, Best Buy, and Dell. In 2011 there were three major releases in jQuery and the community surrounding it continues to grow. jQuery is prominently featured at the front end of Java/Spring, PHP, .NET, Ruby on Rails, and Python/Django stacks all over the Web. If you have experience with HTML, CSS, and JavaScript, this book is for you. It will expand your jQuery knowledge by focusing on the core library with the benefit of strong core JavaScript expertise in many of the lessons. This book is aimed at three groups of readers: Experienced server-side web application developers looking to move into the client-side using the world’s most popular front-end library Experienced JavaScript programmers looking to ramp up quickly on jQuery Novice to intermediate jQuery developers looking to expand their jQuery knowledge into more advanced topics This book is not aimed at beginners. For those looking to start with the basics of HTML, CSS, and JavaScript/jQuery development, Beginning JavaScript and CSS Development with jQuery (Wrox Programmer to Programmer) by Richard York will most likely help you more. This book covers a lot of information about jQuery including a developer-level introduction as well as providing an in-depth look into some of the more advanced features. The book is divided into two parts, jQuery Fundamentals and Applied jQuery. jQuery Fundamentals introduces the core concepts while Applied jQuery focuses on more advanced subjects. The first part of the book offers an in-depth introduction to jQuery fundamentals, which includes selecting elements, manipulating the DOM, and binding and reacting to browser books. After providing a solid foundation, the book will then outline more advanced topics such as plugin development, unit testing with JavaScript, and other advanced features of the library. The book also focuses on features available as of jQuery 1.7.1. but also tries to incorporate feature support in older versions of the library wherever it is relevant. The first few chapters will help you set up a development environment and they also review important JavaScript concepts. Chapters 3-7 examine the jQuery core concepts. The second half of the book focuses on applying jQuery in the real world as well as detailing jQuery UI, plugin development, and templates among other lessons. Part 1, jQuery Fundamentals, contains chapters 1-7. Chapter 1 sets up an environment for developing and debugging jQuery and JavaScript code. It also defines the code standards that will be used throughout the book. Chapter 2 goes through the basics of JavaScript programming language to strengthen the foundation that the rest of the book is built upon. Chapter 3 introduces the basic functions that make up the library and illuminates usages of the core jQuery functions. It also introduces many of the functions that you will need to perform varieties of tasks. Chapter 4 goes in-depth into one of the core features of jQuery, which is the ability to select and manipulate HTML elements. Chapter 5 introduces another feature of jQuery: the cross-browser ability to bind and manage browser events. Chapter 6 explores one of the biggest revolutions in web development in recent years--Ajax. Chapter 7 focuses on some of the shortcuts jQuery offers for animating components in your web applications. These include moving, fading, toggling, and resizing elements. Part 2, Applied jQuery, contains chapters 8-14. Chapter 8 introduces jQuery UI, which is an associated user interface library for jQuery and contains things such as widgets, effects, animations, and interactions. Chapter 9 explores additional jQuery UI features including moving, sorting, resizing, and selection elements with a mouse. Chapter 10 teaches a variety of techniques, best practices, and patterns that you can apply to your code to immediately make it more efficient, maintainable, and clear. Chapter 11 focuses on the jQuery Template plugin. Templates are a standard way of marrying data and markup snippets. Chapter 12 focuses on authoring jQuery plugins. It is important to know how to extend the power of jQuery with custom methods as it is a fundamental skill for a top jQuery developer. Chapter 13 introduces the jQuery Deferred Object, which was introduced in version 1.5. It is a chainable utility object that provides control over the way callback functions are handled. Chapter 14 introduces the general concept of unit testing and goes into detail with the specific unit testing framework created by and used by the jQuery project itself, QUnit. In order to use this book effectively, you will need one of the following web browsers to run the samples provided within the book: Firefox 3.6, Current -1 Version Internet Explorer 6+ Safari 5.0x Opera current - 1 Version Chrome Current - 1 Version Cesar Otero is a freelance web developer. His technical interests include Python, Django, JavaScript, and jQuery. He sometimes contributes articles to IBM’s developer works. He holds a degree in electrical engineering. Rob Larsen has many years’ experience as a front-end engineer and team leader. He has built websites and applications for some of the world’s biggest brands. He is currently a Senior Specialist, Platform at Sapient Global markets. Rob is an active writer and speaker on web technology with a special focus on emerging standards like HTML5, CSS3, and JavaScript.


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


