【论文伴读】【TimeSformer】Is Space-Time Attention All You Need for Video Understanding?-Part1

最新推荐文章于 2025-03-22 23:55:12 发布

Mark White

最新推荐文章于 2025-03-22 23:55:12 发布

阅读量994

点赞数 23

分类专栏：论文分享文章标签：视频计算机视觉

本文链接：https://blog.csdn.net/crazyjinks/article/details/144720301

版权

论文分享专栏收录该内容

2 篇文章

订阅专栏

前言

论文题目：Is Space-Time Attention All You Need for Video Understanding?
论文链接：https://arxiv.org/pdf/2102.05095.pdf
github 地址：https://github.com/lucidrains/TimeSformer-pytorch

0 摘要 Abstract

在这里插入图片描述

原句：“We present a convolution-free approach to video classification built exclusively on self-attention over space and time.”
翻译：我们提出了一种完全基于时空自注意力机制的无卷积视频分类方法。
解释：强调了这是一种创新方法，完全摒弃了传统的卷积操作，只使用自注意力机制。

原句：“Our method, named ‘TimeSformer,’ adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches.”
翻译：我们的方法名为"TimeSformer"，它通过直接从帧级别的图像块序列中学习时空特征，将标准Transformer架构应用于视频。
解释：介绍了模型名称和核心机制，将Transformer从处理文本序列扩展到处理视频帧序列。

原句：“Our experimental study compares different self-attention schemes and suggests that ‘divided attention,’ where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.”
翻译：我们的实验研究比较了不同的自注意力方案，结果表明"分离注意力"（在每个模块中分别应用时间注意力和空间注意力）在所考虑的设计选择中能达到最佳的视频分类准确率。
解释：介绍了关键技术创新点，即将时间和空间注意力分开处理的策略。

原句：“Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600.”
翻译：尽管采用了全新的设计，TimeSformer在多个动作识别基准测试中都达到了最先进的结果，包括在Kinetics-400和Kinetics-600数据集上报告的最佳准确率。
解释：强调了模型的创新性和优越性，特别提到了在重要数据集上的突出表现。

原句：“Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long).”
翻译：最后，与3D卷积网络相比，我们的模型训练更快，可以实现显著更高的测试效率（仅有轻微的精度下降），而且还可以应用于更长的视频片段（超过一分钟）。
解释：列举了相比传统方法的三个主要优势：训练速度、测试效率和处理长视频的能力。

原句：“Code and models are available at: https://github.com/facebookresearch/TimeSformer.”
翻译：代码和模型可在以下地址获取：https://github.com/facebookresearch/TimeSformer。
解释：提供了开源代码链接，表明研究的可复现性。

总体理解：
这段话主要在说：
提出了一个新的视频分类模型TimeSformer
完全基于自注意力机制，不使用传统的卷积操作
创新点是将时间和空间注意力分开处理
在多个数据集上达到了最好的效果
相比3D卷积网络有多个优势：训练更快、测试效率更高、可处理更长视频

补充说明：
相关术语解释：
self-attention：自注意力机制，来自Transformer架构
spatiotemporal：时空的，指同时考虑空间和时间维度
frame-level patches：帧级别的图像块，指将视频帧分割成小块进行处理
Kinetics-400/600：重要的视频动作识别数据集

1 导论 introduction

在这里插入图片描述
[句子1]

原文：“Over the last few years, the field of natural language processing (NLP) has been revolutionized by the emergence of methods based on self-attention (Vaswani et al., 2017a).”
翻译：在过去几年中，基于自注意力机制的方法的出现彻底革新了自然语言处理(NLP)领域。
解释：
重点词汇：revolutionized (彻底改变), emergence (出现)
句子结构：使用完整时态，强调时间跨度和变革性影响
与上下文关系：作为开篇句，引出文章主题
难点解释：引用"Vaswani et al., 2017a"指的是Transformer论文
[句子2]

原文：“Because of their excellent capabilities at capturing long-range dependencies among words as well as their training scalability, self-attention architectures, such as the Transformer model, represent the current state-of-the-art across a wide range of language tasks”
翻译：由于自注意力架构在捕捉词语间长距离依赖关系和训练可扩展性方面表现出色，像Transformer模型这样的架构在广泛的语言任务中都达到了最先进的水平。
解释：
重点词汇：long-range dependencies (长距离依赖关系), scalability (可扩展性), state-of-the-art (最先进的)
句子结构：因果复合句，说明原因和结果
与上下文关系：解释自注意力机制的优势
难点解释：强调了两个主要优势：长距离依赖捕捉和可扩展性
[句子3]

原文：“including machine translation (Ott et al., 2018; Chen et al., 2018a), question answering (Devlin et al., 2019; Dai et al., 2019), and autoregressive word generation (Radford et al., 2019; Brown et al., 2020).”
翻译：包括机器翻译、问答系统和自回归词生成等任务。
解释：
重点词汇：machine translation, question answering, autoregressive word generation
句子结构：并列结构，列举具体应用
与上下文关系：具体说明应用场景
难点解释：每个应用都配有相关研究引用
在这里插入图片描述
[句子1]

原文：“Video understanding shares several high-level similarities with NLP.”
翻译：视频理解与自然语言处理有几个高层次的相似之处。
解释：
重点词汇：high-level similarities (高层次相似性)
句子结构：简单陈述句
与上下文关系：过渡句，引出新主题
难点解释：为后续论述做铺垫

[句子2]

原文：“First of all, videos and sentences are both sequential.”
翻译：首先，视频和句子都是序列性的。
解释：
重点词汇：sequential (序列性的)
句子结构：简单句，说明第一个相似点
与上下文关系：开始列举相似之处
难点解释：强调数据结构的共同特点

[句子3]

原文：“Furthermore, precisely as the meaning of a word can often be understood only by relating it to the other words in the sentence, it may be argued that atomic actions in short-term segments need to be contextualized with the rest of the video in order to be fully disambiguated.”
翻译：此外，正如一个词的含义往往只能通过与句子中的其他词的关联来理解一样，可以说短时间片段中的原子动作需要与视频的其余部分建立联系才能完全消除歧义。
解释：
重点词汇：atomic actions (原子动作), contextualized (上下文化), disambiguated (消除歧义)
句子结构：复杂的类比句，使用平行结构比较语言和视频处理
与上下文关系：深入说明第二个相似点
难点解释：通过语言理解的类比来解释视频理解的特点

[句子4]

原文：“Thus, one would expect the long-range self-attention models from NLP to be highly effective for video modeling as well.”
翻译：因此，人们会期望来自NLP领域的长程自注意力模型在视频建模方面也能很有效。
解释：
重点词汇：long-range self-attention models (长程自注意力模型)
句子结构：推论性陈述句
与上下文关系：基于前述相似性做出合理推测
难点解释：提出了技术迁移的可能性

[句子5]

原文：“However, in the video domain, 2D or 3D convolutions still represent the core operators for spatiotemporal feature learning across different video tasks (Feichtenhofer et al., 2019a; Teed & Deng, 2020; Bertasius & Torresani, 2020).”
翻译：然而，在视频领域，2D或3D卷积仍然是跨不同视频任务的时空特征学习的核心算子。
解释：
重点词汇：2D/3D convolutions (二维/三维卷积), spatiotemporal feature learning (时空特征学习)
句子结构：转折句，说明现状
与上下文关系：提出与预期不符的现实情况
难点解释：指出当前视频处理领域的主流技术方法

[句子6]

原文：“While self-attention has shown benefits when applied on top of convolutional layers (Wang et al., 2018a), to the best of our knowledge, no attempt to use self-attention as the exclusive building block for video recognition models has been reported.”
翻译：虽然自注意力机制在应用于卷积层之上时显示出了优势，但据我们所知，还没有尝试将自注意力作为视频识别模型的唯一构建模块的报道。
解释：
重点词汇：exclusive building block (唯一构建模块)
句子结构：让步复合句
与上下文关系：指出研究空白
难点解释：暗示了未来研究方向

B. 段落整体理解

段落主旨：
第一段：介绍自注意力机制在NLP领域的革命性影响
第二段：探讨视频理解与NLP的相似性及当前技术现状

核心要点：
自注意力机制在NLP领域取得重大突破
视频和语言处理有本质相似性
视频领域仍主要依赖卷积操作
存在研究空白待填补
段落关系：第一段为背景铺垫，第二段引出研究动机和现状
在这里插入图片描述
句子1：

原文：“In this work we pose the question of whether it may be possible to build a performant convolution-free video architecture by replacing altogether the convolution operator with self-attention.”
翻译：在本研究中，我们提出一个问题：是否可能通过完全用自注意力机制替代卷积运算符来构建一个高性能的无卷积视频架构。
解释：
重点词汇：
performant：高性能的
convolution-free：无卷积的
self-attention：自注意力机制
句子结构：复杂句，主干是"we pose the question"
这是论文的开篇句，提出研究问题

句子2：

原文：“We argue that such a design has the potential to overcome a few inherent limitations of convolutional models for video analysis.”
翻译：我们认为这种设计有潜力克服卷积模型在视频分析中的一些固有限制。
解释：
重点词汇：
inherent limitations：固有限制
convolutional models：卷积模型
句子结构：简单陈述句
承接第一句，说明研究动机
引出下文将要讨论的具体限制

句子3：

原文：“First, while their strong inductive biases (e.g., local connectivity and translation equivariance) are undoubtedly beneficial on small training sets, they may excessively limit the expressivity of the model in settings where there is ample availability of data and “all” can be learned from examples.”
翻译：首先，尽管它们的强归纳偏置（例如局部连接性和平移等变性）在小型训练集上无疑是有益的，但在数据充足且"一切"都可以从样本中学习的情况下，这些偏置可能会过度限制模型的表达能力。
解释：
重点词汇：
inductive biases：归纳偏置
local connectivity：局部连接性
translation equivariance：平移等变性
expressivity：表达能力
句子结构：复杂的让步转折句式（while…但是…）
阐述卷积模型的局限性之一
难点：专业概念的理解和长句的逻辑关系

句子4：

原文：“Compared to CNNs, Transformers impose less restrictive inductive biases. This broadens the family of functions they can represent (Cordonnier et al., 2020; Zhao et al., 2020), and renders them better suited to modern big-data regimes where there is less need for strong inductive priors.”
翻译：与CNN相比，Transformer施加的归纳偏置较少。这扩展了它们可以表示的函数族（Cordonnier等，2020；Zhao等，2020），使它们更适合于不太需要强归纳先验的现代大数据场景。
解释：
重点词汇：
CNNs：卷积神经网络
Transformers：变换器模型
inductive priors：归纳先验
句子结构：比较句式，后接解释说明
引入解决方案，与前文形成呼应
提供了学术引用支持论点

B. 段落小节理解

段落主旨：提出用Transformer替代CNN的动机和优势
核心要点：
卷积模型的归纳偏置在大数据场景可能成为限制
Transformer的灵活性更适合现代大数据环境
提供了学术支持和理论依据
这是一个完整的论证段落，包含问题提出、限制分析和解决方案
C. 专业知识拓展

术语解释：
归纳偏置（Inductive bias）：模型在学习过程中的先验假设或偏好，完整的内容及理解参考https://zhuanlan.zhihu.com/p/38861547
Inductive bias（归纳偏置）是深度学习模型中一个重要的概念，它指的是模型在学习过程中的先验假设或预设倾向，这些假设决定了模型如何从训练数据中归纳出规律。以CNN为例，它具有两个典型的归纳偏置：局部连接性（Local Connectivity）和平移等变性（Translation Equivariance）。局部连接性假设相邻像素之间存在更强的相关性，因此模型只关注局部区域的特征，这种设计类似于人类视觉系统的工作方式；而平移等变性则假设图像中的特征在位置变化时应保持不变，使得模型能够在图像的不同位置识别相同的物体。

归纳偏置就像是我们教导小孩识别事物时给出的预设规则，比如告诉他"有四条腿的可能是狗"。这种预设在数据量较少时确实有助于快速学习和提高模型的泛化能力，但也可能带来局限性——就像小孩可能因此难以识别站立的袋鼠。同样，在深度学习中，这种预设在小规模数据集上是有益的，可以帮助模型更快地学习和泛化。然而，在现代大数据环境下，这些强预设反而可能成为限制：当数据足够丰富时，模型本可以自主学习到更复杂的模式，而预设的规则可能阻碍了这种探索，限制了模型的潜在表达能力。这就解释了为什么在大数据时代，相比具有强归纳偏置的CNN，具有更少预设的Transformer架构可能更具优势。

CNN（Convolutional Neural Network）：卷积神经网络，广泛应用于计算机视觉
Transformer：一种基于自注意力机制的神经网络架构
平移等变性：输入发生平移时，输出也相应平移的性质
在这里插入图片描述
句子1：

原文：“Second, while convolutional kernels are specifically designed to capture short-range spatiotemporal information, they cannot model dependencies that extend beyond the receptive field.”
翻译：其次，虽然卷积核专门用于捕获短程时空信息，但它们无法建模超出感受野范围的依赖关系。
解释：
重点词汇：
convolutional kernels：卷积核
spatiotemporal：时空的
receptive field：感受野
句子结构：转折复合句（while引导）
指出CNN的第二个局限性
难点：专业术语的理解

句子2：

原文：“While deep stacks of convolutions (Simonyan & Zisserman, 2015; Szegedy et al., 2015; Carreira & Zisserman, 2017) naturally extend the receptive field, these strategies are inherently limited in capturing long-range dependencies by means of aggregation of shorter-range information.”
翻译：尽管深层堆叠的卷积自然会扩展感受野，但这些策略通过聚合短程信息来捕获长程依赖关系的方法本质上是有限的。
解释：
重点词汇：
deep stacks：深层堆叠
long-range dependencies：长程依赖关系
句子结构：让步转折句
解释现有解决方案的局限性
引用多个研究支持论点

句子3和4：

原文：“Conversely, the self-attention mechanism can be applied to capture both local as well as global long-range dependencies by directly comparing feature activations at all space-time locations, much beyond the receptive field of traditional convolutional filters.”
翻译：相反，自注意力机制可以通过直接比较所有时空位置的特征激活来捕获局部和全局长程依赖关系，远超传统卷积滤波器的感受野范围。
解释：
重点词汇：
feature activations：特征激活
global long-range dependencies：全局长程依赖关系
句子结构：转折说明句
提出解决方案
强调自注意力机制的优势

B. 段落整体理解

段落主旨：对比CNN的局限性和自注意力机制的优势
核心要点：
CNN只能处理局部信息
深层堆叠无法根本解决长程依赖问题
自注意力机制可以有效处理全局依赖关系
与上文的关系：继续论证用自注意力替代卷积的必要性

C. 专业知识拓展

术语解释：
感受野：神经网络中某一层的神经元能够"看到"的输入区域范围
卷积核：用于提取特征的滤波器，定义了局部连接模式
自注意力机制：可以计算序列中任意位置之间关系的机制
时空信息：同时包含空间和时间维度的信息
在这里插入图片描述
句子1：

原文：“Finally, despite the advances in GPU hardware acceleration, training deep CNNs remains very costly, especially when applied to high-resolution and long videos.”
翻译：最后，尽管GPU硬件加速技术已有进步，但训练深度CNN仍然非常耗费资源，特别是在处理高分辨率和长视频时。
解释：
重点词汇：
GPU hardware acceleration：GPU硬件加速
high-resolution：高分辨率
句子结构：让步复合句
引入CNN的第三个局限性
强调计算成本问题

句子2和3：

原文：“Recent work in the still image domain (Dosovitskiy et al., 2020; Carion et al., 2020; Zhao et al., 2020) has demonstrated that Transformers enjoy faster training and inference compared to CNNs, making it possible to construct models with larger learning capacity for the same computational budget.”
翻译：最近在静态图像领域的研究表明，与CNN相比，Transformer具有更快的训练和推理速度，这使得在相同的计算预算下可以构建具有更大学习容量的模型。
解释：
重点词汇：
inference：推理
learning capacity：学习容量
computational budget：计算预算
句子结构：复合句，包含引用和因果关系
提出解决方案
引用多个研究支持论点

B. 段落整体理解

段落主旨：对比CNN和Transformer在计算效率上的差异
核心要点：
CNN训练成本高昂
Transformer具有更快的训练和推理速度
相同计算资源下Transformer可实现更大的模型容量
与其他段落的关系：作为最后一个论点，从计算效率角度支持使用Transformer

C. 专业知识拓展

术语解释：
GPU硬件加速：使用图形处理器加速深度学习计算的技术
学习容量：模型能够学习和表示复杂模式的能力
计算预算：可用于训练和运行模型的计算资源限制
推理：模型在训练后进行预测的过程
在这里插入图片描述
句子1：

原文：“Motivated by these observations, we propose a video architecture built exclusively on self-attention.”
翻译：基于这些观察，我们提出了一个完全基于自注意力机制的视频架构。
解释：
重点词汇：
exclusively：完全地
self-attention：自注意力机制
句子结构：简单陈述句
承上启下，引出解决方案
表明方案的独特性（完全基于自注意力）

句子2：

原文：“We adapt the image model “Vision Transformer” (ViT) (Dosovitskiy et al., 2020) to video by extending the self-attention mechanism from the image space to the space-time 3D volume.”
翻译：我们通过将自注意力机制从图像空间扩展到时空三维空间，将图像模型"Vision Transformer" (ViT)适配到视频领域。
解释：
重点词汇：
Vision Transformer (ViT)：视觉transformer
space-time 3D volume：时空三维空间
句子结构：说明方式与方法的复合句
介绍核心技术方案

句子3：

原文：“Our proposed model, named “TimeSformer” (from Time-Space Transformer), views the video as a sequence of patches extracted from the individual frames.”
翻译：我们提出的模型，命名为"TimeSformer"（源自Time-Space Transformer），将视频视为从单独帧中提取的patch序列。
解释：
重点词汇：
TimeSformer：时空transformer
patches：图像块
句子结构：包含同位语的说明句
介绍模型名称和基本原理

句子4和5：

原文：“As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vectors as token embeddings which can be fed to a Transformer encoder, analogously to the token features computed from words in NLP.”
翻译：与ViT类似，每个patch都被线性映射为嵌入向量并增加位置信息。这使得可以将生成的向量序列解释为标记嵌入，类似于NLP中从词语计算得到的标记特征，这些标记嵌入可以输入到Transformer编码器中。
解释：
重点词汇：
embedding：嵌入
positional information：位置信息
token features：标记特征
句子结构：类比说明句
详细解释技术实现
建立与NLP领域的类比

B. 段落整体理解

段落主旨：介绍TimeSformer模型的基本架构和工作原理
核心要点：
完全基于自注意力机制
将ViT扩展到视频领域
通过patch序列处理视频
借鉴NLP的token处理方式
承上启下：从前文问题分析转向具体解决方案

C. 专业知识拓展

术语解释：
Vision Transformer：处理视觉任务的transformer模型
patch：图像或视频帧的小块区域
token embedding：将离散数据转换为连续向量表示
Transformer encoder：transformer架构中的编码器部分
在这里插入图片描述

句子1-2：

原文：“One downside of self-attention in standard Transformer is that it requires computing a similarity measure for all pairs of tokens. In our setting, this is computationally costly due to the large number of patches in the video.”
翻译：标准Transformer中自注意力机制的一个缺点是需要计算所有token对之间的相似度。在我们的场景中，由于视频中patch数量庞大，这种计算非常耗费资源。
解释：
重点词汇：similarity measure（相似度度量）
句子结构：说明问题及其原因
指出核心技术挑战
与视频处理的特殊性相关
句子3-4：

原文：“To address these challenges, we propose several scalable self-attention designs over the space-time volume and empirically evaluate them over large-scale action classification datasets. Among the proposed schemes, we found that the best design is represented by a “divided attention” architecture which separately applies temporal attention and spatial attention within each block of the network.”
翻译：为解决这些挑战，我们提出了几种可扩展的时空自注意力设计，并在大规模动作分类数据集上进行了实证评估。在提出的方案中，我们发现最佳设计是"分离注意力"架构，它在网络的每个块中分别应用时间注意力和空间注意力。
解释：
重点词汇：
divided attention：分离注意力
temporal/spatial attention：时间/空间注意力
提出解决方案
强调实验验证的重要性
句子5-6：

原文：“Compared to the established paradigm of convolution-based video architecture, TimeSformer follows a radically different design. Yet, it achieves accuracy comparable, and in some cases superior, to the state-of-the-art in this field.”
翻译：与已建立的基于卷积的视频架构范式相比，TimeSformer采用了完全不同的设计。然而，它达到了与该领域最先进水平相当，在某些情况下甚至更优的精度。
解释：
对比现有方法
强调创新性和效果
表明突破性成果
句子7：

原文：“We also show that our model can be used for long-range modeling of videos spanning many minutes.”
翻译：我们还表明，我们的模型可以用于对跨越多分钟的视频进行长程建模。
解释：
强调额外优势
实际应用价值
B. 段落整体理解

段落主旨：介绍TimeSformer的技术挑战、解决方案和优势
核心要点：
标准自注意力的计算成本问题
分离注意力机制的创新
与现有方法的性能对比
长视频处理能力
C. 专业知识拓展

术语解释：
分离注意力：将时间和空间维度的注意力计算分开处理
动作分类：视频理解中的基础任务
长程建模：处理长时间序列数据的能力
可扩展性：系统处理更大规模数据的能力

2 相关工作 Related Work

在这里插入图片描述
句子1：

原文：“Our approach is influenced by recent works that use self-attention for image classification, either in combination with the convolution operator or even as a full replacement for it.”
翻译：我们的方法受到了最近将自注意力用于图像分类的工作的影响，这些工作要么将自注意力与卷积算子结合使用，要么完全用其替代卷积。
解释：
重点词汇：
self-attention：自注意力
convolution operator：卷积算子
句子结构：说明影响来源的复合句
建立与现有研究的联系
指出两种研究方向

句子2：

原文：“Within the former class, Non-Local Networks (Wang et al., 2018b) employ a non-local mean that effectively generalizes the self-attention function of Transformers (Vaswani et al., 2017b).”
翻译：在第一类方法中，非局部网络采用非局部均值的方法，有效地推广了Transformer的自注意力函数。
解释：
重点词汇：
Non-Local Networks：非局部网络
non-local mean：非局部均值
句子结构：包含引用的说明句
详细介绍第一种方法
建立与Transformer的联系

句子3：

原文：“Bello et al. (Bello et al., 2019) propose a 2D self-attention mechanism that is competitive as a replacement of 2D convolution but gives even stronger results when used to augment convolutional features with self-attention features.”
翻译：Bello等人提出的二维自注意力机制可以作为二维卷积的替代品，但当用于增强卷积特征时，效果更好。
解释：
重点词汵：
2D self-attention：二维自注意力
augment：增强
句子结构：转折复合句
介绍另一种结合方式
强调组合使用的优势

句子4：

原文：“Beyond image categorization, Relation Networks (Hu et al., 2018) and DETR (Carion et al., 2020) use self-attention on top of convolutional feature maps for object detection.”
翻译：除图像分类外，关系网络和DETR在卷积特征图之上使用自注意力进行目标检测。
解释：
重点词汇：
Relation Networks：关系网络
DETR：端到端目标检测transformer
feature maps：特征图
句子结构：并列说明句
扩展到其他应用领域
展示技术的通用性

B. 段落整体理解

段落主旨：回顾自注意力机制在计算机视觉领域的应用发展
核心要点：
自注意力与卷积的两种结合方式
不同研究团队的具体实现方法
技术在多个任务中的应用
各种方法的效果比较

在这里插入图片描述

主要专业术语：self-attention, queries, key-value sampling, axial computation, spatiotemporal volume, linear embeddings, Vision Transformers (ViT)

A. 段落分解和分析

句子1-2：

原文：“Our method is more closely related to image networks leveraging self-attention as a substitute for convolution. Since these works use individual pixels as queries, in order to maintain a manageable computational cost and a small memory consumption, they must restrict the scope of self-attention to local neighborhoods or use global self-attention on heavily downsized versions of the image.”
翻译：我们的方法与那些使用自注意力替代卷积的图像网络更为相关。由于这些工作使用单个像素作为查询，为了保持可控的计算成本和较小的内存消耗，它们必须将自注意力的范围限制在局部邻域内，或在大幅缩小的图像版本上使用全局自注意力。
解释：
重点词汇：
queries：查询
local neighborhoods：局部邻域
global self-attention：全局自注意力
说明现有方法的局限性
解释计算效率问题

句子3-4：

原文：“Alternative strategies for scalability to full images include sparse key-value sampling or constraining the self-attention to be calculated along the spatial axes. A few of the self-attention operators considered in our experiments adopt similar sparse and axial computation, although generalized to the spatiotemporal volume.”
翻译：针对全尺寸图像的可扩展性策略包括稀疏键值采样或将自注意力计算限制在空间轴上。我们实验中考虑的一些自注意力算子采用了类似的稀疏和轴向计算方法，但将其推广到时空体积。
解释：
重点词汇：
sparse key-value sampling：稀疏键值采样
axial computation：轴向计算
介绍替代解决方案
说明自身的改进

句子5-6：

原文：“However, the efficiency of our approach stems mainly from decomposing the video into a sequence of frame-level patches and then feeding linear embeddings of these patches as input token embeddings to a Transformer. This strategy was recently introduced in Vision Transformers (ViT) which were shown to deliver impressive performance on image categorization.”
翻译：然而，我们方法的效率主要来自于将视频分解为帧级patch序列，然后将这些patch的线性嵌入作为输入标记嵌入送入Transformer。这种策略最近在Vision Transformers (ViT)中被引入，并在图像分类任务上展现出令人印象深刻的性能。
解释：
重点词汇：
frame-level patches：帧级图像块
linear embeddings：线性嵌入
介绍核心创新点
建立与ViT的联系

句子7-8：

原文：“In this work, we build on the ViT design, and extend it to video by proposing and empirically comparing several scalable schemes for space-time self-attention over videos.”
翻译：在本工作中，我们以ViT设计为基础，通过提出并实证比较几种可扩展的视频时空自注意力方案，将其扩展到视频领域。
解释：
明确研究贡献
总结研究方法
强调实验验证

B. 段落整体理解

段落主旨：详细说明研究方法的技术路线与创新点
核心要点：
现有方法的局限性
不同的解决策略
基于ViT的创新
实验验证方法

C. 专业知识拓展

术语解释：
稀疏键值采样：减少注意力计算的优化技术
轴向计算：沿特定维度进行的计算方式
时空体积：包含时间和空间维度的数据结构
Vision Transformers：专门处理视觉任务的Transformer模型
在这里插入图片描述
句子1-2：

原文：“While Transformers have been recently used for video generation, we are not aware of prior video recognition architectures using self-attention as the exclusive building block.”
翻译：虽然Transformer最近被用于视频生成，但我们尚未发现以自注意力作为唯一构建模块的视频识别架构。
解释：
重点词汇：
video generation：视频生成
exclusive building block：唯一构建模块
句子结构：转折复合句
强调研究的创新性
区分与现有工作的不同
句子3：

原文：“However, we note that Transformers have been adopted on top of convolutional feature maps for action localization and recognition, video classification, and group activity recognition.”
翻译：然而，我们注意到Transformer已被应用于卷积特征图之上，用于动作定位和识别、视频分类以及群体活动识别。
解释：
重点词汇：
action localization：动作定位
group activity recognition：群体活动识别
综述现有应用领域
说明混合使用方式
句子4：

原文：“We also note that there is a wide literature based on the use of text Transformers combined with video CNNs to address various video-language tasks, such as captioning, question-answering and dialog.”
翻译：我们还注意到，有大量文献研究将文本Transformer与视频CNN结合来解决各种视频-语言任务，如描述生成、问答和对话。
解释：
重点词汇：
video-language tasks：视频-语言任务
captioning：描述生成
扩展到多模态应用
列举具体任务类型
句子5-6：

原文：“Finally, multimodal video-text transformers have also been trained or pretrained in unsupervised fashion by adopting masked-token pretext tasks adapted from the language domain.”
翻译：最后，多模态视频-文本transformer还采用了从语言领域改编的掩码标记预训练任务，以无监督方式进行训练或预训练。
解释：
重点词汇：
multimodal：多模态
masked-token pretext tasks：掩码标记预训练任务
unsupervised：无监督
介绍训练方法
说明与语言处理的联系
B. 段落整体理解

段落主旨：全面回顾Transformer在视频相关任务中的应用现状
核心要点：
Transformer在视频生成中的应用
与CNN的混合使用
多模态任务的应用
预训练方法的借鉴
C. 专业知识拓展

术语解释：
多模态：涉及多种数据类型（如视频、文本）的处理
预训练任务：模型训练前的准备任务
掩码标记：遮蔽部分输入以训练模型的技术
无监督学习：不需要标签数据的学习方式