【论文翻译】SlowFast Networks for Video Recognition

本文链接：https://blog.csdn.net/WhiffeYF/article/details/112616001

其它参考资料

Abstract

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution.

我们提出了用于视频识别的慢速网络。我们的模型包括(i)一个慢路径，以低帧率运行，以捕捉空间语义，和(ii)一个快速路径，以高帧率运行，以捕捉精细的时间分辨率运动。

The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition.

快速通道可以通过降低信道容量使其非常轻量化，但可以学习有用的时间信息用于视频识别。

Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept.

我们的模型在视频动作分类和检测方面都实现了强大的性能，我们的SlowFast概念带来了很大的改进。

We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at:

我们报告最先进的准确性主要视频识别基准，Kinetics，Charades和AVA。代码已在以下地址提供

https://github.com/facebookresearch/SlowFast

1,Introduction

It is customary in the recognition of images I(x, y) to treat the two spatial dimensions x and y symmetrically.

在图像I(x, y)的识别中，通常是对x和y两个空间维度进行对称处理。

This is justified by the statistics of natural images, which are to a first approximation isotropic—all orientations are equally likely—and shift-invariant [41, 26].

这是由自然图像的统计证明的，自然图像近似为各向同性的——所有方向都是相同的——和位移不变的[41,26]。

But what about video signals $I (x, y, t)$ ? Motion is the spatiotemporal counterpart of orientation [2], but all spatiotemporal orientations are not equally likely.

那么视频信号 $I (x, y, t)$ 呢?运动是方位[2]的时空对等物，但所有的时空方向都不是均等的。

Slow motions are more likely than fast motions (indeed most of the world we see is at rest at a given moment) and this has been exploited in Bayesian accounts of how humans perceive motion stimuli [58].

慢的运动比快的运动更有可能(事实上，我们所看到的世界在特定的时刻是静止的)，这已经被贝叶斯描述的人类如何感知运动刺激[58]所利用。

For example, if we see a moving edge in isolation, we perceive it as moving perpendicular to itself, even though in principle it could also have an arbitrary component of movement tangential to itself (the aperture problem in optical flow).

例如，如果我们孤立地看到一个移动的边缘，我们认为它是垂直于自身移动的，即使在原则上它也可能有一个任意的切线运动分量(光流中的孔径问题)。

This percept is rational if the prior favors slow movements.

如果前面的观点倾向于缓慢的运动，那么这种看法是合理的。

If all spatiotemporal orientations are not equally likely, then there is no reason for us to treat space and time symmetrically, as is implicit in approaches to video recognition based on spatiotemporal convolutions [49, 5].

如果所有的时空方向都不是均等的，那么我们就没有理由像基于时空卷积的视频识别方法中隐含的那样对称地对待空间和时间[49,5]。

We might instead “factor” the architecture to treat spatial structures and temporal events separately.

我们可以将建筑“考虑因素”，分别对待空间结构和时间事件。

For concreteness, let us study this in the context of recognition.

为了具体性，让我们在认识的背景下研究它。

The categorical spatial semantics of the visual content often evolve slowly.

视觉内容的分类空间语义往往发展缓慢。

在这里插入图片描述

Figure 1. A SlowFast network has a low frame rate, low temporal resolution $S l o w$ pathway and a high frame rate, $\times$ higher temporal resolution Fast pathway. The $F a s t$ pathway is lightweight by using a fraction $(β, e . g ., 1 / 8)$ of channels. Lateral connections fuse them.

A SlowFast network有一个低帧率，低时间分辨率 $S l o w$ 路径和一个高帧率， $α\times$ 更高的时间分辨率快速路径。 $F a s t$ 路径是轻量级的，通过使用一小部分 $(β ， e . g . ， 1 / 8)$ 的渠道。横向连接将它们融合。

For example, waving hands do not change their identity as “hands” over the span of the waving action, and a person is always in the “person” category even though he/she can transit from walking to running.

例如，在挥手动作的范围内，挥手并不会改变其“手”的身份，一个人总是属于“人”类别，即使他/她可以从走路过渡到跑步。

So the recognition of the categorical semantics (as well as their colors, textures, lighting etc.) can be refreshed relatively slowly.

因此，类别语义(以及它们的颜色、纹理、照明等)的识别可以相对缓慢地刷新。

On the other hand, the motion being performed can evolve much faster than their subject identities, such as clapping, waving, shaking, walking, or jumping.

另一方面，被执行的动作可以发展得比他们的主体身份快得多，如拍手、挥手、颤抖、行走或跳跃。

It can be desired to use fast refreshing frames (high temporal resolution) to effectively model the potentially fast changing motion.

它可以期望使用快速刷新帧(高时间分辨率)有效地建模潜在的快速变化的运动。

Based on this intuition, we present a two-pathway SlowFast model for video recognition (Fig. 1).

基于这一直觉，我们提出了一种用于视频识别的双通道慢速模型(图1)。

One pathway is designed to capture semantic information that can be given by images or a few sparse frames, and it operates at low frame rates and slow refreshing speed.

其中一条路径被设计用来捕捉可以由图像或稀疏帧给出的语义信息，它的帧率较低，刷新速度较慢。

In contrast, the other pathway is responsible for capturing rapidly changing motion, by operating at fast refreshing speed and high temporal resolution.

相比之下，另一个通道则通过快速刷新速度和高时间分辨率来负责捕捉快速变化的运动。

Despite its high temporal rate, this pathway is made very lightweight, e.g., ∼20% of total computation.

尽管它的高时间率，这一途径是非常轻的，例如，占总计算量的20%。

This is because this pathway is designed to have fewer channels and weaker ability to process spatial information, while such information can be provided by the first pathway in a less redundant manner.

这是因为该通路具有较少的通道和较弱的空间信息处理能力，而这些信息可以由第一条通路以较少的冗余方式提供。

We call the first a Slow pathway and the second a Fast pathway, driven by their different temporal speeds. The two pathways are fused by lateral connections.

我们把前者称为慢路径，把后者称为快路径，这是由它们不同的时间速度所驱动的。这两条通道通过横向连接连接在一起。

Our conceptual idea leads to flexible and effective designs for video models. The Fast pathway, due to its lightweight nature, does not need to perform any temporal pooling—it can operate on high frame rates for all intermediate layers and maintain temporal fidelity.

我们的概念理念导致灵活和有效的设计视频模型。由于其轻量级的特性，快速通道不需要执行任何时间池——它可以在所有中间层的高帧率上操作，并保持时间保真度。

Meanwhile, thanks to the lower temporal rate, the Slow pathway can be more focused on the spatial domain and semantics.

同时，由于较低的时间速率，慢路径可以更专注于空间域和语义。

By treating the raw video at different temporal rates, our method allows the two pathways to have their own expertise on video modeling.

通过以不同的时间速率处理原始视频，我们的方法允许两条路径在视频建模上有各自的专长。

There is another well known architecture for video recognition which has a two-stream design [44], but provides conceptually different perspectives.

还有另一个著名的视频识别架构，它有一个双流设计[44]，但提供了概念上不同的视角。

The Two-Stream method [44] has not explored the potential of different temporal speeds, a key concept in our method.

双流[44]方法没有探索不同时间速度的潜力，这是我们方法中的一个关键概念。

The two-stream method adopts the same backbone structure to both streams, whereas our Fast pathway is more lightweight.

双流方法对两个流采用相同的骨干结构，而我们的快速路径更轻量。

Our method does not compute optical flow, and therefore, our models are learned end-to-end from the raw data.

我们的方法不计算光流，因此，我们的模型是从原始数据端到端的学习。

In our experiments we observe that the SlowFast network is empirically more effective.

在我们的实验中，我们观察到慢速网络在经验上更有效。

Our method is partially inspired by biological studies on the retinal ganglion cells in the primate visual system [27, 37, 8, 14, 51], though admittedly the analogy is rough and premature.

我们的方法部分是受到灵长类视觉系统视网膜神经节细胞生物学研究的启发[27,37,8,14,51]，尽管不可否认，这种类比是粗糙和不成熟的。

These studies found that in these cells, ∼80% are Parvocellular (P-cells) and ∼15-20% are Magnocellular (M-cells).

这些研究发现，在这些细胞中，约80%是小细胞(p细胞)，而约15-20%是大细胞(M细胞)。

The M-cells operate at high temporal frequency and are responsive to fast temporal changes, but not sensitive to spatial detail or color.

m细胞在高时间频率下工作，对快速的时间变化有反应，但对空间细节或颜色不敏感。

P-cells provide fine spatial detail and color, but lower temporal resolution, responding slowly to stimuli.

p细胞提供良好的空间细节和颜色，但较低的时间分辨率，对刺激反应缓慢。

Our framework is analogous in that:

我们的框架类似于:

(i) our model has two pathways separately working at low and high temporal resolutions;

(i)我们的模型有两个路径分别在低时间分辨率和高时间分辨率下工作;

(ii) our Fast pathway is designed to capture fast changing motion but fewer spatial details, analogous to M-cells;

(ii)我们的快速通道设计用于捕捉快速变化的运动，但较少的空间细节，类似于M细胞;

and (iii) our Fast pathway is lightweight, similar to the small ratio of M-cells.

(iii)我们的Fast pathway是轻量级的，类似于小比例的m细胞。

We hope these relations will inspire more computer vision models for video recognition.

我们希望这些关系能够启发更多的计算机视觉模型用于视频识别。

We evaluate our method on the Kinetics-400 [30],Kinetics-600 [3], Charades [43] and AVA [20] datasets.

我们在Kinetics-400[30]、Kinetics-600[3]、Charades[43]和AVA[20]数据集上评估我们的方法。

Our comprehensive ablation experiments on Kinetics action classification demonstrate the efficacy contributed by SlowFast.

我们对动力学作用分类的综合消融实验证明了SlowFast的疗效。

SlowFast networks set a new state-of-the-art on all datasets with significant gains to previous systems in the literature.

慢速网络在所有数据集上设置了一种新的技术水平，与文献中以前的系统相比有显著的增益。

2，Related Work

Spatiotemporal filtering. Actions can be formulated as spatiotemporal objects and captured by oriented filtering in spacetime, as done by HOG3D [31] and cuboids [10].

时空的过滤。动作可以被定义为时空对象，并通过时空定向滤波捕获，如HOG3D[31]和长方体[10]所做的。

3D ConvNets [48, 49, 5] extend 2D image models [32, 45, 47, 24] to the spatiotemporal domain, handling both spatial and temporal dimensions similarly.

3D ConvNets[48, 49, 5]将二维图像模型[32,45,47,24]扩展到时空领域，以类似的方式处理空间和时间维度。

There are also related methods focusing on long-term filtering and pooling using temporal strides [52, 13, 55, 62], as well as decomposing the convolutions into separate 2D spatial and 1D temporal filters [12, 50, 61, 39].

还有一些相关的方法侧重于使用时间跨步进行长期滤波和池化[52,13,55,62]，以及将卷积分解为单独的二维空间和一维时间滤波器[12,50,61,39]。

Beyond spatiotemporal filtering or their separable versions, our work pursuits a more thorough separation of modeling expertise by using two different temporal speeds.

除了时空过滤或它们的可分离版本之外，我们的工作追求通过使用两种不同的时间速度来更彻底地分离建模专家。

Optical flow for video recognition.

用于视频识别的光流。

There is a classical branch of research focusing on hand-crafted spatiotemporal features based on optical flow.

基于光流的手工时空特征是一个经典的研究分支。

These methods, including histograms of flow [33], motion boundary histograms [6], and trajectories [53], had shown competitive performance for action recognition before the prevalence of deep learning.

这些方法，包括流[33]直方图、运动边界直方图[6]和轨迹[53]，在深度学习普及之前，在动作识别方面表现出了竞争的性能。

In the context of deep neural networks, the two-stream method [44] exploits optical flow by viewing it as another input modality.

在深度神经网络的背景下，双流方法[44]将光流看作另一种输入模态来利用光流。

This method has been a foundation of many competitive results in the literature [12, 13, 55].

该方法是文献中许多竞争性结果的基础[12,13,55]。

However, it is methodologically unsatisfactory given that optical flow is a hand-designed representation, and two-stream methods are often not learned end-to-end jointly with the flow.

然而，考虑到光流是手工设计的表示形式，并且两流方法通常不能端到端与光流一起学习，这在方法上是不令人满意的。

3. SlowFast Networks

SlowFast networks can be described as a single stream architecture that operates at two different framerates, but we use the concept of pathways to reflect analogy with the biological Parvo- and Magnocellular counterparts.

慢速网络可以被描述为以两种不同帧率运行的单一流结构，但我们使用路径的概念来反映与生物小细胞和大细胞对应的类比。

Our generic architecture has a Slow pathway (Sec. 3.1) and a Fast pathway (Sec. 3.2), which are fused by lateral connections to a SlowFast network (Sec. 3.3). Fig. 1 illustrates our concept.

我们的通用架构有一个慢路径(第3.1节)和一个快路径(第3.2节)，它们通过与慢速网络的横向连接进行融合(第3.3节)。图1说明了我们的概念。

3.1. Slow pathway

The Slow pathway can be any convolutional model (e.g.,[12, 49, 5, 56]) that works on a clip of video as a spatiotemporal volume.

慢路径可以是任何卷积模型(例如[12,49,5,56])，该模型将视频片段作为一个时空体积工作。

The key concept in our Slow pathway is a large temporal stride τ on input frames, i.e., it processes only one out of τ frames.

我们的慢路径中的关键概念是输入帧上的大时间跨度τ，也就是说，它只处理τ帧中的一个。

A typical value of τ we studied is 16—this refreshing speed is roughly 2 frames sampled per second for 30-fps videos. Denoting the number of frames sampled by the Slow pathway as T, the raw clip length is T × τ frames.

我们研究的τ的典型值是16——对于30帧每秒的视频，刷新速度大约是每秒2帧采样。表示慢路径采样的帧数为T，原始剪辑长度为T×τ帧。

3.2. Fast pathway

In parallel to the Slow pathway, the Fast pathway is another convolutional model with the following properties.

与慢路径并行，快路径是另一个卷积模型，具有以下特性。

High frame rate.

高帧频

Our goal here is to have a fine representation along the temporal dimension. Our Fast pathway works with a small temporal stride of τ /α, where α > 1 is the frame rate ratio between the Fast and Slow pathways.

我们的目标是在时间维度上有一个良好的表现。我们的快速路径使用很小的时间跨度τ/α，其中α> 1是快速和慢路径之间的帧率比。

The two pathways operate on the same raw clip, so the Fast pathway samples αT frames, α times denser than the Slow pathway. A typical value is α = 8 in our experiments.

这两条通路在同一个原始剪辑上运作，所以快速通路的样本是αT帧，比慢通路的密度大α倍。在我们的实验中，一个典型的值是α= 8。

The presence of α is in the key of the SlowFast concept (Fig. 1, time axis).

α的存在是慢速概念的关键(图1，时间轴)。

It explicitly indicates that the two pathways work on different temporal speeds, and thus drives the expertise of the two subnets instantiating the two pathways.

它明确地指出，两条路径以不同的时间速度工作，从而驱动实例化两条路径的两个子网的专业知识。

High temporal resolution features.

高时间分辨率特征

Our Fast pathway not only has a high input resolution, but also pursues highresolution features throughout the network hierarchy.

我们的快速通道不仅具有高输入分辨率，而且在整个网络层次中追求高分辨率特征。

In our instantiations, we use no temporal downsampling layers (neither temporal pooling nor time-strided convolutions) throughout the Fast pathway, until the global pooling layer before classification.

在我们的实例化中，我们在整个快速路径中不使用时间降采样层(既不使用时间池也不使用时间跨越卷积)，直到分类前的全局池化层。

As such, our feature tensors always have αT frames along the temporal dimension, maintaining temporal fidelity as much as possible.

因此，我们的特征张量总是沿着时间维度有αT帧，尽可能地保持时间保真度。

Low channel capacity.

低的信道容量

Our Fast pathway also distinguishes with existing models in that it can use significantly lower channel capacity to achieve good accuracy for the SlowFast model. This makes it lightweight.

我们的快速通道与现有模型的区别还在于，它可以使用显著较低的信道容量来为慢速模型实现良好的精度。这使得它很轻。

In a nutshell, our Fast pathway is a convolutional network analogous to the Slow pathway, but has a ratio of β (β < 1) channels of the Slow pathway. The typical value is β = 1/8 in our experiments.

简而言之，我们的快速通道是一个类似于慢通道的卷积网络，但有慢通道的β(β< 1)通道的比例。本实验的典型值为β= 1/8。

Notice that the computation (floating number operations, or FLOPs) of a common layer is often quadratic in term of its channel scaling ratio.

注意，公共层的计算(浮点数操作，或FLOPs)通常是其通道缩放比例的二次函数。

This is what makes the Fast pathway more computation-effective than the Slow pathway. In our instantiations, the Fast pathway typically takes ∼20% of the total computation.

这就是为什么快速路径比慢路径计算效率更高。在我们的实例化中，快速路径通常占用总计算量的20%。

Interestingly, as mentioned in Sec. 1, evidence suggests that ∼15-20% of the retinal cells in the primate visual system are M-cells (that are sensitive to fast motion but not color or spatial detail).

有趣的是，正如第1节提到的，证据表明灵长类视觉系统中视网膜细胞约有15-20%是m细胞(对快速运动敏感，但对颜色或空间细节不敏感)。

The low channel capacity can also be interpreted as a weaker ability of representing spatial semantics.

信道容量低也可以解释为表示空间语义的能力较弱。

Technically, our Fast pathway has no special treatment on the spatial dimension, so its spatial modeling capacity should be lower than the Slow pathway because of fewer channels.

在技术上，我们的快速通道没有对空间维度进行特殊处理，因此由于通道较少，其空间建模能力应该低于慢通道。

The good results of our model suggest that it is a desired tradeoff for the Fast pathway to weaken its spatial modeling ability while strengthening its temporal modeling ability.

该模型的良好结果表明，快速通道在增强时间建模能力的同时削弱空间建模能力是一个理想的折衷方案。

Motivated by this interpretation, we also explore different ways of weakening spatial capacity in the Fast pathway, including reducing input spatial resolution and removing color information.

基于这一解释，我们还探讨了在快速通道中削弱空间容量的不同方式，包括减少输入空间分辨率和去除颜色信息。

As we will show by experiments, these versions can all give good accuracy, suggesting that a lightweight Fast pathway with less spatial capacity can be made beneficial.

正如我们将通过实验证明的那样，这些版本都可以提供良好的准确性，这表明空间容量较小的轻量快速通道是有益的。

3.3. Lateral connections

侧面连接

The information of the two pathways is fused, so one pathway is not unaware of the representation learned by the other pathway.

这两条通路的信息是融合的，所以一条通路不会不知道另一条通路学习到的表示。

We implement this by lateral connections, which have been used to fuse optical flow-based, two-stream networks [12, 13].

我们通过横向连接实现这一点，横向连接已被用于融合基于光流的双流网络[12,13]。

In image object detection, lateral connections [35] are a popular technique for merging different levels of spatial resolution and semantics.

在图像目标检测中，横向连接[35]是一种流行的融合不同层次空间分辨率和语义的技术。

Similar to [12, 35], we attach one lateral connection between the two pathways for every “stage" (Fig. 1).

与[12,35]类似，我们为每个“阶段”在两条通路之间附加一个横向连接(图1)。

Specifically for ResNets [24], these connections are right after pool1, res2, res3, and res4.

特别是对于ResNets[24]，这些连接位于pool1、res2、res3和res4之后。

The two pathways have different temporal dimensions, so the lateral connections perform a transformation to match them (detailed in Sec. 3.4).

这两条路径有不同的时间维度，因此横向连接进行转换以匹配它们(详见3.4节)。

We use unidirectional connections that fuse features of the Fast pathway into the Slow one (Fig. 1). We have experimented with bidirectional fusion and found similar results.

我们使用单向连接，将快通道的特征融合到慢通道中(图1)。我们对双向融合进行了实验，发现了类似的结果。

在这里插入图片描述
Table 1. An example instantiation of the SlowFast network. The dimensions of kernels are denoted by ${T×S^2, C\}$ for temporal, spatial, and channel sizes. Strides are denoted as ${temporal stride, spatial stride^2\}$ . Here the speed ratio is α = 8 and the channel ratio is β = 1/8. τ is 16. The green colors mark higher temporal resolution, and orange colors mark fewer channels, for the Fast pathway. Non-degenerate temporal filters are underlined. Residual blocks are shown by brackets. The backbone is ResNet-50.

表1。一个实例化的慢速网络。核的维数用 ${T×S^2, C\}$ 表示，分别表示时间、空间和通道大小。步数表示为 ${时间步数，空间步数^2\}$ 。这里的速比为α= 8，通道比为β= 1/8。τ是16。绿色表示更高的时间分辨率，橙色表示更少的通道，为快速通道。非简并时间滤波器下划线。剩余的块用括号表示。骨干为ResNet-50。

Finally, a global average pooling is performed on each pathway’s output. Then two pooled feature vectors are concatenated as the input to the fully-connected classifier layer

最后，对每个通路的输出执行全局平均汇集。然后将两个聚类特征向量串联起来作为全连接分类器层的输入

3.4. Instantiations

实例化

Our idea of SlowFast is generic, and it can be instantiated with different backbones (e.g., [45, 47, 24]) and implementation specifics.

我们认为SlowFast是通用的，它可以用不同的骨干(例如，[45,47,24])和实现细节进行实例化。

In this subsection, we describe our instantiations of the network architectures.

在本小节中，我们将描述网络架构的实例化。

An example SlowFast model is specified in Table 1. We denote spatiotemporal size by $T×S^2$ where T is the temporal length and S is the height and width of a square spatial crop. The details are described next.

表1中指定了一个示例SlowFast模型。我们用 $T×S^2$ 表示时空大小，其中T为时间长度，S为正方形空间作物的高度和宽度。下面将详细说明。

Slow pathway.
The Slow pathway in Table 1 is a temporally strided 3D ResNet, modified from [12].

表1中的慢通路是由[12]修改而来的一个时间跨越的3D ResNet。

It has T = 4 frames as the network input, sparsely sampled from a 64-frame raw clip with a temporal stride τ = 16.

它有T = 4帧作为网络输入，从一个时间跨度τ= 16的64帧原始剪辑稀疏采样。

We opt to not perform temporal downsampling in this instantiation, as doing so would be detrimental when the input stride is large.

我们选择在这个实例化中不执行时间下采样，因为当输入步数很大时，这样做是有害的。

Unlike typical C3D / I3D models, we use non-degenerate temporal convolutions (temporal kernel size > 1, underlined in Table 1) only in res4 and res5; all filters from conv1 to res3 are essentially 2D convolution kernels in this pathway.

与典型的C3D / I3D模型不同，我们仅在res4和res5中使用了非退化的时间卷积(时间核大小> 1，在表1中标注);在这个路径中，从conv1到res3的所有滤波器本质上都是2D卷积核。

This is motivated by our experimental observation that using temporal convolutions in earlier layers degrades accuracy.

这是由于我们的实验观察，在较早的层中使用时间卷积会降低精度。

We argue that this is because when objects move fast and the temporal stride is large, there is little correlation within a temporal receptive field unless the spatial receptive field is large enough (i.e., in later layers).

我们认为，这是因为当物体移动得很快，时间步幅很大时，除非空间感受域足够大(即在后面的层次中)，否则时间感受域内的相关性很小。

Fast pathway.
Table 1 shows an example of the Fast pathway with α = 8 and β = 1/8. It has a much higher temporal resolution (green) and lower channel capacity (orange).

表1显示了α= 8和β= 1/8的Fast通路的例子。它具有更高的时间分辨率(绿色)和较低的信道容量(橙色)。

The Fast pathway has non-degenerate temporal convolutions in every block.

快速通道在每个块中都有非退化的时间卷积。

This is motivated by the observation that this pathway holds fine temporal resolution for the temporal convolutions to capture detailed motion.

这是由观察到的，这条路径持有良好的时间分辨率的时间卷积捕捉详细的运动。

Further, the Fast pathway has no temporal downsampling layers by design.

此外，设计的快速通道没有时间下采样层。

Lateral connections. Our lateral connections fuse from the Fast to the Slow pathway. It requires to match the sizes of features before fusing.

横向连接。我们的横向连接从快通道融合到慢通道。在融合前需要匹配特征尺寸。

Denoting the feature shape of the Slow pathway as ${T, S^2, C\}$ , the feature shape of the Fast pathway is ${αT, S^2, βC\}$ . We experiment with the following transformations in the lateral connections:

慢路径的特征形状为 ${T, S^2, C\}$ ，快路径的特征形状为 ${αT, S^2，βC\}$ 。我们在横向连接中进行了以下实验:

(i) Time-to-channel: We reshape and transpose ${αT, S^2,βC\}$ into ${T, S^2, αβC\}$ , meaning that we pack all α frames into the channels of one frame.

(i)时间到通道:将 ${αT, S^2，βC\}$ 重构转置为 ${T, S^2，αβC\}$ ，这意味着我们将所有α帧打包到一个帧的通道中。

(ii) Time-strided sampling: We simply sample one out of every α frames, so ${αT, S^2, βC\}$ becomes ${T, S^2, βC\}$ .

(ii)时间跨度采样:我们简单地从每个α帧中采样一个，因此 ${αT, S²，βC\}$ 变成 ${T, S²，βC\}$ 。

(iii) Time-strided convolution: We perform a 3D convolution of a $5×1^2$ kernel with $2 β C$ output channels and stride = α.

(iii)时间步进卷积:我们将 $5×1^2$ 核与 $2 β C$ 输出通道和步幅=α进行三维卷积。

The output of the lateral connections is fused into the Slow pathway by summation or concatenation.

横向连接的输出通过累加或连接的方式融合到慢通路中。

4. Experiments: Action Classification

We evaluate our approach on four video recognition datasets using standard evaluation protocols.

我们使用标准的评估协议在四个视频识别数据集上评估我们的方法。

For the action classification experiments, presented in this section we consider the widely used Kinetics-400 [30], the recent Kinetics-600 [3], and Charades [43].

对于本节介绍的动作分类实验，我们考虑广泛使用的Kinetics-400[30]，最新的Kinetics-600[3]和Charades[43]。

For action detection experiments in Sec. 5, we use the challenging AVA dataset [20].

对于第5节的动作检测实验，我们使用具有挑战性的AVA数据集[20]。

Training. Our models on Kinetics are trained from random initialization (“from scratch”), without using ImageNet [7] or any pre-training. We use synchronized SGD training following the recipe in [19]. See details in Appendix.

训练。我们的动力学模型是从随机初始化(“从零开始”)开始训练的，没有使用ImageNet[7]或任何预先训练。我们按照[19]中的配方使用同步的SGD培训。详见附录。

For the temporal domain, we randomly sample a clip (of αT×τ frames) from the full-length video, and the input to the Slow and Fast pathways are respectively T and αT frames; for the spatial domain, we randomly crop 224×224 pixels from a video, or its horizontal flip, with a shorter side randomly sampled in [256, 320] pixels [45, 56].

对于时域，我们从完整长度的视频中随机抽取一个片段(αT×τ帧)，慢路径和快路径的输入分别为T帧和αT帧;对于空间域，我们从视频或其水平翻转中随机裁剪224×224个像素，短边随机采样为[256,320]像素[45,56]。

Inference.

推断

Following common practice, we uniformly sample 10 clips from a video along its temporal axis.

按照通常的做法，我们沿着时间轴均匀地从一个视频中采样10个剪辑。

For each clip, we scale the shorter spatial side to 256 pixels and take 3 crops of 256×256 to cover the spatial dimensions, as an approximation of fully-convolutional testing, following the code of [56]. We average the softmax scores for prediction.

对于每个剪辑，我们将较短的空间边缩放到256像素，取3个256×256的农作物覆盖空间维度，作为完全卷积测试的近似，遵循[56]的代码。我们将softmax得分的平均值用于预测。

We report the actual inference-time computation. As existing papers differ in their inference strategy for cropping/clipping in space and in time.

我们报告了实际的推断时间计算。由于现有论文在空间和时间上对剪裁的推理策略存在差异。

When comparing to previous work, we report the FLOPs per spacetime “view" (temporal clip with spatial crop) at inference and the number of views used.

与以前的工作相比，我们在推论中报告每个时空“视图”(带有空间裁剪的时间剪辑)的失败次数和使用的视图数量。

Recall that in our case, the inference-time spatial size is 2562 (instead of 2242 for training) and 10 temporal clips each with 3 spatial crops are used (30 views)

回想一下，在我们的例子中，推断时间的空间大小是2562(而不是训练的2242)，使用了10个时间片段，每个片段有3个空间作物(30个视图)

Datasets. Kinetics-400 [30] consists of ∼240k training videos and 20k validation videos in 400 human action categories.

数据集。kineics - 400[30]包括400个人类动作类别的240k训练视频和20k验证视频。

Kinetics-600 [3] has ∼392k training videos and 30k validation videos in 600 classes. We report top-1 and top-5 classification accuracy (%).

Kinetics-600[3]有多达392k的训练视频和30k的验证视频。我们报道top-1和top-5分类准确率(%)。

We report the computational cost (in FLOPs) of a single, spatially center-cropped clip.

我们报告了单个空间中心裁剪的剪辑的计算成本(以字拖计算)。

Charades [43] has ∼9.8k training videos and 1.8k validation videos in 157 classes in a multi-label classification setting of longer activities spanning ∼30 seconds on average. Performance is measured in mean Average Precision (mAP).

Charades[43]在157个课程中有9.8k的训练视频和1.8k的验证视频，多标签分类设置，平均跨度为30秒。性能是通过平均精度(mAP)来衡量的。

4.1. Main Results

Kinetics-400. Table 2 shows the comparison with state-of-the-art results for our SlowFast instantiations using various input samplings (T×τ ) and backbones: ResNet-50/101 (R50/101) [24] and Nonlocal (NL) [56].

Kinetics- 400。表2显示了使用各种输入采样(T×τ)和主干(ResNet-50/101 (R50/101)[24]和非本地(NL)[56]的慢速实例化的最先进结果的比较。

In comparison to the previous state-of-the-art [56] our best model provides 2.1% higher top-1 accuracy.

与之前最先进的[56]相比，我们最好的模型提供了2.1%的top-1精度。

Notably, all our results are substantially better than existing results that are also without ImageNet pre-training.

值得注意的是，我们的所有结果都比现有的没有ImageNet预训练的结果要好得多。

In particular, our model (79.8%) is 5.9% absolutely better than the previous best result of this kind (73.9%).

特别的是，我们的模型(79.8%)比之前的最佳结果(73.9%)好了5.9%。

We have experimented with ImageNet pretraining for SlowFast networks and found that they perform similar (±0.3%) for both the pre-trained and the train from scratch (random initialization) variants.

我们用ImageNet对网速较慢的网络进行了预训练，发现预训练和从零开始(随机初始化)的变体的性能相似(±0.3%)。

Our results are achieved at low inference-time cost. We notice that many existing works (if reported) use extremely dense sampling of clips along the temporal axis, which can lead to >100 views at inference time.

我们的结果是在较低的推理时间成本。我们注意到，许多现有的作品(如果有报道的话)使用了沿时间轴的非常密集的剪辑采样，这可能导致>在推断时的100个视图。

This cost has been largely overlooked. In contrast, our method does not require many temporal clips, due to the high temporal resolution yet lightweight Fast pathway. Our cost per spacetime view can be low (e.g., 36.1 GFLOPs), while still being accurate.

这一成本在很大程度上被忽视了。相比之下，我们的方法不需要太多的时间片段，由于高时间分辨率和轻量的快速路径。我们的每个时空视图的成本可以很低(例如，36.1 g /次)，但仍然是准确的。

The SlowFast variants from Table 2 (with different backbones and sample rates) are compared in Fig. 2 the with their corresponding Slow-only pathway to assess the improvement brought by the Fast pathway. The horizontal axis measures model capacity for a single input clip of 2562 spatial size, which is proportional to 1/30 of the overall inference cost.

表2中的SlowFast变体(具有不同的骨干和采样率)在图2中进行了比较，并将其与相应的仅用slow途径进行了比较，以评估Fast途径带来的改善。横轴测量单个2562空间大小的输入片段的模型容量，它与总体推理成本的1/30成比例。

在这里插入图片描述
Table 2. Comparison with the state-of-the-art on Kinetics-400. In the last column, we report the inference cost with a single “view" (temporal clip with spatial crop) × the numbers of such views used.The SlowFast models are with different input sampling (T×τ ) and backbones (R-50, R-101, NL). “N/A” indicates the numbers are not available for us.

表2。与最先进的Kinetics-400相比。在上一篇专栏文章中，我们报告了单个“视图”(带有空间裁剪的时间剪辑)×使用的视图数量的推断成本。慢速模型具有不同的输入采样(T×τ)和主干(R-50, R-101, NL)。“N/A”表示我们没有这些号码。

在这里插入图片描述
Figure 2. Accuracy/complexity tradeoff on Kinetics-400 for the SlowFast (green) vs. Slow-only (blue) architectures. SlowFast is consistently better than its Slow-only counterpart in all cases (green arrows). SlowFast provides higher accuracy and lower cost than temporally heavy Slow-only (e.g. red arrow). The complexity is for a single 2562 view, and accuracy are obtained by 30-view testing.

图2。慢速(绿色)和慢速(蓝色)架构在Kinetics-400上的准确性/复杂性权衡。在所有情况下，SlowFast始终比其唯一的对应版本(绿色箭头)更好。SlowFast提供了更高的准确性和更低的成本，比暂时重的Slow-only(如红色箭头)。复杂度是针对单个2562视图，精度是通过30个视图测试获得的。

Fig. 2 shows that for all variants the Fast pathway is able to consistently improve the performance of the Slow counterpart at comparatively low cost. The next subsection provides a more detailed analysis on Kinetics-400.

图2表明，对于所有变量，快速路径能够以相对较低的成本持续提高慢路径的性能。下一小节提供了Kinetics-400更详细的分析。

Kinetics-600 is relatively new, and existing results are limited. So our goal is mainly to provide results for future reference in Table 3.

kinetic -600相对较新，现有结果有限。所以我们的目标主要是提供结果供以后参考，如表3所示。

Note that the Kinetics-600 validation set overlaps with the Kinetics-400 training set [3], and therefore we do not pre-train on Kinetics-400.

注意，kinetic -600验证集与kinetic -400训练集[3]重叠，因此我们不会对kinetic -400进行预训练。

The winning entry [21] of the latest ActivityNet Challenge 2018 [15] reports a best single-model, single-modality accuracy of 79.0%.

最新的2018年活动网挑战赛[15]的获奖作品[21]报告了最佳的单模型、单模态精度为79.0%。

Our variants show good performance with the best model at 81.8%. SlowFast results on the recent Kinetics-700 [4] are in [11].

我们的变种表现良好，最佳模型为81.8%。最近的动态结果是700[4]在[11]。

在这里插入图片描述
Table 3. Comparison with the state-of-the-art on Kinetics-600. SlowFast models the same as in Table 2.

表3。与最先进的Kinetics-600相比。SlowFast的模型与表2中相同。

在这里插入图片描述
Table 4. Comparison with the state-of-the-art on Charades. All our variants are based on T×τ = 16×8, R-101.

Table 4. Comparison with the state-of-the-art on Charades. All our variants are based on T×τ = 16×8, R-101.

Charades [43] is a dataset with longer range activities. Table 4 shows our SlowFast results on it. For fair comparison, our baseline is the Slow-only counterpart that has 39.0 mAP.

Charades[43]是一个具有较长的活动范围的数据集。表4显示了我们的慢速结果。为了公平的比较，我们的基线是只有39.0地图的慢版本。

SlowFast increases over this baseline by 3.1 mAP (to 42.1),while the extra NL leads to an additional 0.4 mAP.

SlowFast比这个基线增加了3.1个地图(到42.1)，而额外的NL导致额外的0.4个地图。

We also achieve 45.2 mAP when pre-trained on Kinetics-600. Overall, our SlowFast models in Table 4 outperform the previous best number (STRG [57]) by solid margins, at lower cost.