SlowFast Networks for Video Recognition


We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution.


The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition.


Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept.


We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.




It is customary in the recognition of images I(x, y) to treat the two spatial dimensions x and y symmetrically.

在图像I(x, y)的识别中,通常是对x和y两个空间维度进行对称处理。

This is justified by the statistics of natural images, which are to a first approximation isotropic—all orientations are equally likely—and shift-invariant [41, 26].


But what about video signals I ( x , y , t ) I(x, y, t) I(x,y,t)? Motion is the spatiotemporal counterpart of orientation [2], but all spatiotemporal orientations are not equally likely.

那么视频信号 I ( x , y , t ) I(x, y, t) I(x,y,t)呢?运动是方位[2]的时空对等物,但所有的时空方向都不是均等的。

Slow motions are more likely than fast motions (indeed most of the world we see is at rest at a given moment) and this has been exploited in Bayesian accounts of how humans perceive motion stimuli [58].


For example, if we see a moving edge in isolation, we perceive it as moving perpendicular to itself, even though in principle it could also have an arbitrary component of movement tangential to itself (the aperture problem in optical flow).


This percept is rational if the prior favors slow movements.


If all spatiotemporal orientations are not equally likely, then there is no reason for us to treat space and time symmetrically, as is implicit in approaches to video recognition based on spatiotemporal convolutions [49, 5].


We might instead “factor” the architecture to treat spatial structures and temporal events separately.


For concreteness, let us study this in the context of recognition.


The categorical spatial semantics of the visual content often evolve slowly.



Figure 1. A SlowFast network has a low frame rate, low temporal resolution S l o w Slow Slow pathway and a high frame rate, α × α \times α×higher temporal resolution Fast pathway. The F a s t Fast Fast pathway is lightweight by using a fraction ( β , e . g . , 1 / 8 ) (β, e.g., 1/8) (β,e.g.,1/8) of channels. Lateral connections fuse them.

A SlowFast network有一个低帧率,低时间分辨率 S l o w Slow Slow路径和一个高帧率, α × α\times α×更高的时间分辨率快速路径。 F a s t Fast Fast路径是轻量级的,通过使用一小部分 ( β , e . g . , 1 / 8 ) (β, e.g.,1/8) (βe.g.1/8)的渠道。横向连接将它们融合。

For example, waving hands do not change their identity as “hands” over the span of the waving action, and a person is always in the “person” category even though he/she can transit from walking to running.


So the recognition of the categorical semantics (as well as their colors, textures, lighting etc.) can be refreshed relatively slowly.


On the other hand, the motion being performed can evolve much faster than their subject identities, such as clapping, waving, shaking, walking, or jumping.


It can be desired to use fast refreshing frames (high temporal resolution) to effectively model the potentially fast changing motion.


Based on this intuition, we present a two-pathway SlowFast model for video recognition (Fig. 1).


One pathway is designed to capture semantic information that can be given by images or a few sparse frames, and it operates at low frame rates and slow refreshing speed.


In contrast, the other pathway is responsible for capturing rapidly changing motion, by operating at fast refreshing speed and high temporal resolution.


Despite its high temporal rate, this pathway is made very lightweight, e.g., ∼20% of total computation.


This is because this pathway is designed to have fewer channels and weaker ability to process spatial information, while such information can be provided by the first pathway in a less redundant manner.


We call the first a Slow pathway and the second a Fast pathway, driven by their different temporal speeds. The two pathways are fused by lateral connections.


Our conceptual idea leads to flexible and effective designs for video models. The Fast pathway, due to its lightweight nature, does not need to perform any temporal pooling—it can operate on high frame rates for all intermediate layers and maintain temporal fidelity.


Meanwhile, thanks to the lower temporal rate, the Slow pathway can be more focused on the spatial domain and semantics.


By treating the raw video at different temporal rates, our method allows the two pathways to have their own expertise on video modeling.


There is another well known architecture for video recognition which has a two-stream design [44], but provides conceptually different perspectives.


The Two-Stream method [44] has not explored the potential of different temporal speeds, a key concept in our method.


The two-stream method adopts the same backbone structure to both streams, whereas our Fast pathway is more lightweight.


Our method does not compute optical flow, and therefore, our models are learned end-to-end from the raw data.


In our experiments we observe that the SlowFast network is empirically more effective.


Our method is partially inspired by biological studies on the retinal ganglion cells in the primate visual system [27, 37, 8, 14, 51], though admittedly the analogy is rough and premature.


These studies found that in these cells, ∼80% are Parvocellular (P-cells) and ∼15-20% are Magnocellular (M-cells).


The M-cells operate at high temporal frequency and are responsive to fast temporal changes, but not sensitive to spatial detail or color.


P-cells provide fine spatial detail and color, but lower temporal resolution, responding slowly to stimuli.


Our framework is analogous in that:


(i) our model has two pathways separately working at low and high temporal resolutions;


(ii) our Fast pathway is designed to capture fast changing motion but fewer spatial details, analogous to M-cells;


and (iii) our Fast pathway is lightweight, similar to the small ratio of M-cells.

(iii)我们的Fast pathway是轻量级的,类似于小比例的m细胞。

We hope these relations will inspire more computer vision models for video recognition.


We evaluate our method on the Kinetics-400 [30],Kinetics-600 [3], Charades [43] and AVA [20] datasets.


Our comprehensive ablation experiments on Kinetics action classification demonstrate the efficacy contributed by SlowFast.


SlowFast networks set a new state-of-the-art on all datasets with significant gains to previous systems in the literature.


2,Related Work

Spatiotemporal filtering. Actions can be formulated as spatiotemporal objects and captured by oriented filtering in spacetime, as done by HOG3D [31] and cuboids [10].


3D ConvNets [48, 49, 5] extend 2D image models [32, 45, 47, 24] to the spatiotemporal domain, handling both spatial and temporal dimensions similarly.

3D ConvNets[48, 49, 5]将二维图像模型[32,45,47,24]扩展到时空领域,以类似的方式处理空间和时间维度。

There are also related methods focusing on long-term filtering and pooling using temporal strides [52, 13, 55, 62], as well as decomposing the convolutions into separate 2D spatial and 1D temporal filters [12, 50, 61, 39].


Beyond spatiotemporal filtering or their separable versions, our work pursuits a more thorough separation of modeling expertise by using two different temporal speeds.


Optical flow for video recognition.


There is a classical branch of research focusing on hand-crafted spatiotemporal features based on optical flow.


These methods, including histograms of flow [33], motion boundary histograms [6], and trajectories [53], had shown competitive performance for action recognition before the prevalence of deep learning.


In the context of deep neural networks, the two-stream method [44] exploits optical flow by viewing it as another input modality.


This method has been a foundation of many competitive results in the literature [12, 13, 55].


However, it is methodologically unsatisfactory given that optical flow is a hand-designed representation, and two-stream methods are often not learned end-to-end jointly with the flow.


3. SlowFast Networks

SlowFast networks can be described as a single stream architecture that operates at two different framerates, but we use the concept of pathways to reflect analogy with the biological Parvo- and Magnocellular counterparts.


Our generic architecture has a Slow pathway (Sec. 3.1) and a Fast pathway (Sec. 3.2), which are fused by lateral connections to a SlowFast network (Sec. 3.3). Fig. 1 illustrates our concept.


3.1. Slow pathway

The Slow pathway can be any convolutional model (e.g.,[12, 49, 5, 56]) that works on a clip of video as a spatiotemporal volume.


The key concept in our Slow pathway is a large temporal stride τ on input frames, i.e., it processes only one out of τ frames.


A typical value of τ we studied is 16—this refreshing speed is roughly 2 frames sampled per second for 30-fps videos. Denoting the number of frames sampled by the Slow pathway as T, the raw clip length is T × τ frames.


3.2. Fast pathway

In parallel to the Slow pathway, the Fast pathway is another convolutional model with the following properties.


High frame rate.


Our goal here is to have a fine representation along the temporal dimension. Our Fast pathway works with a small temporal stride of τ /α, where α > 1 is the frame rate ratio between the Fast and Slow pathways.

我们的目标是在时间维度上有一个良好的表现。我们的快速路径使用很小的时间跨度τ/α,其中α> 1是快速和慢路径之间的帧率比。

The two pathways operate on the same raw clip, so the Fast pathway samples αT frames, α times denser than the Slow pathway. A typical value is α = 8 in our experiments.

这两条通路在同一个原始剪辑上运作,所以快速通路的样本是αT帧,比慢通路的密度大α倍。在我们的实验中,一个典型的值是α= 8。

The presence of α is in the key of the SlowFast concept (Fig. 1, time axis).


It explicitly indicates that the two pathways work on different temporal speeds, and thus drives the expertise of the two subnets instantiating the two pathways.


High temporal resolution features.


Our Fast pathway not only has a high input resolution, but also pursues highresolution features throughout the network hierarchy.


In our instantiations, we use no temporal downsampling layers (neither temporal pooling nor time-strided convolutions) throughout the Fast pathway, until the global pooling layer before classification.


As such, our feature tensors always have αT frames along the temporal dimension, maintaining temporal fidelity as much as possible.


Low channel capacity.


Our Fast pathway also distinguishes with existing models in that it can use significantly lower channel capacity to achieve good accuracy for the SlowFast model. This makes it lightweight.


In a nutshell, our Fast pathway is a convolutional network analogous to the Slow pathway, but has a ratio of β (β < 1) channels of the Slow pathway. The typical value is β = 1/8 in our experiments.

简而言之,我们的快速通道是一个类似于慢通道的卷积网络,但有慢通道的β(β< 1)通道的比例。本实验的典型值为β= 1/8。

Notice that the computation (floating number operations, or FLOPs) of a common layer is often quadratic in term of its channel scaling ratio.


This is what makes the Fast pathway more computation-effective than the Slow pathway. In our instantiations, the Fast pathway typically takes ∼20% of the total computation.


Interestingly, as mentioned in Sec. 1, evidence suggests that ∼15-20% of the retinal cells in the primate visual system are M-cells (that are sensitive to fast motion but not color or spatial detail).


The low channel capacity can also be interpreted as a weaker ability of representing spatial semantics.


Technically, our Fast pathway has no special treatment on the spatial dimension, so its spatial modeling capacity should be lower than the Slow pathway because of fewer channels.


The good results of our model suggest that it is a desired tradeoff for the Fast pathway to weaken its spatial modeling ability while strengthening its temporal modeling ability.


Motivated by this interpretation, we also explore different ways of weakening spatial capacity in the Fast pathway, including reducing input spatial resolution and removing color information.


As we will show by experiments, these versions can all give good accuracy, suggesting that a lightweight Fast pathway with less spatial capacity can be made beneficial.


3.3. Lateral connections


The information of the two pathways is fused, so one pathway is not unaware of the representation learned by the other pathway.


We implement this by lateral connections, which have been used to fuse optical flow-based, two-stream networks [12, 13].


In image object detection, lateral connections [35] are a popular technique for merging different levels of spatial resolution and semantics.


Similar to [12, 35], we attach one lateral connection between the two pathways for every “stage" (Fig. 1).


Specifically for ResNets [24], these connections are right after pool1, res2, res3, and res4.


The two pathways have different temporal dimensions, so the lateral connections perform a transformation to match them (detailed in Sec. 3.4).


We use unidirectional connections that fuse features of the Fast pathway into the Slow one (Fig. 1). We have experimented with bidirectional fusion and found similar results.


Table 1. An example instantiation of the SlowFast network. The dimensions of kernels are denoted by { T × S 2 , C } \{T×S^2, C\} {T×S2,C} for temporal, spatial, and channel sizes. Strides are denoted as { t e m p o r a l s t r i d e , s p a t i a l s t r i d e 2 } \{temporal stride, spatial stride^2\} {temporalstride,spatialstride2}. Here the speed ratio is α = 8 and the channel ratio is β = 1/8. τ is 16. The green colors mark higher temporal resolution, and orange colors mark fewer channels, for the Fast pathway. Non-degenerate temporal filters are underlined. Residual blocks are shown by brackets. The backbone is ResNet-50.

表1。一个实例化的慢速网络。核的维数用 { T × S 2 , C } \{T×S^2, C\} {T×S2,C} 表示,分别表示时间、空间和通道大小。步数表示为 { 时 间 步 数 , 空 间 步 数 2 } \{时间步数,空间步数^2\} {2}。这里的速比为α= 8,通道比为β= 1/8。τ是16。绿色表示更高的时间分辨率,橙色表示更少的通道,为快速通道。非简并时间滤波器下划线。剩余的块用括号表示。骨干为ResNet-50。

Finally, a global average pooling is performed on each pathway’s output. Then two pooled feature vectors are concatenated as the input to the fully-connected classifier layer


3.4. Instantiations


Our idea of SlowFast is generic, and it can be instantiated with different backbones (e.g., [45, 47, 24]) and implementation specifics.


In this subsection, we describe our instantiations of the network architectures.


An example SlowFast model is specified in Table 1. We denote spatiotemporal size by T × S 2 T×S^2 T×S2 where T is the temporal length and S is the height and width of a square spatial crop. The details are described next.

表1中指定了一个示例SlowFast模型。我们用 T × S 2 T×S^2 T×S2表示时空大小,其中T为时间长度,S为正方形空间作物的高度和宽度。下面将详细说明。

Slow pathway.
The Slow pathway in Table 1 is a temporally strided 3D ResNet, modified from [12].

表1中的慢通路是由[12]修改而来的一个时间跨越的3D ResNet。

It has T = 4 frames as the network input, sparsely sampled from a 64-frame raw clip with a temporal stride τ = 16.

它有T = 4帧作为网络输入,从一个时间跨度τ= 16的64帧原始剪辑稀疏采样。

We opt to not perform temporal downsampling in this instantiation, as doing so would be detrimental when the input stride is large.


Unlike typical C3D / I3D models, we use non-degenerate temporal convolutions (temporal kernel size > 1, underlined in Table 1) only in res4 and res5; all filters from conv1 to res3 are essentially 2D convolution kernels in this pathway.

与典型的C3D / I3D模型不同,我们仅在res4和res5中使用了非退化的时间卷积(时间核大小> 1,在表1中标注);在这个路径中,从conv1到res3的所有滤波器本质上都是2D卷积核。

This is motivated by our experimental observation that using temporal convolutions in earlier layers degrades accuracy.


We argue that this is because when objects move fast and the temporal stride is large, there is little correlation within a temporal receptive field unless the spatial receptive field is large enough (i.e., in later layers).


Fast pathway.
Table 1 shows an example of the Fast pathway with α = 8 and β = 1/8. It has a much higher temporal resolution (green) and lower channel capacity (orange).

表1显示了α= 8和β= 1/8的Fast通路的例子。它具有更高的时间分辨率(绿色)和较低的信道容量(橙色)。

The Fast pathway has non-degenerate temporal convolutions in every block.


This is motivated by the observation that this pathway holds fine temporal resolution for the temporal convolutions to capture detailed motion.


Further, the Fast pathway has no temporal downsampling layers by design.


Lateral connections. Our lateral connections fuse from the Fast to the Slow pathway. It requires to match the sizes of features before fusing.


Denoting the feature shape of the Slow pathway as { T , S 2 , C } \{T, S^2, C\} {T,S2,C}, the feature shape of the Fast pathway is { α T , S 2 , β C } \{αT, S^2, βC\} {αT,S2,βC}. We experiment with the following transformations in the lateral connections:

慢路径的特征形状为 { T , S 2 , C } \{T, S^2, C\} {T,S2,C},快路径的特征形状为 { α T , S 2 , β C } \{αT, S^2,βC\} {αT,S2βC}。我们在横向连接中进行了以下实验:

(i) Time-to-channel: We reshape and transpose { α T , S 2 , β C } \{αT, S^2,βC\} {αT,S2,βC} into { T , S 2 , α β C } \{T, S^2, αβC\} {T,S2,αβC}, meaning that we pack all α frames into the channels of one frame.

(i)时间到通道:将 { α T , S 2 , β C } \{αT, S^2,βC\} {αT,S2βC}重构转置为 { T , S 2 , α β C } \{T, S^2,αβC\} {T,S2αβC},这意味着我们将所有α帧打包到一个帧的通道中。

(ii) Time-strided sampling: We simply sample one out of every α frames, so { α T , S 2 , β C } \{αT, S^2, βC\} {αT,S2,βC} becomes { T , S 2 , β C } \{T, S^2, βC\} {T,S2,βC}.

(ii)时间跨度采样:我们简单地从每个α帧中采样一个,因此 { α T , S ² , β C } \{αT, S²,βC\} {αT,S²βC}变成 { T , S ² , β C } \{T, S²,βC\} {T,S²βC}

(iii) Time-strided convolution: We perform a 3D convolution of a 5 × 1 2 5×1^2 5×12 kernel with 2 β C 2βC 2βC output channels and stride = α.

(iii)时间步进卷积:我们将 5 × 1 2 5×1^2 5×12核与 2 β C 2βC 2βC输出通道和步幅=α进行三维卷积。

The output of the lateral connections is fused into the Slow pathway by summation or concatenation.


4. Experiments: Action Classification

We evaluate our approach on four video recognition datasets using standard evaluation protocols.


For the action classification experiments, presented in this section we consider the widely used Kinetics-400 [30], the recent Kinetics-600 [3], and Charades [43].


For action detection experiments in Sec. 5, we use the challenging AVA dataset [20].


Training. Our models on Kinetics are trained from random initialization (“from scratch”), without using ImageNet [7] or any pre-training. We use synchronized SGD training following the recipe in [19]. See details in Appendix.


For the temporal domain, we randomly sample a clip (of αT×τ frames) from the full-length video, and the input to the Slow and Fast pathways are respectively T and αT frames; for the spatial domain, we randomly crop 224×224 pixels from a video, or its horizontal flip, with a shorter side randomly sampled in [256, 320] pixels [45, 56].




Following common practice, we uniformly sample 10 clips from a video along its temporal axis.


For each clip, we scale the shorter spatial side to 256 pixels and take 3 crops of 256×256 to cover the spatial dimensions, as an approximation of fully-convolutional testing, following the code of [56]. We average the softmax scores for prediction.


We report the actual inference-time computation. As existing papers differ in their inference strategy for cropping/clipping in space and in time.


When comparing to previous work, we report the FLOPs per spacetime “view" (temporal clip with spatial crop) at inference and the number of views used.


Recall that in our case, the inference-time spatial size is 2562 (instead of 2242 for training) and 10 temporal clips each with 3 spatial crops are used (30 views)


Datasets. Kinetics-400 [30] consists of ∼240k training videos and 20k validation videos in 400 human action categories.

数据集。kineics - 400[30]包括400个人类动作类别的240k训练视频和20k验证视频。

Kinetics-600 [3] has ∼392k training videos and 30k validation videos in 600 classes. We report top-1 and top-5 classification accuracy (%).


We report the computational cost (in FLOPs) of a single, spatially center-cropped clip.


Charades [43] has ∼9.8k training videos and 1.8k validation videos in 157 classes in a multi-label classification setting of longer activities spanning ∼30 seconds on average. Performance is measured in mean Average Precision (mAP).


4.1. Main Results

Kinetics-400. Table 2 shows the comparison with state-of-the-art results for our SlowFast instantiations using various input samplings (T×τ ) and backbones: ResNet-50/101 (R50/101) [24] and Nonlocal (NL) [56].

Kinetics- 400。表2显示了使用各种输入采样(T×τ)和主干(ResNet-50/101 (R50/101)[24]和非本地(NL)[56]的慢速实例化的最先进结果的比较。

In comparison to the previous state-of-the-art [56] our best model provides 2.1% higher top-1 accuracy.


Notably, all our results are substantially better than existing results that are also without ImageNet pre-training.


In particular, our model (79.8%) is 5.9% absolutely better than the previous best result of this kind (73.9%).


We have experimented with ImageNet pretraining for SlowFast networks and found that they perform similar (±0.3%) for both the pre-trained and the train from scratch (random initialization) variants.


Our results are achieved at low inference-time cost. We notice that many existing works (if reported) use extremely dense sampling of clips along the temporal axis, which can lead to >100 views at inference time.


This cost has been largely overlooked. In contrast, our method does not require many temporal clips, due to the high temporal resolution yet lightweight Fast pathway. Our cost per spacetime view can be low (e.g., 36.1 GFLOPs), while still being accurate.

这一成本在很大程度上被忽视了。相比之下,我们的方法不需要太多的时间片段,由于高时间分辨率和轻量的快速路径。我们的每个时空视图的成本可以很低(例如,36.1 g /次),但仍然是准确的。

The SlowFast variants from Table 2 (with different backbones and sample rates) are compared in Fig. 2 the with their corresponding Slow-only pathway to assess the improvement brought by the Fast pathway. The horizontal axis measures model capacity for a single input clip of 2562 spatial size, which is proportional to 1/30 of the overall inference cost.


Table 2. Comparison with the state-of-the-art on Kinetics-400. In the last column, we report the inference cost with a single “view" (temporal clip with spatial crop) × the numbers of such views used.The SlowFast models are with different input sampling (T×τ ) and backbones (R-50, R-101, NL). “N/A” indicates the numbers are not available for us.

表2。与最先进的Kinetics-400相比。在上一篇专栏文章中,我们报告了单个“视图”(带有空间裁剪的时间剪辑)×使用的视图数量的推断成本。慢速模型具有不同的输入采样(T×τ)和主干(R-50, R-101, NL)。“N/A”表示我们没有这些号码。

Figure 2. Accuracy/complexity tradeoff on Kinetics-400 for the SlowFast (green) vs. Slow-only (blue) architectures. SlowFast is consistently better than its Slow-only counterpart in all cases (green arrows). SlowFast provides higher accuracy and lower cost than temporally heavy Slow-only (e.g. red arrow). The complexity is for a single 2562 view, and accuracy are obtained by 30-view testing.


Fig. 2 shows that for all variants the Fast pathway is able to consistently improve the performance of the Slow counterpart at comparatively low cost. The next subsection provides a more detailed analysis on Kinetics-400.


Kinetics-600 is relatively new, and existing results are limited. So our goal is mainly to provide results for future reference in Table 3.

kinetic -600相对较新,现有结果有限。所以我们的目标主要是提供结果供以后参考,如表3所示。

Note that the Kinetics-600 validation set overlaps with the Kinetics-400 training set [3], and therefore we do not pre-train on Kinetics-400.

注意,kinetic -600验证集与kinetic -400训练集[3]重叠,因此我们不会对kinetic -400进行预训练。

The winning entry [21] of the latest ActivityNet Challenge 2018 [15] reports a best single-model, single-modality accuracy of 79.0%.


Our variants show good performance with the best model at 81.8%. SlowFast results on the recent Kinetics-700 [4] are in [11].


Table 3. Comparison with the state-of-the-art on Kinetics-600. SlowFast models the same as in Table 2.


Table 4. Comparison with the state-of-the-art on Charades. All our variants are based on T×τ = 16×8, R-101.

Table 4. Comparison with the state-of-the-art on Charades. All our variants are based on T×τ = 16×8, R-101.

Charades [43] is a dataset with longer range activities. Table 4 shows our SlowFast results on it. For fair comparison, our baseline is the Slow-only counterpart that has 39.0 mAP.


SlowFast increases over this baseline by 3.1 mAP (to 42.1),while the extra NL leads to an additional 0.4 mAP.


We also achieve 45.2 mAP when pre-trained on Kinetics-600. Overall, our SlowFast models in Table 4 outperform the previous best number (STRG [57]) by solid margins, at lower cost.






