SlowFast Networks for Video Recognition
(2019 ICCV)
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He
Notes
Contributions
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.
Method
SlowFast networks can be described as a single stream architecture that operates at two different framerates.
Slow Pathway
The key concept in our Slow pathway is a large temporal stride τ on input frames, i.e., it processes only one out of τ frames. A typical value of τ we studied is 16—this refreshing speed is roughly 2 frames sampled per second for 30-fps videos. Denoting the number of frames sampled by the Slow pathway as T, the raw clip length is T × τ frames.
Fast Pathway
Our Fast pathway works with a small temporal stride of τ/α, where α > 1 is the frame rate ratio between the Fast and Slow pathways. The two pathways operate on the same raw clip, so the Fast pathway samples αT frames, α times denser than the Slow pathway. A typical value is α = 8 in our experiments. Our Fast pathway is a convolutional network analogous to the Slow pathway, but has a ratio of β (β < 1) channels of the Slow pathway. The typical value is β = 1/8 in our experiments.
Lateral Connections
Similar to [12, 35], we attach one lateral connection between the two pathways for every “stage" (Fig. 1). Specifically for ResNets [24], these connections are right after pool1, res2, res3, and res4. The two pathways have different temporal dimensions, so the lateral connections perform a transformation to match them:
- Time-to-channel: We reshape and transpose {αT, S2, βC} into {T, S2, αβC}, meaning that we pack all α frames into the channels of one frame.
- Time-strided sampling: We simply sample one out of every α frames, so {αT, S2, βC} becomes {T, S2, βC}.
- Time-strided convolution: We perform a 3D convolution of a 5×12 kernel with 2βC output channels and stride = α.
The output of the lateral connections is fused into the Slow pathway by summation or concatenation. We use unidirectional connections that fuse features of the Fast pathway into the Slow one (Fig. 1). Finally, a global average pooling is performed on each pathway’s output. Then two pooled feature vectors are concatenated as the input to the fully-connected classifier layer. The network architecture details are shown in table.
Results