【论文阅读笔记】(2019 ICCV)SlowFast Networks for Video Recognition

SlowFast Networks for Video Recognition

(2019 ICCV)

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He

Notes

Contributions

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.


Method

SlowFast networks can be described as a single stream architecture that operates at two different framerates.

Slow Pathway

The key concept in our Slow pathway is a large temporal stride τ on input frames, i.e., it processes only one out of τ frames. A typical value of τ we studied is 16—this refreshing speed is roughly 2 frames sampled per second for 30-fps videos. Denoting the number of frames sampled by the Slow pathway as T, the raw clip length is T × τ frames.

Fast Pathway

Our Fast pathway works with a small temporal stride of τ/α, where α > 1 is the frame rate ratio between the Fast and Slow pathways. The two pathways operate on the same raw clip, so the Fast pathway samples αT frames, α times denser than the Slow pathway. A typical value is α = 8 in our experiments. Our Fast pathway is a convolutional network analogous to the Slow pathway, but has a ratio of β (β < 1) channels of the Slow pathway. The typical value is β = 1/8 in our experiments.

Lateral Connections

Similar to [12, 35], we attach one lateral connection between the two pathways for every “stage" (Fig. 1). Specifically for ResNets [24], these connections are right after pool1, res2, res3, and res4. The two pathways have different temporal dimensions, so the lateral connections perform a transformation to match them:

  1. Time-to-channel: We reshape and transpose {αT, S2, βC} into {T, S2, αβC}, meaning that we pack all α frames into the channels of one frame.
  2. Time-strided sampling: We simply sample one out of every α frames, so {αT, S2, βC} becomes {T, S2, βC}.
  3. Time-strided convolution: We perform a 3D convolution of a 5×12 kernel with 2βC output channels and stride = α.

The output of the lateral connections is fused into the Slow pathway by summation or concatenation. We use unidirectional connections that fuse features of the Fast pathway into the Slow one (Fig. 1). Finally, a global average pooling is performed on each pathway’s output. Then two pooled feature vectors are concatenated as the input to the fully-connected classifier layer. The network architecture details are shown in table.


Results

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值