【论文阅读笔记】（2019 ICCV）SlowFast Networks for Video Recognition

最新推荐文章于 2022-08-31 02:29:42 发布

小吴同学真棒

最新推荐文章于 2022-08-31 02:29:42 发布

阅读量486

点赞数

分类专栏：学习人工智能文章标签：计算机视觉深度学习动作识别 SlowFast 人工智能

本文链接：https://blog.csdn.net/qq_36627158/article/details/113832413

版权

学习同时被 2 个专栏收录

115 篇文章 7 订阅

订阅专栏

人工智能

72 篇文章 4 订阅

订阅专栏

SlowFast Networks for Video Recognition

（2019 ICCV）

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He

Notes

Contributions

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.

Method

SlowFast networks can be described as a single stream architecture that operates at two different framerates.

Slow Pathway

The key concept in our Slow pathway is a large temporal stride τ on input frames, i.e., it processes only one out of τ frames. A typical value of τ we studied is 16—this refreshing speed is roughly 2 frames sampled per second for 30-fps videos. Denoting the number of frames sampled by the Slow pathway as T, the raw clip length is T × τ frames.

Fast Pathway

Our Fast pathway works with a small temporal stride of τ/α, where α > 1 is the frame rate ratio between the Fast and Slow pathways. The two pathways operate on the same raw clip, so the Fast pathway samples αT frames, α times denser than the Slow pathway. A typical value is α = 8 in our experiments. Our Fast pathway is a convolutional network analogous to the Slow pathway, but has a ratio of β (β < 1) channels of the Slow pathway. The typical value is β = 1/8 in our experiments.

Lateral Connections

Similar to [12, 35], we attach one lateral connection between the two pathways for every “stage" (Fig. 1). Specifically for ResNets [24], these connections are right after pool1, res2, res3, and res4. The two pathways have different temporal dimensions, so the lateral connections perform a transformation to match them:

Time-to-channel: We reshape and transpose {αT, S2, βC} into {T, S2, αβC}, meaning that we pack all α frames into the channels of one frame.
Time-strided sampling: We simply sample one out of every α frames, so {αT, S2, βC} becomes {T, S2, βC}.
Time-strided convolution: We perform a 3D convolution of a 5×12 kernel with 2βC output channels and stride = α.

The output of the lateral connections is fused into the Slow pathway by summation or concatenation. We use unidirectional connections that fuse features of the Fast pathway into the Slow one (Fig. 1). Finally, a global average pooling is performed on each pathway’s output. Then two pooled feature vectors are concatenated as the input to the fully-connected classifier layer. The network architecture details are shown in table.

Results

小吴同学真棒

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【论文阅读笔记】（2019 ICCV）SlowFast Networks for Video Recognition

论文名称：SlowFast Networks for Video Recognition论文链接：https://arxiv.org/pdf/1812.03982.pdf论文作者：Christoph Feichtenhofer，Haoqi Fan，Jitendra Malik，Kaiming He【Facebook AI Research (FAIR)】写在前面由于关于这篇论文网上已经有比较详细的讲解了，所以我就不做重复的工作了。在引用别人讲解...
复制链接

扫一扫