行为识别论文笔记|SlowFast|SlowFast Networks for Video Recognition
Feichtenhofer, Christoph, et al. “Slowfast networks for video recognition.” Proceedings of the IEEE international conference on computer vision. 2019. ——FAIR
Motivations
The categorical spatial semantics of the visual content often evolve slowly.
The motion being performed can evolve much faster than their subject identities
Biological studies on the retinal ganglion cells in the primate visual system:∼80%
are Parvocellular (P-cells) and ∼15-20% are Magnocellular (M-cells).
Solutions
-
Slow pathway: with large temporal stride (16) on input frames (roughly 2 frames sampled per second for a 30-fps video)
-
Fast pathway: with small temporal stride (16/8=2), no temporal downsampling layers for also high feature resolution (8xT frames)
Why 1/8? the Fast pathway typically takes ∼20% of the total computation. Interestingly, s mentioned in Sec. 1, evidence suggests that ∼15-20% of the retinal cells in the primate visual system are M-cells (that are sensitive to fast motion but not color or spatial detail).
-
Lateral connections: these connections are right after pool1, res2, res3, and res4
- Slow: { T , S 2 , C T, S^2, C T,S2,C}
- Fast: { α T , S 2 , β C \alpha T, S^2, \beta C αT,S