[深度学习论文笔记][Video Classification] Two-Stream Convolutional Networks for Action Recognition in Videos

Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos.” Advances in Neural Information Processing Systems. 2014.
(Citations: 425).


1 Motivation

The features learnt by Spatio-Temporal CNN do not capture the motion well. Our idea is to separate CNN streams for appearance from still frames and motion between frames, and combine them by late fusion. Decoupling the spatial and temporal nets also allows us to exploit the availability of large amounts of annotated image data by pre-training the spatial net on the ImageNet challenge dataset.


2 Architecture

See Fig.


The spatial stream are used to perform action recognition from still frames. This is the standard Image classification task. Thus, we can use CNN pre-trained on ImageNet.


The temporal stream are used to perform action recognition from motion. The input to this model is formed by stacking optical flow displacement fields between several consecutive frames. Such input explicitly describes the motion between video frames. Thus, the convolution is 3d convolution.


The final fusion is by class score averaging or linear SVM on top of l_2 normalized softmax scores as features.


3 Temporal Stream
There are several variations of the temporal stream part.


3.1 Optical Flow Stacking

The input is a set of displacement vector fields d ⃗ t between the pairs of consecutive frames t and t + 1. By d ⃗ t (i, j) we denote the displacement vector at the point (i, j) in frame t, which moves the point to the corresponding point in the following frame t + 1. To represent the motion across a sequence of frames, we stack the horizontal and vertical components of the vector field d t (i, j) x , d t (i, j) y of T consecutive frames to form a total of 2T input channels.

3.2 Trajectory Stacking
Replaces the optical flow, sampled at the same locations across several frames, with the flow, sampled along the motion trajectories of some anchor points.

See Fig. for illustration.


3.3 Bi-directional Optical Flow
We can construct an input volume by stacking T/2 forward flows between frames t and t + T/2 and T/2 backward flows between frames tT/2 and t. The input thus has the same
number of channels (2T) as before. The flows can be represented using either of the optical flow stacking or trajectory stacking.


3.4 Training Details
It is generally beneficial to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities. In our case, the displacement vector field components can be dominated by a particular displacement, e.g., caused by the camera movement. We consider a simpler approach: from each displacement field d ⃗ we subtract its mean vector. Because the datasets are small, to combat overfitting, we use multi-task learning. The CNN architecture is modified so that it has two softmax classification layers on top of the last fully-connected layer, one for each dataset. 


5 Results
Stacking multiple (T > 1) displacement fields in the input is highly beneficial, as it provides the network with long-term motion information. Mean subtraction is helpful, as it
reduces the effect of global motion between the frames. Optical flow stacking performs better than trajectory stacking, and using the bi-directional optical flow is only slightly
better than a uni-directional forward flow. Temporal CNN significantly outperform the spatial CNN, which confirms the importance of motion information for action recognition.
Temporal and spatial recognition streams are complementary, as their fusion significantly improves on both.

6 References
[1]. https://www.youtube.com/watch?v=FXQZBZVrigM.



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值