【ML】Two-Stream Convolutional Networks for Action Recognition in Videos

Two-Stream Convolutional Networks for Action Recognition in Videos

&

Towards Good Practices for Very Deep Two-Stream ConvNets

 

Note here: it's a learning note on the topic of video representations. This note incorporates two papers about popular two-stream architecture.

Link: http://arxiv.org/pdf/1406.2199v2.pdf

       http://arxiv.org/pdf/1507.02159v1.pdf

 

Motivation: CNN has significantly boosted the performance of object recognition in still images. However, the use of it for video recognition with stacked frames doesn’t outperform the one with individual frame (work by Karpathy), which indicates traditional way of adapting CNN to video clips doesn’t capture the motion well.

 

Proposed Model:

In order to learn the spatio-temporal features well, this paper proposed a two-stream architecture for video recognition. It passes the spatial information (single static RGB frame) and another temporal information (optical flow of multiple frames) through the ConvNet. Then fuse the parallel outputs of two streams to form the final class score fusion.

The overall pipeline is shown below:

 

 

-      ConvNet input configurations:

There are some options for the input of temporal stream. The author discussed about utilizing optical flow stacking and trajectory stacking as motional information. The former one considers displacements of each point between consecutive frames, while the latter one focuses on the displacements of every point in the initial frame throughout the entire sequences.

 

They also mentioned bi-directional optical flow to enhance the capacity of video representations; and mean flow subtraction to avoid the influences of camera motion.

 

Visualization:

The visualization of filters in this architecture is shown below.

Each column corresponds to a filter, each row – to an input channel.

As we can draw from the image, one single filter composed with half black and half white means to compute spatial derivative; and the filters in a column with black turning into white gradually means to compute temporal derivative.

With the intuition above, we can see how the two-stream architecture captures the spatio-temporal features well.

 

 

Improvements:

       There is another paper named Towards Good Practices for Very Deep Two-Stream ConvNets, which improves the efficiency of two-stream model in practice.

       They argue that previous two-stream model didn’t significantly outperform other hand-crafted features for the mainly two reasons: first, the network is not deep enough as VGGNet&GoogLeNet; second, the lack of plenty training data limits its performance.

       Thus, they proposed some suggestions to learn a more powerful two-stream model:

-      Pre-training for Two-stream ConvNets: pre-train both spatial and temporal nets on ImageNet.

-      Smaller Learning Rate.

-      More Data Augmentation Techniques

-      High Dropout Ratio: make the training of deep network with small amount of data easier.

-      Multi-GPU training.

 

转载于:https://www.cnblogs.com/kanelim/p/5381759.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值