Learning Spatiotemporal Features with 3D Convolutional Networks

Abstract

We introduce a simple and effective approach for spatiotemporal features learning by using deep 3-dimensional convolutional networks(3D ConvNets).

  1. 3D ConvNets are better than 2D ConvNets.
  2. The homogeneous architecture with 333 convolutional kernel in all layer is the best performing architecture.
  3. The C3D(Convolutional 3D) is better than state of the art method.The feature is compact with 10 dimensions.

Introduction

For remained content,You can refer original paper.
Contribution:

  1. The C3D is a good features learning machine.
  2. The architecture with 333 convolutional kernel in all layers works best among explored architectures.
  3. The proposed features with a simple linear model outperform or approach the best method on 4 different task and 6 benchmarks,and they are compact and efficiently computed.

Compare:
在这里插入图片描述

Related Work

Refer origin paper.

Learning features with 3D ConvNets

3D ConvNet has ability to model spatiotemporal information owing to 3D convolution and 3D pooling operations. In
3D ConvNets, convolution and pooling operations are per-
formed spatio-temporally while in 2D ConvNets they are
done only spatially.Hence,2D ConvNets lose temporal information of the input signals right after every convolution operation.Only 3D convolution preserves the temporal information of the input signals resulting in an output volume.
We train on a medium-scale dataset to find a best architecture.Then we verify the finds on a large-scale dataset with a smaller number of network experiments.
Then we set some common settings for all the networks we trained. It contains size of inputs and the amount of input frames. We vary the value d d d of convolutional layers to search for a best 3D architecture.All of these convolutional layers are applied with appropriate padding(both spatial and temporal) and stride 1,thus there is no change in term of size from the input to the output of these convolution layers. Labels belong to 101 different actions.All video frames are resized into 128 ∗ 171 128*171 128171.Videos are split into non-overlapped 16-frame clips which are then used as input to the networks.The input dimensions are 3 ∗ 16 ∗ 128 ∗ 171 3*16*128*171 316128171 where 3 is channels.Each conv layer is immediately followed by a max pooling layer.The two fully-connected layers and a softmax loss layer is to predict action labels.The number of filters for 5 conv layer from 1 to 5 are 64,128,256,256,256,respectively(PS: a feature map corresponds to a filter,a filter operates last layer whole feature map).The pooling is 2 ∗ 2 ∗ 2 2*2*2 222 with stride 1(except first layer,temporal dimension is 1).So we don’t merge the temporal signal too early,and we can temporally pool with factor 2 at most 4 times before completely collapsing the temporal signal.The two fully connected layer are 2048 outputs.We use SGD with decaying l r lr lr to train model.Our attention is to aggregate temporal information through deep networks.
We experiment with tow types of architectures:

  1. homogeneous temporal depth:all conv layer have same kernel temporal depth,using 1,3,5,7;
  2. varying temporal depth:kernel temporal depth is changing across the layers,using 3-3-5-5-7,7-5-5-3-3.

Their number of parameters is only different at conv layers due to different temporal depth.These differences are quite minute compared to millions of parameters in the fully connected layers.This indicates that the learning capacity of the networks are comparable and the differences in number of parameters should not affect the results of our architecture search.
Results on using UCF101 train split-1 to train networks:
在这里插入图片描述
Our findings in the previous section indicate that homogeneous setting with convolution kernels of 3 × 3 × 3 is the best option for 3D ConvNets.This finding is also consistent with a similar finding in 2D
ConvNets [37] (K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 3, 4).
Architecture of the C3D ConvNets as following:
在这里插入图片描述
Description:
All of 3D convolution filters are 3 × 3 × 3 with stride 1 × 1 × 1. All 3D pooling layers are 2 × 2 × 2 with stride 2 × 2 × 2 except for pool1 which has kernel size of 1 × 2 × 2 and stride 1 × 2 × 2 with the intention of preserving the temporal information in the early phase. Each fully connected layer has 4096 output units.
iDT:It is the best method for action recognition before using deep learning. Before iDT appears,the DT is main method.DT gets tracks of video sequence by using optic flow field.Then we extract HOF,HOG,MBH,trajectory features among tracks. We use FV(Fisher Vector) to encode features.Finally,we use SVM to train model based on encoding result.The iDT improves optic flow computation weakening influence of camera motion.(Refer to Video Analysis相关领域介绍之Action Recognition(行为识别))
We conclude that C3D has good regularization capability and it is effective.
The following is some experiments to prove capability of C3D.You can Google this paper to get detail.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值