Abstract
We introduce a simple and effective approach for spatiotemporal features learning by using deep 3-dimensional convolutional networks(3D ConvNets).
- 3D ConvNets are better than 2D ConvNets.
- The homogeneous architecture with 333 convolutional kernel in all layer is the best performing architecture.
- The C3D(Convolutional 3D) is better than state of the art method.The feature is compact with 10 dimensions.
Introduction
For remained content,You can refer original paper.
Contribution:
- The C3D is a good features learning machine.
- The architecture with 333 convolutional kernel in all layers works best among explored architectures.
- The proposed features with a simple linear model outperform or approach the best method on 4 different task and 6 benchmarks,and they are compact and efficiently computed.
Compare:
Related Work
Refer origin paper.
Learning features with 3D ConvNets
3D ConvNet has ability to model spatiotemporal information owing to 3D convolution and 3D pooling operations. In
3D ConvNets, convolution and pooling operations are per-
formed spatio-temporally while in 2D ConvNets they are
done only spatially.Hence,2D ConvNets lose temporal information of the input signals right after every convolution operation.Only 3D convolution preserves the temporal information of the input signals resulting in an output volume.
We train on a medium-scale dataset to find a best architecture.Then we verify the finds on a large-scale dataset with a smaller number of network experiments.
Then we set some common settings for all the networks we trained. It contains size of inputs and the amount of input frames. We vary the value
d
d
d of convolutional layers to search for a best 3D architecture.All of these convolutional layers are applied with appropriate padding(both spatial and temporal) and stride 1,thus there is no change in term of size from the input to the output of these convolution layers. Labels belong to 101 different actions.All video frames are resized into
128
∗
171
128*171
128∗171.Videos are split into non-overlapped 16-frame clips which are then used as input to the networks.The input dimensions are
3
∗
16
∗
128
∗
171
3*16*128*171
3∗16∗128∗171 where 3 is channels.Each conv layer is immediately followed by a max pooling layer.The two fully-connected layers and a softmax loss layer is to predict action labels.The number of filters for 5 conv layer from 1 to 5 are 64,128,256,256,256,respectively(PS: a feature map corresponds to a filter,a filter operates last layer whole feature map).The pooling is
2
∗
2
∗
2
2*2*2
2∗2∗2 with stride 1(except first layer,temporal dimension is 1).So we don’t merge the temporal signal too early,and we can temporally pool with factor 2 at most 4 times before completely collapsing the temporal signal.The two fully connected layer are 2048 outputs.We use SGD with decaying
l
r
lr
lr to train model.Our attention is to aggregate temporal information through deep networks.
We experiment with tow types of architectures:
- homogeneous temporal depth:all conv layer have same kernel temporal depth,using 1,3,5,7;
- varying temporal depth:kernel temporal depth is changing across the layers,using 3-3-5-5-7,7-5-5-3-3.
Their number of parameters is only different at conv layers due to different temporal depth.These differences are quite minute compared to millions of parameters in the fully connected layers.This indicates that the learning capacity of the networks are comparable and the differences in number of parameters should not affect the results of our architecture search.
Results on using UCF101 train split-1 to train networks:
Our findings in the previous section indicate that homogeneous setting with convolution kernels of 3 × 3 × 3 is the best option for 3D ConvNets.This finding is also consistent with a similar finding in 2D
ConvNets [37] (K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 3, 4).
Architecture of the C3D ConvNets as following:
Description:
All of 3D convolution filters are 3 × 3 × 3 with stride 1 × 1 × 1. All 3D pooling layers are 2 × 2 × 2 with stride 2 × 2 × 2 except for pool1 which has kernel size of 1 × 2 × 2 and stride 1 × 2 × 2 with the intention of preserving the temporal information in the early phase. Each fully connected layer has 4096 output units.
iDT:It is the best method for action recognition before using deep learning. Before iDT appears,the DT is main method.DT gets tracks of video sequence by using optic flow field.Then we extract HOF,HOG,MBH,trajectory features among tracks. We use FV(Fisher Vector) to encode features.Finally,we use SVM to train model based on encoding result.The iDT improves optic flow computation weakening influence of camera motion.(Refer to Video Analysis相关领域介绍之Action Recognition(行为识别))
We conclude that C3D has good regularization capability and it is effective.
The following is some experiments to prove capability of C3D.You can Google this paper to get detail.