Tran, Du, et al. “Learning spatiotemporal features with 3d convolutional networks.” 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015. (Citations: 101).
1 Architecture
This model is 3D VGGNet, basically. It contains 3 × 3 × 3 conv, 2 × 2 × 2 pool. An illustration of 3d convolution can be seen in Fig. 3D convolution preserves the temporalinformation of the input signals resulting in an output volume.
2 Results
By using deconv approach, we observe that C3D starts by focusing on appearance in the first few frames and tracks the salient motion in the subsequent frames. Thus 3d CNN
differs from stadard 2d CNN in that it selectively attends to both motion and appearance. Like standard 2d CNN, we can extract video features from 3d CNN. We use fc6 features
in our experiments.
3 References
[1]. http://web.cs.hacettepe.edu.tr/ ̃aykut/classes/spring2016/bil722/slides/w07-conv3d.pdf.