A Closer Look at Spatiotemporal Convolutions for Action Recognition
主要贡献
比较了几种用于视频分析的时空卷积形式,提出了“R(2+1)D”架构。
用于视频分析的时空卷积形式
Residual network architectures for video classification considered in this work.
(2+1)D vs 3D convolution
将
N
i
N_i
Ni个 大小为
N
i
−
1
N_{i−1}
Ni−1 × t × d × d的3D卷积核替换为
M
i
M_i
Mi个大小为
N
i
−
1
N_{i−1}
Ni−1× 1 × d × d的2D卷积核替和
N
i
N_i
Ni个大小为
M
i
M_i
Mi× t × 1 × 1的卷积核。
同时为了使分解后的 R(2+1)D 核和原3D卷积核的参数量大体相同,本文让上图中的
M
i
M_i
Mi值等于
使用因式分解卷积核有以下两个优点:
① 增加了额外的非线性映射,提高了网络的表示能力。
②使得网络的参数更加容易优化,在参数量相同的情况下,R(2+1)D 获得的训练损失和测试损失更低。网络层数越深,效果差距越明显。
实验
R3D architectures considered in our experiments.
Action recognition accuracy for different forms of convolution on the Kinetics validation set.
Comparison with the state-of-the-art on Sports-1M.
Comparison with the state-of-the-art on Kinetics.
Comparison with the state-of-the-art on UCF101 and HMDB51.