行为识别论文笔记-I3D T3D S3D R(2+1)D P3D CSN
I3D
Carreira, Joao, and Andrew Zisserman. “Quo vadis, action recognition? a new model and the kinetics dataset.” proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
T3D
Diba, Ali, et al. “Temporal 3d convnets: New architecture and transfer learning for video classification.” arXiv preprint arXiv:1711.08200 (2017).
S3D
Xie, Saining, et al. “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.
R(2+1)D
Tran, Du, et al. “A closer look at spatiotemporal convolutions for action recognition.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018.
P3D
Qiu, Zhaofan, Ting Yao, and Tao Mei. “Learning spatio-temporal representation with pseudo-3d residual networks.” proceedings of the IEEE International Conference on Computer Vision. 2017.
CSN
Tran, Du, et al. “Video classification with channel-separated convolutional networks.” Proceedings of the IEEE International Conference on Computer Vision. 2019.
I3D 是 Zisserman
T3D是接着I3D的思路做迁移学习 3D网络的参数
R(2+1)D是C3D作者 Tran Du,IDT作者hengwang ,养乐村在FAIR的作品
S3D,R(2+1)D, P3D都是时空分解卷积
CSN是3D分组卷积
I3D Contributions:
-
Kinects 400 数据集,每类400+ Youtube视频
结构e为作者提出的I3D模型: 2. Inflating 2D ConvNets into 3D:结构设计——2D ConvNets增加一个维度,如Figure3所示:
3. Bootstrapping 3D filters from 2D Filters:3D网络初始化参数——NxN的2D卷积核(ImageNet预训练的Kernal,池化核同理)复制N次并归一化,堆叠起来得到NxNxN的3D卷积核,这样做的出发点是短期内时间不变性的假设 -
Pacing receptive field growth in space, time and network depth.:stride也模仿Inc-v1设计
T3D Contributions
当年Kinetics的SOTA
- 2D pretrain的卷积网络的参数扩展到3D卷积网络上,作者设计了一个知识迁移结构如下:
蓝色:pre-train的2D网络(teacher)处理RGB单帧输入
绿色:待学习参数的3D网络(student)处理整个视频输入
蓝绿色输入来自同一个视频:正样本对,否则负样本对;将许多正负样本对输入整个网络,优化网络的判断力
S3D Contributions:
-
kt x k x k 卷积分解为 1 x k x k + kt x 1 x 1
Sep-Conv:
Sep-Inc:
- 比较了全3D,全2D,先3D再2D,先2D再3D, 先2D再3D效果最好
R(2+1)D contributions
- 同样的 先空间+时间 分解卷积,分解之后,Loss下降更快,下降速度val上比train上更快
- R是Residual
P3D contributions
- 同样的空间时间分解,比较了下面三种结构:
P3D-A效率最高
所有这些方法中,特征加强的越多(光流,idt),最终准确度越高
所有这些模型都是在 Kinetics 和 Sport-1M这种大库上训练,UCF101 HMDB51上这种小库上fintune
CSN contributions
- 分组卷积,通道上进行组合,大大减少计算量
- channel separation 扮演了 正则化的作用,得到更低的test error 避免过拟合
- interaction preserved CSN:先全连接,再depth wise
- interaction-reduced CSN:直接depth wise