视频分类 S3D（separable 3D convolutions）模型及代码分析

最新推荐文章于 2025-01-28 11:01:58 发布

老光头_ME2CS

最新推荐文章于 2025-01-28 11:01:58 发布

阅读量6.8k

点赞数 7

分类专栏：视频分类 Pytorch 文章标签：深度学习

本文链接：https://blog.csdn.net/Forrest97/article/details/107913630

版权

本文介绍了S3D（separable 3D convolutions）模型，一种用于视频分类的高效方法。通过将3D卷积分解为时域和空间域分离的卷积，S3D网络在降低模型复杂性的同时提高了性能。实验表明，S3D网络在某些情况下优于I3D网络，并且通过分析权值分布和特征聚类，证明了其有效性和时序信息的重要性。文章还提供了代码分析，详细解释了模型结构和实现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

S3D（separable 3D CNN）是ECCV 2018发表的关于视频分类模型，核心思想就是将原来的I3D网络替换为时域和空间域分离进行卷积的S3D网络，相比I3D网络，不仅模型参数量得到大幅减少，而且性能也得到提升。

原文Rethinking Spatiotemporal Feature Learning:Speed-Accuracy Trade-offs in Video Classification

原文

Introduction

第一段，概述视频分类问题和现有数据集Sports-1M [5], Kinetics [6], Something-something [7], ActivityNet [8], Charades [9]
第二段，视频分类关键的三要素：时间，空间和计算经济性(1) how best to represent spatial information (i.e., recognizing the appearances of objects); (2) how best to represent temporal information (i.e., recognizing context, correlation and causation through time); and (3) how best to tradeoff model complexity with speed, both at training and testing time.
第三段，引入I3D网络，性能高但是模型计算经济性差。提出三个问题：

是否需要3D网络？
是否有必要时空融合提取特征，能否分离操作？
如何提高效率和准确度

为了回答问题1，作者基于I3D网络提出了如下一组网络，对比相同框架下3D和2D网络结构。
在这里插入图片描述

为了回答问题2，作者提出了时空分离的S3D网络，即卷积核由 $k_{t} \times k \times k$ 替换为 $\times k \times k$ 和 $k_{t} \times 1 \times 1$ 两组
为了回答问题3，作者基于前两点的发现，提出了一个a spatio-temporal gating mechanism进一步提高模型的精度，即S3D-G模型

Related work

I3D网络，计算量巨大，前人对问题1的一些尝试(mixed convolutional models)，以及分离卷积操作(separable convolutions)，特征门控（feature gating）

Experiment Setup

介绍Kinetics和Something-something两个数据集
从256x256 cropped 网络输入 224x224
对Kinetics前60Ksteps学习率0.1， 70K steps lr=0.01, 80K steps lr=0.001
对Something-something 10K lr=0.1

Network surgery

Replacing all 3D convolutions with 2D

分别利用I3D和I2D网络在两个数据集上，采用正序和反序两种方式进行网络训练。结果如下，I2D在正序和反序的表现相近，说明结果与时序相关性不高（主要由静态场景信息提供）。对I3D网络在Something-something数据集上表现差距很大的原因可能是由于Something-something数据集需要fine-grained distinctions between visually similar action categories. （比如Pushing something from left to right” and “Pushing something from right to left）
在这里插入图片描述

倒叙输入网络说明，时间序列输出对结果预测的重要性

Replacing some 3D convolutions with 2D

主要对比Bottom-Heavy-I3D model和Top-Heavy-I3D models，结果如下发现Top-Heavy-I3D models性能要优越很多，说明替换底层的3D卷积层对网络影响更低，且计算量大幅减少。
在这里插入图片描述

Analysis of weight distribution of learned filters

对3d卷积操作中不同时间的权值分布统计，可以看到越底层的卷重中间时刻的方差越大，顶层，不同权重分布方差越大。说明越顶层的3d卷积操作越有意义。
在这里插入图片描述

Separating temporal convolution from spatial convolutions

对比S3D与I3D发现性能有所提升
在这里插入图片描述

tSNE analysis of the features

利用聚类算法比较模型网络特征提取的效果，上一层可以看出S3D网络的卷积层越顶层的特征分离效果越明显；
横向比较也可以看出S3D网络是特征分离最理想的网络
在这里插入图片描述

Spatio-temporal feature gating

$y=\sigma(W x+b) \odot x$
$\in \mathcal{R}^{n}$ usually learned at final embedding layers close to the logit output

代码

model.py

总的来看S3D模型还是相同简单明了的直通式卷积网络

总体结构分析

本代码基于github pytorch程序。参考的这个代码比较好的是带有pretrained weight文件。原文主要网络基于I3D模型，如下这幅图，并提出了一个S3D的思想。
在这里插入图片描述

按照管理先打印看看模型的结构。首先我们对比上一个图重点看看一级（标红字）模块，分别对应上图中的主干模型中的模块。关键点使用使用S3D模块，以及3D temporal separable Inception block，即mixed_*模块替换原来的3d卷积，其他结构保持不变。这里只列出了部分内容。

S3D(
(base): Sequential(
(0): SepConv3d(
(conv_s): Conv3d(3, 64, kernel_size=(1, 7, 7), stride=(1, 2, 2), padding=(0, 3, 3), bias=False)
(bn_s): BatchNorm3d(64, eps=0.001, momentum=0.001, affine=True, track_running_stats=True)
(relu_s): ReLU()
(conv_t): Conv3d(64, 64, kernel_size=(7, 1, 1), stride=(2, 1, 1), padding=(3, 0, 0), bias=False)
(bn_t): BatchNorm3d(64, eps=0.001, momentum=0.001, affine=True, track_running_stats=True)
(relu_t): ReLU()
)
(1): MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), dilation=1, ceil_mode=False)
(2): BasicConv3d(
(conv): Conv3d(64, 64, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(64, eps=0.001, momentum=0.001, affine=True, track_running_stats=True)
(relu): ReLU()
)
(3): SepConv3d(
(conv_s): Conv3d(64, 192, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn_s): BatchNorm3d(192, eps=0.001, momentum=0.001, affine=True, track_running_stats=True)
(relu_s): ReLU()

最低0.47元/天解锁文章