动作识别阅读笔记(三)《Temporal Segment Networks: Towards Good Practices for Deep Action Recognition》

最新推荐文章于 2024-07-23 22:38:20 发布

堂堂正正做猪

最新推荐文章于 2024-07-23 22:38:20 发布

阅读量1.1k

点赞数 2

文章标签：深度学习动作识别视频理解

本文链接：https://blog.csdn.net/mhz9123/article/details/86769543

版权

（注：为避免中文翻译不准确带来误解，故附上论文原句。）

论文：Wang L , Xiong Y , Wang Z , et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[J]. 2016.

链接：https://arxiv.org/abs/1608.00859

论文发表在ECCV2016，提出TSN（temporal segment network）做video的action recognition，TSN可看成是two-stream算法的改进(two-stream算法可参考之前的博客)。
作者首先提出了使用卷积做视频动作识别的2大困难：

First, long-range temporal structure plays an important role in understanding the dynamics in action videos. However, mainstream ConvNet frameworks usually focus on appearances and short-term motions, thus lacking the capacity to incorporate long-range temporal structure. Recently there are a few attempts to deal with this problem. These methods mostly rely on dense temporal sampling with a pre-defined sampling interval. This approach would incur excessive computational cost when applied to long video sequences, which limits its application in real-world practice and poses a risk of missing important information for videos longer than the maximal sequence length.
Second, in practice, training deep ConvNets requires a large volume of training samples to achieve optimal performance. However, due to the difficulty in data collection and annotation, publicly available action recognition datasets (e.g. UCF101, HMDB51) remain limited, in both size and diversity. Consequently, very deep ConvNets, which have attained remarkable success in image classification, are confronted with high risk of over-fitting.

概括一下，就是：
1、long-range temporal对理解视频中动态行为非常重要，但是很多现有的网络不能get到long-range temporal，他们中有些使用密度时间采样(dense temporal sampling),它只取视频中的一小部分，所以计算量很大且在长视频中会丢失信息。
2、训练数据数据少（2015年），容易过拟合。

针对2大困难，作者提出解决方案，也就是本文的贡献：

Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video.（来之Abstract，Introduction倒数第三段也是同样的意思）
to overcome the aforementioned difficulties caused by the limited number of training samples, including 1) cross-modality pre-training; 2) regularization; 3) enhanced data augmentation. Meanwhile, to fully utilize visual content from videos, we empirically study four types of input modalities to two-stream ConvNets, namely a single RGB image, stacked RGB difference, stacked optical flow field, and stacked warped optical flow field.（来自Introduction倒数第二段）

概括一下，就是：
1、提出了TSN网络，采用稀疏采样，保证计算的高效性，而且能有效利用视频所有帧，而不像two-stream算法只用单张图片及一段连续光流(冗余信息).
2、一些训练技巧：1).cross-modality 预训练；2).正则化；3).数据增强。为了充分利用视频中的视觉内容,使用4种特征输入：RGB、帧差图像、光流、warp光流。

下面将对作者贡献点中的一些概念进行解释

TSN网络

在这里插入图片描述
网络主体和two-stream算法一样分为spatial stream convnet(上图绿色方块) 和 temporal stream convnet(上图蓝色方块)，只不过使用了更深的BN-Inception网络，最后融合了多个采样识别结果。
训练时样本随机采样得到，即把视频平均分为K份，图中为3份，在每一份中在随机取出1帧RGB图像作为spatial convnet的输入，及一定数量（未从论文中get到）的光流作为temporal convnet的输入，每个segment都可以得到一个分类分数，图中spatial和temporal各3个分数，再将这些分数使用某种方法(方法有：evenly averaging, maximum, and weighted averaging)进行融合得到各自类别分数，在训练中spatial和temporal是分开训练的，所以可以单独使用，同时使用spatial和temporal预测时，需要加一个权重。

训练技巧

1、训练spatial convnet网络时，采用在ImageNet预训练的模型，进行初始化，然后进行fine-tuning。
2、训练temporal convnet网络时，作者提出了Cross Modality预训练的方式，即设法用RGB模态的网络权值来初始化其他模态网络的权值！当然不能直接拷贝过去，因为数据分布就不一样！怎么呢办呢？将其他模态的数据线性扩展到0~255（RGB的数据分布）。输入数据分布的改变直接影响到网络的第一个卷积层，因此，作者修改人为的修改了第一层卷积层的权值，按照输入维度进行平均后，复制到其他输入通道。
3. BN层的修改：BN层将batch数据转换成符合标准的高斯分布，加速收敛，但是会有过拟合的风险。因此，论文中选择不更新mean和variable。这种修改叫做Partial BN。(不太理解)
4、数据增强
5、DropOut

个人感受

主要想学习一下网络结构，是two-stream的增强版，网络方面主要采用稀疏采样策略。使用了多种输入(RGB、帧差图像、光流、warp光流)，个人觉得有点复杂，只使用原始RGB输入(简单)，而且能取得很好的结果（识别率高），计算开销小，这才是好的算法。训练时使用了很多技巧，这些都非常值得借鉴学习。

参考资料：

项目主页
 https://blog.csdn.net/Eudemonia_mia/article/details/82956311
https://blog.csdn.net/charel_chen/article/details/81350260
https://blog.csdn.net/small_ARM/article/details/78524442
https://blog.csdn.net/u010579901/article/details/80264496

堂堂正正做猪

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
动作识别阅读笔记(三)《Temporal Segment Networks: Towards Good Practices for Deep Action Recognition》

（注：为避免中文翻译不准确带来误解，故附上论文原句。）论文：Wang L , Xiong Y , Wang Z , et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[J]. 2016.链接：https://arxiv.org/abs/1608.00859 论文发表在ECC...
复制链接

扫一扫