TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation 翻译

最新推荐文章于 2023-12-18 16:08:34 发布

nclgsj1028

最新推荐文章于 2023-12-18 16:08:34 发布

阅读量1.2k

点赞数

分类专栏：目标检测行为检测图像分割

行为检测同时被 3 个专栏收录

7 篇文章 0 订阅

订阅专栏

目标检测

5 篇文章 0 订阅

订阅专栏

图像分割

2 篇文章 0 订阅

订阅专栏

本博文是对论文的翻译，如有不准确，请在评论中指出

TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation

一种用于视频动作分割的时间卷积和递归混合网络

关键词：hybrid temporal convolutional and recurrent network 混合时间卷积与递归网络

文章主要介绍了一种混合时间卷积与递归网络，它具有编码解码结构：编码器由一层时间卷积核组成，捕捉不同动作的局部运动变化；解码器是一种递归神经网络的层次结构，能够在编码阶段之后学习和记忆长期的动作依赖关系。其新能要优于现有网络。（论文发表时间在2017-5）

1介绍

目前大多数的动作分割方法[27，20，7]都使用卷积神经网络提取的特征。例如：Two-stream CNNS [19] or local 3D ConvNets [24]，

过去方法的优点：在编解码网络中，分别采用一维时间卷积核和反卷积核的层次结构，它们的模型在捕捉局部运动方面是有效的，并且在各种动作分割数据集中达到了最先进的性能。

过去方法的缺点：一个明显的缺点是它无法捕获视频中由于固定大小而产生的不同动作的长期依赖关系。例如：典型的制作热狗的视频中，挤出番茄酱的动作通常发生在同时拿面包和香肠之后。

另外，一个扩展的时间卷积网络，类似于用于语音处理的WavNet[25]，也是在[10]中测试的。但表现更差，这进一步说明了视频和语音数据之间存在着差异。尽管它们都被表示为序列特征。

为了克服上述限制，我们提出了一种新的时间卷积和递归混合网络(TricorNet)，它既关注局部运动的变化，又关注长期的动作依赖关系，视频动作分割建模。TricorNet使用帧级特征作为编解码结构的输入。在我们的例子中，编码器是一个时间卷积网络，由一维卷积核组成，观察到卷积核善于编码局部运动变化；解码器是递归神经网络的分层结构，双向短时记忆网络(Bi-LSTMS)[6]，它能够在编码过程之后学习和记忆长期的动作依赖关系。我们的网络很简单，但是非常有效，可以处理不同时间的动作，并对不同动作之间的依赖关系进行建模。

2 Related Work 相关工作

对于动作分割，现有的许多作品都是以帧级特征作为输入，然后在整个视频序列上建立时态模型。

杨等人。[27]提出了一种关注lstm网络来模拟固定长度窗口中输入帧特征的依赖关系。

黄等人。[7]考虑无监督的行为标号问题。

辛格等人。[ 20 ]提出了一种细粒度动作检测任务的多流双向递归神经网络。

Lea等人。[10]提出了两种时间卷积网络用于动作分割和检测。

我们的模型设计受到了[20]和[10]的启发，并在实验中与它们进行了比较。

Lea等人。[11]引入spatial CNN 和 a spatiotemporal CNN；后者是一种端到端的方法，用来从帧中对整个序列进行建模。在这里，我们使用它们的spatial CNN的输出特性作为自己的TricorNet的输入，并与spatial CNN的结果进行了比较。

Richard等人。[18]使用一种统计语言模型，着重于对不同长度的视频片段进行定位和分类。

Kuehne等人。[9]提出了一种基于稠密轨迹特征的Hidden Markov模型的一种端到端的生成行为分割的方法。

另一个相关的领域是行动检测。

Peng 和 Schmid[17]提出了一个双流R-CNN探测行动。

杨等人[28]介绍了一个基于强化学习的行动检测框架。

李等人。[14]提出联合分类-回归递归神经网络，用于从三维骨骼数据中检测人体行为。

这些方法主要适用于单动作、短视频。

最近的工作[29]关注到行动检测和依赖 untrimmed （未修剪和不受限制）的YouTube视频。

但本课题不在本文研究范围之内。

3模型

TricorNet的输入是一组帧级视频特征，例如，cnn对给定视频的每一帧的输出。

图2：这是TricorNet的总框架。编码器网络由一层时间卷积核组成，能够有效地捕捉局部运动变化。解码器网络由一层依靠长期动作的Bi-LSTMs组成。

3.1时间卷积和递归网络

图2大致描述了TricorNet的一般框架。该TricorNet具有一组编码解码器结构。编码器和解码器网络都由K层组成。我们将编码层定义为，我们将解码层定义为，i=1,2,......k，在编码器和解码器之间的是中间层。这里，K是一个超参数，它可以根据数据集中的视频数据的大小和外观来转换。一个大的K表示网络很深，通常需要更多的数据来训练。从经验角度来看，我们对所有的实验中设置 K = 2。

在解码层，每个层都是由(一维)时间卷积、非线性激活函数E = f(·)、最大池化跨越时间组成。在编码层L_E^（i）上使用F_i指定的一系列卷积滤波器，我们定义了卷积滤波器的集合，并给出了相应的偏置向量。给定前一编码层在pooling E^(i-1)后的输出，我们计算当前层的激活函数：

其中*表示一维卷积算子。注意，是输入帧级特征向量的集合。卷积核的长度是另一个超参数。长度越长，感受野越大，但也会减少相邻时间步长之间的差别。我们在文章的第四部分给出了最佳做法。

这里，中间层L_mid是池化之后最后一个编码层E^(K)的输出，并将其作为译码器网络的输入端。解码部分的结构与编码部分相比是一个保留的层次结构，也是由K个层组成。我们使用(Bi-LSTM)[6]单元来模拟远程动作依赖关系，并使用上采样来解码帧级标签。因此，每一层的在解码器网络是结合了上采样和Bi- LSTM。

通常，递归神经网络使用隐藏状态表示h=(h1，h2，…，ht)将输入向量x=(x1，x2，…，xt)映射到输出序列y=(y1，y2，…，yt)。在LSTM单元的条件中，它通过以下公式更新其隐藏状态：

其中σ(·)是S型激活函数，tanh(·)是双曲正切激活函数，i_t、f_t、o_t和c_t是各自地输入门、遗忘门、输出门和记忆单元激活向量。这里，W_s和b_s是权重和偏差。

Bi-LSTM层包含两个LSTM：一个在时间上前进，一个倒退。输出是两个方向的结果的连合。

在TricorNet中，我们使用隐藏状态H^(I)的更新序列作为每个解码层的输出。我们使用H_i来指定单个LSTM层中隐藏状态的数量。因此，对于层，每一时间步骤的输出维数为2H_i，作为正向和反向LSTM的级联。整个解码部分的输出将是一个矩阵，这意味着在每个时间步骤t=1，2，…，T，我们有一个2* h_k维向量D_t（是最后一个解码层的输出）。

最后，我们有一个跨时间的Softmax层来计算帧的标记在每个时间步骤t中接受c操作类之一的概率，它是由：

其中是c类在时间步长t上的输出概率向量，D_t是时间步长t的译码器输出，W_d是权，B_d是偏置项。

3.2模型的变化

为了找到时间卷积层和Bi-LSTM层之间最好的组合结构，本文测试了三种不同的模型设计。在第四节中给出了不同模型的测试。

TricorNet：这个模型的想法是通过不同的时间卷积层来进行编码，通过不同的Bi-LSTM来进行解码，来学习不同长时间活动依赖关系的水平。

TricorNet (high)：TriorNet(high)只在中间层部署Bi-LSTM单元，后者是编码器和解码器之间的一层。编码层和解码层都使用了时间卷积核。灵感来源于使用Bi-LSTM在抽象级别上对序列依赖进行建模，其中信息是高度压缩的，同时保持本地的编码和解码。当动作标签很粗糙时和预期的表现一样良好。

TricorNet (low)：TriorNet(low)只在层部署Bi-LSTM单元，这是解码器的最后一层。它对编码层和解码层使用时间卷积核，其中在解码器中i<k。出发点是是使用 Bi-LSTM对唯一细节解码。对于操作标签是细粒度的情况，最好集中在低级别学习依赖项，在低级别上，信息压缩较少。

3.3实现细节

在本工作中，所有TricorNets的一些超参数是固定的，并在所有实验中使用。在编码部分，使用的max pooling宽度为2。每个时间卷积层具有32+32i个滤波器。在解码部分，上采样是由每进入两次的简单重复实现。2 H_i给出了每个LSTM层的潜伏状态。我们使用归一化校正线性单元[10]作为所有时间卷积层的激活函数，其定义为：

是层中的最大激活值，。

在我们的实验中，模型仅使用目标数据集的训练集从头开始进行培训。使用随机梯度下降和ADAM[8]阶跃更新的分类交叉熵损失来学习权值和参数。我们还在卷积层之间和Bi-LSTM层之间添加空间dropout算法。这些模型是用Keras[1]和TensorFlow实现的。

实验和评估部分略

References

[1] F. Chollet. keras. https://github.com/fchollet/keras, 2015.

[2] J. Cross and L. Huang. Incremental parsing with minimal features using bi-directional lstm. In Association

for Computational Linguistics, 2016.

[3] P. Das, C. Xu, R. Doell, and J. J. Corso. A thousand frames in just a few words: lingual description of

videos through latent topics and sparse object stitching. In IEEE Conference on Computer Vision and

Pattern Recognition, 2013.

[4] A. Fathi, X. Ren, and J. M. Rehg. Learning to recognize objects in egocentric activities. In Computer

Vision and Pattern Recognition (CVPR), 2011 IEEE Conference On, pages 3281–3288. IEEE, 2011.

[5] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar,

D. D. Yuh, et al. Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for

human motion modeling. In MICCAI Workshop: M2CAI, volume 3, 2014.

[6] A. Graves, S. Fernández, and J. Schmidhuber. Bidirectional lstm networks for improved phoneme

classification and recognition. Artificial Neural Networks: Formal Models and Their Applications–ICANN

2005, pages 753–753, 2005.

[7] D.-A. Huang, L. Fei-Fei, and J. C. Niebles. Connectionist temporal modeling for weakly supervised action

labeling. In European Conference on Computer Vision, 2016.

[8] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[9] H. Kuehne, J. Gall, and T. Serre. An end-to-end generative framework for video segmentation and

recognition. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pages 1–8.

IEEE, 2016.

[10] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager. Temporal convolutional networks for action

segmentation and detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.

[11] C. Lea, A. Reiter, R. Vidal, and G. D. Hager. Segmental spatiotemporal cnns for fine-grained action

segmentation. In European Conference on Computer Vision, pages 36–52. Springer, 2016.

[12] C. Lea, R. Vidal, and G. D. Hager. Learning convolutional action primitives for fine-grained action

recognition. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 1642–

1649. IEEE, 2016.

[13] C. Lea, R. Vidal, A. Reiter, and G. D. Hager. Temporal convolutional networks: A unified approach to

action segmentation. In Computer Vision–ECCV 2016 Workshops, pages 47–54. Springer, 2016.

[14] Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, and J. Liu. Online human action detection using joint

classification-regression recurrent neural networks. In European Conference on Computer Vision, 2016.

[15] P. Mettes, J. C. van Gemert, and C. G. Snoek. Spot on: Action localization from pointly-supervised

proposals. In European Conference on Computer Vision, 2016.

[16] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In IEEE

International Conference on Computer Vision, 2015.

[17] X. Peng and C. Schmid. Multi-region two-stream r-cnn for action detection. In European Conference on

Computer Vision, 2016.

[18] A. Richard and J. Gall. Temporal action detection using a statistical language model. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, pages 3131–3140, 2016.

[19] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In

Advances in Neural Information Processing Systems, 2014.

[20] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. A multi-stream bi-directional recurrent neural

network for fine-grained action detection. In IEEE Conference on Computer Vision and Pattern Recognition,

2016.

[21] S. Singh, C. Arora, and C. Jawahar. First person action recognition using deep learned descriptors. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2620–2628,

2016.

[22] S. Stein and S. J. McKenna. Combining embedded accelerometers with computer vision for recognizing

food preparation activities. In Proceedings of the 2013 ACM international joint conference on Pervasive

and ubiquitous computing, pages 729–738. ACM, 2013.

[23] L. Tao, L. Zappella, G. D. Hager, and R. Vidal. Surgical gesture segmentation and recognition. In

International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 339–

346. Springer, 2013.

[24] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d

convolutional networks. In IEEE International Conference on Computer Vision, 2015.

[25] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,

and K. Kavukcuoglu. Wavenet: A generative model for raw audio. Technical report, arXiv:1609.03499,

2016.

[26] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4305–4314,

2015.

[27] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei. Every moment counts: Dense

detailed labeling of actions in complex videos. Technical report, arXiv:1507.05738, 2015.

[28] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection from frame

glimpses in videos. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[29] L. Zhou, C. Xu, and J. J. Corso. Procnets: Learning to segment procedures in untrimmed and unconstrained

videos. Technical report, arXiv:1703.09788, 2017

nclgsj1028

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录