Discriminative Feature Learning for Unsupervised Video Summarization（论文翻译）

菜菜菜菜菜菜菜

已于 2023-11-29 18:01:38 修改

阅读量1.3k

点赞数 1

分类专栏：英文论文翻译文章标签： php 开发语言

于 2019-12-27 12:33:16 首次发布

本文链接：https://blog.csdn.net/weixin_43590290/article/details/103729696

版权

本文提出了一种解决无监督视频摘要问题的方法，通过方差损失增强特征学习，设计了块和跨步网络（CSNet）处理长视频，并引入差异注意力机制处理动态信息。实验表明，这种方法在基准数据集上提高了模型性能，尤其是在长视频的汇总上表现突出。

摘要由CSDN通过智能技术生成

Discriminative Feature Learning for Unsupervised Video Summarization

Abstract

在本文中，我们解决了无监督视频摘要的问题，该问题会自动从输入视频中提取关键镜头。具体而言，我们根据经验观察结果解决了两个关键问题：（i）由于每帧输出重要性得分的平均分布而导致无效的特征学习，以及（ii）处理长视频输入时的训练难度。为了缓解第一个问题，我们提出了一个简单而有效的正则化损失项，称为方差损失。拟议的方差损失使网络可以高度差异地预测每个帧的输出分数，从而可以进行有效的特征学习并显着提高模型性能。对于第二个问题，我们设计了一种新颖的两流网络，称为块和跨步网络（CSNet），该网络利用视频功能的本地（块）和全局（跨步）时间视图。与现有方法相比，我们的CSNet为长视频提供了更好的汇总结果。此外，我们引入了一种注意力机制来处理视频中的动态信息。我们通过进行大量的消融研究证明了所提出方法的有效性，并表明我们的最终模型在两个基准数据集上获得了最新的技术成果。

in this paper, we address the problem of unsupervised video summarization that automatically extracts key-shots from an input video. Specifically, we tackle two critical issues based on our empirical observations: (i) Ineffective feature learning due to flat distributions of output importance scores for each frame, and (ii) training difficulty when dealing with longlength video inputs. To alleviate the first problem, we propose a simple yet effective regularization loss term called variance loss. The proposed variance loss allows a network to predict output scores for each frame with high discrepancy which enables effective feature learning and significantly improves model performance. For the second problem, we design a novel two-stream network named Chunk and Stride Network (CSNet) that utilizes local (chunk) and global (stride) temporal view on the video features. Our CSNet gives better summarization results for long-length videos compared to the existing methods. In addition, we introduce an attention mechanism to handle the dynamic information in videos. We demonstrate the effectiveness of the proposed methods by conducting extensive ablation studies and show that our final model achieves new state-of-the-art results on two benchmark datasets.

Introduction

视频已成为视觉数据的一种非常重要的形式，并且近年来，上传到各种在线平台的视频内容的数量急剧增加。在这方面，处理视频的有效方法变得越来越重要。一种流行的解决方案是将视频汇总为较短的视频，而不会丢失语义上重要的帧。在过去的几十年中，许多研究（Song等，2015； Ngo，Ma和Zhang，2003； Lu和Grauman，2013； Kim和Xing，2014； Khosla等，2013）试图解决这个问题。最近，张等人。使用深度神经网络显示出令人鼓舞的结果，并且在有监督的领域（Zhang等人2016a; 2016b; Zhao，Li和Lu 2017; 2018; Wei等人2018）进行了大量后续工作学习（Mahasseni，Lam和Todorovic 2017; Zhou和Qiao 2018）。

Video has become a highly significant form of visual data, and the amount of video content uploaded to various online platforms has increased dramatically in recent years. In this regard, efficient ways of handling video have become increasingly important. One popular solution is to summarize videos into shorter ones without missing semantically important frames. Over the past few decades, many studies (Song et al. 2015; Ngo, Ma, and Zhang 2003; Lu and Grauman 2013; Kim and Xing 2014; Khosla et al. 2013) have attempted to solve this problem. Recently, Zhang et al. showed promising results using deep neural networks, and a lot of follow-up work has been conducted in areas of supervised (Zhang et al. 2016a; 2016b; Zhao, Li, and Lu 2017; 2018; Wei et al. 2018) and unsupervised learning (Mahasseni, Lam, and Todorovic 2017; Zhou and Qiao 2018).

监督学习方法（Zhang等人2016a; 2016b; Zhao，Li和Lu 2017; 2018; Wei等人2018）利用代表每个框架重要性得分的地面真相标签来训练深度神经网络。由于使用了人类注释数据，因此可以忠实地学习语义特征。但是，为许多视频帧加标签很昂贵，并且当标签数据不足时，经常会出现过度拟合的问题。这些局限性可以通过使用无监督学习方法来缓解（Mahasseni，Lam和Todorovic 2017; Zhou和Qiao 2018）。但是，由于该方法没有人工标记，因此需要适当地设计用于监视网络的方法。

Supervised learning methods (Zhang et al. 2016a; 2016b; Zhao, Li, and Lu 2017; 2018; Wei et al. 2018) utilize ground truth labels that represent importance scores of each frame to train deep neural networks. Since human-annotated data is used, semantic features are faithfully learned. However, labeling for many video frames is expensive, and overfitting problems frequently occur when there is insufficient label data. These limitations can be mitigated by using the unsupervised learning method as in (Mahasseni, Lam, and Todorovic 2017; Zhou and Qiao 2018). However, since there is no human labeling in this method, a method for supervising the network needs to be appropriately designed.

我们的基准方法（Mahasseni，Lam和Todorovic 2017）使用变分自动编码器（VAE）（Kingma and Welling 2013）和生成对抗网络（GANs）（Goodfellow et al.2014）来学习没有人工标签的视频摘要。关键思想是一个好的摘要应该无缝地重建原始视频。通过卷积神经网络（CNN）获得的每个输入帧的特征都与预测的重要性得分相乘。然后，将这些功能传递给生成器以还原原始功能。鉴别器经过训练，可以区分生成的（还原的）特征和原始特征。

Our baseline method (Mahasseni, Lam, and Todorovic 2017) uses a variational autoencoder (VAE) (Kingma and Welling 2013) and generative adversarial networks (GANs) (Goodfellow et al. 2014) to learn video summarization without human labels. The key idea is that a good summary should reconstruct original video seamlessly. Features of each input frame obtained by convolutional neural network (CNN) are multiplied with predicted importance scores. Then, these features are passed to a generator to restore the original features. The discriminator is trained to distinguish between the generated (restored) features and the original ones.

虽然可以说一个好的摘要可以很好地表示和还原原始视频，但也可以使用均匀分布的帧级重要性分数很好地还原原始功能。这种琐碎的解决方案导致学习判别功能以查找关键镜头时遇到困难。我们的方法旨在克服这个问题。当输出得分变得更平坦时，得分的方差会大大降低。从这个数学上显而易见的事实，我们提出了一种简单而有效的方法来增加分数的方差。方差损失简单定义为预测分数的方差的倒数。

Although it is fair to say that a good summary can represent and restore original video well, original features can also be restored well with uniformly distributed frame level importance scores. This trivial solution leads to difficulties in learning discriminative features to find key-shots. Our approach works to overcome this problem. When output scores become more flattened, the variance of the scores tremendously decreases. From this mathematically obvious fact, we propose a simple yet powerful way to increase the variance of the scores. Variance loss is simply defined as a reciprocal of variance of the predicted scores.

此外，要了解更多区分功能，我们建议使用块和跨步网络（CSNet），该网络同时利用视频上的本地（块）和全局（跨步

最低0.47元/天解锁文章

菜菜菜菜菜菜菜

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
3
评论
Discriminative Feature Learning for Unsupervised Video Summarization（论文翻译）

Discriminative Feature Learning for Unsupervised Video SummarizationAbstract在本文中，我们解决了无监督视频摘要的问题，该问题会自动从输入视频中提取关键镜头。具体而言，我们根据经验观察结果解决了两个关键问题：（i）由于每帧输出重要性得分的平均分布而导致无效的特征学习，以及（ii）处理长视频输入时的训练难度。为了缓解第一个...
复制链接

扫一扫