论文阅读 [TPAMI-2022] Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learni

该论文提出了一种新的框架GLMGIR,用于处理细粒度团队运动视频的自动叙事。通过多粒度交互建模和注意力模块,该框架能捕捉个体动作、时空依赖和复杂互动,并生成详细评论。同时,论文引入了新的SportsVideoNarrative数据集和Fine-grained Captioning Evaluation(FCE)指标,以促进这一领域的研究。实验结果显示了该方法的有效性。
摘要由CSDN通过智能技术生成

论文阅读 [TPAMI-2022] Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning

论文搜索(studyai.com)

搜索论文: Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning

搜索论文: http://www.studyai.com/search/whole-site/?q=Fine-Grained+Video+Captioning+via+Graph-based+Multi-Granularity+Interaction+Learning

关键字(Keywords)

Sports; Task analysis; Feature extraction; Linguistics; Games; Three-dimensional displays; Measurement; Video caption; representation learning; graphCNN; fine-grained; multiple granularity

机器视觉; 自然语言处理

细粒度视觉; 语言表示学习; 视觉(频)字幕; 时间与空间

摘要(Abstract)

Learning to generate continuous linguistic descriptions for multi-subject interactive videos in great details has particular applications in team sports auto-narrative.

学习为多主题互动视频生成连续的语言描述在团队运动自动叙事中有着特殊的应用。.

In contrast to traditional video caption, this task is more challenging as it requires simultaneous modeling of fine-grained individual actions, uncovering of spatio-temporal dependency structures of frequent group interactions, and then accurate mapping of these complex interaction details into long and detailed commentary.

与传统的视频字幕相比,这项任务更具挑战性,因为它需要同时对细粒度的个体动作进行建模,揭示频繁的群体互动的时空依赖结构,然后将这些复杂的互动细节准确地映射到长而详细的评论中。.

To explicitly address these challenges, we propose a novel framework Graph-based Learning for Multi-Granularity Interaction Representation (GLMGIR) for fine-grained team sports auto-narrative task.

为了明确解决这些挑战,我们提出了一种新的基于图的多粒度交互表示学习框架(GLMGIR),用于细粒度团队运动自动叙事任务。.

A multi-granular interaction modeling module is proposed to extract among-subjects’ interactive actions in a progressive way for encoding both intra- and inter-team interactions.

提出了一个多粒度交互建模模块,以渐进的方式提取受试者之间的交互行为,对团队内和团队间的交互进行编码。.

Based on the above multi-granular representations, a multi-granular attention module is developed to consider action/event descriptions of multiple spatio-temporal resolutions.

在上述多粒度表示的基础上,提出了一种考虑多个时空分辨率的动作/事件描述的多粒度注意力模块。.

Both modules are integrated seamlessly and work in a collaborative way to generate the final narrative.

这两个模块无缝集成,以协作的方式生成最终叙事。.

In the meantime, to facilitate reproducible research, we collect a new video dataset from YouTube.com called Sports Video Narrative dataset (SVN).

同时,为了便于重复性研究,我们从YouTube上收集了一个新的视频数据集。com称为运动视频叙事数据集(SVN)。.

It is a novel direction as it contains $6K$6K team sports videos (i.e., NBA basketball games) with $10K$10K ground-truth narratives(e.g., sentences).

这是一个新颖的方向,因为它包含价值6K$6K的团队体育视频(即NBA篮球比赛),以及价值10K$10K的地面真相叙述(例如句子)。.

Furthermore, as previous metrics such as METEOR (i.e., used in coarse-grained video caption task) DO NOT cope with fine-grained sports narrative task well, we hence develop a novel evaluation metric named Fine-grained Captioning Evaluation (FCE), which measures how accurate the generated linguistic description reflects fine-grained action details as well as the overall spatio-temporal interactional structure.

此外,由于之前的METEOR(即用于粗粒度视频字幕任务)等指标无法很好地处理细粒度体育叙事任务,因此我们开发了一种新的评估指标,名为细粒度字幕评估(FCE),它衡量生成的语言描述反映细粒度动作细节以及整体时空交互结构的准确程度。.

Extensive experiments on our SVN dataset have demonstrated the effectiveness of the proposed framework for fine-grained team sports video auto-narrative…

在我们的SVN数据集上进行的大量实验证明了所提出的细粒度团队运动视频自动叙事框架的有效性。。.

作者(Authors)

[‘Yichao Yan’, ‘Ning Zhuang’, ‘Bingbing Ni’, ‘Jian Zhang’, ‘Minghao Xu’, ‘Qiang Zhang’, ‘Zheng Zhang’, ‘Shuo Cheng’, ‘Qi Tian’, ‘Yi Xu’, ‘Xiaokang Yang’, ‘Wenjun Zhang’]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值