iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question

1. problems:

1.1 Most prior art in visual understanding relies solely on analyzing the “what” (e.g., event recognition) and “where” (e.g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention. 

1.2 Common-sense reasoning [43], which leads to the interesting question of “why”, is a thinking gap in today’s pattern learning-based systems which rely on the likelihood of observing object Y given object X, P(Y|X).

2. contribution:

2.1 present iPerceive, a framework that generates common-sense features by inferring the causal relationships between events in videos using contextual losses as self-supervised training mechanisms.

2.2 iPerceive DVC and iPerceive VideoQA on the ActivityNet Captions and TVQA datasets respectively furthers the state-of-the-art.

3. introduction:

Top: An example of a cognitive error in DVC. While the girl tries to block the boy’s dunking attempt, him jumping (event X) eventually leads to him dunking the basketball through the hoop (event Y)

Bottom: An example of incorrect attention where conventional DVC approaches correlate a chef and steak to the ac- tivity of cooking without even attending to the nearby oven. 

4. framwork:

4.1 iPerceive DVC

iPerceive DVC generates common-sense vectors from the temporal events that the event proposal module localizes (left). Features from all modalities are sent to the corresponding encoder-decoder Transformers (middle). Upon fusing the processed features we finally output the next word in the caption using the distribution over the vocabulary (right).

 

4.2 iPerceive VideoQA

iPerceive VideoQA consists of two main components: feature fusion and frame selection.

For feature fusion, we encode features using a convolutional encoder, generate common-sense vectors from the input video sequence, and extract dense captions using iPerceive DVC (left). Features from all modalities (video, dense captions, QA and subtitles) are then fed to dual-layer attention: word/object and frame-level (middle). Upon fusing the attended features, we calculate frame-relevance scores (right).

 

5. result

 5.1 comparison with the SoTA on TVQA validation set:

5.2 iPerceive VideoQA on TVQA:

We can see that iPerceive VideoQA furthers the SoTA in all the TV on TVQA

5.3 ablation analysis:

Using ablation studies, we showed that these common-sense features help the model better perceive relationships between events in videos, leading to improved performance on challenging video tasks that need cognition.

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值