VQA论文阅读
记录自己阅读的论文
cheetah023
这个作者很懒,什么都没留下…
展开
-
Self-supervised Pre-training and Contrastive Representation Learning for Multiple-choice Video QA
AAAI 2021AbstractIn this paper, we propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learn- ing.1. In the self原创 2020-12-21 11:29:49 · 480 阅读 · 0 评论 -
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question
1. problems:1.1 Most prior art in visual understanding relies solely on analyzing the “what” (e.g., event recognition) and “where” (e.g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads原创 2020-11-23 12:55:09 · 396 阅读 · 0 评论 -
论文阅读:MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
1、abstractWe present MMFT-BERT (MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities.Our approach benefits from processing multimodal data原创 2020-11-09 00:19:39 · 720 阅读 · 0 评论 -
2020 cvpr Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions
摘要为了理解电影,人们常常根据特定场景的对话和动作来进行推理,并把它们和已看到的整个故事线相联系。受这个行为的启发,我们设计了ROLL(Read, Observe, and Recall,)模型,利用电影理解的三个关键方面:1、对话理解 2、场景推理 3、故事线回忆。在ROLL模型里,每个任务负责提取丰富多样的信息,通过:1、处理场景对话 2、生成无监督视频场景描述 3、以弱监督的方式获取外部知识。每个激发-认知任务产生的信息通过Transformers编码,最终由modality weighting.原创 2020-10-31 23:17:45 · 554 阅读 · 2 评论 -
2020 cvpr Modality Shifting Attention Network for Multi-modal Video Question Answering
摘要:这篇文章针对多模态视频问答任务,提出了一种叫做Modality Shifting Attention Network (MSAN)的网络。MSAN可以分解为两个子任务:(1) localization of temporal moment relevant to the question(与问题相关的时间的定位) (2) accurate prediction of the answer based on the localized moment.(基于定位的时间来预测答案)。这个模型要求原创 2020-10-29 11:17:00 · 594 阅读 · 0 评论 -
2020 cvpr Hierarchical Conditional Relation Networks for Video Question Answering
摘要:problems:Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts(Video QA任务具有挑战性,因为他需要模型能力来提取动态的视觉对象和距离关系并且和语言概念联系起来)原创 2020-10-21 14:20:15 · 968 阅读 · 0 评论 -
2020cvpr论文阅读 On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering
摘要:目前的VQA方法的泛化能力都挺差,容易学习到数据中一些巧合的联系,而不是图像和问题的更深层的联系。所以论文作者提出了一种数据集来设法解决这个问题。该数据集的问题包含两种语言:中文和英文;并提供一种基于图像的利于理解的度量标准来反映方法的推理能力。测量推理能力可通过惩罚碰巧正确的答案来提高模型的推理能力。简介:聚焦于数据中巧合的联系会使泛华能力下降。这些联系在数据集之间并不稳定,一旦测试数据和训练数据集的分布不一样,那些利用这些联系的方法就无法正常工作。相反,底层的推理,在数据集之间稳定,促原创 2020-10-20 19:18:25 · 414 阅读 · 0 评论 -
Counterfactual Samples Synthesizing for Robust Visual Question Answering 2020cvpr论文阅读
摘要:目前的方法不能使基于全局的模型同时有效利用两种不可或缺的特征:1、图像可解释性:模型在生成答案时应该依赖正确的图像区域;2、对问题敏感性:模型应该对问题的语言变化要敏感所以论文作者提出了一种跨模型的训练策略叫Counterfactual Samples Synthesizing(反事实样本合成)...原创 2020-10-10 15:36:05 · 545 阅读 · 0 评论