Abstract
1. 早期研究:
retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question(从显式知识库中检索需要的知识,引入了不相关的信息);
implicit knowledge engine to acquire the necessary knowledge for answering(隐式知识)。
2. 本文框架:
Prophet—a concep
tually simple framework
3. 具体做法:
(1)first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge(首先在特定基于知识的VQA数据集上训练一个VQA模型,不引入外部知识);
(2)extract two types of complementary answer heuristics from the model: answer candidates and answer-aware examples(提取两种答案启发);
(3)the two types of answer heuristics are encoded into the prompts to enable GPT-3 to better
comprehend the task(两种答案启发作为prompt来使得GPT-3更好地理解这个任务)。
1. Introduction
1. 早期工作:
(1)Early knowledge-based VQA benchmarks additionally provide structured knowledge bases (KBs) and annotate required knowledge facts for all the questions(提供基于结构化知识库的数据,为所有问题标注需要的知识);
(2)
retrieve knowledge entries from explicit KBs(从显式知识库中检索知识,例如Wikipedia and ConceptNet):
局限性:
① the required knowledge may not be successfully retrieved from the KBs(需要的知识可能没有检索到);
② plenty of irrelevant knowledge is inevitably introduced(不相关的知识可能会引入)。
(3)pretrained large language models,
e.g., GPT-3 [
3], as implicit knowledge engines for knowledge acquisition(预训练大语言模型,隐式知识的获取):
局限性:
① The generated captions cannot cover all the necessary information in the image(生成的caption并不能包括图像中所有必要的信息);
② GPT-3 employs a few-shot learning paradigm that requires a few in-context examples to adapt to new tasks.
2. 本文方法:
(1)引入了两种答案启发:answer candidates(answer candidates refer to a list of promising answers to the testing input, where each answer is associated with a confidence score,一系列可能答案,每个答案都有一个置信度得分) and answer-aware examples(answer-aware examples refer to a list of in-context examples, where each example has a similar answer to the testing input)
2. Related Work
Visual Question Answering (VQA)
Knowledge-based VQA
In-context learning
3. The Prophet Framework
3.1 Preliminaries
GPT-3: autoregressive language model(是一个自回归语言模型,用大量的语料库进行训练)
formulates a new downstream task as a text sequence generation task on the frozen GPT-3 model(该模型将新的下游任务视为文本序列生成任务)
3.2 Stage-1. Answer Heuristics Generation
answer candidates(由图像和问题预测答案,这里的答案包括预测得分)
answer-aware examples(与真实答案相似的例子,这里感觉就是寻找答案相关的question)
Answer candidates
选择前K个有最高得分的答案
Answer-aware examples
这里作者假设:问题q和图像i融合后的特征应该是某一个答案空间,那么,如果某一对question-image对和训练数据集中的question-image对的融合特征非常相似,我们认为这两组特征所对应的question-image和答案有很大相关性。
3.3 Stage-2. Heuristics-enhanced Prompting
4. Experiments
4.1 Datasets
4.2 Implementation Details
图像模型:grid-based features extracted from CLIP’s visual encoder with a RN50×
64 backbone
语言模型:BERT-large model
基础VQA模型:MCAN-large