Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering

Abstract

1. 早期研究:
retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question(从显式知识库中检索需要的知识,引入了不相关的信息);
implicit knowledge engine to acquire the necessary knowledge for answering(隐式知识)。
2. 本文框架:
Prophet—a concep tually simple framework
3. 具体做法:
(1)first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge(首先在特定基于知识的VQA数据集上训练一个VQA模型,不引入外部知识);
(2)extract two types of complementary answer heuristics from the model: answer candidates and answer-aware examples(提取两种答案启发);
(3)the two types of answer heuristics are encoded into the prompts to enable GPT-3 to better
comprehend the task(两种答案启发作为prompt来使得GPT-3更好地理解这个任务)。

1. Introduction

1. 早期工作:
(1)Early knowledge-based VQA benchmarks additionally provide structured knowledge bases (KBs) and annotate required knowledge facts for all the questions(提供基于结构化知识库的数据,为所有问题标注需要的知识);
(2) retrieve knowledge entries from explicit KBs(从显式知识库中检索知识,例如Wikipedia and ConceptNet):

局限性:

① the required knowledge may not be successfully retrieved from the KBs(需要的知识可能没有检索到);

② plenty of irrelevant knowledge is inevitably introduced(不相关的知识可能会引入)。
(3)pretrained large language models, e.g., GPT-3 [ 3], as implicit knowledge engines for knowledge acquisition(预训练大语言模型,隐式知识的获取):
局限性:
① The generated captions cannot cover all the necessary information in the image(生成的caption并不能包括图像中所有必要的信息);
② GPT-3 employs a few-shot learning paradigm that requires a few in-context examples to adapt to new tasks.
2. 本文方法:
(1)引入了两种答案启发:answer candidates(answer candidates refer to a list of promising answers to the testing input, where each answer is associated with a confidence score,一系列可能答案,每个答案都有一个置信度得分) and answer-aware examples(answer-aware examples refer to a list of in-context examples, where each example has a similar answer to the testing input)

2. Related Work

Visual Question Answering (VQA)

Knowledge-based VQA

In-context learning

3. The Prophet Framework

3.1 Preliminaries

GPT-3: autoregressive language model(是一个自回归语言模型,用大量的语料库进行训练) 
formulates a new downstream task as a text sequence generation task on the frozen GPT-3 model(该模型将新的下游任务视为文本序列生成任务)

3.2 Stage-1. Answer Heuristics Generation

answer candidates(由图像和问题预测答案,这里的答案包括预测得分)
answer-aware examples(与真实答案相似的例子,这里感觉就是寻找答案相关的question)

Answer candidates

选择前K个有最高得分的答案

Answer-aware examples

这里作者假设:问题q和图像i融合后的特征应该是某一个答案空间,那么,如果某一对question-image对和训练数据集中的question-image对的融合特征非常相似,我们认为这两组特征所对应的question-image和答案有很大相关性。

3.3 Stage-2. Heuristics-enhanced Prompting

4. Experiments

4.1 Datasets

4.2 Implementation Details

图像模型:grid-based features extracted from CLIP’s visual encoder with a RN50× 64 backbone
语言模型:BERT-large model
基础VQA模型:MCAN-large

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值