Abstract
1. 创新来源:
(1)pre-trained language models or/and unstructured text(预训练的语言模型和无结构的文本)是incomplete and noisy(不完整且带噪声的);
(2)knowledge graphs (KGs)(知识图谱):have intensive structured knowledge(有密集的结构化知识)
2. 本文方法:
LaKo, a knowledge-driven VQA method via Late Knowledge-to-text Injection(知识驱动的VQA方法);
transfer triples into text and propose a late injection mechanism(将三元组转换为了文本)(编解码结构);
将VQA任务视为本文生成任务。
1 Introduction
According to how to incorporate knowledge, we divide the current works into two categories(根据如何结合知识,将目前的工作分为两类):
国内外研究现状
1. directly exploiting the knowledge in language model’s parameters to answer questions(直接利用语言模型中的参数来回答问题)
做法:将常识知识作为模型参数的一部分
不足:the knowledge in language model sometimes is insufficient for VQA scenario, and they are likely to fail when referring to new knowledge that is out of origin training corpus(当涉及新知识时,模型不起作用).
2. the knowledge retrieval strategy(知识检索策略)
不足:
(1)network delay might become a bottleneck(网络延迟);
(2)retrieve relevant corpus from encyclopedia articles(从百科全书中检索相关语料库)leads to lots of irrelevant information and interferes with the model’s judgment(导致许多不相关的信息引入,影响了模型学习)
本文方法
a retriever-reader VQA architecture, with a knowledge retriever and a late injection mechanism
between the knowledge and input corpus in reader(基于阅读理解、检索的VQA框架,在知识和输入的语料之间包含知识检索和知识嵌入)
具体地:
① 构建常识性知识图谱VQA: construct a common-sense knowledge graph (KG) for VQA:retriever queries this KG according to the vision-language input to recall the target triples(根据视觉和语言的输入,检索器来从知识图谱中检索到目标三元组,即与回答这个问题相关的知识);
② 模态统一:the transformation of images to captions(图像转为caption); triples to sentences with Knowledge-to-Text strategy(知识库中的三元组转为句子);
③ 任务转换:convert the VQA into a text generation task via an encoder-decoder paradigm(采用编解码结构将VQA任务转为文本生成任务);
主要贡献
①
a new KG retrieval paradigm(一种新的知识检索框架);
② build a large-scale common-sense KG targeted at knowledge-based VQA(大规模常识性知识图谱数据集);
③ using a high quality KG as the external knowledge is better than using unstructured text and pure language model parameters。
2 Related Work
2.2 Knowledge-based VQA
Exploit Knowledge in Language Model
1. pre-trained language models (PLMs)(预训练的语言模型,相当于隐式知识,没有额外的知识作为输入);
2. letting the common-sense knowledge become part of the reader’s parameters(将常识性知识作为阅读理解参数的一部分):
(1)encourages the entity representation to align with the corresponding graph embedding in a KG(鼓励实体表达与相应的graph embedding对齐);
(2)learns a joint Concept-Vision Language embedding(学习一个联合的概念知识-视觉-语言embedding);
(3)the implicit reasoning(隐式推理) of transformer models, integrates symbolic representations(符号化表达) from a knowledge graph (KG), and combines them together through a relational graph convolutional network (RGCN)。
Knowledge Retrieval Strategy
做法:
adding a separate retrieval module (a.k.a. retriever) to recall the required explicit knowledge as external input of the downstream reader(增加一个分开的知识检索模块来寻找需要的知识,作为下游阅读理解器的额外输入)
缺点:
the network delay might become a bottleneck for all these policies when take the search engine as the retriever, and the unstructured knowledge probably leads to the decrease of knowledge density(将搜索引擎作为检索器,网络延迟会成为一个瓶颈,无结构化的知识可能导致密集知识的降低)
本文方法:
a vision-language KG retriever
3 Methodology
3.1 Vision-Language KG Retriever
① image representation-1:transform the visual content into the textual format(将视觉内容转化为文本形式):一些工作基于视觉检测到的目标来检索知识,容易brings in a lot of irrelevant noise and makes the model easy to lose focus(引入噪声,使得模型失去重点),image caption(包含像人类一样的注意力机制) 。
② image representation-2:optical character recognition (OCR):字符识别,improve the information integrity(提高信息的完整性)
③ stem corpus(建立语料库):缩小知识的范围
④ Knowledge-to-Text Transformation(知识到文本转换):
⑤ Stem-based BM25:defines the a word stem 1 as the smallest semantic unit rather than an entire word:首先得到基于语义基元的句子,calculate the score for each factual triple sentence(并计算与每个事实三元组句子的得分)
IDF是指inverse document frequency,是一个词语普遍重要性的度量,总文件数除以包含该词语文件数,再将得到的商取对数,常用于挖掘文章中的关键词。
The Top-K sf
are concatenated to get the sfact
as external knowledge for each (v,q)
pair, which contributes to the late knowledge injection within the reader(选择前K个最相关的知识,得到外部知识,并和question和image一起作为模型的输入。)
3.2 Late Knowledge Injection
1. 输入模态的表现形式:
unify all data into textual to fully exploit the semantic understanding capability of the text only PLM(统一所有的数据到文本,充分利用大规模预训练语言模型的语义理解能力),first add special prefixes(增加了特殊的前缀:question:, context: and fact:
);
2. 网络结构:
the encoder-decoder transformer architecture as the reader(编解码transformer结构);
答案token被逐个解码:
3. 任务:
将答案分类任务转化为了答案生成任务;
4. 目标函数:
4 Experiments
4.1 Dataset
VQA2.0:大约1.1 million个问题,204,721个图像,每个问题都有十个不同的答案
OK-VQA:所有的图像来源于COCO 2014 validation set
4.2 Knowledge Graph Construction
科学性知识(what genus are cats)和常识性知识(what are paper made of)
知识类型:
ConceptNet:contains human common-sense knowledge about the world(常识性知识)
WebChild:contains triples that connect nouns with adjectives via more fine-grained relations(包含三元组,通过更细粒度的关系来连接名词和形容词) (e.g., “ℎ𝑎𝑠𝑆ℎ𝑎𝑝𝑒”, “𝐹𝑎𝑠𝑡𝑒𝑟”)
DBpedia:includes knowledge extracted from Wikipedia(从维基百科提取的知识)
hasPart KB:collects“ℎ𝑎𝑠_𝑝𝑎𝑟𝑡” relationships between common objects such as <𝑑𝑜𝑔, ℎ𝑎𝑠_𝑝𝑎𝑟𝑡, 𝑤ℎ𝑖𝑠𝑘𝑒𝑟𝑠> or scientific ones like <𝑚𝑜𝑙𝑒𝑐𝑢𝑙𝑒𝑠, ℎ𝑎𝑠_𝑝𝑎𝑟𝑡, 𝑎𝑡𝑜𝑚𝑠>(组成部分)
共包括:300,559 triples, 96191 entities and 2198 relations
4.3 Metrics
1. Acc:答案预测准确度
2. EM: exact match, 准确匹配
3. Inc:Inclusion-based Acc metric
4. Stem:Stem-based Acc metric,判断是否预测答案和真实答案有重叠,例如,the stem of “ℎ𝑎𝑝𝑝𝑦” and “ℎ𝑎𝑝𝑝𝑖𝑛𝑒𝑠𝑠” are both“ℎ𝑎𝑝𝑝𝑖”