论文笔记 Learning Visual Knowledge Memory Networks for Visual Question Answering (CVPR2018)

这篇文章的一个出发点也是希望VQA里面的视觉内容与人的结构化知识相联系起来,提出了一种visual

knowledge memory network (VKMN)来将结构化知识与视觉特征融合进端对端的学习框架。在经典VQA

数据集VQA v1.0与v2.0上在与知识推理相关的问题上取得不错效果。

在这里插入图片描述

对于上图这样一个VQA范例,在视觉内容中并不存在Monkey这样一种视觉对象,其需要外部知识来进行演

绎或者推理(deduction/reasoning)。
这里写图片描述
这里写图片描述

方法:

(1)Encoding of Image/Question Inputs

这儿值得关注的是其利用了MLB方法中的low-rank bilinear pooling对视觉特征与文本特征进行融合,对于

其它跨媒体问题也有一般性。

(2)Knowledge Spotting and Sub-graph Hashing

这里写图片描述

对于给定question,分析出相关实体与属性,基于其构造的知识库进行实体扩展,形成如上所示的triplet

关系群,作为Knowledge facts。

(3)Visual Knowledge Memory Network

对于triplet < s , r , t > <s,r,t> <srt>,(即<主体,关系,客体>),构造Key-value这样的键值对,因为VQA不确定对

s , r , t s, r, t s,r,t中的哪一部分提问,故 ( s , r ) (s, r) (s,r) ( s , t ) (s, t) (s,t) ( r , t ) (r, t) (r,t) 都可做为key,然后分别对key与value进行编码。

这篇文章的memory机制的流程比较直观与形象,简述如下:

对于一个query(这在记忆机制中是个很关键的一点,很多非QA的问题,如果能很好地定义query也可引

入记忆机制,比如可往跨媒体检索上面引),其与memory中的主键key进行相似度比较,然后进行value

的读取,进行问题的回答。

Knowledge base:

本文自身构造了一个visual knowledge base,即知识条目triplet的构造其主要有两种来源:

(1)从VQA v1.0中question与answer这样的pair中抽取 。

(2)直接从现有的Visual Genome Relationship(VGR)中获得knowledge triplet,通过这两种方式构造自

己的视觉知识库。

这篇文章的一些不足:

(1)虽然涉及memory的读,但我发现其并没有写机制。

(2)triplet的扩展。该文是基于question中的实体与属性,继而在知识库中进行关联扩展,对于VQA中

视觉信息的利用,该文仅仅只是提图像全局特征,其实另外一方面可以对图片进行属性/目标提取,与

前者question中分析出的概念共同作为query,来形成更为丰富与完备的知识条目。

实验结果

baseline

这里写图片描述

VQA 1.0

这里写图片描述

VQA 2.0

这里写图片描述

可视化

这里写图片描述

参考原文:Learning Visual Knowledge Memory Networks for Visual Question Answering

Emergence of Data Science placed knowledge discovery, machine learning, and data mining in multidimensional data, into the forefront of a wide range of current research, and application activities in computer science, and many domains far beyond it. Discovering patterns, in multidimensional data, using a combination of visual and analytical machine learning means are an attractive visual analytics opportu- nity. It allows the injection of the unique human perceptual and cognitive abilities, directly into the process of discovering multidimensional patterns. While this opportunity exists, the long-standing problem is that we cannot see the n-D data with a naked eye. Our cognitive and perceptual abilities are perfected only in the 3-D physical world. We need enhanced visualization tools (“n-D glasses”) to represent the n-D data in 2-D completely, without loss of information, which is important for knowledge discovery. While multiple visualization methods for the n-D data have been developed and successfully used for many tasks, many of them are non-reversible and lossy. Such methods do not represent the n-D data fully and do not allow the restoration of the n-D data completely from their 2-D represen- tation. Respectively, our abilities to discover the n-D data patterns, from such incomplete 2-D representations, are limited and potentially erroneous. The number of available approaches, to overcome these limitations, is quite limited itself. The Parallel Coordinates and the Radial/Star Coordinates, today, are the most powerful reversible and lossless n-D data visualization methods, while suffer from occlusion. There is a need to extend the class of reversible and lossless n-D data visual representations, for the knowledge discovery in the n-D data. A new class of such representations, called the General Line Coordinate (GLC) and several of their specifications, are the focus of this book. This book describes the GLCs, and their advantages, which include analyzing the data of the Challenger disaster, World hunger, semantic shift in humorous texts, image processing, medical computer-aided diag- nostics, stock market, and the currency exchange rate predictions. Reversible methods for visualizing the n-D data have the advantages as cognitive enhancers, of the human cognitive abilities, to discover the n-D data patterns. This book reviews the state of the vii viii Preface art in this area, outlines the challenges, and describes the solutions in the framework of the General Line Coordinates. This book expands the methods of the visual analytics for the knowledge dis- covery, by presenting the visual and hybrid methods, which combine the analytical machine learning and the visual means. New approaches are explored, from both the theoretical and the experimental viewpoints, using the modeled and real data. The inspiration, for a new large class of coordinates, is twofold. The first one is the marvelous success of the Parallel Coordinates, pioneered by Alfred Inselberg. The second inspiration is the absence of a “silver bullet” visualization, which is perfect for the pattern discovery, in the all possible n-D datasets. Multiple GLCs can serve as a collective “silver bullet.” This multiplicity of GLCs increases the chances that the humans will reveal the hidden n-D patterns in these visualizations. The topic of this book is related to the prospects of both the super-intelligent machines and the super-intelligent humans, which can far surpass the current human intelligence, significantly lifting the human cognitive limitations. This book is about a technical way for reaching some of the aspects of super-intelligence, which are beyond the current human cognitive abilities. It is to overcome the inabilities to analyze a large amount of abstract, numeric, and high-dimensional data; and to find the complex patterns, in these data, with a naked eye, supported by the analytical means of machine learning. The new algorithms are presented for the reversible GLC visual representations of high-dimensional data and knowledge discovery. The advantages of GLCs are shown, both mathematically and using the different datasets. These advantages form a basis, for the future studies, in this super-intelligence area. This book is organized as follows. Chapter 1 presents the goal, motivation, and the approach. Chapter 2 introduces the concept of the General Line Coordinates, which is illustrated with multiple examples. Chapter 3 provides the rigorous mathematical definitions of the GLC concepts along with the mathematical state- ments of their properties. A reader, interested only in the applied aspects of GLC, can skip this chapter. A reader, interested in implementing GLC algorithms, may find Chap. 3 useful for this. Chapter 4 describes the methods of the simplification of visual patterns in GLCs for the better human perception. Chapter 5 presents several GLC case studies, on the real data, which show the GLC capabilities. Chapter 6 presents the results of the experiments on discovering the visual features in the GLCs by multiple participants, with the analysis of the human shape perception capabilities with over hundred dimensions, in these experiments. Chapter 7 presents the linear GLCs combined with machine learning, including hybrid, automatic, interactive, and collaborative versions of linear GLC, with the data classification applications from medicine to finance and image pro- cessing. Chapter 8 demonstrates the hybrid, visual, and analytical knowledge dis- covery and the machine learning approach for the investment strategy with GLCs. Chapter 9 presents a hybrid, visual, and analytical machine learning approach in text mining, for discovering the incongruity in humor modeling. Chapter 10 describes the capabilities of the GLC visual means to enhance evaluation of accuracy and errors of machine learning algorithms. Chapter 11 shows an approach, Preface ix to how the GLC visualization benefits the exploration of the multidimensional Pareto front, in multi-objective optimization tasks. Chapter 12 outlines the vision of a virtual data scientist and the super-intelligence with visual means. Chapter 13 concludes this book with a comparison and the fusion of methods and the dis- cussion of the future research. The final note is on the topics, which are outside of this book. These topics are “goal-free” visualizations that are not related to the specific knowledge discovery tasks of supervised and unsupervised learning, and the Pareto optimization in the n-D data. The author’s Web site of this book is located at http://www.cwu.edu/*borisk/visualKD, where additional information and updates can be found. Ellensburg, USA Boris Kovalerchuk
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

猴猴猪猪

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值