ai 图灵测试_适用于现代AI系统的“视觉图灵测试”-CSDN博客

ai 图灵测试

Visual Question Answering (VQA) is a fascinating research field at the intersection of computer vision and language understanding.

视觉问答(VQA)是计算机视觉与语言理解相交的一个有趣的研究领域。

In this post we will elaborate both on existing data sets and examine potential approaches and applications and present a prototype in which the user can choose from images the algorithm has not seen before and asks question accordingly.

在这篇文章中，我们将详细介绍现有数据集，并研究潜在的方法和应用，并提出一个原型，用户可以从中选择算法从未见过的图像，并提出相应的问题。

什么是VQA？ (What is VQA ?)

Visual Question Answering approaches are designed to handle the following tasks: Given an image and a natural language question about the image, the VQA model needs to provide an accurate natural language answer.

视觉问题解答方法旨在处理以下任务：给定图像和关于图像的自然语言问题，VQA模型需要提供准确的自然语言答案。

This is by nature a multi-discipline research problem. It consists of the following sub-tasks:· Computer Vision (CV)· Natural Language Processing (NLP)· Knowledge Representation & reasoning

本质上，这是一个多学科研究问题。它由以下子任务组成：·计算机视觉(CV)·自然语言处理(NLP)·知识表示与推理

That’s why some authors refer to Visual Question Answering as “Visual Turing Test” for modern AI systems.

这就是为什么有些作者将视觉问题解答称为现代AI系统的“视觉图灵测试”。

This screenshot from my prototype illustrates how a VQA system works. Be aware of the fact that the user has chosen an image the algorithm has not seen during training and asks question accordingly.

我的原型的屏幕截图说明了VQA系统是如何工作的。请注意，用户选择了算法在训练期间未看到的图像，并提出了相应的问题。

数据集 (Datasets)

Most of the existing datasets contain triples made of an image, a question and its correct answer. Some publicly available datasets, on the other hand, provides extra information like image captions, image regions represented as bounding boxes or multiple-choice candidate answers.

现有的大多数数据集都包含由图像，问题及其正确答案组成的三元组。另一方面，一些公开可用的数据集提供了额外的信息，例如图像标题，以边界框表示的图像区域或多项选择候选答案。

The available VQA datasets can be categorized based on three factors:· type of images (natural, clip-art, synthetic)· question–answer format (open-ended, multiple-choice)· use of external knowledge

可以根据以下三个因素对可用的VQA数据集进行分类：·图像类型(自然，剪贴画，合成)·问答格式(开放式，多项选择)·外部知识的使用

The following table shows an overview of the available datasets:

下表概述了可用的数据集：

For our prototype we make use of the VQA dataset with natural images and open-ended questions. It is one of the most popular ones and also used for the annual VQA competition. The dataset we use consists of 443,757 image-questions pairs for training and 214,354 sets for validation. It can be downloaded here.

对于我们的原型，我们使用带有自然图像和开放式问题的VQA数据集。它是最受欢迎的游戏之一，也用于年度VQA竞赛。我们使用的数据集包括用于训练的443,757个图像问题对和用于验证的214,354套数据。可以在这里下载。

One special characteristic of VQA dataset is that the annotations, i.e. the answers provided to a specific image-question pair are not unique. The answers have been collected via Amazon Mechanical Turk and for each image-question pair ten answers are supplied, that could be all equal but also different. The screenshot on the left shows an example.

VQA数据集的一个特殊特征是注释，即提供给特定图像问题对的答案不是唯一的。答案是通过Amazon Mechanical Turk收集的，并且为每个图像问题对提供了十个答案，它们可以相等但也可以不同。左侧的屏幕截图显示了一个示例。

方法与架构 (Approaches & Architectures)

The basic architecture as shown below consists of three main elements:· Image feature extraction· Question Feature extraction· Fusion model + classifier to merge the features

如下所示的基本架构由三个主要元素组成：·图像特征提取·问题特征提取·融合模型+分类器以合并特征

图像特征提取 (Image feature extraction)

Image feature extraction describes the method to transform an image to a numerical vector to enable further computational processing.

图像特征提取描述了将图像转换为数值向量以实现进一步计算处理的方法。

Convolutional neural network (CNN) has established themselves as the state-of-the-art approach. VQA architectures generally use already pre-trained CNN models by applying transfer learning. The chart shows an evaluation of the utilization rates of different architectures in several VQA research papers.

卷积神经网络(CNN)已将自己确立为最先进的方法。 VQA架构通常通过应用转移学习来使用已经预先训练的CNN模型。该图表显示了一些VQA研究论文中对不同体系结构的利用率的评估。

In the prototype we use the VGG16 architecture that uses 224 × 224 pixel images as input and outputs a 4096-dimensional vector.

在原型中，我们使用VGG16架构，该架构使用224×224像素的图像作为输入并输出4096维向量。

问题特征提取 (Question feature extraction)

To extract question features multiple approaches have been developed ranging from count-based methods like One-hot-encoding, Bag-of-words to text embedding methods like Long-short-term-memory (LSTM) or gated recurrent unit (GRU). The diagram below illustrates the utilization rate of these approaches in the research.

为了提取问题特征，已经开发了多种方法，从基于计数的方法(例如，单次热编码，词袋)到文本嵌入方法(例如，短期记忆(LSTM)或门控循环单元(GRU))。下图说明了这些方法在研究中的利用率。

For our prototype we use the most popular LSTM approach with Word2Vec representations of the single words fed into it. The LSTM-model outputs a 512-dimensional vector.

对于我们的原型，我们使用最受欢迎的LSTM方法，将Word2Vec表示形式的单词输入其中。 LSTM模型输出512维向量。

融合模型+分类器 (Fusion model + classifier)

To fusion the two feature vectors several basic approaches exist including point-wise multiplication or addition and concatenation. More advanced architectures use Canonical Correlation Analysis (CCA) or end-to-end models with a Multimodal Compact Bilinear Pooling (MCB) layer.

为了融合两个特征向量，存在几种基本方法，包括逐点乘法或加法和级联。更高级的体系结构使用规范相关分析(CCA)或具有多模式紧凑型双线性池(MCB)层的端到端模型。

In our prototype we use simple concatenation followed by a softmax classifier to the 1,000 most common answers. This approach is suitable as more than 95% of the question contain at least one annotation which is covered by the 1,000 most common answers (see graph on the left).

在我们的原型中，我们使用简单的串联，然后使用softmax分类器对1,000个最常见的答案进行分类。这种方法是合适的，因为超过95％的问题包含至少一个注释，该注释被1,000个最常见的答案覆盖(请参见左侧的图表)。

更高级的方法 (More advanced approaches)

In the recent past more sophisticated architectures have been developed with attention-based approaches being the most popular. Here, the idea is to set the focus of the algorithm on the most relevant parts of the input. For example, if the question is “What is the color of the ball?”, the region of the image containing the ball is more relevant than the others. Concerning the question, “color” and “ball” are more informative than the rest of the words.

在最近的过去，已经开发了更复杂的体系结构，其中基于注意力的方法是最流行的。这里的想法是将算法的焦点设置在输入的最相关部分。例如，如果问题是“球的颜色是什么？”，则包含球的图像区域比其他区域更相关。关于这个问题，“颜色”和“球”比其余单词更具参考价值。

The most common choice in VQA is to use spatial attention to generate region specific features to train the Convolutional Neural Network.

VQA中最常见的选择是利用空间注意力生成特定于区域的特征以训练卷积神经网络。

Two common methods to obtain spatial attention are to either project a grid over the image and determine the relevance of each region by the specific question or to automatically generate bounding boxes in the image and utilize the question to determine the relevance of the features for each box.

获得空间注意力的两种常用方法是在图像上投影网格并通过特定问题确定每个区域的相关性，或自动在图像中生成边界框并利用该问题来确定每个框的特征的相关性。

The use of an attention-based approach goes beyond the scope of our prototype.

基于注意力的方法的使用超出了我们原型的范围。

评价 (Evaluation)

Due to the variety of datasets it is not surprising that multiple approaches to evaluate the performance of the algorithms exist. In a multiple-choice setting, there is just a single right answer for every question, so the assessment can be easily quantified by the mean accuracy over test questions. In open-ended setting though, several answers for a particular question could be correct due to synonyms and paraphrasing.

由于数据集的多样性，因此存在多种评估算法性能的方法就不足为奇了。在选择题设置中，每个问题只有一个正确答案，因此可以通过测试题的平均准确性轻松量化评估。但是在开放式环境中，由于同义词和措辞的不同，一个特定问题的几个答案可能是正确的。

In such cases metrics that measure how much a predicted answer differs from the ground truth based on the difference in their semantic meaning could be used. The Wu-Palmer Similarity (WUPS) is such an example.

在这种情况下，可以使用基于其语义含义的差异来衡量预测答案与地面事实有多少不同的度量。 Wu-Palmer相似度(WUPS)就是这样的一个例子。

As the VQA datasets work with very short answers a consensus metric defined as Accuracy_VQA = min(n/3, 1) is used, i.e. a 100% accuracy is achieved when the predicted answer matches at least 3 out of the 10 annotated answers.

当VQA数据集使用非常短的答案工作时，将使用定义为Accuracy_VQA = min(n / 3，1)的共识度量，即，当预测答案与10个带注释的答案中的至少3个匹配时，将达到100％的准确性。

The diagram show the accuracy as defined above for the differen question types:

该图显示了以上针对不同问题类型定义的准确性：

VQA的潜在应用 (Potential applications of VQA)

VQA systems offer a vast number of potential applications. One of the most socially relevant and direct application is to help blind and visually-impaired users to communicate with pictures. Furthermore, it can be integrated in image retrieval system, which can be commercially used on e-commerce sites to attract customers by giving more exact results to their search queries. Incorporation of VQA may also increase the popularity of online educational services by allowing learners to interact with images. Another application of VQA is in the field of data analysis where VQA can help the analyst to summarize the available visual data.

VQA系统提供了大量潜在的应用程序。与社会最相关且最直接的应用之一是帮助盲人和视障用户与图片进行交流。此外，它可以集成在图像检索系统中，该系统可以在电子商务站点上商业使用，通过为客户的搜索查询提供更准确的结果来吸引他们。通过允许学习者与图像进行交互，VQA的合并也可以增加在线教育服务的普及。 VQA的另一个应用是在数据分析领域，其中VQA可以帮助分析人员总结可用的可视数据。

总结思想 (Closing thoughts)

VQA is a research field that requires the understanding of both text and vision. The current performance of the systems is still lagging behind human decisions, but since deep learning techniques are significantly improving both in Natural Language Processing and Computer Vision, we can reasonably expect VQA to achieve higher and higher accuracy. Progress will be further driven by contests like the VQA challenge hosted on visualqa.org.

VQA是一个需要了解文本和视觉的研究领域。系统的当前性能仍然落后于人类的决策，但是由于深度学习技术在自然语言处理和计算机视觉方面都得到了显着改善，因此我们可以合理地期望VQA能够实现越来越高的准确性。 visualqa.org举办的VQA挑战赛等竞赛将进一步推动进步。

If you would like to dive deeper into this topic you can find the code of the prototype on my github repo here. Any feedback to the approach or the code is highy appreciated.

如果您想深入探讨这个话题，你可以找到原型的代码在我的github回购这里。高度赞赏对该方法或代码的任何反馈。

Further recommend readings include:· Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor Artificial Intelligence Review (2020)· VQA: Visual Question Answering: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh

进一步的推荐读物包括：·视觉问题解答：最新技术评论，Sruthy Manmadhan和Binsu C. Kovoor人工智能评论(2020)·VQA：视觉问题解答：Aishwarya Agrawal，Jiasen Lu，Stanislaw Antol，Margaret Mitchell，C.Lawrence Zitnick，Dhruv Batra，Devi Parikh