GQA 的简单介绍

最新推荐文章于 2024-04-23 17:02:28 发布

置顶咪咕班克斯

最新推荐文章于 2024-04-23 17:02:28 发布

阅读量3.1k

点赞数 2

分类专栏： pytorch 第三方工具包代码调试文章标签： python

本文链接：https://blog.csdn.net/u012211422/article/details/118482585

版权

pytorch 同时被 2 个专栏收录

45 篇文章 5 订阅

订阅专栏

第三方工具包代码调试

10 篇文章 0 订阅

订阅专栏

编写不易如果觉得不错，麻烦关注一下~

GQA 的简单介绍

官网：GQA: Visual Reasoning in the Real World

问题减少强的语言偏置，很多都是根据场景语义图进行构建

且有多个评价准则：

consistency, validity, plausibility, grounding and distribution scores

之外该数据集还有一个类似指引步骤的标注，将问题分解为图中的路径，去寻找答案

还有长句来回答问题

语言偏置笼统的说，是指给定一个问题，可以不根据图片就直接作答，问题模态的权重很大。尤其当有10个关于香蕉的问题答案都是黄色的。那么给一个绿香蕉，多数情况下答案还是黄色。

Our starting point in creating the GQA dataset is the Visual Genome Scene Graph annotations [ 20 ] that cover 113k images from COCO [ 23 ] and Flickr [ 36 ]. 2 The scene graph serves as a formalized representation of the image: each node denotes an object , a visual entity within the image, like a person, an apple, grass or clouds. It is linked to a bounding box specifying its position and size, and is marked up with about 1–3 attributes , properties of the object: e.g., its color, shape, material or activity. The objects are connected by relation edges, representing actions (verbs), spatial relations (prepositions), and comparatives.

每个图片都有人工标注的目标框的类别与坐标，且每个目标都有1-3个属性，可以形容颜色，形状等，每个目标框之间会有关系，比如动作、方位、比较词等。

The GQA dataset consists of 22,669,678 questions over 113,018 images, which cover wide range of reasoning skills and vary in length and number of required inference-steps (fifigure 6 ). The dataset has a vocabulary size of 3097 words and 1878 possible answers. While smaller than natural language datasets, further investigation reveals that it covers 88.8% and 70.6% of VQA questions and answers respectively, corroborating its wide diversity. A wide selection of dataset visualizations is provided in the supplementary.

问题有22百万之多，图片有11万之多，字典有3097大小，1878个可能答案。

We associate each question with two types: structural and semantic. The structural type is derived from the fifinal operation in the question’s functional program. It can be (1) verify for yes/no questions, (2) query for all open questions, (3) choose for questions that present two alternatives to choose from, e.g . “Is it red or blue?”; (4) logical which involve logical inference, and (5) compare for comparison questions between two or more objects. The semantic type refers to the main subject of the question: (1) object : for existence questions, (2) attribute : consider the properties or position of an object, (3) category : related to object identification within some class, (4) relation : for questions asking about the subject or object of a described relation (e.g . “what is the girl wearing?” ), and (5) global : about overall properties of the scene such as weather or place. As shown in fifigure 6 , the questions’ types vary at both the semantic and structural levels.

问题涉及结构与语义问题。

结构类型（1）对与错（2）查询（开放性）（3）选择型（4）带有逻辑词（5）有比较词

语义类型（1）关于目标的存在与否（2）关于目标的位置以及属性（3）关于目标的类别（4）关于主语或谓语的关系问题（5）全局性问题，场景的整体例如天气或地点

新的标准准则

由于模型到底是猜测还是偏置还是通过无关的图片推测出，我们不仅要凭借准确率这唯一的准则来衡量，而是需要多个测试角度进行评判，文中提出的五种新式准则，当然并不是只有这5种，还可以有很多...

consistency, validity, plausibility, grounding and distribution scores

Consistency

指的是相关的问题答案是否保持一致性。但是我感觉这里应该是强调一张图有关一个问题的多种表示形式的问题要保持一致的答案，例如一张图片里，白盘子上有个红苹果，关于这两个问题无论问什么问题都要回答一致的口径。而不能出现灰色盘子，绿苹果。这一指标是将相关问题的平均值作为评价准则。

人类可以达到98%的水平，而模型可能达到80%吧。

Validity and Plausibility

validity 有效性是指回答的问题要在问题类型的范围内，不能所答非所问。例如问颜色，回答是对错。

plausibility 合理性，则回答要求更高，具备一些常识性的知识。不能有违背常识性的问题，例如大象会说话，吃披萨等...

解析trainscenegraph 获取有用信息

制作字典，将类别名转换为某id

这里是由于GQA 有scenegraph 文件中只有目标的名字，没有字典目标名对应目标id。

所以我们首先需要建立一个字典。这里采用vg的字典做基础。将gqa的场景图json 进行整理

这里采用lxmert 的split 图片的方式，将scenegraph 根据其split 方式进行分割，因为GQA官网没有分割好的文件可下载！！！

将lxmert 的github 中的train.json,valid.json,testdev.json进行抽取imageid 并将与scenegraph 中的keys 比对，求取是否每个图片都有其人工标注的scenegraph.

通过实验发现train 的所有graph 都在scenegraph 中，但是额外的2000多个scenegraph 无从考证。

下面考虑scenegraph 的val_sceneGraphs.json集合

最终发现testdev 没有 scenegraph

统计一下train 的目标类别数据分布

发现还确实分布不均

['stop sign,stopsign', 'microwave,microwave oven', 'refrigerator,fridge', 'television,tv', 'sailboat,sail boat', 'racket,racquet', 'headboard,head board', 'tennis racket,tennis racquet', 'skateboard,skate board', 'hot dog,hotdog', 'surfboard,surf board', 'fire hydrant,hydrant', 'suitcase,suit case', 'donut,doughnut', 'sidewalk,side walk', 'stove top,stovetop', 'nightstand,night stand', 'donuts,doughnuts', 'lamp post,lamppost', 'fire truck,firetruck', 'tail light,taillight', 'hot dogs,hotdogs', 'tshirt,t-shirt,t shirt', 'streetlight,street light']

下面尝试获取截图的信息：

下面进行各个图片的object num 个数的统计，以设计一个合适的图片框个数

即统计objects 数组的长度即可

抽取图片的序号以及关系名

当然官网的trainscenegraph 的文件共有这些图片生成了

竟然还有126个目标的时候

将objnumtrain 中的数据筛选成和lxmert 一致的train 图片集

重新整理图片中的目标个数

平均为16张，大于50个的总共180张图，大于40个的766个。

选择50个看来比较合理

下面将各个object 的目标的坐标改为x1,y1,x2,y2，以及object 坐标，为其关系的提取做准备

整理成json 格式，这里只展示最终的状态：

这是下面需要保存的dict形式

上面图的坐标已勾勒好！

最终整理成如下的dict。

如果生成关系矩阵很有可能你需要对角为1的初始矩阵

import numpy as np

a=[1,2,3]

np.diag(a)

valscenegraph 的关系类型总共有295种

trainscenegraph 总共有296 种类型,但是我看了大部分还是重复。有效的一般在150种左右

而vg fasterrcnn 总共提供了20个类型

其中下面这些不再vg的20 类中

看一下总共有多少种类型的问题

{'categoryThis', 'existAttrNotC', 'activityWho', 'verifyAttrAnd', 'weatherChoose', 'materialVerify', 'materialChoose', 'companyVerifyC', 'existAttrOr', 'existAnd', 'categoryThisChoose', 'directOf', 'typeVerifyC', 'typeVerify', 'weather', 'sameRelate', 'activityChoose', 'positionQuery', 'companyVerify', 'weatherVerifyC', 'existRelSC', 'objThisChoose', 'typeChoose', 'company', 'weatherVerify', 'sameGender', 'verifyAttrKC', 'categoryAttr', 'sameAnimalsC', 'chooseAttr', 'categoryThatChoose', 'categoryRelO', 'existThatOr', 'relS', 'categoryThat', 'existAttrOrC', 'categoryRelS', 'diffAnimalsC', 'diffAnimals', 'twoSameMaterial', 'verifyMaterialAnd', 'placeChoose', 'sameAnimals', 'materialVerifyC', 'existRelS', 'existThatNotC', 'stateChoose', 'dir', 'existOrC', 'relVerifyCop', 'relVerify', 'positionVerifyC', 'comparativeChoose', 'twoSameC', 'twoDifferent', 'existMaterialC', 'existAndC', 'twoCommon', 'diffGender', 'locationVerifyC', 'sameGenderC', 'positionVerify', 'material', 'locationChoose', 'sameMaterialRelate', 'place', 'twoSameMaterialC', 'existMaterialNot', 'existAttr', 'relChooser', 'relVerifyCo', 'verifyAttr', 'how', 'existOr', 'verifyAttrs', 'verifyAttrsC', 'verifyAttrC', 'placeVerifyC', 'companyChoose', 'existC', 'existAttrNot', 'existMaterialNotC', 'categoryRelOChoose', 'twoDifferentC', 'existThatOrC', 'category', 'existThat', 'verifyAttrThis', 'twoSame', 'existThatNot', 'activity', 'relVerifyCr', 'verifyAttrCThis', 'state', 'existMaterial', 'exist', 'existAttrC', 'positionChoose', 'relO', 'directWhich', 'existRelSRC', 'existThatC', 'placeVerify', 'locationVerify', 'verifyAttrK'}

问题json:

{"02930152": {"semantic": [{"operation": "select", "dependencies": [], "argument": "sky (2486325)"}, {"operation": "verify color", "dependencies": [0], "argument": "dark"}], "entailed": ["02930160", "02930158", "02930159", "02930154", "02930155", "02930156", "02930153"], "equivalent": ["02930152"], "question": "Is the sky dark?", "imageId": "2354786", "isBalanced": true, "groups": {"global": null, "local": "06-sky_dark"}, "answer": "yes", "semanticStr": "select: sky (2486325)->verify color: dark [0]", "annotations": {"answer": {}, "question": {"2": "2486325"}, "fullAnswer": {"2": "2486325"}}, "types": {"detailed": "verifyAttr", "semantic": "attr", "structural": "verify"}, "fullAnswer": "Yes, the sky is dark."}, "07333408": {"semantic": [{"operation": "select", "dependencies": [], "argument": "wall (722332)"}, {"operation": "filter color", "dependencies": [0], "argument": "white"}, {"operation": "relate", "dependencies": [1], "argument": "_,on,s (722335)"}, {"operation": "query", "dependencies": [2], "argument": "name"}], "entailed": [], "equivalent": ["07333408"], "question": "What is on the white wall?", "imageId": "2375429", "isBalanced": true, "groups": {"global": "", "local": "14-wall_on,s"}, "answer": "pipe", "semanticStr": "select: wall (722332)->filter color: white [0]->relate: _,on,s (722335) [1]->query: name [2]", "annotations": {"answer": {"0": "722335"}, "question": {"4:6": "722332"}, "fullAnswer": {"1": "722335", "5": "722332"}}, "types": {"detailed": "relS", "semantic": "rel", "structural": "query"}, "fullAnswer": "The pipe is on the wall."}, "07333405": {"semantic": [{"operation": "select", "dependencies": [], "argument": "pipe (722335)"}, {"operation": "verify color", "dependencies": [0], "argument": "red"}], "entailed": ["07333406"], "equivalent": ["07333405"], "question": "Is that pipe red?", "imageId": "2375429", "isBalanced": true, "groups": {"global": null, "local": "06-pipe_red"}, "answer": "no", "semanticStr": "select: pipe (722335)->verify color: red [0]", "annotations": {"answer": {}, "question": {"2": "722335"}, "fullAnswer": {"2": "722335"}}, "types": {"detailed": "verifyAttrC", "semantic": "attr", "structural": "verify"}, "fullAnswer": "No, the pipe is white."}, "15736264": {"semantic": [{"operation": "select", "dependencies": [], "argument": "clock (746851)"}, {"operation": "filter height", "dependencies": [0], "argument": "tall"}, {"operation": "choose size", "dependencies": [1], "argument": "large|small"}], "entailed": ["15736259", "15736258", "15736267", "15736253", "15736252", "15736251", "15736257", "15736256", "15736255", "15736254", "15736291", "15736249"], "equivalent": ["15736264"], "question": "Is the tall clock small or large?", "imageId": "2368326", "isBalanced": true, "groups": {"global": "size", "local": "10c-clock_size"}, "answer": "large", "semanticStr": "select: clock (746851)->filter height: tall [0]->choose size: large|small [1]", "annotations": {"answer": {}, "question": {"2:4": "746851"}, "fullAnswer": {"1": "746851"}}, "types": {"detailed": "chooseAttr", "semantic": "attr", "structural": "choose"}, "fullAnswer": "The clock is large."}

【题外话】

能否合成整张大图？

关于官网提交 leaderboard可以选择代码形式提交，比按钮体验好多了！按钮基本灰面提交状态不变！！！