2020_ICML_Neuro-Symbolic Visual Reasoning: Disentangling “Visual” from “Reasoning”_genome: generative neuro-symbolic visual reasoning-CSDN博客

本文链接：https://blog.csdn.net/a1228136188/article/details/115947152

Neuro-Symbolic Visual Reasoning: Disentangling “Visual” from “Reasoning”

写在前面的话

本文为对 paper 的简单粗读，如有理解不对的地方，欢迎大家指出
本文的几种标记：
- 红色 & 蓝色 & 黑体：表示强调
- 绿色：表示不懂的地方
- 紫色：引用的 paper

标题及出处

标题：Neuro-Symbolic Visual Reasoning: Disentangling “Visual” from “Reasoning”

神经符号视觉推理：将「视觉」与「推理」区分开
出处：2020，ICML
- link 链接：https://arxiv.org/abs/2006.11524
- github 链接：https://github.com/microsoft/DFOL-VQA
- 作者：Saeed Amizadeh、Hamid Palangi、Oleksandr Polozov 、Yichen Huang、Kazuhito Koishida
- 研究团队：Microsoft Applied Sciences Group (ASG)、Microsoft Research AI

粗读

motivation
- a framework to isolate and evaluate the reasoning aspect of VQA separately from its perception,
  
  想提出一个框架，这个框架能用于将 VQA 的推理 (reasoning) 从感知 (perception) 分离开来，并进行隔离和评估
  
  (感知：可以理解为是图像的特征向量)
- a novel top-down calibration technique that allows the model to answer reason- ing questions even with imperfect perception
  
  想提出一种新颖的自上而下的校准技术，
  
  即使在感知不完美的情况下，该模型也可以回答推理问题
  
  (自上而下的校准技术不太懂)
模型

在这里插入图片描述

收获 & 启发
- Visual Reasoning (VR)
  - 常见的语言驱动的任务：(Introduction)
    - Visual Question Answering (VQA)，视觉问答
    - Visual Commonsense Reasoning (VCR)，视觉常识推理
  - 视觉推理是感知系统 (perception system) 和推理系统 (reasoning system) 的交互 (Introduction)
    - 感知系统
      
      即，对象检测 (object detection) 和场景表示学习 (scene representation learning)
    - 推理系统
      
      即，基于场景的问题解释和推理 (question interpretation and inference grounded in the scene)
- GQA 数据集主要评估视觉感知，而不是模型的推理 (Introduction)
  - 在可以访问真实场景图的神经符号 VQA 模型上，能在 GQA 数据集达到 96% 的精度
    
    (这里是 train 训练，valid 测试)
    
    (有问题：GQA 数据集生成时，也是拿场景图生成的)
  - 但更高的 GQA 准确率并不一定意味着推理能力更高
    
    (感知系统够强，推理系统不太行，准确率也能达到很高)
- 神经符号推理模型 (neuro-symbolic VR model) (Related work)
  - 根据操作 (operators) 的简单分类：
    - 类似 MAC 的：自己学习操作 (没有显式的符号操作)
    - 类似 NMN 的：预定义操作
  - 神经符号推理的主要好处是它们的组合性 (compositionality)
    - 所有的问题共享各个模块相同的可学习参数
      
      (the learnable parameters of individual operators are shared for all questions)
    - 各个模块的中间表示可能会相互组合
      
      (the intermediate representations produced by each module are likely composable with each other)
    (如何理解？)
- 视觉特征的来源 (section 3)
  - Object Detection：如 Faster-RCNN
  - Scene graph generation：Neural Motifs、Graph-RCNN
    
    (Neural Motifs：Neural motifs: Scene graph parsing with global context., 2018 CVPR)
    
    (Graph-RCNN：Graph R-CNN for scene graph generation., 2018 ECCV)
  - 还可以包含对象之间关系的特征
    
    (不太理解)
    
    关系特征已被证明在图像字幕和信息检索等任务中很有用（Lee et al., 2019）
    
    (Learning visual relation priors for image-text matching and image captioning with neural scene graph generators., 2019 )
- 如何定义感知不够强？ (Introduction)
  - 学到的对象表示 (object representations) 不包含足够的信息来确定对象的某些属性
- 当视觉感知中存在漏洞时，这种经过预训练的知识将有助于推理过程
- 结论
  - 当前的 Faster-RCNN 本身会产生不完善的表示
    
    这些表示不包含足够的信息，来通过直接的顺序处理来回答问题
  - 视觉感知 (visual perception) 上仍有很大的提升空间
  - 可按照本文的方法，来评估 VQA 模型的推理能力及敏感性
- 代码
  
  与 GQA 数据集相关的 JSON 文件：https://github.com/microsoft/DFOL-VQA/tree/main/src/nsvqa/data/metadata
  - gqa_all_attribute.json
  - gqa_all_class.json
  - gqa_relation.json
  - gqa_vocab.json
  - op_map.json
实验
- 结论1：Faster-RCNN 特征本身具有足够的信息性
  - Faster-RCNN 特征可为 GQA 中 51.86% 的实例提供正确答案 (无需借助 ∇-FOL 的推理功能)
    
    剩下的问题，不足以只根据视觉特征来回答问题
    
    (具体怎么实现的，没看懂)
  - 大约 2/3 的二元问题，其视觉特征足以回答问题
    
    这也能解释，为什么早期分类器模型 (early classifier-based models) 在二元问题上工作的相当好
- The parameter-less ∇-FOL inference from Section 3 achieves 96% accuracy on the GQA validation split using the golden oracle $O^*$ and the golden programs.
存在的问题
- 模型大部分都没看懂
  - 涉及一阶逻辑推理的内容和 ∇-FOL (神经符号推理模型)
    
    (∇-FOL：Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning，2019)
- GQA 提供的 golden programs (黄金程序) 是什么