[IEEE TPAMI 2024]Robust Visual Question Answering: Datasets, Methods, and Future Challenges

论文网址:Robust Visual Question Answering: Datasets, Methods, and Future Challenges | IEEE Journals & Magazine | IEEE Xplore

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Preliminaries

2.3.1. Task Formulation

2.3.2. VQA Paradigm

2.4. Datasets

2.4.1. In-Distribution Setting

2.4.2. Out-of-Distribution Setting

2.5. Evaluations

2.6. Debiasing Methods

2.6.1. Ensemble Learning

2.6.2. Data Augmentation

2.6.3. Self-Supervised Contrastive Learning

2.6.4. Answer Re-Ranking

2.7. Exploring the Robustness of Vision-Language Models for VQA

2.8. Discussions and Future Directions

2.9. Conclusion

3. Reference

1. 心得

(1)宝贝你都TPAMI了我还有什么理由不看呢

(2)介绍了对学习到偏差/某个固定分布问题,数据集和模型各自的解决办法

(3)可以说总结概括都还ok但不是那种一眼惊艳的文章(人总是对TPAMI有额外的期望

2. 论文逐段精读

2.1. Abstract

        ①Existing visual question answering (VQA) methods is lack of generalization

2.2. Introduction

        ①Significance of VQA: image retrieval, intelligent virtual assistants, visual recommendation systems, autonomous driving(啊这样的吗??)

        ②Methods of VQA: joint embedding, attention mechanism, and external knowledge

        ③In distribution (ID) performance and out of distribution (OOD) performance:

(作者意思是在某个数据集里面Q是问what sports且非常多个问what sports的回答A都是tennis因此机器认为这俩东西挂钩了导致别的运动也会被回答为tennis)(这个就是学习到了偏见,是很常见的问题)

2.3. Preliminaries

2.3.1. Task Formulation

        ①Discriminative VQA: for dataset \mathcal{D} with n triplets \{(v_i,q_i,a_i)\}_{i=1}^n, where v_{i}\in V denotes image, q_{i}\in\mathcal{Q} denotes question, a_{i}\in\mathcal{A} denotes answer. By optimizing parameters \theta^{(\mathrm{d})}, models predict the answer:

\hat{a}_i^{(\mathrm{d})}=\arg\max_{a_i\in\mathcal{A}}p(a_i|v_i,q_i;\theta^{(\mathrm{d})})

        ②Generative VQA: for dataset \mathcal{D}, it predicts answers token by token. Thus the optimization goal is maximize the conditional probability p(y_j|(\hat{y}_1,\ldots,\hat{y}_{j-1}),v_i,q_i;\theta^{(\mathrm{g})}):

where \mathcal{Y} is the set of all tokens in the corpus, k denotes the number of tokens in \hat{a}_i

2.3.2. VQA Paradigm

(1)Non-Debiasing

        ①Discriminative VQA: combining patch level image encoder E_\mathrm{v}:\mathcal{V}\to\mathbb{R}^{n_\mathrm{v}\times d_\mathrm{v}}, word-level question encoder E_\mathrm{q}:\mathcal{Q}\to\mathbb{R}^{n_\mathrm{q}\times d_\mathrm{q}}, multi-modality encoder E_\mathrm{m}:\mathbb{R}^{n_\mathrm{q}\times d_\mathrm{q}}\times\mathbb{R}^{n_\mathrm{v}\times d_\mathrm{v}}\to\mathbb{R}^{d_\mathrm{m}} and predictor E_\mathrm{c}:\mathbb{R}^{d_\mathrm{m}}\to\mathbb{R}^{|\mathcal{A}|}. Just like:

\hat{a}_i^{(\mathrm{d})}=f^{(\mathrm{d})}(v_i,q_i;\theta^{(\mathrm{d})})=E_\mathrm{c}(E_\mathrm{m}(E_\mathrm{v}(v_i),E_\mathrm{q}(q_i)))

使用不同的/相同的Transformer来替换编码器被作者称为双流/单流方法:

\hat{a}_i^{(\mathrm{d})}=f^{(\mathrm{d})}(v_i,q_i;\theta^{(\mathrm{d})})=E_\mathrm{c}(E_\mathrm{t}(\boldsymbol{v}_i||\boldsymbol{q}_i))

        ②Generative VQA: with patch level image encoder E_\mathrm{v}:\mathcal{V}\to\mathbb{R}^{n_\mathrm{v}\times d_\mathrm{v}} and word-level question encoder E_\mathrm{q}:\mathcal{Q}\to\mathbb{R}^{n_\mathrm{q}\times d_\mathrm{q}}, the decoder directly be \mathbb{R}^{d_{\mathbf{q}}}\times\mathbb{R}^{d_{\mathbf{v}}}\to\mathbb{R}^{|\mathcal{Y}|}. The paradigm is:

\hat{a}_i^{(\mathrm{g})}=f^{(\mathrm{g})}(q_i,v_i;\theta^{(\mathrm{g})})=\mathrm{Decoder}(E_\mathrm{q}(q_i),E_\mathrm{v}(v_i))

(2)Debiasing

        ①Fake connections: Q & key words

        ②The non-debiasing methods predict answers by:

\begin{gathered} \hat{a}_{i}^{(\mathrm{l})}=f^{(\mathrm{l})}(v_i,q_i;\theta_\mathrm{l})=E_\mathrm{c}(E_\mathrm{q}(q_i)), \\ \hat{a}_{i}^{(\mathrm{v})}=f^{(\mathrm{v})}(v_i,q_i;\theta_\mathrm{v})=E_\mathrm{c}(E_\mathrm{v}(v_i)), \end{gathered}

⭐language bias learning is more prevalent than vision bias learning

        ③Multimodal bias learning:

\hat{a}_i^{(\mathrm{m})}=f^{(\mathrm{m})}(v_i,q_i;\theta^{(\mathrm{m})})

2.4. Datasets

2.4.1. In-Distribution Setting

        ①VQA v1: Q&A with 10 subjects:

type of answers: "(number)", "Yes/No", "Other" (might be multi true answer in other option)

        ②VQA v2 (to solve the bias problem in v1): designed similar stimuli with different answer:

but distribution still affects

        ③TDIUC: COCO based, add 12 types of questions such as positional reasoning and activity recognition

        ④GQA: more complex question

        ⑤COVR: only one true answer, and add logical operators such as quantifiers and aggregations

        ⑥CRIC: contains annotations triples such as the commonsense knowledge “(subject: fork, relation: is used for, object: moving food)”

2.4.2. Out-of-Distribution Setting

        ①VQA-CP v1 & VQA-CP v2: 370 K questions with 205 K images are in VQA-CP v1 and 658 K questions with 219 K images are in VQA-CP v2. ⭐There are significant distribution difference between train set and test set:

        ②GQA-OOD: evaluate ID and OOD performance by answer groups:

which contains 53 K questions with 9.7 K images

        ③VQA-Rephrasings: 162.0 K questions with 40.5 K images are used to measure answer changes when question change (just words, which causes fault prediction)

        ④VQACE: short-cut example:

        ⑤VQA-VS: provide more specific shortcut

        ⑥AVQA: 243.0 K questions accompanied by 37.9 K images, collects fail-predicted Q&A

       ⑦AdVQA: adversarial also, contains 46.8 K questions accompanied by 41.8 K images

       ⑧Overview of VQA datasets:

2.5. Evaluations

        ①Open-Ended Accuracy:

\text{open-endded\,\,accuracy}=\text{min}\left \{ \frac{n_a}{3},1 \right \}

where n_a denotes the number of predicted answers that are identical to human-provided answers for questions and 3 is the minimum number of consensuses

        ②Composite Metrics: Consistency, Validity/Plausibility, Distribution, Grounding

        ③Arithmetic MPT: by TDIUC

        ④Consensus Score: proposed by VQA-Rephrasings:

cs(k)=\sum_{\mathcal{Q}^{\prime}\subset\mathcal{Q},|\mathcal{Q}^{\prime}|=k}\frac{s(\mathcal{Q}^{\prime})}{^nC_k}

\left.\mathrm{with}s(\mathcal{Q}^{\prime})=\left\{ \begin{array} {ll}1 & \mathrm{if}\quad\forall q\in\mathcal{Q}^{\prime}\quad\phi(q)>0, \\ 0 & \text{otherwise.} \end{array}\right.\right.

where {}^nC_k is the number of subsets with size k sampled from a set with size n\mathcal{Q}^{\prime} is a group of questions contained in \mathcal{Q} that consists of n rephrasings, and \phi(q) is the open-ended accuracy.

2.6. Debiasing Methods

        ①Categories: ensemble learning, data augmentation, self-supervised contrastive learning, and answer re-ranking

        ②Summarizing of existing methods:

        ③Methods in other dataset:

2.6.1. Ensemble Learning

        ①Contains combination \Phi of a bias branch E_b:

\hat{a}_i=\Phi\left(E_\mathrm{b}\left(E_\mathrm{v}(v_i),E_\mathrm{q}(q_i)\right),f(v_i,q_i)\right)

(就是一般测试的时候还是问题和图片都进去学,但这个集成学习是训练的时候用两个模态但测试的时候只用一个模态。看上去有点离谱了,测试为什么只用一个模态?我没理解错吧?↓)

2.6.2. Data Augmentation

        ①Generate additional ugmented question-answer pairs \left ( v_i^\prime,q_i^\prime,a_i^\prime \right )

        ②Answer prediction:

\begin{aligned} & \hat{a}_{i}=E_{\mathrm{c}}(E_{\mathrm{m}}(E_{\mathrm{v}}(v_{i}),E_{\mathrm{q}}(q_{i}))), \\ & (v_i,q_i,a_i)\in\mathcal{D}\cup\{(v_i^{\prime},q_i^{\prime},a_i^{\prime})|i\in[1,n]\}. \end{aligned}

        ③Main methods: 1) synthetic-based: generate new training samples by modifying regions or words of the original images or questions; 2) pairing-based: generate new samples by re-matching relevant questions for images:

2.6.3. Self-Supervised Contrastive Learning

        ①Gather similar sample pairs and push dissimilars away

        ②Loss:

\mathcal{L}= \lambda_\mathrm{C}\mathcal{L}_\mathrm{C}+\lambda_\mathrm{V}\mathcal{L}_\mathrm{V}, \\ \mathcal{L}_{\mathrm{C}}= \mathbb{E}_{o,p,n\in\mathcal{D}^*}\left[-\log\left(\frac{e^{s(o,p)}}{e^{s(o,p)}+e^{s(o,n)}}\right)\right], \\ \mathcal{L}_{\mathrm{V}} =-\frac{1}{|\mathcal{D}^*|}\sum_{i=1}^{|\mathcal{D}^*|}[a_i]\log\hat{a}_i,

where \lambda's are weights, s(o,p),s(o,n) is the scoring function between the anchor o and the positive sample p / negative sample n\left|\mathcal{D}^*\right| is augmented data, [a_i] is the index of the answer a_i

        ③Schematic of this method:

2.6.4. Answer Re-Ranking

        ①Process:

\hat{a}_i=E_\mathrm{r}(E_\mathrm{c}(E_\mathrm{m}(E_\mathrm{v}(v_i),E_\mathrm{q}(q_i))))

2.7. Exploring the Robustness of Vision-Language Models for VQA

        ①Single-stream and dual-stream VLM:

        ②VQA Results of VLMs in ID and OOD Situations:

2.8. Discussions and Future Directions

        ①Limitations: more accurate annotations, ID/OOD tasks contained, metrics of multi-hop inference, accurate evaluation, robust debiasing methods

        ②Robust evaluation:

2.9. Conclusion

        ~

3. Reference

@ARTICLE{10438044,
  author={Ma, Jie and Wang, Pinghui and Kong, Dechen and Wang, Zewei and Liu, Jun and Pei, Hongbin and Zhao, Junzhou},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Robust Visual Question Answering: Datasets, Methods, and Future Challenges}, 
  year={2024},
  volume={46},
  number={8},
  pages={5575-5594},
  keywords={Sports;Task analysis;Robustness;Transformers;Training;Question answering (information retrieval);Knowledge engineering;Vision-and-language pre-training;bias learning;debiasing;multi-modality learning;visual question answering},
  doi={10.1109/TPAMI.2024.3366154}}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值