VQA论文汇总

最新推荐文章于 2024-07-04 16:28:27 发布

youminglan

最新推荐文章于 2024-07-04 16:28:27 发布

阅读量782

点赞数 1

分类专栏：机器学习 nlp pytorch 文章标签：深度学习 nlp 自然语言处理计算机视觉

本文链接：https://blog.csdn.net/weixin_43485035/article/details/119062514

版权

机器学习同时被 3 个专栏收录

17 篇文章 5 订阅

订阅专栏

nlp

7 篇文章 1 订阅

订阅专栏

pytorch

7 篇文章 1 订阅

订阅专栏

Awesome Text VQA

Text related VQA is a fine-grained direction of the VQA task, which only focuses on the question that requires to read the textual content shown in the input image.

Datasets

VisualMRC dataset (AAAI 2021) [Project][Paper]
EST-VQA dataset (CVPR 2020) [Project][Paper]
DOC-VQA dataset (CVPR Workshop 2020) [Project][Paper]
Text-VQA dataset (CVPR 2019) [Project][Paper]
ST-VQA dataset (ICCV 2019) [Project][Paper]
OCR-VQA dataset (ICDAR 2019) [Project][Paper]

Dataset	#Train+Val Img	#Train+Val Que	#Test Img	#Test Que	Image Source	Language
Text-VQA	25,119	39,602	3,353	5,734	[1]	EN
ST-VQA	19,027	26,308	2,993	4,163	[2, 3, 4, 5, 6, 7, 8]	EN
OCR-VQA	186,775	901,717	20,797	100,429	[9]	EN
EST-VQA	17,047	19,362	4,000	4,525	[4, 5, 8, 10, 11, 12, 13]	EN+CH
DOC-VQA	11,480	44,812	1,287	5,188	[14]	EN
VisualMRC	7,960	23,854	2,237	6,708	self-collected webpage screenshot	EN

Image Source:

[1] OpenImages: A public dataset for large-scale multi-label and multi-class image classification (v3) [dataset]

[2] Imagenet: A large-scale hierarchical image database [dataset]

[3] Vizwiz grand challenge: Answering visual questions from blind people [dataset]

[4] ICDAR 2013 robust reading competition [dataset]

[5] ICDAR 2015 competition on robust reading [dataset]

[6] Visual Genome: Connecting language and vision using crowdsourced dense image annotations [dataset]

[7] Image retrieval using textual cues [dataset]

[8] Coco-text: Dataset and benchmark for text detection and recognition in natural images [dataset]

[9] Judging a book by its cover [dataset]

[10] Total Text [dataset]

[11] SCUT-CTW1500 [dataset]

[12] MLT [dataset]

[13] Chinese Street View Text [dataset]

[14] UCSF Industry Document Library [dataset]

Related Challenges

ICDAR 2021 COMPETITION On Document Visual Question Answering (DocVQA) Submission Deadline: 31st March 2021 [Challenge]

Document Visual Question Answering （CVPR 2020 Workshop on Text and Documents in the Deep Learning Era Submission Deadline: ~~30 April 2020~~ [Challenge]

Papers

2021

[VisualMRC] VisualMRC: Machine Reading Comprehension on Document Images (AAAI) [Paper][Project]
[SSBaseline] Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps (AAAI) [Paper][code]

2020

[SA-M4C] Spatially Aware MultimodalTransformers for TextVQA (ECCV) [Paper][Project][Code]
[EST-VQA] On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering (CVPR) [Paper]
[M4C] Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA (CVPR) [Paper][Project]
[LaAP-Net] Finding the Evidence: Localization-aware Answer Prediction for TextVisual Question Answering (COLING) [Paper]
[CRN] Cascade Reasoning Network for Text-basedVisual Question Answering (ACM MM) [Paper][Project]

2019

[Text-VQA/LoRRA] Towards VQA Models That Can Read (CVPR) [Paper][Code]
[ST-VQA] Scene Text Visual Question Answering (ICCV) [Paper]
[Text-KVQA] From Strings to Things: Knowledge-enabled VQA Modelthat can Read and Reason (ICCV) [Paper]
[OCR-VQA] OCR-VQA: Visual Question Answering by Reading Text in Images (ICDAR) [Paper]

Technical Reports

[TAP] TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [Report]
[RUArt] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering [Report]
[SMA] Structured Multimodal Attentions for TextVQA [Report][Slides][Video]
[DiagNet] DiagNet: Bridging Text and Image [Report][Code]
[DCD_ZJU] Winner of 2019 Text-VQA challenge [Slides]
[Schwail] Runner-up of 2019 Text-VQA challenge [Slides]

Benchmark

Acc. : Accuracy
I. E. : Image Encoder
Q. E. : Question Encoder
O. E. : OCR Token Encoder
Ensem. : Ensemble

Text-VQA

[official leaderboard(2019)]
[official leaderboard(2020)]

Y-C./J.	Methods	Acc.	I. E.	Q. E.	OCR	O. E.	Output	Ensem.
2019–CVPR	LoRRA	26.64	Faster R-CNN	GloVe	Rosetta-ml	FastText	Classification	N
2019–N/A	DCD_ZJU	31.44	Faster R-CNN	BERT	Rosetta-ml	FastText	Classification	Y
2020–CVPR	M4C	40.46	Faster R-CNN (ResNet-101)	BERT	Rosetta-en	FastText	Decoder	N
2020–Challenge	Xiangpeng	40.77
2020–Challenge	colab_buaa	44.73
2020–Challenge	CVMLP(SAM)	44.80
2020–Challenge	NWPU_Adelaide_Team(SMA)	45.51	Faster R-CNN	BERT	BDN	Graph Attention	Decoder	N
2020–ECCV	SA-M4C	44.6*	Faster R-CNN (ResNext-152)	BERT	Google-OCR	FastText+PHOC	Decoder	N
2020–arXiv	TAP	53.97*	Faster R-CNN (ResNext-152)	BERT	Microsoft-OCR	FastText+PHOC	Decoder	N

* Using external data for training.

ST-VQA

[official leaderboard]

T1 : Strongly Contextualised Task
T2 : Weakly Contextualised Task
T3 : Open Dictionary

Y-C./J.	Methods	Acc. (T1/T2/T3)	I. E.	Q. E.	OCR	O. E.	Output	Ensem.
2020–CVPR	M4C	na/na/0.4621	Faster R-CNN (ResNet-101)	BERT	Rosetta-en	FastText	Decoder	N
2020–Challenge	SMA	0.5081/0.3104/0.4659	Faster	BERT	BDN	Graph Attention	Decoder	N
2020–ECCV	SA-M4C	na/na/0.5042	Faster R-CNN (ResNext-152)	BERT	Google-OCR	FastText+PHOC	Decoder	N
2020–arXiv	TAP	na/na/0.5967	Faster R-CNN (ResNext-152)	BERT	Microsoft-OCR	FastText+PHOC	Decoder	N

OCR-VQA

Y-C./J.	Methods	Acc.	I. E.	Q. E.	OCR	O. E.	Output	Ensem.
2020–CVPR	M4C	63.9	Faster R-CNN	BERT	Rosetta-en	FastText	Decoder	N

youminglan

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
VQA论文汇总

Awesome Text VQAText related VQA is a fine-grained direction of the VQA task, which only focuses on the question that requires to read the textual content shown in the input image.DatasetsVisualMRC dataset (AAAI 2021) [Project][Paper]EST-VQA dataset (
复制链接

扫一扫

专栏目录