weixin_45154287-CSDN博客

原创 TRAR: Routing the Attention Spans in Transformer for Visual Question Answering——论文

the superior ability of global dependency modeling（全局关系建模的能力很强）Transformer has become an emerging issue（如何动态规划transformer中的全局和局部依赖关系建模已经成为一个问题）example-dependent routing scheme called TRAnsformer Routing (TRAR) to address this issue（依赖例子的路由策略）

2023-10-11 19:35:38 622 1

原创 Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering

（1）first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge（首先在特定基于知识的VQA数据集上训练一个VQA模型，不引入外部知识）；GPT-3： autoregressive language model（是一个自回归语言模型，用大量的语料库进行训练）answer-aware examples（与真实答案相似的例子，这里感觉就是寻找答案相关的question）

2023-10-11 10:06:34 419 1

原创 Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks——代码复现

解决方法：将“extended_attention_mask = extended_attention_mask.to(dtype = next(self.parameters()).dtype)”改为“extended_attention_mask = extended_attention_mask.to(dtype = torch.float32)”解决方法：将“self.apply(self.init_weights)”改为“self.init_weights()”

2023-10-10 20:48:32 217 2

原创 REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering——论文

（1）visual features are extracted either from the whole image or in a sliding window manner for retrieving knowledge, and the important relationship within/among object regions is neglected（或者从整个图像或者用一种滑动窗的方式来提取视觉特征，进而检索知识，目标之间重要的关系被忽略了）；

2023-10-08 21:24:53 277 1

原创 KAT: A Knowledge Augmented Transformer for Vision-and-Language——论文

（1）Knowledge extraction（知识提取）：① for implicit knowledge, we design new prompts to extract both tentative answers and supporting evidence from a frozen GPT-3 model（对于隐式知识，设计了新的prompt，从参数冻结的GPT-3模型中提取试探性的答案以及支撑证据）；, Wikidata)——显式知识即维基百科知识，隐式知识即常识性知识。

2023-10-08 11:07:37 521 1

原创 KAT: A Knowledge Augmented Transformer for Vision-and-Language——代码

3 function1：load_okvqa_data(下载OKVQA数据集，最终得到examples，这是一个列表，每个元素表示一个图像文本对，是用dict呈现，包括id（图像id#问题id），question（问题），answers（答案），entities（显式知识，维基百科知识），gpt3（隐式知识）)3 gpt3_okvqa_train2014_answers.pkl：这是一个字典，包括9009个键值对，其中键表示图像名字，值为列表，列表里面是元组，表示实体以及对应的知识描述。

2023-10-08 11:06:07 304 2

原创 Multi-Modal Transformer with Global-Local Alignment for Composed Query Image Retrieval——论文

1. 目的：composed query image retrieval2. 目前存在的问题：the architecture discrepancy in feature encoders would restrict the vision-language plenitudinous interaction（特征编码器结构的差异性会限制视觉-语言纵向交互）1. 主要挑战：2. 本文方法：组成：the vision Transformer, the language Transformer, and th

2023-10-05 13:13:38 357

原创 LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection——论文

（1）pre-trained language models or/and unstructured text（预训练的语言模型和无结构的文本）是incomplete and noisy（不完整且带噪声的）；（2）knowledge graphs (KGs)（知识图谱）：have intensive structured knowledge（有密集的结构化知识）LaKoLateKon（知识驱动的VQA方法）；

2023-09-30 12:18:29 230

原创 WINNER: Weakly-supervised hIerarchical decompositioN and aligNment forspatio-tEmporal video gRoundin

动机：dense boundary and bounding box annotations（密集的边界和目标框标注，这里的边界我认为是与文本相关的视频序列，目标框标注我认为是每一帧图像上的目标框）；因此，我们提出了如果能够捕获video and language分解的结构，video-language component之间的虚假连接可以被避免。新的框架：WINNER for hierarchical video-text understanding（分层的视频文本理解）

2023-09-20 17:59:48 400 1

原创 Abstract Meaning Representation

1.语义关系semantic relation和概念concept，旨在说明“谁对谁做了个什么事情”，遵循谓词-论元（即参数）结构，每一级节点均遵循“实例作为主要部分（例如动词），主语和宾语作为主语和宾语（即动作的发起者和动作的承受者）”，这样便可以构成一个三元组“主-谓-宾”2.每个句子都表示为一个有根、有向、无环的图（从根节点到叶子结点不断扩展，直到完全结束）

2023-09-19 10:56:31 886

原创 Zero-shot Visual Question Answering using Knowledge Graph——论文学习

1. pipeline approaches with different components for knowledge matching and extraction, feature learning（一个模型有不同模态组成）；（答案偏差问题）。

2023-09-13 18:29:19 307

原创 Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks代码学习（一）

oscar代码学习

2023-09-13 17:36:07 501 1

原创 Declaration-based Prompt Tuning for Visual Question Answering——论文学习

Declaration-based Prompt Tuning for Visual Question Answering——论文学习

2023-09-11 17:17:04 194

weixin_45154287的博客