- 博客(13)
- 资源 (1)
- 收藏
- 关注
原创 TRAR: Routing the Attention Spans in Transformer for Visual Question Answering——论文
the superior ability of global dependency modeling(全局关系建模的能力很强)Transformer has become an emerging issue(如何动态规划transformer中的全局和局部依赖关系建模已经成为一个问题)example-dependent routing scheme called TRAnsformer Routing (TRAR) to address this issue(依赖例子的路由策略)
2023-10-11 19:35:38 226 1
原创 Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
(1)first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge(首先在特定基于知识的VQA数据集上训练一个VQA模型,不引入外部知识);GPT-3: autoregressive language model(是一个自回归语言模型,用大量的语料库进行训练)answer-aware examples(与真实答案相似的例子,这里感觉就是寻找答案相关的question)
2023-10-11 10:06:34 208 1
原创 Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks——代码复现
解决方法:将“extended_attention_mask = extended_attention_mask.to(dtype = next(self.parameters()).dtype)”改为“extended_attention_mask = extended_attention_mask.to(dtype = torch.float32)”解决方法:将“self.apply(self.init_weights)”改为“self.init_weights()”
2023-10-10 20:48:32 105 2
原创 REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering——论文
(1)visual features are extracted either from the whole image or in a sliding window manner for retrieving knowledge, and the important relationship within/among object regions is neglected(或者从整个图像或者用一种滑动窗的方式来提取视觉特征,进而检索知识,目标之间重要的关系被忽略了);
2023-10-08 21:24:53 103 1
原创 KAT: A Knowledge Augmented Transformer for Vision-and-Language——论文
(1)Knowledge extraction(知识提取):① for implicit knowledge, we design new prompts to extract both tentative answers and supporting evidence from a frozen GPT-3 model(对于隐式知识,设计了新的prompt,从参数冻结的GPT-3模型中提取试探性的答案以及支撑证据);, Wikidata)——显式知识即维基百科知识,隐式知识即常识性知识。
2023-10-08 11:07:37 77 1
原创 KAT: A Knowledge Augmented Transformer for Vision-and-Language——代码
3 function1:load_okvqa_data(下载OKVQA数据集,最终得到examples,这是一个列表,每个元素表示一个图像文本对,是用dict呈现,包括id(图像id#问题id),question(问题),answers(答案),entities(显式知识,维基百科知识),gpt3(隐式知识))3 gpt3_okvqa_train2014_answers.pkl:这是一个字典,包括9009个键值对,其中键表示图像名字,值为列表,列表里面是元组,表示实体以及对应的知识描述。
2023-10-08 11:06:07 71 2
原创 Multi-Modal Transformer with Global-Local Alignment for Composed Query Image Retrieval——论文
1. 目的:composed query image retrieval2. 目前存在的问题:the architecture discrepancy in feature encoders would restrict the vision-language plenitudinous interaction(特征编码器结构的差异性会限制视觉-语言纵向交互)1. 主要挑战:2. 本文方法:组成:the vision Transformer, the language Transformer, and th
2023-10-05 13:13:38 109
原创 LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection——论文
(1)pre-trained language models or/and unstructured text(预训练的语言模型和无结构的文本)是incomplete and noisy(不完整且带噪声的);(2)knowledge graphs (KGs)(知识图谱):have intensive structured knowledge(有密集的结构化知识)LaKoLateKon(知识驱动的VQA方法);
2023-09-30 12:18:29 87
原创 WINNER: Weakly-supervised hIerarchical decompositioN and aligNment forspatio-tEmporal video gRoundin
动机:dense boundary and bounding box annotations(密集的边界和目标框标注,这里的边界我认为是与文本相关的视频序列,目标框标注我认为是每一帧图像上的目标框);因此,我们提出了如果能够捕获video and language分解的结构,video-language component之间的虚假连接可以被避免。新的框架:WINNER for hierarchical video-text understanding(分层的视频文本理解)
2023-09-20 17:59:48 170 1
原创 Abstract Meaning Representation
1.语义关系semantic relation和概念concept,旨在说明“谁对谁做了个什么事情”,遵循谓词-论元(即参数)结构,每一级节点均遵循“实例作为主要部分(例如动词),主语和宾语作为主语和宾语(即动作的发起者和动作的承受者)”,这样便可以构成一个三元组“主-谓-宾”2.每个句子都表示为一个有根、有向、无环的图(从根节点到叶子结点不断扩展,直到完全结束)
2023-09-19 10:56:31 339
原创 Zero-shot Visual Question Answering using Knowledge Graph——论文学习
1. pipeline approaches with different components for knowledge matching and extraction, feature learning(一个模型有不同模态组成);(答案偏差问题)。
2023-09-13 18:29:19 115
原创 Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks代码学习(一)
oscar代码学习
2023-09-13 17:36:07 194 1
原创 Declaration-based Prompt Tuning for Visual Question Answering——论文学习
Declaration-based Prompt Tuning for Visual Question Answering——论文学习
2023-09-11 17:17:04 44
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人