多模态
文章平均质量分 95
连理o
负优化砖家
展开
-
[ICML 2021] CLIP: Learning Transferable Visual Models From Natural Language Supervision
[ICML 2021] CLIP: Learning Transferable Visual Models From Natural Language Supervision原创 2022-02-19 18:27:09 · 3449 阅读 · 1 评论 -
CLIP 改进工作
各种借鉴了 CLIP 的模型原创 2022-09-22 19:59:52 · 2991 阅读 · 0 评论 -
X-VLM: Multi-Grained Vision Language Pre-Training
字节提出的多模态模型 x-vlm原创 2022-09-10 15:57:39 · 1522 阅读 · 0 评论 -
SOHO: Seeing Out of tHe bOx
Seeing out of the box: End-to-end pre-training for vision-language representation learning原创 2022-06-01 21:28:52 · 472 阅读 · 2 评论 -
KD-VLP: Knowledge Distillation Vision-and-Language Pretraining
ContentsIntroductionApproachModel ArchitecturePretext TasksExperimentsPretraining CorpusDownstream TasksAblation Study & Visualization AnalysisReferencesIntroduction自监督视觉语言预训练 (vision-and-language pretraining, VLP) 旨在从大规模图像-文本数据中学习可迁移的多模态特征。主流的 VLP原创 2022-05-31 21:14:43 · 451 阅读 · 0 评论 -
UNITER: UNiversal Image-TExt Representation Learning
目录IntroductionModel ArchitecturePre-training tasksPre-training datasetsExperimentsReferencesIntroductionUNITER: a UNiversal Image-TExt Representation, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.Model Architectu原创 2021-12-11 23:03:47 · 2178 阅读 · 0 评论 -
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
VL-BERT 其实就是一个增加了视觉元素输入的多模态 BERT,它能高效地融合对齐视觉和语言信息,可以被当作大多数视觉-语言下游任务的预训练模型原创 2021-12-09 23:19:09 · 5787 阅读 · 0 评论 -
VILLA: Large-Scale Adversarial Training for Vision-and-Language Representation Learning
目录IntroductionReferencesIntroductionWe present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; fo原创 2021-12-16 16:51:40 · 1668 阅读 · 0 评论 -
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
目录IntroductionScene Graph (场景图)ERNIE-ViLModel ArchitectureScene Graph Prediction (SGP)ExperimentsTraining ERNIE-ViLDownstream TasksResultsReferencesIntroductionScene Graph (场景图)Scene graphs contain structured knowledge of visual scenes, including the p原创 2021-12-17 16:35:16 · 629 阅读 · 0 评论 -
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
目录Model ArchitecturePre-Training StrategiesExperimental Setup and ResultsReferencesLXMERT: Learning Cross-Modality Encoder Represen-tations from TransformersModel ArchitectureInput Embeddings: input embedding layers 负责将 sentence 和 image 分别转化为 word-l原创 2022-01-03 15:08:37 · 1142 阅读 · 0 评论 -
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
目录ViLBERT: Extending BERT to Jointly Represent Images and TextExperimental SettingsReferencesViLBERT: Vision-and-Language BERTViLBERT: Extending BERT to Jointly Represent Images and TextTwo-stream Architecture: ViLBERT 采用 two-stream 架构,由两个并行的 BERT-st原创 2022-01-02 17:29:31 · 976 阅读 · 0 评论 -
VisualBERT: A Simple and Performant Baseline for Vision and Language
目录VisualBERTExperimentReferencesVisualBERTArchitecture网络架构与 BERT 相同。文本部分的处理也与 BERT 相同,下面主要介绍视觉图像部分的处理:设 FFF 为 visual embeddings 的集合,f∈Ff\in Ff∈F 为一个 bounding region 对应的 feature,为如下 3 部分 embedding 的加和:(1) fof_ofo: bounding region 对于的 visual feature re原创 2022-01-01 20:38:44 · 1066 阅读 · 0 评论 -
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
目录Oscar Pre-trainingAdapting to V+L TasksExperimental Results & AnalysisPerformance Comparison with SoTAQualitative StudiesReferencesOSCAR: Object-SemantiCs Aligned pRe-trainingOscar Pre-trainingInputOscar 将每个输入的 image-text pair 都表示为 Word-Tag-I原创 2022-01-01 17:22:10 · 810 阅读 · 0 评论