0. 多模态transformer
0.1 《Multimodal Transformer for Unaligned Multimodal Language Sequences》
0.2 《Low Rank Fusion based Transformers for Multimodal Sequences》
0.3 《Perceiver: General Perception with Iterative Attention》
0.4 《Integrating Multimodal Information in Large Pretrained Transformers》
1. 多模态BERT
1.1 如何做好BERT多模态任务
1. 2. 基于BERT的多模态应用:图像,视频如何通过BERT处理: link
1.3. BERT跨模态预训练: link.
2. AAAI 2021 | 多模态最新进展解读: link
3. 跨模态检索
3.1 《Temporal Context Aggregation for Video Retrieval with Contrastive Learning》
3.2 《Adversarial Cross-Modal Retrieval》
3.3 《T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval》
3.4 《Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval》
知识点:
3.4.1. 残差网络
3.4.2. transformer
3.4. 3. bert
3.4. 4. videobert、actbert
3.5. 《Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval》
知识点: 【大规模图像检索的利器】Deep哈希算法介绍
4. 多模态融合
4.1. 《se What You Have: Video Retrieval Using Representations From Collaborative Experts》(有代码:collaborative-experts-master)
4.2 《Learning a Text-Video Embedding from Incomplete and Heterogeneous Data》(代码:Mixture-of-Embedding-Experts-master)
4.3 《Multi-modal Transformerfor Video Retrieval》 (代码:mmt)
5. 视频-文本匹配
5.1 《Learning Spatiotemporal Features via Video and Text Pair Discrimination》
知识点
5.1.1. 对比学习(contrastive learning)
5.1.2. 课程式学习(Curriculum Learning)
5.1.3. 噪声对比估计 Noise Contrastive Estimation (NCE)
5.1.4. Bert
5.1.5. SlowFast Networks for Video Recognition