多模态自监督 论文

1. 《Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis》
在这里插入图片描述

  1. 多个任务,一个多模态任务和三个单模态任务。多模态任务监督学习,单模态任务自监督学习
  2. 主要创新点:前两条都是针对单模态学习任务的。大模型可能是参考文献Yu et al. (2020a), 参考的文章也是一个多模态和多个单模态组成的多任务学习模型,但都是监督学习。
    在这里插入图片描述
  3. 提供了代码
    实验:在三个多模态情感分析数据集上取得了最高结果

2. 《Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning》
在这里插入图片描述
4. 在这项工作中,我们证明了多模态数据中的噪声估计问题可以有效地简化为多模态密度估计任务。基于这种有效的噪声估计,我们提出了一种新的基于噪声的多模态表示学习的构造块,该构造块可以集成到多个多模态学习模型中,并能立即提高其性能。(In this work, we showed that the problem of noise estimationin multimodal data can be effectively reduced to a multi-modal density estimation task. Based on this efficient noise estimation we proposed a novel building block for noiserobust multimodal representation learning that can be inte-grated into many multimodal learning models and improve their performance instantly.)
5. 实验:VQA 和 Text-To-Video Retrieval,利用从上游模型中分别学习出来的两种特征:visual和text特征。VQA:we embed a concatenation of the question and video to asingle feature vector; and an answer embedding. We train the model with a max margin ranking loss function to embed an answer close to its question+video. Inference is performed simply with a nearest neighbor search over the set of predeter-mined answers in the joint video+question and answer space. Text-To-Video Retrieval:Text-To-Video Retrieval includes retrieval of video clips based on textual description. With a learned joint representation space(也就是说两种特征有相同的维度), retrievalis performed with a nearest neighbor search over the joint embedding space.

3. 《Jointly Fine-Tuning “BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition》
在这里插入图片描述
任务:用了已经预训练好的两个self-supervised BERT-like model(不用做上游任务,直接用的公共的训练好的自监督模型),作为特征提取器,或只微调一些层,最后将从两个bert-like model中提取出的特征融合(尝试了两种融合方法:简单拼接和co-attention),再加FC得出最后结果(只进行情感分类)
实验:在三个数据集上进行多模态的情感分析分类任务

4. 《Towards Long-Form Video Understanding》
在这里插入图片描述

目的:In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets。进行long-form的视频理解,并构建了一个framework和protocols。
方法:将object(或同时和和action)送入构建的模型中,取输出特征,再加上微调的FC层,进行每个下游任务
实验:Our experiments show that Object Transformers outperform existing state-of-the-art methods on most of the long-form tasks, and significantly outperform the current state-of-the-art on existing datasets, such as AVA 2.2。
为什么要引入新数据集: We introduce a new benchmark that contains 9 tasks for evaluating long-form video understanding. The benchmark is constructed on the publicly available MovieClips dataset(就是为了构建新的benchmark)

  1. 实验6.1,6.2是在MovieClips数据集上先用自监督pre-train(train-75%),再根据各个任务用监督方法微调(validate-15%),最后15%来test。微调时每个任务都用了标签。
  2. 实验6.3的Object Transformer (masked)和Object Transformer也是只在AVA上微调,预训练和6.1,6.2一样在MovieClips数据集上

5. 《Watching Too Much Television is Good:Self-Supervised Audio-Visual Representation Learning from Movies and TV Shows》
在这里插入图片描述

目的:In this paper, we study the efficacy of learning from Movies and TV Shows as forms of uncurated data for audio-visual self-supervised learning。从uncurated的movie和TV shows中学习self-supervised multi-modal representation
实验:To evaluate the efficacy of self-supervised audio-visual representation learning from movies and TV shows, we follow recent works [3,40,32,2] and benchmark UCF101[43] and HMDB51[24] for action recognition, along with ESC50[41] for audio classification. 用抽取的visual特征进行action classification,audio特征进行audio classification
训练方法:两种特征(visual+audio)分别通过神经网络( 18-layers deep R(2+1)D and ResNet+几个全连接和卷积层)+Contrastive loss(对比损失。创新点:新的负样本采样方法)

  1. 上游任务:在自己收集的Movie-TV数据集上train和validate:0.9/0.1
  2. 下游任务:UCF101/HMDB51/ ESC50上微调(Notations and Architecture一节最后一句话提到)并测试

6. 《Self-Supervised Learning by Cross-Modal Audio-Video Clustering》

上游实验:用visual和audio分别作为假标签来监督另一种特征(audio和visual),从而学习更好的audio和visual特征表示,设计一种跨模态聚类方法。
下游实验:用学出的audio和visual特征分别在不同的数据集上进行action recognition and audio classification

7. 《Shot Contrastive Self-Supervised Learning for Scene Boundary Detection》
在这里插入图片描述
上游:提取visual或audio特征+contrastive loss
下游:用visual或audio特征进行scene boundary detection (as a binary classification problem of determining if a shot boundary is also a scene boundary or not )

8. 《Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning》
在这里插入图片描述
没仔细看,没弄懂

9.《Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics》
在这里插入图片描述
方法:通过上游任务学习视频特征(motion and appearance?),并将这两种特征(一起?)应用到下游action recognition 任务中(?好像是这样)

10. 《Self-Supervised Learning of Audio-Visual Objects from Video》
在这里插入图片描述
方法:first,transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources,and optical flow to aggregate information over time. 使用自监督的方法训练了一个模型,能将音频和视频中说话的对象关联起来,此外还用到了光流来做跟踪。最终效果就是该模型得到的(visual和audio)Embedding可以成功用于四个下游任务(也用学到的cosine similarity 公式(1))。
问题:这两种embedding,不能通过下游的任务用自监督学习出来?感觉上游的这个任务和下游的这四个其实是一样的,一个能学出来其他的就都能完成。

11. 《Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning》
在这里插入图片描述
上游:Audio-Visual Synchronization (数据集:Audioset )
下游:sound source localization (数据集:Audioset )和 action recognition(用上游输出的visual+audio特征拼接,加两个fc层。数据集:UCF101 and HMDB51)

12. 《Multi-modal Self-Supervision from Generalized Data Transformations》
在这里插入图片描述
上游:Kinetics-400数据集进行无监督训练(具体参考博客链接)
下游:下游任务是动作识别(UCF101, HMDB51,只用visual?)和语音事件分类(ESC-50, DCASE2014,只用audio?)

  • 1
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值