Multimodal Machine Learning:A Survey and Taxonomy

Abstract
aim: build models that can process and relate information from multiple modalities
new taxonomy:representation, translation, alignment, fusion, and co-learning.
INTRODUCTION                                                                                                                                 
three modalities:                                                                                                                                  
(nlp)natural language which can be both written or spoken;
(cv) visual signals which are often represented with images or videos;
(sr) vocal signals which encode sounds and para-verbal information such as prosody and vocal expressions
challenges:
representation: the complementarity and redundancy
translation:translate one modality to another( open-ended or subjective)
alignment: measure similarity between different modalities and deal with possible long
range dependencies and ambiguities
fusion: join information from two or more modalities to perform a prediction
co-learning: co -training, conceptual grounding, and zero shot learning
Table 1: A summary of applications enabled by multimodal machine learning. For each application area we identify the core technical challenges that need to be addressed in order to tackle it.

2 A PPLICATIONS : A HISTORICAL PERSPECTIVE 略
3 M ULTIMODAL R EPRESENTATIONS
Representing multiple modalities poses many diffificulties:
how to combine the data from heterogeneous sources;
how to deal with different levels of noise;
and how to deal with missing data.
a number of properties for good representations:
smoothness(类比nlp的平滑), temporal and spatial coherence, sparsity, and natural clustering amongst others
additional desirable properties for multimodal representations:
similarity in the representation space should reflect the similarity of the corresponding concepts,
the representation should be easy to obtain even in the absence of some modalities
it should be possible  to fifill-in missing modalities given the observed ones.
3.1 Joint Representations( project unimodal representations together into a multimodal
space
Mathematically, the joint representation is expressed as:
xm = f(x1, . . . , xn)        (1)
Neural networks
To construct a multimodal representation using neural networks each modality starts with several individual neural layers fol lowed by a hidden layer that projects the modalities into a joint space.The joint multimodal representation is then be passed through multiple hidden layers itself or used directly for prediction.(usually be trained end to end)
advantages:
neural network based joint rep resentations comes from their often superior performance
and the ability to pre-train the representations in an unsu pervised manner. The performance gain is, however, depen dent on the amount of data available for training.
disadvantages:
model not being able to handle missing data                                                                              naturally deep networks are often diffificult to train
Probabilistic graphical models
advantages:
generative nature, which allows for an easy way to deal with missing data (
even if a whole modality is missing, the model has a natural way to cope)
generate samples of one modality in the presence of the other one, or both modalities from the representation s
the representation can be trained in an unsupervised manner enabling the use of unlabeled data
disadvantages:
the diffificulty of training them( high computational cost, and the need to use approximate
variational training methods )
Sequential Representation(varying length sequences) 略
3.2 Coordinated Representations ( learn separate representations for each modality but coordinate them through a constraint )
While coordinated representation is as follows:
f ( x 1 ) g ( x 2 )        (2)
Similarity models( minimize the distance between modal ities in the coordinated space
structured coordinated space( Structured coordinated spaces are commonly used in cross-modal hashing
Hashing enforces certain constraints on the resulting multimodal space:
1) it has to be an N -dimensionalHamming space — a binary representation with controllable
number of bits;
2) the same object from different modalities has to have a similar hash code;
3) the space has to be similarity-preserving.
4 T RANSLATION
Given an entity in one modality the task is to generate the same entity in a different modality.
4.1 Example-based
Retrieval-based models ( directly use the retrieved translation without modifying it )
Combination-based models ( rely on more complex rules to create translations based on a number of retrieved instances )
4.2 Generative approaches
Grammar-based models
Encoder-decoder models
Continuous generation models
5 A LIGNMENT
5.1 Explicit alignment
Unsupervised
Supervised
5.2 Implicit alignment
Graphical models
Neural networks         

 

6 F USION
6.1 Model-agnostic approaches
6.2 Model-based approaches
Multiple kernel learning (MKL)
Graphical models
Neural Networks
7 C O - LEARNING
7.1 Parallel data        
Co-training
Transfer learning
7.2 Non-parallel data
Transfer learning
Conceptual grounding
Zero shot learning
7.3 Hybrid data

 

 

 

 

 

上传者不拥有讲义的原始版权。所有版权归属CMU。 该文件集是CMU开设的11-777课程,名为multimodal machine learning,每年fall学期开设。 本讲义是2019 Fall的版本。 课程介绍: Description Multimodal machine learning (MMML) is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with language vision projects such as image and video captioning, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities. The course will present the fundamental mathematical concepts in machine learning and deep learning relevant to the five main challenges in multimodal machine learning: (1) multimodal representation learning, (2) translation mapping, (3) modality alignment, (4) multimodal fusion and (5) co-learning. These include, but not limited to, multimodal auto-encoder, deep canonical correlation analysis, multi-kernel learning, attention models and multimodal recurrent neural networks. We will also review recent papers describing state-of-the-art probabilistic models and computational algorithms for MMML and discuss the current and upcoming challenges. The course will discuss many of the recent applications of MMML including multimodal affect recognition, image and video captioning and cross-modal multimedia retrieval. This is a graduate course designed primarily for PhD and research master students at LTI, MLD, CSD, HCII and RI; others, for example (undergraduate) students of CS or from professional master programs, are advised to seek prior permission of the instructor. It is required for students to have taken an introduction machine learning course such as 10-401, 10-601, 10-701, 11-663, 11-441, 11-641 or 11-741. Prior knowledge of deep learning is recommended.
深度多模态学习是一种研究方法,它将多个模态(例如语音、图像、视频等)的信息进行融合和学习。近年来,深度多模态学习取得了许多重要进展和趋势。 在多模态学习中,深度神经网络在特征提取和模态融合方面发挥了重要作用。通过深度网络的层次处理,可以有效地从原始模态数据中提取出高层次的语义特征。同时,多模态数据的融合也成为研究热点。不同模态之间的关联信息可以通过深度多模态网络进行学习和利用,提高了模型的性能。 近年来,深度多模态学习在不同领域取得了一系列重要的研究成果。在自然语言处理领域,多模态问答系统、图像字幕生成和视觉问答等任务得到了广泛研究。在计算机视觉领域,通过融合多个模态的信息,如图像和语音,可以实现更准确的物体识别和行为分析。在语音识别和语音合成领域,多模态学习也被用来提高语音处理的性能。 同时,一些趋势也值得关注。首先,多模态学习的应用正在不断扩展到更多领域,如医疗、机器人和智能交通等。其次,深度多模态学习和其他深度学习技术的结合也被广泛研究,以提高模型的性能和泛化能力。此外,深度多模态学习在大规模数据和计算资源方面的需求也值得关注。 总之,深度多模态学习是一个充满潜力和挑战的研究方向。随着技术的不断发展和应用需求的增加,我们有理由相信,深度多模态学习将在未来发挥更重要的作用。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值