Paper:《Multimodal Machine Learning: A Survey and Taxonomy,多模态机器学习:综述与分类》翻译与解读

255 篇文章 237 订阅
71 篇文章 46 订阅

Paper:《Multimodal Machine Learning: A Survey and Taxonomy,多模态机器学习:综述与分类》翻译与解读


《Multimodal Machine Learning: A Survey and Taxonomy》翻译与解读



2 Applications: a historical perspective 应用:历史视角

3 Multimodal Representations多模态表示

3.1 Joint Representations 联合表示

3.2 Coordinated Representations协调表示

3.3 Discussion讨论

4 Translation翻译

4.1 Example-based 基于实例

4.2 Generative approaches生成方法

4.3 Model evaluation and discussion模型评价与讨论

5 Alignment对齐

5.1 Explicit alignment显式对齐

5.2 Implicit alignment隐式对齐

5.3 Discussion讨论

6 Fusion融合

6.1 Model-agnostic approaches与模型无关的方法

6.2 Model-based approaches基于模型的方法

6.3 Discussion讨论

7 Co-learning共同学习

7.1 Parallel data并行数据

7.2 Non-parallel data非并行数据

7.3 Hybrid data混合数据

7.4 Discussion讨论

8 Conclusion结论

《Multimodal Machine Learning: A Survey and Taxonomy》翻译与解读

作者:Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency


Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.


Index Terms—Multimodal, machine learning, introductory, survey.



THE world surrounding us involves multiple modalities— we see objects, hear sounds, feel texture, smell odors, and so on. In general terms, a modality refers to the way in which something happens or is experienced. Most people associate the word modality with the sensory modalities which represent our primary channels of communication and sensation, such as vision or touch. A research problem or dataset is therefore characterized as multimodal when it includes multiple such modalities. In this paper we focus primarily, but not exclusively, on three modalities: natural language which can be both written or spoken; visual signals which are often represented with images or videos; and vocal signals which encode sounds and para-verbal information such as prosody and vocal expressions.

In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret and reason about multimodal messages. Multi- modal machine learning aims to build models that can process and relate information from multiple modalities. From early research on audio-visual speech recognition to the recent explosion of interest in language and vision models, multi- modal machine learning is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.



The research field of Multimodal Machine Learning brings some unique challenges for computational re- searchers given the heterogeneity of the data. Learning from multimodal sources offers the possibility of capturing cor- respondences between modalities and gaining an in-depth understanding of natural phenomena. In this paper we iden- tify and explore five core technical challenges (and related sub-challenges) surrounding multimodal machine learning. They are central to the multimodal setting and need to be tackled in order to progress the field. Our taxonomy goes beyond the typical early and late fusion split, and consists of the five following challenges:

1)、Representation A first fundamental challenge is learning how to represent and summarize multimodal data in a way that exploits the complementarity and redundancy of multiple modalities. The heterogeneity of multimodal data makes it challenging to construct such representa- tions. For example, language is often symbolic while au- dio and visual modalities will be represented as signals.

2)、Translation  A second challenge addresses how to trans- late (map) data from one modality to another. Not only is the data heterogeneous, but the relationship between modalities is often open-ended or subjective. For exam- ple, there exist a number of correct ways to describe an image and and one perfect translation may not exist.

3)、Alignment A third challenge is to identify the direct rela- tions between (sub)elements from two or more different modalities. For example, we may want to align the steps in a recipe to a video showing the dish being made. To tackle this challenge we need to measure similarity between different modalities and deal with possible long- range dependencies and ambiguities.

4)、Fusion A fourth challenge is to join information from two or more modalities to perform a prediction. For example, for audio-visual speech recognition, the visual description of the lip motion is fused with the speech signal to predict spoken words. The information coming from different modalities may have varying predictive power and noise topology, with possibly missing data in at least one of the modalities.

5)、Co-learning A fifth challenge is to transfer knowledge between modalities, their representation, and their pre- dictive models. This is exemplified by algorithms of co- training, conceptual grounding, and zero shot learning. Co-learning explores how knowledge learning from one modality can help a computational model trained on a different modality. This challenge is particularly relevant when one of the modalities has limited resources (e.g., annotated data).







Table 1: A summary of applications enabled by multimodal machine learning. For each application area we identify the core technical challenges that need to be addressed in order to tackle it.


1、Speech recognition and synthesis:Audio-visual speech recognition、(Visual) speech synthesis

2、Event detection:Action classification、Multimedia event detection

3、Emotion and affect:Recognition、Synthesis

4、Media description:Image description、Video description、Visual question-answering、Media summarization

5、Multimedia retrieval:Cross modal retrieval、Cross modal hashing








For each of these five challenges, we defines taxonomic classes and sub-classes to help structure the recent work in this emerging research field of multimodal machine learning. We start with a discussion of main applications of multimodal machine learning (Section 2) followed by a discussion on the recent developments on all of the five core technical challenges facing multimodal machine learning: representation (Section 3), translation (Section 4), alignment (Section 5), fusion (Section 6), and co-learning (Section 7). We conclude with a discussion in Section 8.


2 Applications: a historical perspective 应用:历史视角

Multimodal machine learning enables a wide range of applications: from audio-visual speech recognition to im-age captioning. In this section we present a brief history of multimodal applications, from its beginnings in audio-visual speech recognition to a recently renewed interest in language and vision applications.

One of the earliest examples of multimodal research is audio-visual speech recognition (AVSR) [243]. It was moti-vated by the McGurk effect [138] — an interaction between hearing and vision during speech perception. When human subjects heard the syllable /ba-ba/ while watching the lips of a person saying /ga-ga/, they perceived a third sound: /da-da/. These results motivated many researchers from the speech community to extend their approaches with visual information. Given the prominence of hidden Markov mod-els (HMMs) in the speech community at the time [95], it is without surprise that many of the early models for AVSR were based on various HMM extensions [24], [25]. While research into AVSR is not as common these days, it has seen renewed interest from the deep learning community [151].

While the original vision of AVSR was to improve speech recognition performance (e.g., word error rate) in all contexts, the experimental results showed that the main advantage of visual information was when the speech signal was noisy (i.e., low signal-to-noise ratio) [75], [151], [243]. In other words, the captured interactions between modalities were supplementary rather than complementary. The same information was captured in both, improving the robustness of the multimodal models but not improving the speech recognition performance in noiseless scenarios.




A second important category of multimodal applications comes from the field of multimedia content indexing and retrieval [11], [188]. With the advance of personal comput-ers and the internet, the quantity of digitized multime-dia content has increased dramatically [2]. While earlier approaches for indexing and searching these multimedia videos were keyword-based [188], new research problems emerged when trying to search the visual and multimodal content directly. This led to new research topics in multi-media content analysis such as automatic shot-boundary detection [123] and video summarization [53]. These re-search projects were supported by the TrecVid initiative from the National Institute of Standards and Technologies which introduced many high-quality datasets, including the multimedia event detection (MED) tasks started in 2011 [1]


A third category of applications was established in the early 2000s around the emerging field of multimodal interaction with the goal of understanding human multi-modal behaviors during social interactions. One of the first landmark datasets collected in this field is the AMI Meet-ing Corpus which contains more than 100 hours of video recordings of meetings, all fully transcribed and annotated [33]. Another important dataset is the SEMAINE corpus which allowed to study interpersonal dynamics between speakers and listeners [139]. This dataset formed the basis of the first audio-visual emotion challenge (AVEC) orga-nized in 2011 [179]. The fields of emotion recognition and affective computing bloomed in the early 2010s thanks to strong technical advances in automatic face detection, facial landmark detection, and facial expression recognition [46]. The AVEC challenge continued annually afterward with the later instantiation including healthcare applications such as automatic assessment of depression and anxiety [208]. A great summary of recent progress in multimodal affect recognition was published by D’Mello et al. [50]. Their meta-analysis revealed that a majority of recent work on multimodal affect recognition show improvement when using more than one modality, but this improvement is reduced when recognizing naturally-occurring emotions.

第三类应用是在21世纪初建立的,围绕着新兴的多模态互动领域,目的是理解社会互动中人类的多模态行为。在这个领域收集的第一个具有里程碑意义的数据集是AMI会议语料库(AMI meetet Corpus),它包含了100多个小时的会议视频记录,全部都是完全转录和注释的[33]。另一个重要的数据集是SEMAINE语料库,它可以研究说话者和听者之间的人际动力学[139]。该数据集构成了2011年组织的第一次视听情绪挑战(AVEC)的基础[179]。由于自动人脸检测、面部地标检测和面部表情识别[46]技术的强大进步,情绪识别和情感计算领域在2010年代早期蓬勃发展。此后,AVEC挑战每年都在继续,后来的实例包括抑郁和焦虑的自动评估等医疗保健应用[208]。D’mello et al.[50]对多模态情感识别的最新进展进行了很好的总结。他们的荟萃分析显示,最近关于多模态情感识别的大部分工作在使用多个模态时表现出改善,但当识别自然发生的情绪时,这种改善就会减少。

Most recently, a new category of multimodal applica-tions emerged with an emphasis on language and vision: media description. One of the most representative applica-tions is image captioning where the task is to generate a text description of the input image [83]. This is motivated by the ability of such systems to help the visually impaired in their daily tasks [20]. The main challenges media description is evaluation: how to evaluate the quality of the predicted descriptions. The task of visual question-answering (VQA) was recently proposed to address some of the evaluation challenges [9], where the goal is to answer a specific ques-tion about the image.

In order to bring some of the mentioned applications to the real world we need to address a number of tech-nical challenges facing multimodal machine learning. We summarize the relevant technical challenges for the above mentioned application areas in Table 1. One of the most im-portant challenges is multimodal representation, the focus of our next section.



3 Multimodal Representations多模态表示

Representing raw data in a format that a computational model can work with has always been a big challenge in machine learning. Following the work of Bengio et al. [18] we use the term feature and representation interchangeably, with each referring to a vector or tensor representation of an entity, be it an image, audio sample, individual word, or a sentence. A multimodal representation is a representation of data using information from multiple such entities. Repre-senting multiple modalities poses many difficulties: how to combine the data from heterogeneous sources; how to deal with different levels of noise; and how to deal with missing data. The ability to represent data in a meaningful way is crucial to multimodal problems, and forms the backbone of any model.

Good representations are important for the performance of machine learning models, as evidenced behind the recent leaps in performance of speech recognition [79] and visual object classification [109] systems. Bengio et al. [18] identify a number of properties for good representations: smooth-ness, temporal and spatial coherence, sparsity, and natural clustering amongst others. Srivastava and Salakhutdinov [198] identify additional desirable properties for multi-modal representations: similarity in the representation space should reflect the similarity of the corresponding concepts, the representation should be easy to obtain even in the absence of some modalities, and finally, it should be possible to fill-in missing modalities given the observed ones.



The development of unimodal representations has been extensively studied [5], [18], [122]. In the past decade there has been a shift from hand-designed for specific applications to data-driven. For example, one of the most famous image descriptors in the early 2000s, the scale invariant feature transform (SIFT) was hand designed [127], but currently most visual descriptions are learned from data using neural architectures such as convolutional neural networks (CNN)[109]. Similarly, in the audio domain, acoustic features such as Mel-frequency cepstral coefficients (MFCC) have been superseded by data-driven deep neural networks in speech recognition [79] and recurrent neural networks for para-linguistic analysis [207]. In natural language process-ing, the textual features initially relied on counting word occurrences in documents, but have been replaced data-driven word embeddings that exploit the word context [141]. While there has been a huge amount of work on unimodal representation, up until recently most multimodal representations involved simple concatenation of unimodal ones [50], but this has been rapidly changing.


To help understand the breadth of work, we propose two categories of multimodal representation: joint and coor-dinated. Joint representations combine the unimodal signals into the same representation space, while coordinated repre-sentations process unimodal signals separately, but enforce certain similarity constraints on them to bring them to what we term a coordinated space. An illustration of different multimodal representation types can be seen in Figure 1.

Mathematically, the joint representation is expressed as:

 where the multimodal representation xm is computed using function f (e.g., a deep neural network, restricted Boltz-mann machine, or a recurrent neural network) that relies on unimodal representations x1, . . . xn. While coordinated representation is as follows:

where each modality has a corresponding projection func-tion (f and g above) that maps it into a coordinated multi-modal space. While the projection into the multimodal space is independent for each modality, but the resulting space is coordinated between them (indicated as ∼). Examples of such coordination include minimizing cosine distance [61], maximizing correlation [7], and enforcing a partial order [212] between the resulting spaces.





3.1 Joint Representations 联合表示

We start our discussion with joint representations that project unimodal representations together into a multimodal space (Equation 1). Joint representations are mostly (but not exclusively) used in tasks where multimodal data is present both during training and inference steps. The sim-plest example of a joint representation is a concatenation of individual modality features (also referred to as early fusion [50]). In this section we discuss more advanced methods for creating joint representations starting with neural net-works, followed by graphical models and recurrent neural networks (representative works can be seen in Table 2). Neural networks have become a very popular method for unimodal data representation [18]. They are used to repre-sent visual, acoustic, and textual data, and are increasingly used in the multimodal domain [151], [156], [217]. In this section we describe how neural networks can be used to construct a joint multimodal representation, how to train them, and what advantages they offer.


In general, neural networks are made up of successive building blocks of inner products followed by non-linear activation functions. In order to use a neural network as a way to represent data, it is first trained to perform a specific task (e.g., recognizing objects in images). Due to the multilayer nature of deep neural networks each successive layer is hypothesized to represent the data in a more abstract way [18], hence it is common to use the final or penultimate neural layers as a form of data representation. To construct a multimodal representation using neural networks each modality starts with several individual neural layers fol-lowed by a hidden layer that projects the modalities into a joint space [9], [145], [156], [227]. The joint multimodal representation is then be passed through multiple hidden layers itself or used directly for prediction. Such models can be trained end-to-end — learning both to represent the data and to perform a particular task. This results in a close relationship between multimodal representation learning and multimodal fusion when using neural networks.


 Figure 1: Structure of joint and coordinated representations. Joint representations are projected to the same space using all of the modalities as input. Coordinated representations, on the other hand, exist in their own space, but are coordinated through a similarity (e.g. Euclidean distance) or structure constraint (e.g. partial order).


As neural networks require a lot of labeled training data, it is common to pre-train such representations using an autoencoder on unsupervised data [80]. The model pro-posed by Ngiam et al. [151] extended the idea of using autoencoders to the multimodal domain. They used stacked denoising autoencoders to represent each modality individ-ually and then fused them into a multimodal representation using another autoencoder layer. Similarly, Silberer and Lapata [184] proposed to use a multimodal autoencoder for the task of semantic concept grounding (see Section 7.2). In addition to using a reconstruction loss to train the representation they introduce a term into the loss function that uses the representation to predict object labels. It is also common to fine-tune the resulting representation on a particular task at hand as the representation constructed using an autoencoder is generic and not necessarily optimal for a specific task [217].

The major advantage of neural network based joint rep-resentations comes from their often superior performance and the ability to pre-train the representations in an unsu-pervised manner. The performance gain is, however, depen-dent on the amount of data available for training. One of the disadvantages comes from the model not being able to handle missing data naturally — although there are ways to alleviate this issue [151], [217]. Finally, deep networks are often difficult to train [69], but the field is making progress in better training techniques [196].

Probabilistic graphical models are another popular way to construct representations through the use of latent random variables [18]. In this section we describe how probabilistic graphical models are used to represent unimodal and mul-timodal data.




The most popular approaches for graphical-model based representation are deep Boltzmann machines (DBM) [176], that stack restricted Boltzmann machines (RBM) [81] as building blocks. Similar to neural networks, each successive layer of a DBM is expected to represent the data at a higher level of abstraction. The appeal of DBMs comes from the fact that they do not need supervised data for training [176]. As they are graphical models the representation of data is probabilistic, however it is possible to convert them to a deterministic neural network — but this loses the generative aspect of the model [176].

Work by Srivastava and Salakhutdinov [197] introduced multimodal deep belief networks as a multimodal represen-tation. Kim et al. [104] used a deep belief network for each modality and then combined them into joint representation for audiovisual emotion recognition. Huang and Kingsbury [86] used a similar model for AVSR, and Wu et al. [225] for audio and skeleton joint based gesture recognition.

Multimodal deep belief networks have been extended to multimodal DBMs by Srivastava and Salakhutdinov [198]. Multimodal DBMs are capable of learning joint represen-tations from multiple modalities by merging two or more undirected graphs using a binary layer of hidden units on top of them. They allow for the low level representations of each modality to influence each other after the joint training due to the undirected nature of the model.

Ouyang et al. [156] explore the use of multimodal DBMs for the task of human pose estimation from multi-view data. They demonstrate that integrating the data at a later stage —after unimodal data underwent nonlinear transformations— was beneficial for the model. Similarly, Suk et al. [199] use multimodal DBM representation to perform Alzheimer’s disease classification from positron emission tomography and magnetic resonance imaging data.





One of the big advantages of using multimodal DBMs for learning multimodal representations is their generative nature, which allows for an easy way to deal with missing data — even if a whole modality is missing, the model has a natural way to cope. It can also be used to generate samples of one modality in the presence of the other one, or both modalities from the representation. Similar to autoen-coders the representation can be trained in an unsupervised manner enabling the use of unlabeled data. The major disadvantage of DBMs is the difficulty of training them —high computational cost, and the need to use approximate variational training methods [198].

Sequential Representation. So far we have discussed mod-els that can represent fixed length data, however, we often need to represent varying length sequences such as sen-tences, videos, or audio streams. In this section we describe models that can be used to represent such sequences.



 Table 2: A summary of multimodal representation tech-niques. We identify three subtypes of joint representations (Section 3.1) and two subtypes of coordinated ones (Section 3.2). For modalities + indicates the modalities combined.


Recurrent neural networks (RNNs), and their variants such as long-short term memory (LSTMs) networks [82], have recently gained popularity due to their success in sequence modeling across various tasks [12], [213]. So far RNNs have mostly been used to represent unimodal se-quences of words, audio, or images, with most success in the language domain. Similar to traditional neural networks, the hidden state of an RNN can be seen as a representation of the data, i.e., the hidden state of RNN at timestep t can be seen as the summarization of the sequence up to that timestep. This is especially apparent in RNN encoder-decoder frameworks where the task of an encoder is to represent a sequence in the hidden state of an RNN in such a way that a decoder could reconstruct it [12].

The use of RNN representations has not been limited to the unimodal domain. An early use of constructing a multimodal representation using RNNs comes from work by Cosi et al. [43] on AVSR. They have also been used for representing audio-visual data for affect recognition [37],[152] and to represent multi-view data such as different visual cues for human behavior analysis [166].



3.2 Coordinated Representations协调表示

An alternative to a joint multimodal representation is a coor-dinated representation. Instead of projecting the modalities together into a joint space, we learn separate representations for each modality but coordinate them through a constraint. We start our discussion with coordinated representations that enforce similarity between representations, moving on to coordinated representations that enforce more structure on the resulting space (representative works of different coordinated representations can be seen in Table 2).

Similarity models minimize the distance between modal-ities in the coordinated space. For example such models encourage the representation of the word dog and an image of a dog to have a smaller distance between them than distance between the word dog and an image of a car [61]. One of the earliest examples of such a representation comes from the work by Weston et al. [221], [222] on the WSABIE (web scale annotation by image embedding) model, where a coordinated space was constructed for images and their annotations. WSABIE constructs a simple linear map from image and textual features such that corresponding anno-tation and image representation would have a higher inner product (smaller cosine distance) between them than non-corresponding ones.



More recently, neural networks have become a popular way to construct coordinated representations, due to their ability to learn representations. Their advantage lies in the fact that they can jointly learn coordinated representations in an end-to-end manner. An example of such coordinated representation is DeViSE — a deep visual-semantic embed-ding [61]. DeViSE uses a similar inner product and ranking loss function to WSABIE but uses more complex image and word embeddings. Kiros et al. [105] extended this to sentence and image coordinated representation by using an LSTM model and a pairwise ranking loss to coordinate the feature space. Socher et al. [191] tackle the same task, but extend the language model to a dependency tree RNN to incorporate compositional semantics. A similar model was also proposed by Pan et al. [159], but using videos instead of images. Xu et al. [231] also constructed a coordinated space between videos and sentences using a subject, verb, object compositional language model and a deep video model. This representation was then used for the task of cross-modal retrieval and video description.

While the above models enforced similarity between representations, structured coordinated space models go beyond that and enforce additional constraints between the modality representations. The type of structure enforced is often based on the application, with different constraints for hashing, cross-modal retrieval, and image captioning.



Structured coordinated spaces are commonly used in cross-modal hashing — compression of high dimensional data into compact binary codes with similar binary codes for similar objects [218]. The idea of cross-modal hashing is to create such codes for cross-modal retrieval [27], [93],[113]. Hashing enforces certain constraints on the result-ing multimodal space: 1) it has to be an N-dimensional Hamming space — a binary representation with controllable number of bits; 2) the same object from different modalities has to have a similar hash code; 3) the space has to be similarity-preserving. Learning how to represent the data as a hash function attempts to enforce all of these three requirements [27], [113]. For example, Jiang and Li [92] introduced a method to learn such common binary space between sentence descriptions and corresponding images using end-to-end trainable deep learning techniques. While Cao et al. [32] extended the approach with a more complex LSTM sentence representation and introduced an outlier insensitive bit-wise margin loss and a relevance feedback based semantic similarity constraint. Similarly, Wang et al.[219] constructed a coordinated space in which images (and sentences) with similar meanings are closer to each other.


Another example of a structured coordinated represen-tation comes from order-embeddings of images and lan-guage [212], [249]. The model proposed by Vendrov et al.[212] enforces a dissimilarity metric that is asymmetric and implements the notion of partial order in the multimodal space. The idea is to capture a partial order of the language and image representations — enforcing a hierarchy on the space; for example image of “a woman walking her dog“ → text “woman walking her dog” → text “woman walking”. A similar model using denotation graphs was also proposed by Young et al. [238] where denotation graphs are used to induce a partial ordering. Lastly, Zhang et al. present how exploiting structured representations of text and images can create concept taxonomies in an unsupervised manner [249].

A special case of a structured coordinated space is one based on canonical correlation analysis (CCA) [84]. CCA computes a linear projection which maximizes the correla-tion between two random variables (in our case modalities) and enforces orthogonality of the new space. CCA models have been used extensively for cross-modal retrieval [76],[106], [169] and audiovisual signal analysis [177], [187]. Extensions to CCA attempt to construct a correlation max-imizing nonlinear projection [7], [116]. Kernel canonical correlation analysis (KCCA) [116] uses reproducing kernel Hilbert spaces for projection. However, as the approach is nonparametric it scales poorly with the size of the training set and has issues with very large real-world datasets. Deep canonical correlation analysis (DCCA) [7] was introduced as an alternative to KCCA and addresses the scalability issue, it was also shown to lead to better correlated representation space. Similar correspondence autoencoder [58] and deep correspondence RBMs [57] have also been proposed for cross-modal retrieval.

CCA, KCCA, and DCCA are unsupervised techniques and only optimize the correlation over the representations, thus mostly capturing what is shared across the modal-ities. Deep canonically correlated autoencoders [220] also include an autoencoder based data reconstruction term. This encourages the representation to also capture modal-ity specific information. Semantic correlation maximization method [248] also encourages semantic relevance, while retaining correlation maximization and orthogonality of the resulting space — this leads to a combination of CCA and cross-modal hashing techniques.


结构化协调空间的一种特殊情况是基于典型相关分析(CCA)的情况[84]。CCA计算线性投影,最大化两个随机变量(在本例中为模态)之间的相关性,并加强新空间的正交性。CCA模型被广泛用于跨模态检索[76]、[106]、[169]和视听信号分析[177]、[187]。对CCA的扩展尝试构造一个相关max- imalize非线性投影[7],[116]。核典型相关分析(Kernel canonical correlation analysis, KCCA)[116]使用再现核希尔伯特空间进行投影。然而,由于该方法是非参数化的,它不能很好地适应训练集的大小,并且在处理非常大的真实世界数据集时存在问题。深度典型相关分析(DCCA)[7]被引入作为KCCA的替代方案,并解决了可伸缩性问题,它还可以带来更好的相关表示空间。类似的对应自动编码器[58]和深度对应RBMs[57]也被提出用于跨模态检索。


3.3 Discussion讨论

In this section we identified two major types of multimodal representations — joint and coordinated. Joint representa-tions project multimodal data into a common space and are best suited for situations when all of the modalities are present during inference. They have been extensively used for AVSR, affect, and multimodal gesture recognition. Coordinated representations, on the other hand, project each modality into a separate but coordinated space, making them suitable for applications where only one modality is present at test time, such as: multimodal retrieval and trans-lation (Section 4), grounding (Section 7.2), and zero shot learning (Section 7.2). Finally, while joint representations have been used in situations to construct representations of more than two modalities, coordinated spaces have, so far, been mostly limited to two modalities.


 Table 3: Taxonomy of multimodal translation research. For each class and sub-class, we include example tasks with references. Our taxonomy also includes the directionality of the translation: unidirectional (⇒) and bidirectional (⇔).


4 Translation翻译

A big part of multimodal machine learning is concerned with translating (mapping) from one modality to another. Given an entity in one modality the task is to generate the same entity in a different modality. For example given an image we might want to generate a sentence describing it or given a textual description generate an image matching it. Multimodal translation is a long studied problem, with early work in speech synthesis [88], visual speech generation [136] video description [107], and cross-modal retrieval [169].

More recently, multimodal translation has seen renewed interest due to combined efforts of the computer vision and natural language processing (NLP) communities [19] and recent availability of large multimodal datasets [38], [205]. A particularly popular problem is visual scene description, also known as image [214] and video captioning [213], which acts as a great test bed for a number of computer vision and NLP problems. To solve it, we not only need to fully understand the visual scene and to identify its salient parts, but also to produce grammatically correct and comprehensive yet concise sentences describing it.



While the approaches to multimodal translation are very broad and are often modality specific, they share a number of unifying factors. We categorize them into two types —example-based, and generative. Example-based models use a dictionary when translating between the modalities. Genera-tive models, on the other hand, construct a model that is able to produce a translation. This distinction is similar to the one between non-parametric and parametric machine learning approaches and is illustrated in Figure 2, with representative examples summarized in Table 3.

Generative models are arguably more challenging to build as they require the ability to generate signals or sequences of symbols (e.g., sentences). This is difficult for any modality — visual, acoustic, or verbal, especially when temporally and structurally consistent sequences need to be generated. This led to many of the early multimodal transla-tion systems relying on example-based translation. However,

this has been changing with the advent of deep learning models that are capable of generating images [171], [210], sounds [157], [209], and text [12].




 Figure 2: Overview of example-based and generative multimodal translation. The former retrieves the best translation from a dictionary, while the latter first trains a translation model on the dictionary and then uses that model for translation.


4.1 Example-based 基于实例

Example-based algorithms are restricted by their training data — dictionary (see Figure 2a). We identify two types of such algorithms: retrieval based, and combination based. Retrieval-based models directly use the retrieved translation without modifying it, while combination-based models rely on more complex rules to create translations based on a number of retrieved instances.

Retrieval-based models are arguably the simplest form of multimodal translation. They rely on finding the closest sample in the dictionary and using that as the translated result. The retrieval can be done in unimodal space or inter-mediate semantic space.

Given a source modality instance to be translated, uni-modal retrieval finds the closest instances in the dictionary in the space of the source — for example, visual feature space for images. Such approaches have been used for visual speech synthesis, by retrieving the closest matching visual example of the desired phoneme [26]. They have also been used in concatenative text-to-speech systems [88]. More recently, Ordonez et al. [155] used unimodal retrieval to generate image descriptions by using global image features to retrieve caption candidates [155]. Yagcioglu et al. [232] used a CNN-based image representation to retrieve visu-ally similar images using adaptive neighborhood selection. Devlin et al. [49] demonstrated that a simple k-nearest neighbor retrieval with consensus caption selection achieves competitive translation results when compared to more complex generative approaches. The advantage of such unimodal retrieval approaches is that they only require the representation of a single modality through which we are performing retrieval. However, they often require an extra processing step such as re-ranking of retrieved translations [135], [155], [232]. This indicates a major problem with this approach — similarity in unimodal space does not always imply a good translation.



给定要翻译的源模态实例,单模态检索在源空间(例如,图像的视觉特征空间)中找到字典中最近的实例。这种方法已经被用于视觉语音合成,通过检索最接近匹配的期望音素[26]的视觉示例。它们也被用于串联文本-语音系统[88]。最近,Ordonez等人[155]使用单模态检索,通过使用全局图像特征检索候选标题来生成图像描述[155]。Yagcioglu等人[232]使用了一种基于cnn的图像表示,使用自适应邻域选择来检索视觉上相似的图像。Devlin et al.[49]证明,与更复杂的生成方法相比,具有一致标题选择的简单k近邻检索可以获得有竞争力的翻译结果。这种单模态检索方法的优点是,它们只需要表示我们执行检索时所使用的单一模态。然而,它们通常需要额外的处理步骤,如对检索到的翻译进行重新排序[135]、[155]、[232]。这表明了这种方法的一个主要问题——单峰空间中的相似性并不总是意味着好的翻译。

An alternative is to use an intermediate semantic space for similarity comparison during retrieval. An early ex-ample of a hand crafted semantic space is one used by Farhadi et al. [56]. They map both sentences and images to a space of object, action, scene, retrieval of relevant caption to an image is then performed in that space. In contrast to hand-crafting a representation, Socher et al. [191] learn a coordinated representation of sentences and CNN visual features (see Section 3.2 for description of coordinated spaces). They use the model for both translating from text to images and from images to text. Similarly, Xu et al. [231] used a coordinated space of videos and their descriptions for cross-modal retrieval. Jiang and Li [93] and Cao et al. [32] use cross-modal hashing to perform multimodal translation from images to sentences and back, while Ho-dosh et al. [83] use a multimodal KCCA space for image-sentence retrieval. Instead of aligning images and sentences globally in a common space, Karpathy et al. [99] propose a multimodal similarity metric that internally aligns image fragments (visual objects) together with sentence fragments (dependency tree relations).


Retrieval approaches in semantic space tend to perform better than their unimodal counterparts as they are retriev-ing examples in a more meaningful space that reflects both modalities and that is often optimized for retrieval. Fur-thermore, they allow for bi-directional translation, which is not straightforward with unimodal methods. However, they require manual construction or learning of such a semantic space, which often relies on the existence of large training dictionaries (datasets of paired samples).

Combination-based models take the retrieval based ap-proaches one step further. Instead of just retrieving exam-ples from the dictionary, they combine them in a meaningful way to construct a better translation. Combination based media description approaches are motivated by the fact that sentence descriptions of images share a common and simple structure that could be exploited. Most often the rules for combinations are hand crafted or based on heuristics.

Kuznetsova et al. [114] first retrieve phrases that describe visually similar images and then combine them to generate novel descriptions of the query image by using Integer Linear Programming with a number of hand crafted rules. Gupta et al. [74] first find k images most similar to the source image, and then use the phrases extracted from their captions to generate a target sentence. Lebret et al. [119] use a CNN-based image representation to infer phrases that describe it. The predicted phrases are then combined using a trigram constrained language model.

A big problem facing example-based approaches for translation is that the model is the entire dictionary — mak-ing the model large and inference slow (although, optimiza-tions such as hashing alleviate this problem). Another issue facing example-based translation is that it is unrealistic to expect that a single comprehensive and accurate translation relevant to the source example will always exist in the dic-tionary — unless the task is simple or the dictionary is very large. This is partly addressed by combination models that are able to construct more complex structures. However, they are only able to perform translation in one direction, while semantic space retrieval-based models are able to perform it both ways.





4.2 Generative approaches生成方法

Generative approaches to multimodal translation construct models that can perform multimodal translation given a unimodal source instance. It is a challenging problem as it requires the ability to both understand the source modality and to generate the target sequence or signal. As discussed in the following section, this also makes such methods much more difficult to evaluate, due to large space of possible correct answers.

In this survey we focus on the generation of three modal-ities: language, vision, and sound. Language generation has been explored for a long time [170], with a lot of recent attention for tasks such as image and video description [19]. Speech and sound generation has also seen a lot of work with a number of historical [88] and modern approaches [157], [209]. Photo-realistic image generation has been less explored, and is still in early stages [132], [171], however, there have been a number of attempts at generating abstract scenes [253], computer graphics [45], and talking heads [6].



We identify three broad categories of generative mod-els: grammar-based, encoder-decoder, and continuous generation models. Grammar based models simplify the task by re-stricting the target domain by using a grammar, e.g., by gen-erating restricted sentences based on a subject, object, verbtemplate. Encoder-decoder models first encode the source modality to a latent representation which is then used by a decoder to generate the target modality. Continuous gen-eration models generate the target modality continuously based on a stream of source modality inputs and are most suited for translating between temporal sequences — such as text-to-speech.

Grammar-based models rely on a pre-defined grammar for generating a particular modality. They start by detecting high level concepts from the source modality, such as objects in images and actions from videos. These detections are then incorporated together with a generation procedure based on a pre-defined grammar to result in a target modality.

Kojima et al. [107] proposed a system to describe human behavior in a video using the detected position of the person’s head and hands and rule based natural language generation that incorporates a hierarchy of concepts and actions. Barbu et al. [14] proposed a video description model that generates sentences of the form: who did what to whom and where and how they did it. The system was based on handcrafted object and event classifiers and used a restricted grammar suitable for the task. Guadarrama et al.[73] predict subject, verb, object triplets describing a video using semantic hierarchies that use more general words in case of uncertainty. Together with a language model their approach allows for translation of verbs and nouns not seen in the dictionary.



Kojima等人[107]提出了一种系统,利用检测到的人的头和手的位置,以及基于规则的自然语言生成(包含概念和行为的层次),来描述视频中的人类行为。Barbu et al.[14]提出了一个视频描述模型,该模型生成如下形式的句子:谁对谁做了什么,在哪里以及他们是如何做的。该系统基于手工制作的对象和事件分类器,并使用了适合该任务的限制性语法。guadarama等人[73]预测主语、动词、宾语三连词描述视频,使用语义层次结构,在不确定的情况下使用更一般的词汇。与语言模型一起,他们的方法允许翻译字典中没有的动词和名词。

To describe images, Yao et al. [235] propose to use an and-or graph-based model together with domain-specific lexicalized grammar rules, targeted visual representation scheme, and a hierarchical knowledge ontology. Li et al.[121] first detect objects, visual attributes, and spatial re-lationships between objects. They then use an n-gram lan-guage model on the visually extracted phrases to generatesubject, preposition, object style sentences. Mitchell et al.[142] use a more sophisticated tree-based language model to generate syntactic trees instead of filling in templates, leading to more diverse descriptions. A majority of ap-proaches represent the whole image jointly as a bag of visual objects without capturing their spatial and semantic relationships. To address this, Elliott et al. [51] propose to explicitly model proximity relationships of objects for image description generation.

Some grammar-based approaches rely on graphical models to generate the target modality. An example includes BabyTalk [112], which given an image generates object, preposition, object triplets, that are used together with a conditional random field to construct the sentences. Yang et al. [233] predict a set of noun, verb, scene, prepositioncandidates using visual features extracted from an image and combine them into a sentence using a statistical lan-guage model and hidden Markov model style inference. A similar approach has been proposed by Thomason et al. [204], where a factor graph model is used for video description of the form subject, verb, object, place. The factor model exploits language statistics to deal with noisy visual representations. Going the other way Zitnick et al.[253] propose to use conditional random fields to generate abstract visual scenes based on language triplets extracted from sentences.

为了描述图像,Yao等人[235]提出使用基于和或图的模型,以及特定领域的词汇化语法规则、有针对性的视觉表示方案和层次知识本体。Li等人[121]首先检测对象、视觉属性和对象之间的空间关系。然后,他们在视觉提取的短语上使用一个n-gram语言模型,生成主语、介词、宾语式的句子。Mitchell等人[142]使用更复杂的基于树的语言模型来生成语法树,而不是填充模板,从而产生更多样化的描述。大多数方法将整个图像共同表示为一袋视觉对象,而没有捕捉它们的空间和语义关系。为了解决这个问题,Elliott et al.[51]提出明确地建模物体的接近关系,以生成图像描述。

一些基于语法的方法依赖于图形模型来生成目标模态。一个例子包括BabyTalk[112],它给出一个图像生成object,介词,object三连词,这些连词与条件随机场一起用来构造句子。Yang等人[233]利用从图像中提取的视觉特征预测一组的名词、动词、场景、介词候选人,并使用统计语言模型和隐马尔可夫模型风格推理将它们组合成一个句子。Thomason等人也提出了类似的方法[204],其中一个因子图模型用于subject, verb, object, place形式的视频描述。因子模型利用语言统计来处理嘈杂的视觉表示。Zitnick等人[253]则提出利用条件随机场从句子中提取语言三联体,生成抽象视觉场景。

An advantage of grammar-based methods is that they are more likely to generate syntactically (in case of lan-guage) or logically correct target instances as they use predefined templates and restricted grammars. However, this limits them to producing formulaic rather than creative translations. Furthermore, grammar-based methods rely on complex pipelines for concept detection, with each concept requiring a separate model and a separate training dataset. Encoder-decoder models based on end-to-end trained neu-ral networks are currently some of the most popular tech-niques for multimodal translation. The main idea behind the model is to first encode a source modality into a vectorial representation and then to use a decoder module to generate the target modality, all this in a single pass pipeline. Al-though, first used for machine translation [97], such models have been successfully used for image captioning [134],[214], and video description [174], [213]. So far, encoder-decoder models have been mostly used to generate text, but they can also be used to generate images [132], [171], and continuos generation of speech and sound [157], [209].

The first step of the encoder-decoder model is to encode the source object, this is done in modality specific way.Popular models to encode acoustic signals include RNNs [35] and DBNs [79]. Most of the work on encoding words sentences uses distributional semantics [141] and variants of RNNs [12]. Images are most often encoded using convo-lutional neural networks (CNN) [109], [185]. While learned CNN representations are common for encoding images, this is not the case for videos where hand-crafted features are still commonly used [174], [204]. While it is possible to use unimodal representations to encode the source modality, it has been shown that using a coordinated space (see Section 3.2) leads to better results [105], [159], [231].



Decoding is most often performed by an RNN or an LSTM using the encoded representation as the initial hidden state [54], [132], [214], [215]. A number of extensions have been proposed to traditional LSTM models to aid in the task of translation. A guide vector could be used to tightly couple the solutions in the image input [91]. Venugopalan et al.[213] demonstrate that it is beneficial to pre-train a decoder LSTM for image captioning before fine-tuning it to video description. Rohrbach et al. [174] explore the use of various LSTM architectures (single layer, multilayer, factored) and a number of training and regularization techniques for the task of video description.

A problem facing translation generation using an RNN is that the model has to generate a description from a single vectorial representation of the image, sentence, or video. This becomes especially difficult when generating long sequences as these models tend to forget the initial input. This has been partly addressed by neural attention models (see Section 5.2) that allow the network to focus on certain parts of an image [230], sentence [12], or video [236] during generation.

Generative attention-based RNNs have also been used for the task of generating images from sentences [132], while the results are still far from photo-realistic they show a lot of promise. More recently, a large amount of progress has been made in generating images using generative adversarial networks [71], which have been used as an alternative to RNNs for image generation from text [171].




While neural network based encoder-decoder systems have been very successful they still face a number of issues. Devlin et al. [49] suggest that it is possible that the network is memorizing the training data rather than learning how to understand the visual scene and generate it. This is based on the observation that k-nearest neighbor models perform very similarly to those based on generation. Furthermore, such models often require large quantities of data for train-ing.

Continuous generation models are intended for sequence translation and produce outputs at every timestep in an online manner. These models are useful when translating from a sequence to a sequence such as text to speech, speech to text, and video to text. A number of different techniques have been proposed for such modeling — graphical models, continuous encoder-decoder approaches, and various other regression or classification techniques. The extra difficulty that needs to be tackled by these models is the requirement of temporal consistency between modalities.

A lot of early work on sequence to sequence transla-tion used graphical or latent variable models. Deena and Galata [47] proposed to use a shared Gaussian process latent variable model for audio-based visual speech synthesis. The model creates a shared latent space between audio and vi-sual features that can be used to generate one space from the other, while enforcing temporal consistency of visual speech at different timesteps. Hidden Markov models (HMM) have also been used for visual speech generation [203] and text-to-speech [245] tasks. They have also been extended to use cluster adaptive training to allow for training on multiple speakers, languages, and emotions allowing for more con-trol when generating speech signal [244] or visual speech parameters [6].

虽然基于神经网络的编码器-解码器系统已经非常成功,但它们仍然面临一些问题。Devlin et al.[49]认为,网络可能是在记忆训练数据,而不是学习如何理解视觉场景并生成它。这是基于k近邻模型与基于生成的模型非常相似的观察得出的。此外,这种模型通常需要大量的数据进行训练。



Encoder-decoder models have recently become popular for sequence to sequence modeling. Owens et al. [157] used an LSTM to generate sounds resulting from drumsticks based on video. While their model is capable of generat-ing sounds by predicting a cochleogram from CNN visual features, they found that retrieving a closest audio sample based on the predicted cochleogram led to best results. Di-rectly modeling the raw audio signal for speech and music generation has been proposed by van den Oord et al. [209]. The authors propose using hierarchical fully convolutional neural networks, which show a large improvement over previous state-of-the-art for the task of speech synthesis. RNNs have also been used for speech to text translation (speech recognition) [72]. More recently encoder-decoder based continuous approach was shown to be good at pre-dicting letters from a speech signal represented as a filter bank spectra [35] — allowing for more accurate recognition of rare and out of vocabulary words. Collobert et al. [42] demonstrate how to use a raw audio signal directly for speech recognition, eliminating the need for audio features.

A lot of earlier work used graphical models for mul-timodal translation between continuous signals. However, these methods are being replaced by neural network encoder-decoder based techniques. Especially as they have recently been shown to be able to represent and generate complex visual and acoustic signals.

编码器-解码器模型是近年来序列对序列建模的流行方法。Owens等人[157]使用LSTM来产生基于视频的鼓槌的声音。虽然他们的模型能够通过预测CNN视觉特征的耳蜗图来产生声音,但他们发现,根据预测的耳蜗图检索最近的音频样本会带来最好的结果。van den Oord等人提出直接对原始音频信号建模以生成语音和音乐[209]。作者建议使用分层全卷积神经网络,这表明在语音合成的任务中,比以前的最先进技术有了很大的改进。rnn也被用于语音到文本的翻译(语音识别)[72]。最近,基于编码器-解码器的连续方法被证明能够很好地从表示为滤波器组光谱[35]的语音信号中预测字母,从而能够更准确地识别罕见的和词汇之外的单词。Collobert等人的[42]演示了如何直接使用原始音频信号进行语音识别,消除了对音频特征的需求。


4.3 Model evaluation and discussion模型评价与讨论

A major challenge facing multimodal translation methods is that they are very difficult to evaluate. While some tasks such as speech recognition have a single correct translation, tasks such as speech synthesis and media description do not. Sometimes, as in language translation, multiple answers are correct and deciding which translation is better is often subjective. Fortunately, there are a number of approximate automatic metrics that aid in model evaluation.

Often the ideal way to evaluate a subjective task is through human judgment. That is by having a group of people evaluating each translation. This can be done on a Likert scale where each translation is evaluated on a certain dimension: naturalness and mean opinion score for speech synthesis [209], [244], realism for visual speech synthesis [6],[203], and grammatical and semantic correctness, relevance, order, and detail for media description [38], [112], [142],[213]. Another option is to perform preference studies where two (or more) translations are presented to the participant for preference comparison [203], [244]. However, while user studies will result in evaluation closest to human judgments they are time consuming and costly. Furthermore, they require care when constructing and conducting them to avoid fluency, age, gender and culture biases.



While human studies are a gold standard for evaluation, a number of automatic alternatives have been proposed for the task of media description: BLEU [160], ROUGE [124], Meteor [48], and CIDEr [211]. These metrics are directly taken from (or are based on) work in machine translation and compute a score that measures the similarity between the generated and ground truth text. However, the use of them has faced a lot of criticism. Elliott and Keller [52] showed that sentence-level unigram BLEU is only weakly correlated with human judgments. Huang et al. [87] demon-strated that the correlation between human judgments and BLEU and Meteor is very low for visual story telling task. Furthermore, the ordering of approaches based on human judgments did not match that of the ordering using au-tomatic metrics on the MS COCO challenge [38] — with a large number of algorithms outperforming humans on all the metrics. Finally, the metrics only work well when a number of reference translations is high [211], which is often unavailable, especially for current video description datasets [205]

虽然人类研究是评估的黄金标准,但人们提出了许多媒体描述任务的自动替代方案:BLEU[160]、ROUGE[124]、Meteor[48]和CIDEr[211]。这些指标是直接从(或基于)机器翻译的工作,并计算出一个分数,以衡量生成的文本和地面真实文本之间的相似性。然而,它们的使用面临着许多批评。Elliott和Keller[52]表明句子层面的ungram BLEU与人类判断只有弱相关。Huang等[87]研究表明,在视觉讲故事任务中,人类判断与BLEU和Meteor之间的相关性非常低。此外,基于人类判断的方法排序与在MS COCO挑战[38]上使用自动度量的排序并不匹配——大量算法在所有度量上都优于人类。最后,只有在大量参考翻译量高的情况下,指标才能很好地工作[211],而这通常是不可用的,特别是对于当前的视频描述数据集[205]。

These criticisms have led to Hodosh et al. [83] proposing to use retrieval as a proxy for image captioning evaluation, which they argue better reflects human judgments. Instead of generating captions, a retrieval based system ranks the available captions based on their fit to the image, and is then evaluated by assessing if the correct captions are given a high rank. As a number of caption generation models are generative they can be used directly to assess the likelihood of a caption given an image and are being adapted by im-age captioning community [99], [105]. Such retrieval based evaluation metrics have also been adopted by the video captioning community [175].

Visual question-answering (VQA) [130] task was pro-posed partly due to the issues facing evaluation of image captioning. VQA is a task where given an image and a ques-tion about its content the system has to answer it. Evaluating such systems is easier due to the presence of a correct answer. However, it still faces issues such as ambiguity of certain questions and answers and question bias.

We believe that addressing the evaluation issue will be crucial for further success of multimodal translation systems. This will allow not only for better comparison be-tween approaches, but also for better objectives to optimize.


视觉问答(Visual question-answer, VQA)[130]任务的提出,部分是由于图像字幕评价面临的问题。VQA是一个任务,在这个任务中,给定一个图像和一个关于其内容的问题,系统必须回答它。由于存在正确的答案,评估这些系统更容易。然而,它仍然面临一些问题,如某些问题和答案的模糊性和问题的偏见。


5 Alignment对齐

We define multimodal alignment as finding relationships and correspondences between sub-components of instances from two or more modalities. For example, given an image and a caption we want to find the areas of the image cor-responding to the caption’s words or phrases [98]. Another example is, given a movie, aligning it to the script or the book chapters it was based on [252].

We categorize multimodal alignment into two types –implicit and explicit. In explicit alignment, we are explicitly interested in aligning sub-components between modalities,e.g., aligning recipe steps with the corresponding instructional video [131]. Implicit alignment is used as an interme-diate (often latent) step for another task, e.g., image retrieval based on text description can include an alignment step between words and image regions [99]. An overview of such approaches can be seen in Table 4 and is presented in more detail in the following sections.



 Table 4: Summary of our taxonomy for the multimodal alignment challenge. For each sub-class of our taxonomy, we include reference citations and modalities aligned.


5.1 Explicit alignment显式对齐

We categorize papers as performing explicit alignment if their main modeling objective is alignment between sub-components of instances from two or more modalities. A very important part of explicit alignment is the similarity metric. Most approaches rely on measuring similarity be-tween sub-components in different modalities as a basic building block. These similarities can be defined manually or learned from data.

We identify two types of algorithms that tackle ex-plicit alignment — unsupervised and (weakly) supervised. The first type operates with no direct alignment labels (i.e., la-beled correspondences) between instances from the different modalities. The second type has access to such (sometimes weak) labels.

Unsupervised multimodal alignment tackles modality alignment without requiring any direct alignment labels. Most of the approaches are inspired from early work on alignment for statistical machine translation [28] and genome sequences [3], [111]. To make the task easier the approaches assume certain constrains on alignment, such as temporal ordering of sequence or an existence of a similarity metric between the modalities.




Dynamic time warping (DTW) [3], [111] is a dynamic programming approach that has been extensively used to align multi-view time series. DTW measures the similarity between two sequences and finds an optimal match between them by time warping (inserting frames). It requires the timesteps in the two sequences to be comparable and re-quires a similarity measure between them. DTW can be used directly for multimodal alignment by hand-crafting similar-ity metrics between modalities; for example Anguera et al.[8] use a manually defined similarity between graphemes and phonemes; and Tapaswi et al. [201] define a similarity between visual scenes and sentences based on appearance of same characters [201] to align TV shows and plot syn-opses. DTW-like dynamic programming approaches have also been used for multimodal alignment of text to speech [77] and video [202].

As the original DTW formulation requires a pre-defined similarity metric between modalities, it was extended using canonical correlation analysis (CCA) to map the modali-ties to a coordinated space. This allows for both aligning (through DTW) and learning the mapping (through CCA) between different modality streams jointly and in an unsu-pervised manner [180], [250], [251]. While CCA based DTW models are able to find multimodal data alignment under a linear transformation, they are not able to model non-linear relationships. This has been addressed by the deep canonical time warping approach [206], which can be seen as a generalization of deep CCA and DTW.



Various graphical models have also been popular for multimodal sequence alignment in an unsupervised man-ner. Early work by Yu and Ballard [239] used a generative graphical model to align visual objects in images with spoken words. A similar approach was taken by Cour et al.[44] to align movie shots and scenes to the corresponding screenplay. Malmaud et al. [131] used a factored HMM to align recipes to cooking videos, while Noulas et al. [154] used a dynamic Bayesian network to align speakers to videos. Naim et al. [147] matched sentences with corre-sponding video frames using a hierarchical HMM model to align sentences with frames and a modified IBM [28] algorithm for word and object alignment [15]. This model was then extended to use latent conditional random fields for alignments [146] and to incorporate verb alignment to actions in addition to nouns and objects [195].

Both DTW and graphical model approaches for align-ment allow for restrictions on alignment, e.g. temporal consistency, no large jumps in time, and monotonicity. While DTW extensions allow for learning both the similarity met-ric and alignment jointly, graphical model based approaches require expert knowledge for construction [44], [239]. Supervised alignment methods rely on labeled aligned in-stances. They are used to train similarity measures that are used for aligning modalities.

各种图形模型也流行于无监督方式的多模态序列比对。Yu和Ballard的早期工作[239]使用生成图形模型,将图像中的视觉对象与口语对齐。Cour et al.[44]采用了类似的方法,将电影镜头和场景与相应的剧本对齐。Malmaud等人[131]使用一种经过分解的HMM将食谱与烹饪视频进行对齐,而Noulas等人[154]使用动态贝叶斯网络将说话者与视频进行对齐。Naim等人[147]使用分层HMM模型对句子和帧进行对齐,并使用改进的IBM[28]算法对单词和对象进行对齐[15],将句子与相应的视频帧进行匹配。随后,该模型被扩展到使用潜在条件随机场进行对齐[146],并将动词对齐合并到动作中,除了名词和对象之外[195]。


A number of supervised sequence alignment techniques take inspiration from unsupervised ones. Bojanowski et al.[22], [23] proposed a method similar to canonical time warp-ing, but have also extended it to take advantage of exist-ing (weak) supervisory alignment data for model training. Plummer et al. [161] used CCA to find a coordinated space between image regions and phrases for alignment. Gebru et al. [65] trained a Gaussian mixture model and performed semi-supervised clustering together with an unsupervised latent-variable graphical model to align speakers in an audio channel with their locations in a video. Kong et al. [108] trained a Markov random field to align objects in 3D scenes to nouns and pronouns in text descriptions.

Deep learning based approaches are becoming popular for explicit alignment (specifically for measuring similarity) due to very recent availability of aligned datasets in the lan-guage and vision communities [133], [161]. Zhu et al. [252] aligned books with their corresponding movies/scripts by training a CNN to measure similarities between scenes and text. Mao et al. [133] used an LSTM language model and a CNN visual one to evaluate the quality of a match between a referring expression and an object in an image. Yu et al.[242] extended this model to include relative appearance and context information that allows to better disambiguate between objects of the same type. Finally, Hu et al. [85] used an LSTM based scoring function to find similarities between image regions and their descriptions.

许多监督序列比对技术的灵感来自于非监督序列比对技术。Bojanowski et al.[22],[23]提出了一种类似于规范时间扭曲的方法,但也对其进行了扩展,以利用现有的(弱)监督对准数据进行模型训练。Plummer等[161]利用CCA在图像区域和短语之间找到一个协调的空间进行对齐。Gebru等人[65]训练了一种高斯混合模型,并将半监督聚类与一种无监督的潜在变量图形模型结合在一起,以使音频通道中的扬声器与视频中的位置保持一致。Kong等人[108]训练了马尔可夫随机场来将3D场景中的物体与文本描述中的名词和代词对齐。


5.2 Implicit alignment隐式对齐

In contrast to explicit alignment, implicit alignment is used as an intermediate (often latent) step for another task. This allows for better performance in a number of tasks including speech recognition, machine translation, media description, and visual question-answering. Such models do not explic-itly align data and do not rely on supervised alignment examples, but learn how to latently align the data during model training. We identify two types of implicit alignment models: earlier work based on graphical models, and more modern neural network methods.

Graphical models have seen some early work used to better align words between languages for machine translation [216] and alignment of speech phonemes with their tran-scriptions [186]. However, they require manual construction of a mapping between the modalities, for example a gener-ative phone model that maps phonemes to acoustic features [186]. Constructing such models requires training data or human expertise to define them manually.

Neural networks Translation (Section 4) is an example of a modeling task that can often be improved if alignment is performed as a latent intermediate step. As we mentioned before, neural networks are popular ways to address this translation problem, using either an encoder-decoder model or through cross-modal retrieval. When translation is per-formed without implicit alignment, it ends up putting a lot of weight on the encoder module to be able to properly summarize the whole image, sentence or a video with a single vectorial representation.




A very popular way to address this is through attention [12], which allows the decoder to focus on sub-components of the source instance. This is in contrast with encoding all source sub-components together, as is performed in a con-ventional encoder-decoder model. An attention module will tell the decoder to look more at targeted sub-components of the source to be translated — areas of an image [230], words of a sentence [12], segments of an audio sequence [35], [39], frames and regions in a video [236], [241], and even parts of an instruction [140]. For example, in image captioning in-stead of encoding an entire image using a CNN, an attention mechanism will allow the decoder (typically an RNN) to focus on particular parts of the image when generating each successive word [230]. The attention module which learns what part of the image to focus on is typically a shallow neural network and is trained end-to-end together with a target task (e.g., translation).

Attention models have also been successfully applied to question answering tasks, as they allow for aligning the words in a question with sub-components of an information source such as a piece of text [228], an image [62], or a video sequence [246]. This both allows for better performance in question answering and leads to better model interpretabil-ity [4]. In particular, different types of attention models have been proposed to address this problem, including hierar-chical [128], stacked [234], and episodic memory attention [228].



Another neural alternative for aligning images with cap-tions for cross-modal retrieval was proposed by Karpathy et al. [98], [99]. Their proposed model aligns sentence frag-ments to image regions by using a dot product similarity measure between image region and word representations. While it does not use attention, it extracts a latent alignment between modalities through a similarity measure that is learned indirectly by training a retrieval model.


5.3 Discussion讨论

Multimodal alignment faces a number of difficulties:

1) there are few datasets with explicitly annotated alignments;

2) it is difficult to design similarity metrics between modalities;

3) there may exist multiple possible alignments and not all elements in one modality have correspondences in another.

Earlier work on multimodal alignment focused on aligning multimodal sequences in an unsupervised manner using graphical models and dynamic programming techniques. It relied on hand-defined measures of similarity between the modalities or learnt them in an unsupervised manner. With recent availability of labeled training data supervised learn-ing of similarities between modalities has become possible. However, unsupervised techniques of learning to jointly align and translate or fuse data have also become popular.



2) 难以设计模态之间的相似性度量

3) 可能存在多种可能的对齐方式,并且并非一种模态中的所有元素在另一种模态中都有对应关系。

早期关于多模态对齐的工作侧重于使用图形模型和动态规划技术以无监督方式对齐多模态序列。 它依靠手动定义的模态之间的相似性度量或以无人监督的方式学习它们。 随着最近标记训练数据的可用性,对模态之间相似性的监督学习成为可能。 然而,学习联合对齐和翻译或融合数据的无监督技术也变得流行起来。

6 Fusion融合

Multimodal fusion is one of the original topics in mul-timodal machine learning, with previous surveys empha-sizing early, late and hybrid fusion approaches [50], [247]. In technical terms, multimodal fusion is the concept of integrating information from multiple modalities with the goal of predicting an outcome measure: a class (e.g., happy vs. sad) through classification, or a continuous value (e.g., positivity of sentiment) through regression. It is one of the most researched aspects of multimodal machine learning with work dating to 25 years ago [243].

The interest in multimodal fusion arises from three main benefits it can provide. First, having access to multiple modalities that observe the same phenomenon may allow for more robust predictions. This has been especially ex-plored and exploited by the AVSR community [163]. Second, having access to multiple modalities might allow us to capture complementary information — something that is not visible in individual modalities on their own. Third, a multimodal system can still operate when one of the modalities is missing, for example recognizing emotions from the visual signal when the person is not speaking [50].



Multimodal fusion has a very broad range of appli-cations, including audio-visual speech recognition (AVSR)[163], multimodal emotion recognition [192], medical image analysis [89], and multimedia event detection [117]. There are a number of reviews on the subject [11], [163], [188],[247]. Most of them concentrate on multimodal fusion for a particular task, such as multimedia analysis, information retrieval or emotion recognition. In contrast, we concentrate on the machine learning approaches themselves and the technical challenges associated with these approaches.

While some prior work used the term multimodal fu-sion to include all multimodal algorithms, in this survey paper we classify approaches as fusion category when the multimodal integration is performed at the later prediction stages, with the goal of predicting outcome measures. In recent work, the line between multimodal representation and fusion has been blurred for models such as deep neural networks where representation learning is interlaced with classification or regression objectives. As we will describe in this section, this line is clearer for other approaches such as graphical models and kernel-based methods.

We classify multimodal fusion into two main categories: model-agnostic approaches (Section 6.1) that are not di-rectly dependent on a specific machine learning method; and model-based (Section 6.2) approaches that explicitly ad-dress fusion in their construction — such as kernel-based approaches, graphical models, and neural networks. An overview of such approaches can be seen in Table 5.




 Table 5: A summary of our taxonomy of multimodal fusion approaches. OUT — output type (class — classification or reg — regression), TEMP — is temporal modeling possible.

表5:我们对多模态融合方法的分类总结。OUT -输出类型(类-分类或reg -回归),TEMP -是时间建模的可能。

6.1 Model-agnostic approaches与模型无关的方法

Historically, the vast majority of multimodal fusion has been done using model-agnostic approaches [50]. Such ap-proaches can be split into early (i.e., feature-based), late (i.e., decision-based) and hybrid fusion [11]. Early fusion inte-grates features immediately after they are extracted (often by simply concatenating their representations). Late fusion on the other hand performs integration after each of the modalities has made a decision (e.g., classification or regres-sion). Finally, hybrid fusion combines outputs from early fusion and individual unimodal predictors. An advantage of model agnostic approaches is that they can be implemented using almost any unimodal classifiers or regressors.

Early fusion could be seen as an initial attempt by mul-timodal researchers to perform multimodal representation learning — as it can learn to exploit the correlation and interactions between low level features of each modality. Furthermore it only requires the training of a single model, making the training pipeline easier compared to late and hybrid fusion.



In contrast, late fusion uses unimodal decision values and fuses them using a fusion mechanism such as averaging [181], voting schemes [144], weighting based on channel noise [163] and signal variance [53], or a learned model [68], [168]. It allows for the use of different models for each modality as different predictors can model each individual modality better, allowing for more flexibility. Furthermore, it makes it easier to make predictions when one or more of the modalities is missing and even allows for training when no parallel data is available. However, late fusion ignores the low level interaction between the modalities.

Hybrid fusion attempts to exploit the advantages of both of the above described methods in a common framework. It has been used successfully for multimodal speaker identifi-cation [226] and multimedia event detection (MED) [117].



6.2 Model-based approaches基于模型的方法

While model-agnostic approaches are easy to implement using unimodal machine learning methods, they end up using techniques that are not designed to cope with mul-timodal data. In this section we describe three categories of approaches that are designed to perform multimodal fusion: kernel-based methods, graphical models, and neural networks.

Multiple kernel learning (MKL) methods are an extension to kernel support vector machines (SVM) that allow for the use of different kernels for different modalities/views of the data [70]. As kernels can be seen as similarity functions be-tween data points, modality-specific kernels in MKL allows for better fusion of heterogeneous data.

MKL approaches have been an especially popular method for fusing visual descriptors for object detection [31], [66] and only recently have been overtaken by deep learning methods for the task [109]. They have also seen use for multimodal affect recognition [36], [90], [182], mul-timodal sentiment analysis [162], and multimedia event detection (MED) [237]. Furthermore, McFee and Lanckriet [137] proposed to use MKL to perform musical artist simi-larity ranking from acoustic, semantic and social view data. Finally, Liu et al. [125] used MKL for multimodal fusion in Alzheimer’s disease classification. Their broad applicability demonstrates the strength of such approaches in various domains and across different modalities.




Besides flexibility in kernel selection, an advantage of MKL is the fact that the loss function is convex, allowing for model training using standard optimization packages and global optimum solutions [70]. Furthermore, MKL can be used to both perform regression and classification. One of the main disadvantages of MKL is the reliance on training data (support vectors) during test time, leading to slow inference and a large memory footprint.

Graphical models are another family of popular methods for multimodal fusion. In this section we overview work done on multimodal fusion using shallow graphical models. A description of deep graphical models such as deep belief networks can be found in Section 3.1.

Majority of graphical models can be classified into two main categories: generative — modeling joint probability; or discriminative — modeling conditional probability [200]. Some of the earliest approaches to use graphical models for multimodal fusion include generative models such as cou-pled [149] and factorial hidden Markov models [67] along-side dynamic Bayesian networks [64]. A more recently-proposed multi-stream HMM method proposes dynamic weighting of modalities for AVSR [75].

Arguably, generative models lost popularity to discrimi-native ones such as conditional random fields (CRF) [115] which sacrifice the modeling of joint probability for pre-dictive power. A CRF model was used to better segment images by combining visual and textual information of image description [60]. CRF models have been extended to model latent states using hidden conditional random fields [165] and have been applied to multimodal meeting seg-mentation [173]. Other multimodal uses of latent variable discriminative graphical models include multi-view hidden CRF [194] and latent variable models [193]. More recently Jiang et al. [93] have shown the benefits of multimodal hidden conditional random fields for the task of multimedia classification. While most graphical models are aimed at classification, CRF models have been extended to a continu-ous version for regression [164] and applied in multimodal settings [13] for audio visual emotion recognition.





The benefit of graphical models is their ability to easily exploit spatial and temporal structure of the data, making them especially popular for temporal modeling tasks, such as AVSR and multimodal affect recognition. They also allow to build in human expert knowledge into the models. and often lead to interpretable models.

Neural Networks have been used extensively for the task of multimodal fusion [151]. The earliest examples of using neural networks for multi-modal fusion come from work on AVSR [163]. Nowadays they are being used to fuse information for visual and media question answering [63],[130], [229], gesture recognition [150], affect analysis [96],[153], and video description generation [94]. While the modalities used, architectures, and optimization techniques might differ, the general idea of fusing information in joint hidden layer of a neural network remains the same.

Neural networks have also been used for fusing tempo-ral multimodal information through the use of RNNs and LSTMs. One of the earlier such applications used a bidi-rectional LSTM was used to perform audio-visual emotion classification [224]. More recently, W¨ollmer et al. [223] used LSTM models for continuous multimodal emotion recog-nition, demonstrating its advantage over graphical models and SVMs. Similarly, Nicolaou et al. [152] used LSTMs for continuous emotion prediction. Their proposed method used an LSTM to fuse the results from a modality specific (audio and facial expression) LSTMs.




Approaching modality fusion through recurrent neural networks has been used in various image captioning tasks, example models include: neural image captioning [214] where a CNN image representation is decoded using an LSTM language model, gLSTM [91] which incorporates the image data together with sentence decoding at every time step fusing the visual and sentence data in a joint repre-sentation. A more recent example is the multi-view LSTM (MV-LSTM) model proposed by Rajagopalan et al. [166]. MV-LSTM model allows for flexible fusion of modalities in the LSTM framework by explicitly modeling the modality-specific and cross-modality interactions over time.

A big advantage of deep neural network approaches in data fusion is their capacity to learn from large amount of data. Secondly, recent neural architectures allow for end-to-end training of both the multimodal representation compo-nent and the fusion component. Finally, they show good performance when compared to non neural network based system and are able to learn complex decision boundaries that other approaches struggle with.

The major disadvantage of neural network approaches is their lack of interpretability. It is difficult to tell what the prediction relies on, and which modalities or features play an important role. Furthermore, neural networks require large training datasets to be successful.

通过递归神经网络实现模态融合已被用于各种图像字幕任务,示例模型包括:神经图像字幕[214],其中CNN图像表示使用LSTM语言模型进行解码,gLSTM[91]将图像数据和每一步的句子解码结合在一起,将视觉数据和句子数据融合在一个联合表示中。最近的一个例子是Rajagopalan等人提出的多视图LSTM (MV-LSTM)模型[166]。MV-LSTM模型通过显式地建模随时间变化的特定模态和跨模态交互,允许LSTM框架中模态的灵活融合。



6.3 Discussion讨论

Multimodal fusion has been a widely researched topic with a large number of approaches proposed to tackle it, includ-ing model agnostic methods, graphical models, multiple kernel learning, and various types of neural networks. Each approach has its own strengths and weaknesses, with some more suited for smaller datasets and others performing bet-ter in noisy environments. Most recently, neural networks have become a very popular way to tackle multimodal fu-sion, however graphical models and multiple kernel learn-ing are still being used, especially in tasks with limited training data or where model interpretability is important.


Despite these advances multimodal fusion still faces the following challenges:

1) signals might not be temporally aligned (possibly dense continuous signal and a sparse event);

2) it is difficult to build models that exploit supple-mentary and not only complementary information;

3) each modality might exhibit different types and different levels of noise at different points in time.





7 Co-learning共同学习

The final multimodal challenge in our taxonomy is co-learning — aiding the modeling of a (resource poor) modal-ity by exploiting knowledge from another (resource rich) modality. It is particularly relevant when one of the modali-ties has limited resources — lack of annotated data, noisy input, and unreliable labels. We call this challenge co-learning as most often the helper modality is used only during model training and is not used during test time. We identify three types of co-learning approaches based on their training resources: parallel, non-parallel, and hybrid. Parallel-data approaches require training datasets where the observations from one modality are directly linked to the ob-servations from other modalities. In other words, when the multimodal observations are from the same instances, such as in an audio-visual speech dataset where the video and speech samples are from the same speaker. In contrast, non-parallel data approaches do not require direct links between observations from different modalities. These approaches usually achieve co-learning by using overlap in terms of categories. For example, in zero shot learning when the con-ventional visual object recognition dataset is expanded with a second text-only dataset from Wikipedia to improve the generalization of visual object recognition. In the hybrid data setting the modalities are bridged through a shared modality or a dataset. An overview of methods in co-learning can be seen in Table 6 and summary of data parallelism in Figure 3.


7.1 Parallel data并行数据

In parallel data co-learning both modalities share a set of in-stances — audio recordings with the corresponding videos, images and their sentence descriptions. This allows for two types of algorithms to exploit that data to better model the modalities: co-training and representation learning.

Co-training is the process of creating more labeled training samples when we have few labeled samples in a multimodal problem [21]. The basic algorithm builds weak classifiers in each modality to bootstrap each other with labels for the unlabeled data. It has been shown to discover more training samples for web-page classification based on the web-page itself and hyper-links leading in the seminal work of Blum and Mitchell [21]. By definition this task requires parallel data as it relies on the overlap of multimodal samples.



 Figure 3: Types of data parallelism used in co-learning: parallel — modalities are from the same dataset and there is a direct correspondence between instances; non-parallel— modalities are from different datasets and do not have overlapping instances, but overlap in general categories or concepts; hybrid — the instances or concepts are bridged by a third modality or a dataset.


Co-training has been used for statistical parsing [178] to build better visual detectors [120] and for audio-visual speech recognition [40]. It has also been extended to deal with disagreement between modalities, by filtering out unreliable samples [41]. While co-training is a powerful method for generating more labeled data, it can also lead to biased training samples resulting in overfitting. Transfer learning is another way to exploit co-learning with parallel data. Multimodal representation learning (Section 3.1) approaches such as multimodal deep Boltzmann ma-chines [198] and multimodal autoencoders [151] transfer information from representation of one modality to that of another. This not only leads to multimodal representations, but also to better unimodal ones, with only one modality being used during test time [151] .

Moon et al. [143] show how to transfer information from a speech recognition neural network (based on audio) to a lip-reading one (based on images), leading to a better visual representation, and a model that can be used for lip-reading without need for audio information during test time. Similarly, Arora and Livescu [10] build better acoustic features using CCA on acoustic and articulatory (location of lips, tongue and jaw) data. They use articulatory data only during CCA construction and use only the resulting acoustic (unimodal) representation during test time.



7.2 Non-parallel data非并行数据

Methods that rely on non-parallel data do not require the modalities to have shared instances, but only shared cat-egories or concepts. Non-parallel co-learning approaches can help when learning representations, allow for better semantic concept understanding and even perform unseen object recognition.

 Table 6: A summary of co-learning taxonomy, based on data parallelism. Parallel data — multiple modalities can see the same instance. Non-parallel data — unimodal instances are independent of each other. Hybrid data — the modalities are pivoted through a shared modality or dataset.



Transfer learning is also possible on non-parallel data and allows to learn better representations through transferring information from a representation built using a data rich or clean modality to a data scarce or noisy modality. This type of trasnfer learning is often achieved by using coordinated multimodal representations (see Section 3.2). For example, Frome et al. [61] used text to improve visual representations for image classification by coordinating CNN visual features with word2vec textual ones [141] trained on separate large datasets. Visual representations trained in such a way result in more meaningful errors — mistaking objects for ones of similar category [61]. Mahasseni and Todorovic [129] demonstrated how to regularize a color video based LSTM using an autoencoder LSTM trained on 3D skeleton data by enforcing similarities between their hidden states. Such an approach is able to improve the original LSTM and lead to state-of-the-art performance in action recognition. Conceptual grounding refers to learning semantic mean-ings or concepts not purely based on language but also on additional modalities such as vision, sound, or even smell [16]. While the majority of concept learning approaches are purely language-based, representations of meaning in humans are not merely a product of our linguistic exposure, but are also grounded through our sensorimotor experience and perceptual system [17], [126]. Human semantic knowl-edge relies heavily on perceptual information [126] and many concepts are grounded in the perceptual system and are not purely symbolic [17]. This implies that learning semantic meaning purely from textual information might not be optimal, and motivates the use of visual or acoustic cues to ground our linguistic representations.


Starting from work by Feng and Lapata [59], grounding is usually performed by finding a common latent space between the representations [59], [183] (in case of parallel datasets) or by learning unimodal representations sepa-rately and then concatenating them to lead to a multimodal one [29], [101], [172], [181] (in case of non-parallel data). Once a multimodal representation is constructed it can be used on purely linguistic tasks. Shutova et al. [181] and Bruni et al. [29] used grounded representations for better classification of metaphors and literal language. Such representations have also been useful for measuring conceptual similarity and relatedness — identifying how semantically or conceptually related two words are [30], [101], [183] or actions [172]. Furthermore, concepts can be grounded not only using visual signals, but also acoustic ones, leading to better performance especially on words with auditory associations [103], or even olfactory signals [102] for words with smell associations. Finally, there is a lot of overlap between multimodal alignment and conceptual grounding, as aligning visual scenes to their descriptions leads to better textual or visual representations [108], [161], [172], [240].

Conceptual grounding has been found to be an effective way to improve performance on a number of tasks. It also shows that language and vision (or audio) are com-plementary sources of information and combining them in multimodal models often improves performance. However, one has to be careful as grounding does not always lead to better performance [102], [103], and only makes sense when grounding has relevance for the task — such as grounding using images for visually-related concepts.



Zero shot learning (ZSL) refers to recognizing a concept without having explicitly seen any examples of it. For ex-ample classifying a cat in an image without ever having seen (labeled) images of cats. This is an important problem to address as in a number of tasks such as visual object clas-sification: it is prohibitively expensive to provide training examples for every imaginable object of interest.

There are two main types of ZSL — unimodal and multimodal. The unimodal ZSL looks at component parts or attributes of the object, such as phonemes to recognize an unheard word or visual attributes such as color, size, and shape to predict an unseen visual class [55]. The multimodal ZSL recognizes the objects in the primary modality through the help of the secondary one — in which the object has been seen. The multimodal version of ZSL is a problem facing non-parallel data by definition as the overlap of seen classes is different between the modalities.

Socher et al. [190] map image features to a conceptual word space and are able to classify between seen and unseen concepts. The unseen concepts can be then assigned to a word that is close to the visual representation — this is enabled by the semantic space being trained on a separate dataset that has seen more concepts. Instead of learning a mapping from visual to concept space Frome et al. [61] learn a coordinated multimodal representation between concepts and images that allows for ZSL. Palatucci et al. [158] per-form prediction of words people are thinking of based on functional magnetic resonance images, they show how it is possible to predict unseen words through the use of an intermediate semantic space. Lazaridou et al. [118] present a fast mapping method for ZSL by mapping extracted visual feature vectors to text-based vectors through a neural network.




7.3 Hybrid data混合数据

In the hybrid data setting two non-parallel modalities are bridged by a shared modality or a dataset (see Figure 3c). The most notable example is the Bridge Correlational Neural Network [167], which uses a pivot modality to learn coordinated multimodal representations in presence of non-parallel data. For example, in the case of multilingual image captioning, the image modality would always be paired with at least one caption in any language. Such methods have also been used to bridge languages that might not have parallel corpora but have access to a shared pivot language, such as for machine translation [148], [167] and document transliteration [100].


Instead of using a separate modality for bridging, some methods rely on existence of large datasets from a similar or related task to lead to better performance in a task that only contains limited annotated data. Socher and Fei-Fei [189] use the existence of large text corpora in order to guide image segmentation. While Hendricks et al. [78] use separately trained visual model and a language model to lead to a better image and video description system, for which only limited data is available.


7.4 Discussion讨论

Multimodal co-learning allows for one modality to influ-ence the training of another, exploiting the complementary information across modalities. It is important to note that co-learning is task independent and could be used to cre-ate better fusion, translation, and alignment models. This challenge is exemplified by algorithms such as co-training, multimodal representation learning, conceptual grounding, and zero shot learning (ZSL) and has found many applica-tions in visual classification, action recognition, audio-visual speech recognition, and semantic similarity estimation.


8 Conclusion结论

As part of this survey, we introduced a taxonomy of multi-modal machine learning: representation, translation, fusion, alignment, and co-learning.

Some of them such as fusion have been studied for a long time, but more recent interest in representation and translation have led to a large number of new multimodal algorithms and exciting multimodal applications.

We believe that our taxonomy will help to catalog future research papers and also better understand the remaining unresolved problems facing multimodal machine learning.




上传者不拥有讲义的原始版权。所有版权归属CMU。 该文件集是CMU开设的11-777课程,名为multimodal machine learning,每年fall学期开设。 本讲义是2019 Fall的版本。 课程介绍: Description Multimodal machine learning (MMML) is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with language vision projects such as image and video captioning, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities. The course will present the fundamental mathematical concepts in machine learning and deep learning relevant to the five main challenges in multimodal machine learning: (1) multimodal representation learning, (2) translation mapping, (3) modality alignment, (4) multimodal fusion and (5) co-learning. These include, but not limited to, multimodal auto-encoder, deep canonical correlation analysis, multi-kernel learning, attention models and multimodal recurrent neural networks. We will also review recent papers describing state-of-the-art probabilistic models and computational algorithms for MMML and discuss the current and upcoming challenges. The course will discuss many of the recent applications of MMML including multimodal affect recognition, image and video captioning and cross-modal multimedia retrieval. This is a graduate course designed primarily for PhD and research master students at LTI, MLD, CSD, HCII and RI; others, for example (undergraduate) students of CS or from professional master programs, are advised to seek prior permission of the instructor. It is required for students to have taken an introduction machine learning course such as 10-401, 10-601, 10-701, 11-663, 11-441, 11-641 or 11-741. Prior knowledge of deep learning is recommended.
交通流量预测是城市交通管理和规划的重要问题之一。传统的方法通常使用统计模型和时间序列分析来进行预测,但它们往往无法捕捉到交通流量数据中的复杂模式和非线性关系。因此,本文提出了一种基于多模态深度学习的混合方法来进行交通流量预测。 该方法将多模态数据(如历史交通流量数据、气象数据、节假日信息等)作为输入,利用深度神经网络来学习数据之间的复杂关系。深度神经网络可以自动提取特征,并通过多层次的非线性变换来捕捉到不同模态数据之间的依赖关系。 具体而言,该方法包括两个主要步骤:模态学习和流量预测。在模态学习阶段,使用深度神经网络对每个模态数据进行特征提取和表示学习,从而获得高维的特征表示。在流量预测阶段,利用这些特征表示来训练一个回归模型来进行交通流量的预测。可以使用不同的深度学习模型,如卷积神经网络和循环神经网络,来处理不同类型的输入数据。 该方法在实际的交通流量数据集上进行了实验,并与传统的方法进行了比较。实验结果表明,该混合方法在预测准确性和稳定性方面具有明显的优势。它能够更好地预测交通流量的变化趋势和峰值时段,并且具有较低的误差率。 综上所述,这种基于多模态深度学习的混合方法为交通流量预测提供了一种创新的解决方案。它可以更好地挖掘和利用不同模态数据之间的关联性,从而提高预测准确性,为城市交通管理和规划提供有价值的决策支持。




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则




¥1 ¥2 ¥4 ¥6 ¥10 ¥20



钱包余额 0


