Multimodal Machine Learning:A Survey and Taxonomy

最新推荐文章于 2024-03-18 22:04:15 发布

辉辉小学生

最新推荐文章于 2024-03-18 22:04:15 发布

阅读量834

点赞数

分类专栏：多模态paper 文章标签：大数据多模态

本文链接：https://blog.csdn.net/huihuixiaoxue/article/details/125717509

版权

多模态paper 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

Abstract

aim: build models that can process and relate information from multiple modalities

new taxonomy:representation, translation, alignment, fusion, and co-learning.

INTRODUCTION

three modalities:

(nlp)natural language which can be both written or spoken;

(cv) visual signals which are often represented with images or videos;

(sr) vocal signals which encode sounds and para-verbal information such as prosody and vocal expressions

challenges:

representation: the complementarity and redundancy

translation:translate one modality to another( open-ended or subjective)

alignment: measure similarity between different modalities and deal with possible long

range dependencies and ambiguities

fusion: join information from two or more modalities to perform a prediction

co-learning: co -training, conceptual grounding, and zero shot learning

Table 1: A summary of applications enabled by multimodal machine learning. For each application area we identify the core technical challenges that need to be addressed in order to tackle it.

2 A PPLICATIONS : A HISTORICAL PERSPECTIVE 略

3 M ULTIMODAL R EPRESENTATIONS

Representing multiple modalities poses many diffificulties:

how to combine the data from heterogeneous sources;

how to deal with different levels of noise;

and how to deal with missing data.

a number of properties for good representations:

smoothness(类比nlp的平滑), temporal and spatial coherence, sparsity, and natural clustering amongst others

additional desirable properties for multimodal representations:

similarity in the representation space should reflect the similarity of the corresponding concepts,

the representation should be easy to obtain even in the absence of some modalities

it should be possible to fifill-in missing modalities given the observed ones.

3.1 Joint Representations（ project unimodal representations together into a multimodal
space ）

Mathematically, the joint representation is expressed as:

xm = f(x1, . . . , xn) (1)

Neural networks

To construct a multimodal representation using neural networks each modality starts with several individual neural layers fol lowed by a hidden layer that projects the modalities into a joint space.The joint multimodal representation is then be passed through multiple hidden layers itself or used directly for prediction.（usually be trained end to end）

advantages:

neural network based joint rep resentations comes from their often superior performance

and the ability to pre-train the representations in an unsu pervised manner. The performance gain is, however, depen dent on the amount of data available for training.

disadvantages:

model not being able to handle missing data naturally deep networks are often diffificult to train

Probabilistic graphical models

advantages:

generative nature, which allows for an easy way to deal with missing data (

even if a whole modality is missing, the model has a natural way to cope)

generate samples of one modality in the presence of the other one, or both modalities from the representation s

the representation can be trained in an unsupervised manner enabling the use of unlabeled data

disadvantages:

the diffificulty of training them( high computational cost, and the need to use approximate

variational training methods )

Sequential Representation(varying length sequences) 略

3.2 Coordinated Representations ( learn separate representations for each modality but coordinate them through a constraint )

While coordinated representation is as follows:

f ( x 1 ) ∼ g ( x 2 ) (2)

Similarity models（ minimize the distance between modal ities in the coordinated space ）

structured coordinated space（ Structured coordinated spaces are commonly used in cross-modal hashing ）

Hashing enforces certain constraints on the resulting multimodal space:

1) it has to be an N -dimensionalHamming space — a binary representation with controllable

number of bits;

2) the same object from different modalities has to have a similar hash code;

3) the space has to be similarity-preserving.

4 T RANSLATION

Given an entity in one modality the task is to generate the same entity in a different modality.

4.1 Example-based

Retrieval-based models ( directly use the retrieved translation without modifying it )

Combination-based models ( rely on more complex rules to create translations based on a number of retrieved instances )

4.2 Generative approaches

Grammar-based models

Encoder-decoder models

Continuous generation models

5 A LIGNMENT

5.1 Explicit alignment

Unsupervised

Supervised

5.2 Implicit alignment

Graphical models

Neural networks

6 F USION

6.1 Model-agnostic approaches

6.2 Model-based approaches

Multiple kernel learning (MKL)

Graphical models

Neural Networks

7 C O - LEARNING

7.1 Parallel data

Co-training

Transfer learning

7.2 Non-parallel data

Transfer learning

Conceptual grounding

Zero shot learning

7.3 Hybrid data

辉辉小学生

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录