(学习资料来源:Adaptive Multimodal Fusion for Facial Action Units Recognition—Poster Session E2: Emotional and Social Signals in Multimedia & Media Interpretation----2020)
words:
- Multi-modal learning: 多模态学习
- Unimodal data: 单模态数据
- Multi-modal data: 多模态数据
- Data fusion: 数据融合
- Modality: 模态(如视觉、听觉等)
- Sensor: 传感器
- Feature: 特征
- Embedding: 嵌入
- Convolutional neural network (CNN): 卷积神经网络
- Recurrent neural network (RNN): 循环神经网络
- Transformer: 变换器
- Attention mechanism: 注意力机制
- Fusion method: 融合方法
- Cross-modal retrieval: 跨模态检索
- Multi-task learning: 多任务学习
Sentences:
-
We propose a multi-modal learning framework that combines visual and textual information to improve performance on the task of object recognition.
(我们提出了一个多模态学习框架,将视觉和文本信息相结合,以提高物体识别的性能。) -
Our approach utilizes data fusion techniques to merge different modalities and create a more comprehensive representation of the underlying data.
(我们的方法利用数据融合技术来合并不同的模态,创建更全面的数据表示。) -
By incorporating both audio and video input, our model achieves state-of-the-art performance on the task of speech recognition.
(通过同时加入音频和视频输入,我们的模型在语音识别的任务上达到了最先进的性能。) -
The attention mechanism is used to selectively focus on the relevant features across different modalities.
(使用注意力机制有选择地关注不同模态中相关的特征。) -
Our fusion method combines visual and textual features using a weighted sum approach based on the importance of each modality.
(我们的融合方法使用基于每种模态的重要性的加权求和方法来结合视觉和文本特征。) -
Multi-task learning is used to jointly train our model on multiple related tasks, which leads to improved performance on each individual task.
(多任务学习用于共同训练我们的模型在多个相关任务上,这可以提高每个单独任务的性能。) -
The encoder network converts the input data into a shared latent space where different modalities can be effectively integrated.
(编码器网络将输入数据转换为共享的潜在空间,不同模态可以在其中有效地集成。) -
Our model uses a cross-modal retrieval approach to match images with their corresponding captions.
(我们的模型使用跨模态检索方法来匹配图像与其对应的字幕。) -
The fusion layer combines the features from multiple modalities using a learned weighting scheme.
(融合层使用学习到的加权方案来合并多种模态的特征。) -
Our proposed method achieves outstanding results on both image and speech recognition tasks, demonstrating the effectiveness of multi-modal learning.
(我们提出的方法在图像和语音识别任务上取得了优秀的结果,证明了多模态学习的有效性。) -
We explore the use of unsupervised pre-training for multi-modal learning, which leverages large amounts of unlabeled data to improve performance.
(我们探讨了无监督预训练在多模态学习中的应用,利用大量未标记的数据来提高性能。) -
The attention-based fusion method combines the visual and textual information by selectively attending to the most informative parts of each modality.
(基于注意力的融合方法通过选择性关注每种模态中最具信息量的部分来结合视觉和文本信息。) -
Our approach to multi-modal learning is motivated by the idea that different modalities provide complementary information that can improve performance on a given task.
(我们的多模态学习方法基于这样一个想法,即不同的模态提供了互补信息,可以提高在给定任务上的性能。) -
The embedding layer transforms the raw input data into a low-dimensional space where the different modalities can be effectively compared and integrated.
(嵌入层将原始输入数据转换为低维空间,不同模态可以在其中有效地比较和集成。)