paper review: Multimodal Transformer for Unaligned Multimodal Language Sequences

该研究关注人类语言的多模态特性,包括自然语言、面部表情和声学行为。主要挑战在于不同模态数据的不一致性和元素间的长期依赖性。为此,提出了多模态Transformer(MulT),它采用双向交叉模态注意力来处理非对齐数据。实验表明,MulT在对齐和非对齐序列任务上优于现有方法,且能捕获相关交叉模态信号。
摘要由CSDN通过智能技术生成

Multimodal Transformer for Unaligned Multimodal Language Sequences

A类会议的论文。ACL

Summary

摘要 (中文)

人类语言通常是多模态的,包括自然语言、面部表情和声学行为的混合。然而,在建模这种多模态人类语言时间序列数据时存在两个主要的挑战:1)由于每个模态序列的采样率不同,数据固有的不一致性;2)不同形态的元素之间的长期依赖性。在本文中,我们引入了多模态变压器(MulT),以端到端方式解决上述问题,而无需明确地对齐数据。我们模型的核心是双向交叉模态注意,它关注跨不同时间步长的多模态序列之间的相互作用,并潜在地适应从一种模态到另一种模态的流。对对齐和非对齐多模态时间序列的综合实验表明,我们的模型大大优于目前最先进的方法。此外,实证分析表明,所提出的多模态注意机制能够捕捉到相关的交叉模态信号。

Research Objective

We examine whether faces and voices encode redundant identity information and measure to which extent.

Background and Problems

  • Background

    • We humans often deduce various, albeit perhaps crude, information from the voice of others, such as gender, approximate age and even personality.
  • previous methods brief introduction

    • Neuroscientists have observed that the multimodal associations of faces and voices play a role in perceptual tasks such as speaker recognition [19,14,44].
  • Problem Statement

    • not state in introduction

main work

  1. We provide an extensive human-subject study, with both the participant pool and dataset larger.
  2. We learn the co-embedding of modal representations of human faces and voices, and evaluate the learned representations extensively, revealing unsupervised correlations to demographic, prosodic, and facial features.
  3. We present a new dataset of the audiovisual recordings of speeches by 181 individuals with diverse demographic background, totaling over 3 hours of recordings, with the demographic annotations.

work limitations : self dataset is not big enough.

Related work

  • Human capability for face-voice association:

    • . The study of Campanella and Belin [5] reveals that humans leverage the interface between facial and vocal information for both person recognition and identity processing.
  • Audiovisual cross-modal learning by machinery:

    • Nagrani et al. [25] recently presented a computational model for the facevoice matching task. While they see it as a binary decision problem, we focus more on the shared information between the two modalities and extract it as a representation vector residing in the shared latent space, in which the task is modeled as a nearest neighbor search.

Method(s)

  • Methods one : Study on Human Performance

    • 1.Participants were presented with photographs of two different models and a 10-second voice recording of one of the models. They were asked to choose one and only one of the two faces they thought would have a similar voice to the recorded voice (V → F).
      1. dataset : Amazon Mechanical Turk ( the participants fill out a survey about their fender,age, and soon on)
      1. The result show that participants were able to match a voice of an unfamiliar person to a static facial image of the same person at better than chance levels.
        在这里插入图片描述
  • Cross-modal Metric Learning on Faces and Voices

      1. Our attempt to learn cross-modal representations between faces and voices is inspired by the significance of the overlapping information in certain cognitive tasks like identity recognition, as discussed earlier.
      1. dataset : We use the VoxCeleb dataset [26] to train our network. From each clip, the first frame and first 10 seconds of the audio are used, as the beginning of the clips is usually well aligned with the beginning of utterances.
      1. Network : we use VGG16 [33] and SoundNet [2], which have shown sufficient model capacities while allowing for stable training in a variety of applications.
      1. The result show that
        在这里插入图片描述

Conclusion

  • main controbution
  1. first, with human subjects, showing the baseline for how well people perform such tasks.
  2. On machines using deep neural networks, demonstrating that machines perform on a par with humans.
  • week point
  1. However, we emphasize that, similar to lie detectors, such associations should not be used for screening purposes or as hard evidence. Our work suggests the possibility of learning the associations by referring to a part of the human cognitive process, but not their definitive nature, which we believe would be far more complicated than it is modeled as in this work.
  • further work
  1. not reflected .

Reference(optional)

Arouse for me

  • This paper related work and introduction is worthed to study。 Because it is not hard to read and reasonable. However, In experiment and method part, It is hard to understand by me.
  • this paper has self dataset, but it publish version is not able to download. So , I can’t rechieve it.
  • 中文解读: https://blog.csdn.net/weixin_44390691/article/details/105182181?utm_medium=distribute.pc_relevant.none-task-blog-title-2&spm=1001.2101.3001.4242
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值