使用单次点击出价提高语音中的情绪识别性能

本文介绍了如何使用对比预测编码(CPC)自我监督学习技术来提升语音情绪识别系统的性能。通过CPC训练,模型的准确性从71%提高到80%,减少了30%的相对误差。研究中,双向RNN模型在预训练的CPC表示下,对RAVDESS数据集的八种情绪识别表现出最高性能,达到了79.6%的逐帧精度。文章还探讨了不同模型架构,如MLP、RNN和WaveNet,以及它们在CPC预训练后的性能提升。
摘要由CSDN通过智能技术生成

In this article I will take you through how I developed an emotion recognition system using speech as an input and then boosted performance using self-supervised representations trained with Contrastive Predictive Coding (CPC). Results have improved from a baseline of 71% to 80% accuracy when using CPC. This is a significant relative reduction in error of 30%.

在本文中,我将带您逐步了解如何开发一种使用语音作为输入的情感识别系统,然后如何使用经过对比预测编码(CPC)训练的自我监督表示来提高性能的系统。 使用CPC时,结果从71%的基线精度提高到80%。 相对误差显着降低了30%。

View the full code here.

此处查看完整代码。

In addition, I benchmarked various architectures of the model trained using these representations which include simple multilayer perceptrons (MLPs), Recurrent Neural Networks (RNNs) and WaveNet style models that use dilated convolutions.

此外,我对使用这些表示训练的模型的各种体系结构进行了基准测试,包括简单的多层感知器(MLP),递归神经网络(RNN)和使用扩张卷积的WaveNet样式模型。

I found that a bi-directional RNN model using pre-trained CPC representations as input features was the highest performing setup that leads to 79.6% frame-wise accuracy when classifying eight emotions in the RAVDESS dataset. To the best of my knowledge, this is a very competitive system compared to others trained on this data.

我发现,使用预训练的CPC表示作为输入特征的双向RNN模型是性能最高的设置,在对RAVDESS数据集中的八种情绪进行分类时,可达到79.6%的逐帧准确性。 据我所知,与接受此数据培训的其他系统相比,这是一个非常有竞争力的系统。

介绍 (Introduction)

Emotion recognition from speech involves predicting someone’s emotion from a set of classes such as happy, sad, angry, etc. There are many potential applications in businesses such as call centres, health care and human resources [1]. For example, in a call centre, it would allow an automated way of discovering sentiment of potential customers to guide a sales representative towards a better sales approach.

语音中的情感识别涉及从诸如快乐,悲伤,愤怒等一组类别中预测某人的情感。在呼叫中心,医疗保健和人力资源等业务中有许多潜在的应用[ 1 ]。 例如,在呼叫中心,这将允许发现潜在客户情绪的自动化方法,以指导销售代表寻求更好的销售方法。

Predicting an emotion from audio is challenging, since emotions are perceived differently from person to person and can often be difficult to interpret. In addition, many emotional cues come from areas unrelated to speech such as facial expressions, the person’s particular mentality and the context of the interaction. As humans we naturally take all of these signals into account as well as our past communication experiences before making a final judgment. Some authors improve performance using multimodal approaches where audio is combined with text [3,4] or video [5]. Ideally, a world model that understands the links between these areas and social interactions (see World Scopes in [2]) would be trained for this task. However, this is an on going area of research and it is currently unclear how to learn from social interactions, rather than just learning trends from the data itself. In this work, I boost the performance by using self-supervised representations training with a Contrastive Predictive Coding (CPC [8]) framework rather than multimodal training.

从音频中预测情感是具有挑战性的,因为情感因人而异,通常难以解释。 此外,许多情绪暗示来自与言语无关的领域,例如面部表情,人的特殊心态和互动背景。 作为人类,我们在做出最终判断之前自然会考虑所有这些信号以及我们过去的交流经验。 一些作者使用多模式方法提高了性能,其中音频与文本[3,4]或视频[5]相结合。 理想情况下,将训练一个了解这些区域与社会互动之间的联系的世界模型(请参阅[2]中的“世界范围 )。 但是,这是一个进行中的研究领域,目前尚不清楚如何从社交互动中学习,而不仅仅是从数据本身中学习趋势。 在这项工作中,我通过使用自预测表示训练和对比预测编码(CPC [8])框架而不是多模式训练来提高性能。

In the field of representational learning for speech, phone and speaker identification are widely used to evaluate features generated from self-supervised learning techniques since they evaluate local and global structure in the audio respectively. In this article, I demonstrate that emotion recognition can also be used as a downstream task for gauging the quality of representations. Furthermore, classifying emotions supplements phone and speaker identification when benchmarking how good representations are, since emotions only loosely depend on the words being said or how a persons voice sounds.

在语音的代表性学习领域中,电话和说话者识别被广泛用于评估自监督学习技术产生的特征,因为它们分别评估音频中的局部和全局结构。 在本文中,我演示了情感识别也可以用作衡量表示质量的下游任务。 此外,对情感进行分类可以在基准测试良好的表示效果时补充电话和说话者的识别能力,因为情感仅大致取决于说出的单词或人的发音方式。

相关工作 (Related work)

情绪识别 (Emotion recognition)

The majority of emotion recognition systems [3,4,6] have been trained using Mel-Frequency Cepstral Coefficients (MFCCs) which are popular audio features based on a frequency spectrogram. Fbanks, also known as Mel spectrograms, are similar to MFCCs and are widely used. Both capture the frequency content that humans are sensitive to. There has been little work showing the performance gains when using machine learned features through self-supervised learning on the emotion recognition task. It is worth noting that MFCCs and Fbanks can still be used as an input to the self-supervised task instead of raw audio and can often be a good starting point when extracting richer representations. I will talk more about that later.

大多数情绪识别系统[3,4,6]已经使用

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值