使用单次点击出价提高语音中的情绪识别性能

最新推荐文章于 2023-08-02 15:01:33 发布

weixin_26750481

最新推荐文章于 2023-08-02 15:01:33 发布

阅读量633

点赞数 1

文章标签：语音识别 python 人工智能 java

原文链接：https://medium.com/@jpl.hughes/boosting-emotion-recognition-performance-in-speech-using-cpc-ce6b23a05759

版权

本文介绍了如何使用对比预测编码（CPC）自我监督学习技术来提升语音情绪识别系统的性能。通过CPC训练，模型的准确性从71%提高到80%，减少了30%的相对误差。研究中，双向RNN模型在预训练的CPC表示下，对RAVDESS数据集的八种情绪识别表现出最高性能，达到了79.6%的逐帧精度。文章还探讨了不同模型架构，如MLP、RNN和WaveNet，以及它们在CPC预训练后的性能提升。

摘要由CSDN通过智能技术生成

In this article I will take you through how I developed an emotion recognition system using speech as an input and then boosted performance using self-supervised representations trained with Contrastive Predictive Coding (CPC). Results have improved from a baseline of 71% to 80% accuracy when using CPC. This is a significant relative reduction in error of 30%.

在本文中，我将带您逐步了解如何开发一种使用语音作为输入的情感识别系统，然后如何使用经过对比预测编码(CPC)训练的自我监督表示来提高性能的系统。使用CPC时，结果从71％的基线精度提高到80％。相对误差显着降低了30％。

View the full code here.

在此处查看完整代码。

In addition, I benchmarked various architectures of the model trained using these representations which include simple multilayer perceptrons (MLPs), Recurrent Neural Networks (RNNs) and WaveNet style models that use dilated convolutions.

此外，我对使用这些表示训练的模型的各种体系结构进行了基准测试，包括简单的多层感知器(MLP)，递归神经网络(RNN)和使用扩张卷积的WaveNet样式模型。

I found that a bi-directional RNN model using pre-trained CPC representations as input features was the highest performing setup that leads to 79.6% frame-wise accuracy when classifying eight emotions in the RAVDESS dataset. To the best of my knowledge, this is a very competitive system compared to others trained on this data.

我发现，使用预训练的CPC表示作为输入特征的双向RNN模型是性能最高的设置，在对RAVDESS数据集中的八种情绪进行分类时，可达到79.6％的逐帧准确性。据我所知，与接受此数据培训的其他系统相比，这是一个非常有竞争力的系统。

介绍 (Introduction)

Emotion recognition from speech involves predicting someone’s emotion from a set of classes such as happy, sad, angry, etc. There are many potential applications in businesses such as call centres, health care and human resources [1]. For example, in a call centre, it would allow an automated way of discovering sentiment of potential customers to guide a sales representative towards a better sales approach.

语音中的情感识别涉及从诸如快乐，悲伤，愤怒等一组类别中预测某人的情感。在呼叫中心，医疗保健和人力资源等业务中有许多潜在的应用[ 1 ]。例如，在呼叫中心，这将允许发现潜在客户情绪的自动化方法，以指导销售代表寻求更好的销售方法。

Predicting an emotion from audio is challenging, since emotions are perceived differently from person to person and can often be difficult to interpret. In addition, many emotional cues come from areas unrelated to speech such as facial expressions, the person’s particular mentality and the context of the interaction. As humans we naturally take all of these signals into account as well as our past communication experiences before making a final judgment. Some authors improve performance using multimodal approaches where audio is combined with text [3,4] or video [5]. Ideally, a world model that understands the links between these areas and social interactions (see World Scopes in [2]) would be trained for this task. However, this is an on going area of research and it is currently unclear how to learn from social interactions, rather than just learning trends from the data itself. In this work, I boost the performance by using self-supervised representations training with a Contrastive Predictive Coding (CPC [8]) framework rather than multimodal training.

从音频中预测情感是具有挑战性的，因为情感因人而异，通常难以解释。此外，许多情绪暗示来自与言语无关的领域，例如面部表情，人的特殊心态和互动背景。作为人类，我们在做出最终判断之前自然会考虑所有这些信号以及我们过去的交流经验。一些作者使用多模式方法提高了性能，其中音频与文本[3,4]或视频[5]相结合。理想情况下，将训练一个了解这些区域与社会互动之间的联系的世界模型(请参阅[2]中的“世界范围” )。但是，这是一个进行中的研究领域，目前尚不清楚如何从社交互动中学习，而不仅仅是从数据本身中学习趋势。在这项工作中，我通过使用自预测表示训练和对比预测编码(CPC [8])框架而不是多模式训练来提高性能。

In the field of representational learning for speech, phone and speaker identification are widely used to evaluate features generated from self-supervised learning techniques since they evaluate local and global structure in the audio respectively. In this article, I demonstrate that emotion recognition can also be used as a downstream task for gauging the quality of representations. Furthermore, classifying emotions supplements phone and speaker identification when benchmarking how good representations are, since emotions only loosely depend on the words being said or how a persons voice sounds.

在语音的代表性学习领域中，电话和说话者识别被广泛用于评估自监督学习技术产生的特征，因为它们分别评估音频中的局部和全局结构。在本文中，我演示了情感识别也可以用作衡量表示质量的下游任务。此外，对情感进行分类可以在基准测试良好的表示效果时补充电话和说话者的识别能力，因为情感仅大致取决于说出的单词或人的发音方式。