基于深度学习的的语音情感识别（Speech Emotion Recognition）

最新推荐文章于 2025-03-26 11:20:42 发布

扫地僧985

最新推荐文章于 2025-03-26 11:20:42 发布

阅读量3.3k

点赞数 40

文章标签：语音识别人工智能算法

本文链接：https://blog.csdn.net/weixin_42380711/article/details/141499005

版权

1.前言

言语是表达我们作为人类最自然的方式。因此，将这种通信媒介扩展到计算机应用程序是很自然的。我们将语音情感识别（SER）系统定义为一组方法，用于处理和分类语音信号以检测嵌入的情绪。SER 并不是一个新领域，它已经存在了二十多年，并且由于最近的进步而重新受到关注。这些新颖的研究利用了计算和技术所有领域的进步，因此有必要更新使 SER 成为可能的当前方法和技术。

2.数据集

这里有 4 个最受欢迎的英文数据集：Crema、Ravdess、Savee 和 Tess。它们中的每一个都包含 .wav 格式的音频，并带有一些主要标签。

Ravdess:

Here is the filename identifiers as per the official RAVDESS website:

Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
Vocal channel (01 = speech, 02 = song).
Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
Repetition (01 = 1st repetition, 02 = 2nd repetition).
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

So, here's an example of an audio filename. 02-01-06-01-02-01-12.wav This means the meta data for the audio file is:

Video-only (02)
Speech (01)
Fearful (06)
Normal intensity (01)
Statement "dogs" (02)
1st Repetition (01)
12th Actor (12) - Female (as the actor ID number is even)

Crema:

The third component is responsible for the emotion label:

SAD - sadness;
ANG - angry;
DIS - disgust;
FEA - fear;
HAP - happy;
NEU - neutral.

Tess:

Very similar to Crema - label of emotion is contained in the name of file.

Savee:

The audio files in this dataset are named in such a way that the prefix letters describes the emotion classes as follows:

'a' = 'anger'
'd' = 'disgust'
'f' = 'fear'
'h' = 'happiness'
'n' = 'neutral'
'sa' = 'sadness'
'su' = 'surprise'

关于此目录

add_comment添加建议

这是 Crema 数据集。

SAD - 悲伤;
ANG - 生气;
DIS - 厌恶;
FEA - 恐惧;
HAP - 快乐;
NEU - 中性。

语音情感识别使用一维卷积神经网络¶
在这个实验中，我尝试识别短语音消息中的情感（< 3秒）。我将使用4个数据集，这些数据集包含一些由专业演员配音的英文短语：Ravee、Crema、Savee 和 Tess。

首先，让我们定义一下SER，即语音情感识别。
语音情感识别（SER）是指尝试从语音中识别人的情感和情感状态。这一过程利用了声音往往通过音调和音高反映潜在情感的事实。类似的现象也存在于动物中，比如狗和马，它们利用这种机制来理解人的情感。

本项目使用的数据集包含约7种主要情感：快乐、恐惧、愤怒、厌恶、惊讶、悲伤或中性。

Importing libraries

import os
import re

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Audio
# from entropy import spectral_entropy
from keras import layers
from keras import models
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from tensorflow.python.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import keras
import itertools

第2步：

# Paths to
Ravdess = "../input/speech-emotion-recognition-en/Ravdess/audio_speech_actors_01-24"
Crema = "../input/speech-emotion-recognition-en/Crema"
Savee = "../input/speech-emotion-recognition-en/Savee"
Tess = "../input/speech-emotion-recognition-en/Tess"

数据准备¶
Ravdess 数据集
以下是根据官方 RAVDESS 网站的文件名标识符：

- 模态（01 = 完整音视频，02 = 仅视频，03 = 仅音频）。
- 声音频道（01 = 语音，02 = 歌曲）。
- 情感（01 = 中性，02 = 平静，03 = 快乐，04 = 悲伤，05 = 愤怒，06 = 恐惧，07 = 厌恶，08 = 惊讶）。
- 情感强度（01 = 正常，02 = 强烈）。注意：'中性'情感没有强烈强度。
- 语句（01 = "孩子们在门边说话"，02 = "狗坐在门边"）。
- 重复（01 = 第1次重复，02 = 第2次重复）。
- 演员（01到24。奇数编号的演员为男性，偶数编号的演员为女性）。

例如，一个音频文件名为 02-01-06-01-02-01-12.mp4，这意味着音频文件的元数据为：

- 仅视频（02）
- 语音（01）
- 恐惧（06）
- 正常强度（01）
- 语句 "狗"（02）
- 第1次重复（01）
- 第12号演员（12）- 女性（因为演员ID是偶数）

第3步：

ravdess_directory_list = os.listdir(Ravdess)

emotion_df = []

for dir in ravdess_directory_list:
    actor = os.listdir(os.path.join(Ravdess, dir))

    for wav in actor:
        info = wav.partition(".wav")[0].split("-")
        emotion = int(info[2])
        emotion_df.append((emotion, os.path.join(Ravdess, dir, wav)))
Ravdess_df = pd.DataFrame.from_dict(emotion_df)
Ravdess_df.rename(columns={1 : "Path", 0 : "Emotion"}, inplace=True)
Ravdess_df.head()

Ravdess_df.Emotion.replace({1:'neutral', 2:'neutral', 3:'happy', 4:'sad', 5:'angry', 6:'fear', 7:'disgust', 8:'surprise'}, inplace=True)
Ravdess_df.head()

第4步：Crema dataset

emotion_df = []

for wav in os.listdir(Crema):
    info = wav.partition(".wav")[0].split("_")
    if info[2] == 'SAD':
        emotion_df.append(("sad", Crema + "/" + wav))
    elif info[2] == 'ANG':
        emotion_df.append(("angry", Crema + "/" + wav))
    elif info[2] == 'DIS':
        emotion_df.append(("disgust", Crema + "/" + wav))
    elif info[2] == 'FEA':
        emotion_df.append(("fear", Crema + "