【语音信号处理】2语音信号实践——LSTM(hidden、output)、Attention、语音可视化

1. LSTM-hidden 实现细节

关于class torch.utils.data.Dataset官方文档

当ATCH_SIZE = 128,HIDDEN_SIZE = 64,最大迭代次数 = 5:

当前代数:0,batch:0,平均训练集Loss:0.6244831681251526,验证集精确度:0.7070707070707071,Total time:0m 26s
当前代数:1,batch:0,平均训练集Loss:0.5765572786331177,验证集精确度:0.6868686868686869,Total time:0m 55s
当前代数:2,batch:0,平均训练集Loss:0.5467721819877625,验证集精确度:0.7744107744107744,Total time:1m 25s
当前代数:3,batch:0,平均训练集Loss:0.5237111449241638,验证集精确度:0.7811447811447811,Total time:1m 55s
当前代数:4,batch:0,平均训练集Loss:0.5050662159919739,验证集精确度:0.797979797979798,Total time:2m 25s
验证集精确度:0.797979797979798

在这里插入图片描述
当ATCH_SIZE = 128,HIDDEN_SIZE = 64,最大迭代次数 = 50:

数据集大小为:2961, POS:2059, NEG:902
True
Dataset loaded! length of dataset is 2961
Length of train set is 2664, Length of valid set is 297
Cuda is available!
当前代数:0,batch:0,平均训练集Loss:0.5977479219436646,验证集精确度:0.7205387205387206,Total time:0m 26s
当前代数:1,batch:0,平均训练集Loss:0.558993935585022,验证集精确度:0.7676767676767676,Total time:0m 55s
当前代数:2,batch:0,平均训练集Loss:0.5424327850341797,验证集精确度:0.734006734006734,Total time:1m 24s
当前代数:3,batch:0,平均训练集Loss:0.5222371220588684,验证集精确度:0.7811447811447811,Total time:1m 59s
当前代数:4,batch:0,平均训练集Loss:0.5041710138320923,验证集精确度:0.7946127946127947,Total time:2m 31s
当前代数:5,batch:0,平均训练集Loss:0.48880547285079956,验证集精确度:0.8080808080808081,Total time:3m 1s
...
当前代数:39,batch:0,平均训练集Loss:0.3070719838142395,验证集精确度:0.8653198653198653,Total time:21m 36s
当前代数:40,batch:0,平均训练集Loss:0.3038052022457123,验证集精确度:0.8821548821548821,Total time:22m 9s
当前代数:41,batch:0,平均训练集Loss:0.3010295331478119,验证集精确度:0.8518518518518519,Total time:22m 42s
当前代数:42,batch:0,平均训练集Loss:0.2985186278820038,验证集精确度:0.8754208754208754,Total time:23m 13s
当前代数:43,batch:0,平均训练集Loss:0.2955150306224823,验证集精确度:0.8686868686868687,Total time:23m 44s
当前代数:44,batch:0,平均训练集Loss:0.29303693771362305,验证集精确度:0.8114478114478114,Total time:24m 14s
当前代数:45,batch:0,平均训练集Loss:0.29053598642349243,验证集精确度:0.8518518518518519,Total time:24m 45s
当前代数:46,batch:0,平均训练集Loss:0.28803855180740356,验证集精确度:0.8686868686868687,Total time:25m 16s
当前代数:47,batch:0,平均训练集Loss:0.2856229841709137,验证集精确度:0.8686868686868687,Total time:25m 46s
当前代数:48,batch:0,平均训练集Loss:0.2830147445201874,验证集精确度:0.8619528619528619,Total time:26m 17s
当前代数:49,batch:0,平均训练集Loss:0.280977725982666,验证集精确度:0.8720538720538721,Total time:26m 50s
验证集精确度:0.8720538720538721

在这里插入图片描述
对于CASIA汉语情感语料库,认为:

category2emotion = {
    'angry':0,
    'fear':0,
    'sad':0,
    'happy':1,
    'surprise':1
}

这样一共有1000条数据,其中Pos的有400条,Neg的有600条

总共数据集大小为:3961, POS:2459, NEG:1502,将验证集从10%改为15%,Length of train set is 3366, Length of valid set is 595,最好的结果保存模型:

Cuda is available!
当前代数:0,batch:0,平均训练集Loss:0.6201511025428772,验证集精确度:0.6823529411764706,Total time:0m 44s
当前代数:1,batch:0,平均训练集Loss:0.5891127586364746,验证集精确度:0.7327731092436974,Total time:1m 36s
当前代数:2,batch:0,平均训练集Loss:0.569821298122406,验证集精确度:0.7260504201680672,Total time:2m 28s
当前代数:3,batch:0,平均训练集Loss:0.553916871547699,验证集精确度:0.7747899159663866,Total time:3m 19s
当前代数:4,batch:0,平均训练集Loss:0.5395360589027405,验证集精确度:0.7815126050420168,Total time:4m 11s
当前代数:5,batch:0,平均训练集Loss:0.5276969075202942,验证集精确度:0.788235294117647,Total time:5m 2s
当前代数:6,batch:0,平均训练集Loss:0.5157127976417542,验证集精确度:0.8084033613445378,Total time:5m 54s
当前代数:7,batch:0,平均训练集Loss:0.5061213374137878,验证集精确度:0.7899159663865546,Total time:6m 45s
当前代数:8,batch:0,平均训练集Loss:0.49973732233047485,验证集精确度:0.7966386554621848,Total time:7m 37s
当前代数:9,batch:0,平均训练集Loss:0.4922909140586853,验证集精确度:0.788235294117647,Total time:8m 29s
当前代数:10,batch:0,平均训练集Loss:0.4837799072265625,验证集精确度:0.8369747899159664,Total time:9m 20s
...
当前代数:41,batch:0,平均训练集Loss:0.35956424474716187,验证集精确度:0.880672268907563,Total time:27m 22s
当前代数:42,batch:0,平均训练集Loss:0.3565613031387329,验证集精确度:0.853781512605042,Total time:27m 55s
当前代数:43,batch:0,平均训练集Loss:0.3547583222389221,验证集精确度:0.8756302521008403,Total time:28m 28s
当前代数:44,batch:0,平均训练集Loss:0.35203972458839417,验证集精确度:0.8588235294117647,Total time:29m 1s
当前代数:45,batch:0,平均训练集Loss:0.34941181540489197,验证集精确度:0.8722689075630252,Total time:29m 34s
当前代数:46,batch:0,平均训练集Loss:0.3464643359184265,验证集精确度:0.8554621848739495,Total time:30m 7s
当前代数:47,batch:0,平均训练集Loss:0.3436892628669739,验证集精确度:0.8705882352941177,Total time:30m 41s
当前代数:48,batch:0,平均训练集Loss:0.34104934334754944,验证集精确度:0.8336134453781513,Total time:31m 14s
当前代数:49,batch:0,平均训练集Loss:0.33854228258132935,验证集精确度:0.8689075630252101,Total time:31m 48s
验证集精确度:0.8689075630252101

在这里插入图片描述
不知道为什么最好的模型好像没有保存下来…

提前定义best_model = None,把hidden_size改为96,重新训练:

...
当前代数:42,batch:0,平均训练集Loss:0.35273611545562744,验证集精确度:0.8521008403361344,Total time:24m 4s
当前代数:43,batch:0,平均训练集Loss:0.3498222827911377,验证集精确度:0.8470588235294118,Total time:24m 38s
当前代数:44,batch:0,平均训练集Loss:0.3466912806034088,验证集精确度:0.8554621848739495,Total time:25m 12s
当前代数:45,batch:0,平均训练集Loss:0.34298786520957947,验证集精确度:0.8521008403361344,Total time:25m 46s
当前代数:46,batch:0,平均训练集Loss:0.3392830789089203,验证集精确度:0.8487394957983193,Total time:26m 19s
当前代数:47,batch:0,平均训练集Loss:0.33573663234710693,验证集精确度:0.8521008403361344,Total time:26m 53s
当前代数:48,batch:0,平均训练集Loss:0.33253514766693115,验证集精确度:0.8487394957983193,Total time:27m 27s
当前代数:49,batch:0,平均训练集Loss:0.3291744291782379,验证集精确度:0.838655462184874,Total time:28m 0s
验证集精确度:0.838655462184874

在这里插入图片描述
还是不能保存最好的模型…

当前代数:51,batch:0,平均训练集Loss:0.3056916892528534,验证集精确度:0.8504201680672269,Total time:29m 27s
当前代数:52,batch:0,平均训练集Loss:0.3024693727493286,验证集精确度:0.8907563025210085,Total time:30m 1s
Save model!
当前代数:53,batch:0,平均训练集Loss:0.2991935610771179,验证集精确度:0.8773109243697479,Total time:30m 35s
当前代数:54,batch:0,平均训练集Loss:0.2951126992702484,验证集精确度:0.8605042016806723,Total time:31m 9s
当前代数:55,batch:0,平均训练集Loss:0.2914368212223053,验证集精确度:0.8739495798319328,Total time:31m 43s
当前代数:56,batch:0,平均训练集Loss:0.2875944972038269,验证集精确度:0.8689075630252101,Total time:32m 18s
当前代数:57,batch:0,平均训练集Loss:0.28370267152786255,验证集精确度:0.8571428571428571,Total time:32m 53s
当前代数:58,batch:0,平均训练集Loss:0.2803336977958679,验证集精确度:0.8588235294117647,Total time:33m 27s
当前代数:59,batch:0,平均训练集Loss:0.27704960107803345,验证集精确度:0.853781512605042,Total time:34m 2s
验证集精确度:0.8907563025210085

在这里插入图片描述
这次保存下来了

打开后缀为.pkl的文件:

import pickle
with open(path+'Processed/unaligned_39.pkl', 'rb') as data_file:
    data = pickle.load(data_file)
    print(data)

关于numpy矩阵pad零:

import numpy as np
array = np.array([[1, 1],[2,2]])

"""
((1,1),(2,2))表示在二维数组array第一维(此处便是行)前面填充1行,最后面填充1行;
                 在二维数组array第二维(此处便是列)前面填充2列,最后面填充2列
constant_values=(0,3) 表示第一维填充0,第二维填充3
"""
ndarray=np.pad(array,((1,1),(2,2)),'constant', constant_values=(0,3)) 

print("array",array)
print("ndarray=",ndarray)
array [[1 1]
       [2 2]]

ndarray= [[0 0 0 0 3 3]
          [0 0 1 1 3 3]          
          [0 0 2 2 3 3]
          [0 0 3 3 3 3]]

本例:

import numpy as np
x = np.random.rand(1200,26)
y = np.random.rand(500,26)
print(x.shape, y.shape)
x = x[:999]
y = y[:999]
print(x.shape, y.shape)
x = np.pad(x, ((0, 999 - len(x)),(0, 0)), mode='constant')
y = np.pad(y, ((0, 999 - len(y)),(0, 0)), mode='constant')
print(x.shape, y.shape)
print(x,y)
(1200, 26) (500, 26)
(999, 26) (500, 26)
(999, 26) (999, 26)
[[0.36276419 0.12025306 0.74485184 ... 0.07995154 0.87567182 0.29699818]
 [0.7622248  0.03426568 0.26884526 ... 0.02622873 0.90374401 0.01375409]
 [0.68813536 0.7224915  0.48943753 ... 0.19687244 0.62412416 0.16496784]
 ...
 [0.93325474 0.79646476 0.79513437 ... 0.81376934 0.62457521 0.01225579]
 [0.28987227 0.58696689 0.46909485 ... 0.77735102 0.38161923 0.00966942]
 [0.35099698 0.04347501 0.57235593 ... 0.58973257 0.7156334  0.94827799]] [[0.78924269 0.60543369 0.66701066 ... 0.56727321 0.24234751 0.99821209]
 [0.50008997 0.5139892  0.909358   ... 0.5260091  0.60402981 0.7940462 ]
 [0.57316599 0.258695   0.96163063 ... 0.204388   0.00456991 0.44561223]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]

2. LSTM-output 实现细节

当前代数:53,batch:0,平均训练集Loss:0.2106393724679947,验证集精确度:0.7815126050420168,Total time:31m 44s
当前代数:54,batch:0,平均训练集Loss:0.20743916928768158,验证集精确度:0.7865546218487395,Total time:32m 35s
当前代数:55,batch:0,平均训练集Loss:0.3545965254306793,验证集精确度:0.7142857142857143,Total time:33m 25s
当前代数:56,batch:0,平均训练集Loss:1.031476378440857,验证集精确度:0.7126050420168067,Total time:34m 16s
当前代数:57,batch:0,平均训练集Loss:1.3142709732055664,验证集精确度:0.6554621848739496,Total time:35m 7s
当前代数:58,batch:0,平均训练集Loss:1.3166556358337402,验证集精确度:0.7327731092436974,Total time:35m 57s
当前代数:59,batch:0,平均训练集Loss:1.3262178897857666,验证集精确度:0.7478991596638656,Total time:36m 49s
验证集精确度:0.8201680672268907

在这里插入图片描述
就发现最后结果突然烂掉,同样的,增加一些细节:

torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)  # 规定了最大不能超过的max_norm
scheduler = torch.optim.lr_scheduler.StepLR(optimizer_model, 1, gamma=0.9)

再次运行:

当前代数:54,batch:0,平均训练集Loss:0.6103,验证集精确度:0.8084,Total time:32m 23s
当前代数:55,batch:0,平均训练集Loss:0.5997,验证集精确度:0.8050,Total time:32m 59s
当前代数:56,batch:0,平均训练集Loss:0.5893,验证集精确度:0.8101,Total time:33m 35s
Save model!
当前代数:57,batch:0,平均训练集Loss:0.5794,验证集精确度:0.8084,Total time:34m 11s
当前代数:58,batch:0,平均训练集Loss:0.5698,验证集精确度:0.8067,Total time:34m 48s
当前代数:59,batch:0,平均训练集Loss:0.5605,验证集精确度:0.8084,Total time:35m 24s
验证集精确度:0.8100840336134454

在这里插入图片描述
可见一开始波动较大,而且最后结果一般,可能是参数太多了?把Hidden_size改为64再次观察结果:

当前代数:56,batch:0,平均训练集Loss:0.2949,验证集精确度:0.7882,Total time:32m 55s
当前代数:57,batch:0,平均训练集Loss:0.2902,验证集精确度:0.7933,Total time:33m 31s
当前代数:58,batch:0,平均训练集Loss:0.2857,验证集精确度:0.7882,Total time:34m 5s
当前代数:59,batch:0,平均训练集Loss:0.2814,验证集精确度:0.7849,Total time:34m 40s
验证集精确度:0.7983193277310925

在这里插入图片描述
再为linear(Gru不能初始化权重)增加一个初始化权重,使bias为0,weight为[-0.1, 0.1]:

当前代数:55,batch:0,平均训练集Loss:0.4795,验证集精确度:0.8269,Total time:32m 14s
Save model!
当前代数:56,batch:0,平均训练集Loss:0.4729,验证集精确度:0.8151,Total time:32m 49s
当前代数:57,batch:0,平均训练集Loss:0.4665,验证集精确度:0.8118,Total time:33m 24s
当前代数:58,batch:0,平均训练集Loss:0.4602,验证集精确度:0.8134,Total time:33m 59s
当前代数:59,batch:0,平均训练集Loss:0.4542,验证集精确度:0.8067,Total time:34m 33s
验证集精确度:0.826890756302521

在这里插入图片描述
想了一下,确实用LSTM的output作为全部语音或者句子的信息来进行分类是很不合理的,这样全靠提到的特征好不好,所以还是应该加上Attention

3. Attention

关于.repeat函数:

import torch
x = torch.rand(2,1024)
print(x)
x = x.repeat(35,1,1).permute(1,0,2)
print(x)
print(x.size())
tensor([[0.0709, 0.6496, 0.3133,  ..., 0.9218, 0.6278, 0.5311],
        [0.9497, 0.6602, 0.8397,  ..., 0.5423, 0.7244, 0.3430]])
tensor([[[0.0709, 0.6496, 0.3133,  ..., 0.9218, 0.6278, 0.5311],
         [0.0709, 0.6496, 0.3133,  ..., 0.9218, 0.6278, 0.5311],
         [0.0709, 0.6496, 0.3133,  ..., 0.9218, 0.6278, 0.5311],
         ...,
         [0.0709, 0.6496, 0.3133,  ..., 0.9218, 0.6278, 0.5311],
         [0.0709, 0.6496, 0.3133,  ..., 0.9218, 0.6278, 0.5311],
         [0.0709, 0.6496, 0.3133,  ..., 0.9218, 0.6278, 0.5311]],

        [[0.9497, 0.6602, 0.8397,  ..., 0.5423, 0.7244, 0.3430],
         [0.9497, 0.6602, 0.8397,  ..., 0.5423, 0.7244, 0.3430],
         [0.9497, 0.6602, 0.8397,  ..., 0.5423, 0.7244, 0.3430],
         ...,
         [0.9497, 0.6602, 0.8397,  ..., 0.5423, 0.7244, 0.3430],
         [0.9497, 0.6602, 0.8397,  ..., 0.5423, 0.7244, 0.3430],
         [0.9497, 0.6602, 0.8397,  ..., 0.5423, 0.7244, 0.3430]]])
torch.Size([2, 35, 1024])

把以上代码换成Attention的代码,注意网络返回的还有Attention的数值,所以返回的是两个数,结果如下:

当前代数:56,batch:0,平均训练集Loss:0.3452,验证集精确度:0.8790,Total time:32m 49s
当前代数:57,batch:0,平均训练集Loss:0.3440,验证集精确度:0.8824,Total time:33m 23s
Save model!
当前代数:58,batch:0,平均训练集Loss:0.3429,验证集精确度:0.8807,Total time:33m 58s
当前代数:59,batch:0,平均训练集Loss:0.3419,验证集精确度:0.8756,Total time:34m 32s
验证集精确度:0.8823529411764706

在这里插入图片描述
感觉把最大迭代次数效果可以更好,设成100试试看:

当前代数:97,batch:0,平均训练集Loss:0.3042,验证集精确度:0.8387,Total time:60m 16s
当前代数:98,batch:0,平均训练集Loss:0.3039,验证集精确度:0.8387,Total time:60m 51s
当前代数:99,batch:0,平均训练集Loss:0.3035,验证集精确度:0.8387,Total time:61m 26s
验证集精确度:0.8453781512605042

在这里插入图片描述
还不如上一次的结果,这里感觉到后面Loss都降不下去的感觉,这里把Hidden_size改成96再试试:

当前代数:97,batch:0,平均训练集Loss:0.2911,验证集精确度:0.8723,Total time:58m 10s
当前代数:98,batch:0,平均训练集Loss:0.2906,验证集精确度:0.8706,Total time:58m 46s
当前代数:99,batch:0,平均训练集Loss:0.2900,验证集精确度:0.8706,Total time:59m 22s
验证集精确度:0.8773109243697479

在这里插入图片描述

4. 语音可视化

想做一个可动的图像,matplotlib一些例子,使用此函数matplotlib.animation.FuncAnimation官方文档一个plot实例另一个示例,您必须将创建的动画存储在一个变量中,这个变量的寿命与动画运行的时间一样长。否则,动画对象将被垃圾回收,动画将停止,参数:

  • fig
  • func,在每一帧中调用的函数。第一个参数将是下一个帧的值。任何额外的位置参数可以通过fargs参数提供,如果 blit == True,func 必须返回一个所有被修改或创建的艺术家的迭代信息。这个信息被blit算法用来确定图中哪些部分需要更新。如果blit == False,则返回值不被使用,在这种情况下可以省略

问题:
在这里插入图片描述
画的太慢了,即使把interval=0.0001这个参数调的很小也画得很慢,在数据量比较小的时候可以用,代码如下:

class Scope:
    def __init__(self, ax):
        self.ax = ax
        self.tdata = [0]
        self.ydata = [0]
        self.line = Line2D(self.tdata, self.ydata)
        self.ax.add_line(self.line)
        self.ax.set_ylim(-1, 1)

    def update(self,i):

        self.tdata.append(time[i])
        self.ydata.append(data[i])
        self.line.set_data(self.tdata, self.ydata)
        return self.line,

i = 0
def emitter():
    global i
    i += 1
    yield i

fig, ax = plt.subplots()
scope = Scope(ax)

# pass a generator in "emitter" to produce data for the update func
ani = animation.FuncAnimation(fig, scope.update, emitter, interval=0.0001,
                              blit=True)

plt.show()

直接利用plt.ion()可以进行绘图,如何保存?参考此,使用with writer.saving(fig, "... your path/writer_test.mp4", 100): # 100指的dpi,dot per inch,表示清晰度报错:

FileNotFoundError: [WinError 2] The system cannot find the file specified

需要Windows下安装使用ffmpeg,官网下载链接,在下载ffmpeg-N-101947-gc5ca18fd1b-win64-gpl.zip,共95.1MB,并配置bin的环境变量,最后命令行输入:

ffmpeg -version

安装成功:

在这里插入图片描述
但依然报错,重启一遍Pycharm试试,问题解决!实在不行的话还可以参考此

问题:如何保证输出的时频信号时长和原语音长度尽可能相等?再次读入mp4文件,计算总时长,利用opencv,参考此文章

语音数据转时域可视化信号

代码如下:

import wave
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FFMpegWriter
import cv2

CASIA_path = '... your path/11/CASIA database/'
example1 = '232. 噪音产生污染 '
happy1 = 'liuchanhg/happy/232.wav'
happy2 = 'wangzhe/happy/232.wav'
sad1 = 'liuchanhg/sad/232.wav'
sad2 = 'wangzhe/sad/232.wav'

example2 = '249. 笑话逗得大家开心'
happy3 = 'liuchanhg/happy/249.wav'
happy4 = 'wangzhe/happy/249.wav'
sad3 = 'liuchanhg/sad/249.wav'
sad4 = 'wangzhe/sad/249.wav'


def read_wav(path_wav):
    f = wave.open(path_wav, 'rb')
    params = f.getparams()
    nchannels, sampwidth, framerate, nframes = params[:4]  # 通道数、采样字节数、采样率、采样帧数
    voiceStrData = f.readframes(nframes)
    waveData = np.frombuffer(voiceStrData, dtype=np.short)  # 将原始字符数据转换为整数
    waveData = waveData * 1.0 / max(abs(waveData))  # 音频数据归一化, instead of .fromstring
    waveData = np.reshape(waveData, [nframes, nchannels]).T  # .T 表示转置, 将音频信号规整乘每行一路通道信号的格式,即该矩阵一行为一个通道的采样点,共nchannels行
    f.close()
    return waveData, nframes, framerate

def draw_time_domain_image(waveData, nframes, framerate):       # 时域特征
    time = np.arange(0,nframes) * (1.0/framerate)
    # plt.plot(time,waveData[0,:],c='b')
    # plt.xlabel('time')
    # plt.ylabel('am')
    # plt.show()
    return time,waveData[0,:]



metadata = dict(title='Video to mp4', artist='Matplotlib')
writer = FFMpegWriter(fps=30, metadata=metadata)

waveData, nframes, framerate = read_wav(CASIA_path + happy2)
time, data = draw_time_domain_image(waveData, nframes, framerate)
real_time = time[-1]
print(real_time)
fig = plt.figure(1)
INTERVAL = 537      # 537 - 2.166s(2.1506875)
mp4_path = "... your path/happy2.mp4"

def wavedata2mp4(maxepoch):
    global INTERVAL
    for i in range(maxepoch):
        with writer.saving(fig, mp4_path, 100):  # 100指的dpi,dot per inch,表示清晰度
            # plt.ion()
            x = []
            y = []
            # plt.xlim(0, time[-1])
            # plt.ylim(0, 1)
            for i in range(0, len(time), INTERVAL):
                plt.clf()  # 清空画布上的所有内容
                # x = np.append(x, time[i:i+2000])
                # y = np.append(y, data[i:i+2000])
                x = time[:i+INTERVAL]
                y = data[:i+INTERVAL]
                plt.axis([0, time[-1], -1, 1])
                plt.plot(x, y)
                # plt.draw()
                plt.pause(0.001)
                writer.grab_frame()
            # plt.ioff()
            # plt.show()
        cap = cv2.VideoCapture(mp4_path)
        if cap.isOpened():
            rate = cap.get(5)
            frame_num = cap.get(7)
            video_time = frame_num / rate
            print(video_time, INTERVAL)
            if video_time > real_time:INTERVAL += 1
            else:
                INTERVAL -= 1

wavedata2mp4(1)

5. 全部代码

# <editor-fold desc="Voice">
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
from python_speech_features import mfcc, logfbank
from torch.utils.data import Dataset
import torch
from torch.utils.data.dataset import random_split
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.nn.functional as F
import time
from torch.nn.utils.rnn import pad_sequence
import matplotlib.ticker as ticker

category2emotion = {
    'angry': 0,
    'fear': 0,
    'sad': 0,
    'happy': 1,
    'surprise': 1
}

import os

CASIA_path = '/mnt/Data1/ysc/CASIA database/'
CASIA_wave_path = []
CASIA_label_list = []
CASIA_dirs = ['liuchanhg', 'wangzhe', 'zhaoquanyin', 'ZhaoZuoxiang']
categories = ['angry', 'fear', 'happy', 'sad', 'surprise']
for dir in CASIA_dirs:
    for category in categories:
        label = category2emotion[category]
        for dataname in os.listdir(CASIA_path + dir + '/' + category):
            if dataname.endswith('wav'):
                CASIA_wave_path.append(dir + '/' + category + '/' + dataname)
                CASIA_label_list.append(label)

dic = {}

path = '/mnt/Data1/ysc/TAL_SER/TAL-SER/'
path_label = path + 'label/' + 'label'  # label:标签
path_utt2gen = path + 'label/' + 'utt2gen'  # Gender:性别
path_utt2spk = path + 'label/' + 'utt2spk'  # Speaker:说话人
path_wav = path + 'label/' + 'wav.scp'  # wav地址


class WaveDataset(Dataset):
    def __init__(self):
        total = -1  # 4188
        with open(path_label, 'r', encoding='utf-8') as file:
            for line in file.readlines():
                total += 1
                if total == 0:
                    continue
                tmp = line.strip().split(' ')
                dic[eval(tmp[0])] = {'P': eval(tmp[1]), 'A': eval(tmp[2])}

        with open(path_utt2gen, 'r', encoding='utf-8') as file:
            for line in file.readlines():
                tmp = line.strip().split(' ')
                dic[eval(tmp[0])].update({'Gender': tmp[1]})

        with open(path_utt2spk, 'r', encoding='utf-8') as file:
            for line in file.readlines():
                tmp = line.strip().split(' ')
                dic[eval(tmp[0])].update({'Speaker': tmp[1]})

        with open(path_wav, 'r', encoding='utf-8') as file:
            for line in file.readlines():
                tmp = line.strip().split(' ')
                dic[eval(tmp[0])].update({'Wav': tmp[1][1:]})

        self.path = []
        self.label = []
        self.n_pos = 0
        self.n_neg = 0

        for key in dic:
            if dic[key]['A'] >= 0.5:  # 2961
                self.path.append(dic[key]['Wav'])
                self.label.append(1)
                self.n_pos += 1
            elif dic[key]['A'] <= -0.5:
                self.path.append(dic[key]['Wav'])
                self.label.append(0)
                self.n_neg += 1

        self.path = self.path + CASIA_wave_path
        self.label = self.label + CASIA_label_list
        self.n_pos = self.n_pos + 400
        self.n_neg = self.n_neg + 600

        print('数据集大小为:%d, POS:%d, NEG:%d' % (len(self.path), self.n_pos, self.n_neg))
        print(len(self.label) == len(self.path))

    def __getitem__(self, item):
        try:
            path_wav = path + self.path[item]
            filterbank_features = mfcc_python_speech_features(path_wav)
        except:
            path_wav = CASIA_path + self.path[item]
            filterbank_features = mfcc_python_speech_features(path_wav)
        # return torch.from_numpy(filterbank_features), torch.Tensor([self.label[item]])
        return torch.Tensor(filterbank_features), torch.LongTensor([self.label[item]])

    def __len__(self):
        return self.n_pos + self.n_neg


def mfcc_python_speech_features(path):
    sampling_freq, audio = wavfile.read(path)  # 读取输入音频文件
    # mfcc_features = mfcc(audio, sampling_freq)      # 提取MFCC和滤波器组特征
    filterbank_features = logfbank(audio, sampling_freq)  # numpy.ndarray, (999, 26)
    # 保存输出都是MAX_LENGTH
    # filterbank_features = filterbank_features[:MAX_LENGTH]
    # filterbank_features = np.pad(filterbank_features, ((0, MAX_LENGTH - len(filterbank_features)), (0, 0)),
    #                              mode='constant')
    return filterbank_features


# class Net(nn.Module):       # LSTM hidden
#     def __init__(self, hidden_size):
#         super(Net, self).__init__()
#         self.hidden_size = hidden_size
#         self.gru = nn.GRU(input_size=26, hidden_size=self.hidden_size, num_layers=2, batch_first=True, bidirectional=True, dropout=0.1)
#         self.linear = nn.Linear(2*2*hidden_size, 2)
#
#     def forward(self, input, hidden=None):
#         output, hidden = self.gru(input, hidden)        # batch * seq_len * (2*256), (2*2) * batch * 256
#         hidden = hidden.permute(1,0,2).contiguous()      # batch * (2*2) * 256
#         hidden = hidden.view(hidden.size(0), -1)        # batch * (2*2*256)
#         hidden = self.linear(hidden)        # batch * 2
#         # return F.softmax(hidden, dim=1)
#         return hidden


# class Net(nn.Module):  # LSTM output
#     def __init__(self, hidden_size):
#         super(Net, self).__init__()
#         self.hidden_size = hidden_size
#         self.gru = nn.GRU(input_size=26, hidden_size=self.hidden_size, num_layers=2, batch_first=True,
#                           bidirectional=True, dropout=0.1)
#         self.linear = nn.Linear(MAX_LENGTH * 2 * HIDDEN_SIZE, 2)
#         self.init_weights()
#
#     def init_weights(self):
#         initrange = 0.1
#         self.linear.bias.data.zero_()
#         self.linear.weight.data.uniform_(-initrange, initrange)
#
#     def forward(self, input, hidden=None):
#         output, hidden = self.gru(input, hidden)  # batch * seq_len * (2*256), (2*2) * batch * 256
#         output = output.contiguous().view(output.size(0), -1)
#         output = self.linear(output)
#         return output


class Net(nn.Module):       # LSTM output Attention
    def __init__(self, hidden_size):
        super(Net, self).__init__()
        self.hidden_size = hidden_size
        self.gru = nn.GRU(input_size=26, hidden_size=self.hidden_size, num_layers=2, batch_first=True, bidirectional=True, dropout=0.1)
        self.fc1 = nn.Linear((2 * 2 * hidden_size + 2 * hidden_size), 8)       # Attention dim = 8
        self.fc2 = nn.Linear(2 * hidden_size, 2)

    def forward(self, input, hidden=None):
        output, hidden = self.gru(input, hidden)  # batch * seq_len * (2*256), (2*2) * batch * 256
        hidden = hidden.permute(1,0,2).contiguous()     # batch * (2*2) * 256
        hidden = hidden.view(hidden.size()[0], -1)      # batch * ((2*2) * 256)
        hidden = hidden.repeat(output.size()[1], 1, 1).permute(1,0,2)      # batch * seq_len * ((2*2) * 256)
        combine = torch.cat((output, hidden), dim=2)        # batch * seq_len * ((2*2) * 256 + (2 * 256))
        combine = self.fc1(combine)     # batch * seq_len * 8
        combine = torch.tanh(combine)
        combine = torch.sum(combine, dim=2)     # batch * seq_len
        attention = F.softmax(combine, dim=1)

        a = attention.unsqueeze(1)
        a_apply = a.bmm(output)
        emo = self.fc2(a_apply.squeeze(1))
        return attention, emo       # batch * seq_len, batch * 2


def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0
    with torch.no_grad():
        for i, (wave, label) in enumerate(dataloader):
            wave, label = wave.cuda(), label.cuda()
            # predicted_label = model(wave)
            _, predicted_label = model(wave)
            # print(predicted_label.argmax(1))
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    model.train()
    return total_acc / total_count


def showAttention(input_sentence, output_words, attentions, title):
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 显示中文标签
    plt.rcParams['axes.unicode_minus'] = False
    plt.rcParams['figure.dpi'] = 300  # 分辨率
    # 用色条设置图形
    fig = plt.figure()
    ax = fig.add_subplot(111)
    mat = np.repeat(attentions.numpy(), 20, axis=0)
    cax = ax.matshow(mat)       # colormap, cmap='bone'
    # fig.colorbar(cax)
    # plt.title(title,verticalalignment='bottom')
    # 设置轴
    ax.set_xticklabels([''] + input_sentence, rotation=90)
    ax.set_yticklabels([''] + output_words)

    # Show label at every tick
    # ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    # ax.yaxis.set_major_locator(ticker.MultipleLocator(20))

    plt.show()


def test(model):
    model.eval()
    with torch.no_grad():
        happy3_wave = '/mnt/Data1/ysc/CASIA database/wangzhe/sad/249.wav'
        wave = mfcc_python_speech_features(happy3_wave)
        wave = torch.Tensor(wave)
        wave = wave.unsqueeze(0)
        wave = wave.cuda()
        model = model.cuda()
        attention, predicted_label = model(wave)
        text = 'x'
        if predicted_label.argmax(1).data.item() == 1:
            text = 'POS √'
        if predicted_label.argmax(1).data.item() == 0:
            text = 'NEG √'
        # print(attention)
        showAttention([], [text], attention.cpu(),'')


def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs


def generator_batch(data_batch):
    wave_batch = []
    label_batch = []
    for wave, label in data_batch:
        wave_batch.append(wave)
        label_batch.append(label)
    wave_batch = pad_sequence(wave_batch, batch_first=True)
    label_batch = torch.tensor(label_batch)
    return wave_batch, label_batch


BATCH_SIZE = 128
HIDDEN_SIZE = 96  # 96
MAX_LENGTH = 999
if __name__ == '__main__':
    # datafile = WaveDataset()
    # print('Dataset loaded! length of dataset is {0}'.format(len(datafile)))
    # n_train = int(len(datafile) * 0.85)
    # split_train, split_valid = random_split(dataset=datafile, lengths=[n_train, len(datafile) - n_train])
    # train_dataloader = DataLoader(split_train, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generator_batch)
    # valid_dataloader = DataLoader(split_valid, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generator_batch)
    # print('Length of train set is {}, Length of valid set is {}'.format(len(split_train), len(split_valid)))
    model = Net(HIDDEN_SIZE)
    acc_min = 0.5
    if torch.cuda.is_available() == True:
    #     print('Cuda is available!')
    #     model = model.cuda()
    #     optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # 学习率
    #     scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
    #     criterion = nn.CrossEntropyLoss()
    #     losses = []
    #     start_time = time.time()
    #     for epoch in range(100):  # 最大迭代次数
    #         cnt = 0
    #         for wave, label in train_dataloader:  # batch * seq_len * input_size
    #             wave, label = wave.cuda(), label.cuda()
    #             # out = model(wave)  # 对于LSTM网络有一个返回值
    #             _, out = model(wave)        # 对于Attention!网络有两个返回值
    #             loss = criterion(out, label.squeeze())
    #             losses.append(loss)
    #             loss.backward()
    #             torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)  # 规定了最大不能超过的max_norm
    #             optimizer.step()
    #             optimizer.zero_grad()
    #             # cnt += 1
    #         end_time = time.time()
    #         epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    #         acc_valid = evaluate(valid_dataloader)
    #         print('当前代数:{},batch:{},平均训练集Loss:{:.4f},验证集精确度:{:.4f},Total time:{}m {}s'.format(epoch, cnt,
    #                                                                                           sum(losses) / len(losses),
    #                                                                                           acc_valid, epoch_mins,
    #                                                                                           epoch_secs))
    #         if acc_valid > acc_min:
    #             acc_min = acc_valid
    #             torch.save(model.state_dict(), 'model_wave_Attention.pth')
    #             print('Save model!')
    #         else:
    #             scheduler.step()

        model.load_state_dict(torch.load('model_wave_Attention.pth'))
        # model_wave_best.pth: hidden acc:89%
        # model_wave_LSTM_output.pth: output acc:82%
        # model_wave_Attention.pth: Attention acc:87%

        test(model)

        # print('验证集精确度:{}'.format(evaluate(valid_dataloader)))
        # plt.plot(losses)
        # plt.show()
# </editor-fold>

小结

利用LSTM hidden、output、Attention分别实现语音情绪分类,其中利用hidden的分类效果是最好的,未来工作:看多模态中文数据集的论文

  • 3
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值