1. LSTM-hidden 实现细节
关于class torch.utils.data.Dataset
官方文档,
当ATCH_SIZE = 128,HIDDEN_SIZE = 64,最大迭代次数 = 5:
当前代数:0,batch:0,平均训练集Loss:0.6244831681251526,验证集精确度:0.7070707070707071,Total time:0m 26s
当前代数:1,batch:0,平均训练集Loss:0.5765572786331177,验证集精确度:0.6868686868686869,Total time:0m 55s
当前代数:2,batch:0,平均训练集Loss:0.5467721819877625,验证集精确度:0.7744107744107744,Total time:1m 25s
当前代数:3,batch:0,平均训练集Loss:0.5237111449241638,验证集精确度:0.7811447811447811,Total time:1m 55s
当前代数:4,batch:0,平均训练集Loss:0.5050662159919739,验证集精确度:0.797979797979798,Total time:2m 25s
验证集精确度:0.797979797979798
当ATCH_SIZE = 128,HIDDEN_SIZE = 64,最大迭代次数 = 50:
数据集大小为:2961, POS:2059, NEG:902
True
Dataset loaded! length of dataset is 2961
Length of train set is 2664, Length of valid set is 297
Cuda is available!
当前代数:0,batch:0,平均训练集Loss:0.5977479219436646,验证集精确度:0.7205387205387206,Total time:0m 26s
当前代数:1,batch:0,平均训练集Loss:0.558993935585022,验证集精确度:0.7676767676767676,Total time:0m 55s
当前代数:2,batch:0,平均训练集Loss:0.5424327850341797,验证集精确度:0.734006734006734,Total time:1m 24s
当前代数:3,batch:0,平均训练集Loss:0.5222371220588684,验证集精确度:0.7811447811447811,Total time:1m 59s
当前代数:4,batch:0,平均训练集Loss:0.5041710138320923,验证集精确度:0.7946127946127947,Total time:2m 31s
当前代数:5,batch:0,平均训练集Loss:0.48880547285079956,验证集精确度:0.8080808080808081,Total time:3m 1s
...
当前代数:39,batch:0,平均训练集Loss:0.3070719838142395,验证集精确度:0.8653198653198653,Total time:21m 36s
当前代数:40,batch:0,平均训练集Loss:0.3038052022457123,验证集精确度:0.8821548821548821,Total time:22m 9s
当前代数:41,batch:0,平均训练集Loss:0.3010295331478119,验证集精确度:0.8518518518518519,Total time:22m 42s
当前代数:42,batch:0,平均训练集Loss:0.2985186278820038,验证集精确度:0.8754208754208754,Total time:23m 13s
当前代数:43,batch:0,平均训练集Loss:0.2955150306224823,验证集精确度:0.8686868686868687,Total time:23m 44s
当前代数:44,batch:0,平均训练集Loss:0.29303693771362305,验证集精确度:0.8114478114478114,Total time:24m 14s
当前代数:45,batch:0,平均训练集Loss:0.29053598642349243,验证集精确度:0.8518518518518519,Total time:24m 45s
当前代数:46,batch:0,平均训练集Loss:0.28803855180740356,验证集精确度:0.8686868686868687,Total time:25m 16s
当前代数:47,batch:0,平均训练集Loss:0.2856229841709137,验证集精确度:0.8686868686868687,Total time:25m 46s
当前代数:48,batch:0,平均训练集Loss:0.2830147445201874,验证集精确度:0.8619528619528619,Total time:26m 17s
当前代数:49,batch:0,平均训练集Loss:0.280977725982666,验证集精确度:0.8720538720538721,Total time:26m 50s
验证集精确度:0.8720538720538721
对于CASIA汉语情感语料库,认为:
category2emotion = {
'angry':0,
'fear':0,
'sad':0,
'happy':1,
'surprise':1
}
这样一共有1000条数据,其中Pos的有400条,Neg的有600条
总共数据集大小为:3961, POS:2459, NEG:1502,将验证集从10%改为15%,Length of train set is 3366, Length of valid set is 595,最好的结果保存模型:
Cuda is available!
当前代数:0,batch:0,平均训练集Loss:0.6201511025428772,验证集精确度:0.6823529411764706,Total time:0m 44s
当前代数:1,batch:0,平均训练集Loss:0.5891127586364746,验证集精确度:0.7327731092436974,Total time:1m 36s
当前代数:2,batch:0,平均训练集Loss:0.569821298122406,验证集精确度:0.7260504201680672,Total time:2m 28s
当前代数:3,batch:0,平均训练集Loss:0.553916871547699,验证集精确度:0.7747899159663866,Total time:3m 19s
当前代数:4,batch:0,平均训练集Loss:0.5395360589027405,验证集精确度:0.7815126050420168,Total time:4m 11s
当前代数:5,batch:0,平均训练集Loss:0.5276969075202942,验证集精确度:0.788235294117647,Total time:5m 2s
当前代数:6,batch:0,平均训练集Loss:0.5157127976417542,验证集精确度:0.8084033613445378,Total time:5m 54s
当前代数:7,batch:0,平均训练集Loss:0.5061213374137878,验证集精确度:0.7899159663865546,Total time:6m 45s
当前代数:8,batch:0,平均训练集Loss:0.49973732233047485,验证集精确度:0.7966386554621848,Total time:7m 37s
当前代数:9,batch:0,平均训练集Loss:0.4922909140586853,验证集精确度:0.788235294117647,Total time:8m 29s
当前代数:10,batch:0,平均训练集Loss:0.4837799072265625,验证集精确度:0.8369747899159664,Total time:9m 20s
...
当前代数:41,batch:0,平均训练集Loss:0.35956424474716187,验证集精确度:0.880672268907563,Total time:27m 22s
当前代数:42,batch:0,平均训练集Loss:0.3565613031387329,验证集精确度:0.853781512605042,Total time:27m 55s
当前代数:43,batch:0,平均训练集Loss:0.3547583222389221,验证集精确度:0.8756302521008403,Total time:28m 28s
当前代数:44,batch:0,平均训练集Loss:0.35203972458839417,验证集精确度:0.8588235294117647,Total time:29m 1s
当前代数:45,batch:0,平均训练集Loss:0.34941181540489197,验证集精确度:0.8722689075630252,Total time:29m 34s
当前代数:46,batch:0,平均训练集Loss:0.3464643359184265,验证集精确度:0.8554621848739495,Total time:30m 7s
当前代数:47,batch:0,平均训练集Loss:0.3436892628669739,验证集精确度:0.8705882352941177,Total time:30m 41s
当前代数:48,batch:0,平均训练集Loss:0.34104934334754944,验证集精确度:0.8336134453781513,Total time:31m 14s
当前代数:49,batch:0,平均训练集Loss:0.33854228258132935,验证集精确度:0.8689075630252101,Total time:31m 48s
验证集精确度:0.8689075630252101
不知道为什么最好的模型好像没有保存下来…
提前定义best_model = None,把hidden_size改为96,重新训练:
...
当前代数:42,batch:0,平均训练集Loss:0.35273611545562744,验证集精确度:0.8521008403361344,Total time:24m 4s
当前代数:43,batch:0,平均训练集Loss:0.3498222827911377,验证集精确度:0.8470588235294118,Total time:24m 38s
当前代数:44,batch:0,平均训练集Loss:0.3466912806034088,验证集精确度:0.8554621848739495,Total time:25m 12s
当前代数:45,batch:0,平均训练集Loss:0.34298786520957947,验证集精确度:0.8521008403361344,Total time:25m 46s
当前代数:46,batch:0,平均训练集Loss:0.3392830789089203,验证集精确度:0.8487394957983193,Total time:26m 19s
当前代数:47,batch:0,平均训练集Loss:0.33573663234710693,验证集精确度:0.8521008403361344,Total time:26m 53s
当前代数:48,batch:0,平均训练集Loss:0.33253514766693115,验证集精确度:0.8487394957983193,Total time:27m 27s
当前代数:49,batch:0,平均训练集Loss:0.3291744291782379,验证集精确度:0.838655462184874,Total time:28m 0s
验证集精确度:0.838655462184874
还是不能保存最好的模型…
当前代数:51,batch:0,平均训练集Loss:0.3056916892528534,验证集精确度:0.8504201680672269,Total time:29m 27s
当前代数:52,batch:0,平均训练集Loss:0.3024693727493286,验证集精确度:0.8907563025210085,Total time:30m 1s
Save model!
当前代数:53,batch:0,平均训练集Loss:0.2991935610771179,验证集精确度:0.8773109243697479,Total time:30m 35s
当前代数:54,batch:0,平均训练集Loss:0.2951126992702484,验证集精确度:0.8605042016806723,Total time:31m 9s
当前代数:55,batch:0,平均训练集Loss:0.2914368212223053,验证集精确度:0.8739495798319328,Total time:31m 43s
当前代数:56,batch:0,平均训练集Loss:0.2875944972038269,验证集精确度:0.8689075630252101,Total time:32m 18s
当前代数:57,batch:0,平均训练集Loss:0.28370267152786255,验证集精确度:0.8571428571428571,Total time:32m 53s
当前代数:58,batch:0,平均训练集Loss:0.2803336977958679,验证集精确度:0.8588235294117647,Total time:33m 27s
当前代数:59,batch:0,平均训练集Loss:0.27704960107803345,验证集精确度:0.853781512605042,Total time:34m 2s
验证集精确度:0.8907563025210085
这次保存下来了
打开后缀为.pkl
的文件:
import pickle
with open(path+'Processed/unaligned_39.pkl', 'rb') as data_file:
data = pickle.load(data_file)
print(data)
关于numpy矩阵pad零:
import numpy as np
array = np.array([[1, 1],[2,2]])
"""
((1,1),(2,2))表示在二维数组array第一维(此处便是行)前面填充1行,最后面填充1行;
在二维数组array第二维(此处便是列)前面填充2列,最后面填充2列
constant_values=(0,3) 表示第一维填充0,第二维填充3
"""
ndarray=np.pad(array,((1,1),(2,2)),'constant', constant_values=(0,3))
print("array",array)
print("ndarray=",ndarray)
array [[1 1]
[2 2]]
ndarray= [[0 0 0 0 3 3]
[0 0 1 1 3 3]
[0 0 2 2 3 3]
[0 0 3 3 3 3]]
本例:
import numpy as np
x = np.random.rand(1200,26)
y = np.random.rand(500,26)
print(x.shape, y.shape)
x = x[:999]
y = y[:999]
print(x.shape, y.shape)
x = np.pad(x, ((0, 999 - len(x)),(0, 0)), mode='constant')
y = np.pad(y, ((0, 999 - len(y)),(0, 0)), mode='constant')
print(x.shape, y.shape)
print(x,y)
(1200, 26) (500, 26)
(999, 26) (500, 26)
(999, 26) (999, 26)
[[0.36276419 0.12025306 0.74485184 ... 0.07995154 0.87567182 0.29699818]
[0.7622248 0.03426568 0.26884526 ... 0.02622873 0.90374401 0.01375409]
[0.68813536 0.7224915 0.48943753 ... 0.19687244 0.62412416 0.16496784]
...
[0.93325474 0.79646476 0.79513437 ... 0.81376934 0.62457521 0.01225579]
[0.28987227 0.58696689 0.46909485 ... 0.77735102 0.38161923 0.00966942]
[0.35099698 0.04347501 0.57235593 ... 0.58973257 0.7156334 0.94827799]] [[0.78924269 0.60543369 0.66701066 ... 0.56727321 0.24234751 0.99821209]
[0.50008997 0.5139892 0.909358 ... 0.5260091 0.60402981 0.7940462 ]
[0.57316599 0.258695 0.96163063 ... 0.204388 0.00456991 0.44561223]
...
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]]
2. LSTM-output 实现细节
当前代数:53,batch:0,平均训练集Loss:0.2106393724679947,验证集精确度:0.7815126050420168,Total time:31m 44s
当前代数:54,batch:0,平均训练集Loss:0.20743916928768158,验证集精确度:0.7865546218487395,Total time:32m 35s
当前代数:55,batch:0,平均训练集Loss:0.3545965254306793,验证集精确度:0.7142857142857143,Total time:33m 25s
当前代数:56,batch:0,平均训练集Loss:1.031476378440857,验证集精确度:0.7126050420168067,Total time:34m 16s
当前代数:57,batch:0,平均训练集Loss:1.3142709732055664,验证集精确度:0.6554621848739496,Total time:35m 7s
当前代数:58,batch:0,平均训练集Loss:1.3166556358337402,验证集精确度:0.7327731092436974,Total time:35m 57s
当前代数:59,batch:0,平均训练集Loss:1.3262178897857666,验证集精确度:0.7478991596638656,Total time:36m 49s
验证集精确度:0.8201680672268907
就发现最后结果突然烂掉,同样的,增加一些细节:
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25) # 规定了最大不能超过的max_norm
scheduler = torch.optim.lr_scheduler.StepLR(optimizer_model, 1, gamma=0.9)
再次运行:
当前代数:54,batch:0,平均训练集Loss:0.6103,验证集精确度:0.8084,Total time:32m 23s
当前代数:55,batch:0,平均训练集Loss:0.5997,验证集精确度:0.8050,Total time:32m 59s
当前代数:56,batch:0,平均训练集Loss:0.5893,验证集精确度:0.8101,Total time:33m 35s
Save model!
当前代数:57,batch:0,平均训练集Loss:0.5794,验证集精确度:0.8084,Total time:34m 11s
当前代数:58,batch:0,平均训练集Loss:0.5698,验证集精确度:0.8067,Total time:34m 48s
当前代数:59,batch:0,平均训练集Loss:0.5605,验证集精确度:0.8084,Total time:35m 24s
验证集精确度:0.8100840336134454
可见一开始波动较大,而且最后结果一般,可能是参数太多了?把Hidden_size改为64再次观察结果:
当前代数:56,batch:0,平均训练集Loss:0.2949,验证集精确度:0.7882,Total time:32m 55s
当前代数:57,batch:0,平均训练集Loss:0.2902,验证集精确度:0.7933,Total time:33m 31s
当前代数:58,batch:0,平均训练集Loss:0.2857,验证集精确度:0.7882,Total time:34m 5s
当前代数:59,batch:0,平均训练集Loss:0.2814,验证集精确度:0.7849,Total time:34m 40s
验证集精确度:0.7983193277310925
再为linear(Gru不能初始化权重)增加一个初始化权重,使bias为0,weight为[-0.1, 0.1]:
当前代数:55,batch:0,平均训练集Loss:0.4795,验证集精确度:0.8269,Total time:32m 14s
Save model!
当前代数:56,batch:0,平均训练集Loss:0.4729,验证集精确度:0.8151,Total time:32m 49s
当前代数:57,batch:0,平均训练集Loss:0.4665,验证集精确度:0.8118,Total time:33m 24s
当前代数:58,batch:0,平均训练集Loss:0.4602,验证集精确度:0.8134,Total time:33m 59s
当前代数:59,batch:0,平均训练集Loss:0.4542,验证集精确度:0.8067,Total time:34m 33s
验证集精确度:0.826890756302521
想了一下,确实用LSTM的output作为全部语音或者句子的信息来进行分类是很不合理的,这样全靠提到的特征好不好,所以还是应该加上Attention
3. Attention
关于.repeat
函数:
import torch
x = torch.rand(2,1024)
print(x)
x = x.repeat(35,1,1).permute(1,0,2)
print(x)
print(x.size())
tensor([[0.0709, 0.6496, 0.3133, ..., 0.9218, 0.6278, 0.5311],
[0.9497, 0.6602, 0.8397, ..., 0.5423, 0.7244, 0.3430]])
tensor([[[0.0709, 0.6496, 0.3133, ..., 0.9218, 0.6278, 0.5311],
[0.0709, 0.6496, 0.3133, ..., 0.9218, 0.6278, 0.5311],
[0.0709, 0.6496, 0.3133, ..., 0.9218, 0.6278, 0.5311],
...,
[0.0709, 0.6496, 0.3133, ..., 0.9218, 0.6278, 0.5311],
[0.0709, 0.6496, 0.3133, ..., 0.9218, 0.6278, 0.5311],
[0.0709, 0.6496, 0.3133, ..., 0.9218, 0.6278, 0.5311]],
[[0.9497, 0.6602, 0.8397, ..., 0.5423, 0.7244, 0.3430],
[0.9497, 0.6602, 0.8397, ..., 0.5423, 0.7244, 0.3430],
[0.9497, 0.6602, 0.8397, ..., 0.5423, 0.7244, 0.3430],
...,
[0.9497, 0.6602, 0.8397, ..., 0.5423, 0.7244, 0.3430],
[0.9497, 0.6602, 0.8397, ..., 0.5423, 0.7244, 0.3430],
[0.9497, 0.6602, 0.8397, ..., 0.5423, 0.7244, 0.3430]]])
torch.Size([2, 35, 1024])
把以上代码换成Attention的代码,注意网络返回的还有Attention的数值,所以返回的是两个数,结果如下:
当前代数:56,batch:0,平均训练集Loss:0.3452,验证集精确度:0.8790,Total time:32m 49s
当前代数:57,batch:0,平均训练集Loss:0.3440,验证集精确度:0.8824,Total time:33m 23s
Save model!
当前代数:58,batch:0,平均训练集Loss:0.3429,验证集精确度:0.8807,Total time:33m 58s
当前代数:59,batch:0,平均训练集Loss:0.3419,验证集精确度:0.8756,Total time:34m 32s
验证集精确度:0.8823529411764706
感觉把最大迭代次数效果可以更好,设成100试试看:
当前代数:97,batch:0,平均训练集Loss:0.3042,验证集精确度:0.8387,Total time:60m 16s
当前代数:98,batch:0,平均训练集Loss:0.3039,验证集精确度:0.8387,Total time:60m 51s
当前代数:99,batch:0,平均训练集Loss:0.3035,验证集精确度:0.8387,Total time:61m 26s
验证集精确度:0.8453781512605042
还不如上一次的结果,这里感觉到后面Loss都降不下去的感觉,这里把Hidden_size改成96再试试:
当前代数:97,batch:0,平均训练集Loss:0.2911,验证集精确度:0.8723,Total time:58m 10s
当前代数:98,batch:0,平均训练集Loss:0.2906,验证集精确度:0.8706,Total time:58m 46s
当前代数:99,batch:0,平均训练集Loss:0.2900,验证集精确度:0.8706,Total time:59m 22s
验证集精确度:0.8773109243697479
4. 语音可视化
想做一个可动的图像,matplotlib有一些例子,使用此函数matplotlib.animation.FuncAnimation
官方文档,一个plot实例和另一个示例,您必须将创建的动画存储在一个变量中,这个变量的寿命与动画运行的时间一样长。否则,动画对象将被垃圾回收,动画将停止,参数:
- fig
- func,在每一帧中调用的函数。第一个参数将是下一个帧的值。任何额外的位置参数可以通过fargs参数提供,如果 blit == True,func 必须返回一个所有被修改或创建的艺术家的迭代信息。这个信息被blit算法用来确定图中哪些部分需要更新。如果blit == False,则返回值不被使用,在这种情况下可以省略
问题:
画的太慢了,即使把interval=0.0001
这个参数调的很小也画得很慢,在数据量比较小的时候可以用,代码如下:
class Scope:
def __init__(self, ax):
self.ax = ax
self.tdata = [0]
self.ydata = [0]
self.line = Line2D(self.tdata, self.ydata)
self.ax.add_line(self.line)
self.ax.set_ylim(-1, 1)
def update(self,i):
self.tdata.append(time[i])
self.ydata.append(data[i])
self.line.set_data(self.tdata, self.ydata)
return self.line,
i = 0
def emitter():
global i
i += 1
yield i
fig, ax = plt.subplots()
scope = Scope(ax)
# pass a generator in "emitter" to produce data for the update func
ani = animation.FuncAnimation(fig, scope.update, emitter, interval=0.0001,
blit=True)
plt.show()
直接利用plt.ion()
可以进行绘图,如何保存?参考此文,使用with writer.saving(fig, "... your path/writer_test.mp4", 100): # 100指的dpi,dot per inch,表示清晰度
报错:
FileNotFoundError: [WinError 2] The system cannot find the file specified
需要Windows下安装使用ffmpeg,官网下载链接,在此下载ffmpeg-N-101947-gc5ca18fd1b-win64-gpl.zip
,共95.1MB,并配置bin的环境变量,最后命令行输入:
ffmpeg -version
安装成功:
但依然报错,重启一遍Pycharm试试,问题解决!实在不行的话还可以参考此文
问题:如何保证输出的时频信号时长和原语音长度尽可能相等?再次读入mp4文件,计算总时长,利用opencv,参考此文章
语音数据转时域可视化信号
代码如下:
import wave
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FFMpegWriter
import cv2
CASIA_path = '... your path/11/CASIA database/'
example1 = '232. 噪音产生污染 '
happy1 = 'liuchanhg/happy/232.wav'
happy2 = 'wangzhe/happy/232.wav'
sad1 = 'liuchanhg/sad/232.wav'
sad2 = 'wangzhe/sad/232.wav'
example2 = '249. 笑话逗得大家开心'
happy3 = 'liuchanhg/happy/249.wav'
happy4 = 'wangzhe/happy/249.wav'
sad3 = 'liuchanhg/sad/249.wav'
sad4 = 'wangzhe/sad/249.wav'
def read_wav(path_wav):
f = wave.open(path_wav, 'rb')
params = f.getparams()
nchannels, sampwidth, framerate, nframes = params[:4] # 通道数、采样字节数、采样率、采样帧数
voiceStrData = f.readframes(nframes)
waveData = np.frombuffer(voiceStrData, dtype=np.short) # 将原始字符数据转换为整数
waveData = waveData * 1.0 / max(abs(waveData)) # 音频数据归一化, instead of .fromstring
waveData = np.reshape(waveData, [nframes, nchannels]).T # .T 表示转置, 将音频信号规整乘每行一路通道信号的格式,即该矩阵一行为一个通道的采样点,共nchannels行
f.close()
return waveData, nframes, framerate
def draw_time_domain_image(waveData, nframes, framerate): # 时域特征
time = np.arange(0,nframes) * (1.0/framerate)
# plt.plot(time,waveData[0,:],c='b')
# plt.xlabel('time')
# plt.ylabel('am')
# plt.show()
return time,waveData[0,:]
metadata = dict(title='Video to mp4', artist='Matplotlib')
writer = FFMpegWriter(fps=30, metadata=metadata)
waveData, nframes, framerate = read_wav(CASIA_path + happy2)
time, data = draw_time_domain_image(waveData, nframes, framerate)
real_time = time[-1]
print(real_time)
fig = plt.figure(1)
INTERVAL = 537 # 537 - 2.166s(2.1506875)
mp4_path = "... your path/happy2.mp4"
def wavedata2mp4(maxepoch):
global INTERVAL
for i in range(maxepoch):
with writer.saving(fig, mp4_path, 100): # 100指的dpi,dot per inch,表示清晰度
# plt.ion()
x = []
y = []
# plt.xlim(0, time[-1])
# plt.ylim(0, 1)
for i in range(0, len(time), INTERVAL):
plt.clf() # 清空画布上的所有内容
# x = np.append(x, time[i:i+2000])
# y = np.append(y, data[i:i+2000])
x = time[:i+INTERVAL]
y = data[:i+INTERVAL]
plt.axis([0, time[-1], -1, 1])
plt.plot(x, y)
# plt.draw()
plt.pause(0.001)
writer.grab_frame()
# plt.ioff()
# plt.show()
cap = cv2.VideoCapture(mp4_path)
if cap.isOpened():
rate = cap.get(5)
frame_num = cap.get(7)
video_time = frame_num / rate
print(video_time, INTERVAL)
if video_time > real_time:INTERVAL += 1
else:
INTERVAL -= 1
wavedata2mp4(1)
5. 全部代码
# <editor-fold desc="Voice">
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
from python_speech_features import mfcc, logfbank
from torch.utils.data import Dataset
import torch
from torch.utils.data.dataset import random_split
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.nn.functional as F
import time
from torch.nn.utils.rnn import pad_sequence
import matplotlib.ticker as ticker
category2emotion = {
'angry': 0,
'fear': 0,
'sad': 0,
'happy': 1,
'surprise': 1
}
import os
CASIA_path = '/mnt/Data1/ysc/CASIA database/'
CASIA_wave_path = []
CASIA_label_list = []
CASIA_dirs = ['liuchanhg', 'wangzhe', 'zhaoquanyin', 'ZhaoZuoxiang']
categories = ['angry', 'fear', 'happy', 'sad', 'surprise']
for dir in CASIA_dirs:
for category in categories:
label = category2emotion[category]
for dataname in os.listdir(CASIA_path + dir + '/' + category):
if dataname.endswith('wav'):
CASIA_wave_path.append(dir + '/' + category + '/' + dataname)
CASIA_label_list.append(label)
dic = {}
path = '/mnt/Data1/ysc/TAL_SER/TAL-SER/'
path_label = path + 'label/' + 'label' # label:标签
path_utt2gen = path + 'label/' + 'utt2gen' # Gender:性别
path_utt2spk = path + 'label/' + 'utt2spk' # Speaker:说话人
path_wav = path + 'label/' + 'wav.scp' # wav地址
class WaveDataset(Dataset):
def __init__(self):
total = -1 # 4188
with open(path_label, 'r', encoding='utf-8') as file:
for line in file.readlines():
total += 1
if total == 0:
continue
tmp = line.strip().split(' ')
dic[eval(tmp[0])] = {'P': eval(tmp[1]), 'A': eval(tmp[2])}
with open(path_utt2gen, 'r', encoding='utf-8') as file:
for line in file.readlines():
tmp = line.strip().split(' ')
dic[eval(tmp[0])].update({'Gender': tmp[1]})
with open(path_utt2spk, 'r', encoding='utf-8') as file:
for line in file.readlines():
tmp = line.strip().split(' ')
dic[eval(tmp[0])].update({'Speaker': tmp[1]})
with open(path_wav, 'r', encoding='utf-8') as file:
for line in file.readlines():
tmp = line.strip().split(' ')
dic[eval(tmp[0])].update({'Wav': tmp[1][1:]})
self.path = []
self.label = []
self.n_pos = 0
self.n_neg = 0
for key in dic:
if dic[key]['A'] >= 0.5: # 2961
self.path.append(dic[key]['Wav'])
self.label.append(1)
self.n_pos += 1
elif dic[key]['A'] <= -0.5:
self.path.append(dic[key]['Wav'])
self.label.append(0)
self.n_neg += 1
self.path = self.path + CASIA_wave_path
self.label = self.label + CASIA_label_list
self.n_pos = self.n_pos + 400
self.n_neg = self.n_neg + 600
print('数据集大小为:%d, POS:%d, NEG:%d' % (len(self.path), self.n_pos, self.n_neg))
print(len(self.label) == len(self.path))
def __getitem__(self, item):
try:
path_wav = path + self.path[item]
filterbank_features = mfcc_python_speech_features(path_wav)
except:
path_wav = CASIA_path + self.path[item]
filterbank_features = mfcc_python_speech_features(path_wav)
# return torch.from_numpy(filterbank_features), torch.Tensor([self.label[item]])
return torch.Tensor(filterbank_features), torch.LongTensor([self.label[item]])
def __len__(self):
return self.n_pos + self.n_neg
def mfcc_python_speech_features(path):
sampling_freq, audio = wavfile.read(path) # 读取输入音频文件
# mfcc_features = mfcc(audio, sampling_freq) # 提取MFCC和滤波器组特征
filterbank_features = logfbank(audio, sampling_freq) # numpy.ndarray, (999, 26)
# 保存输出都是MAX_LENGTH
# filterbank_features = filterbank_features[:MAX_LENGTH]
# filterbank_features = np.pad(filterbank_features, ((0, MAX_LENGTH - len(filterbank_features)), (0, 0)),
# mode='constant')
return filterbank_features
# class Net(nn.Module): # LSTM hidden
# def __init__(self, hidden_size):
# super(Net, self).__init__()
# self.hidden_size = hidden_size
# self.gru = nn.GRU(input_size=26, hidden_size=self.hidden_size, num_layers=2, batch_first=True, bidirectional=True, dropout=0.1)
# self.linear = nn.Linear(2*2*hidden_size, 2)
#
# def forward(self, input, hidden=None):
# output, hidden = self.gru(input, hidden) # batch * seq_len * (2*256), (2*2) * batch * 256
# hidden = hidden.permute(1,0,2).contiguous() # batch * (2*2) * 256
# hidden = hidden.view(hidden.size(0), -1) # batch * (2*2*256)
# hidden = self.linear(hidden) # batch * 2
# # return F.softmax(hidden, dim=1)
# return hidden
# class Net(nn.Module): # LSTM output
# def __init__(self, hidden_size):
# super(Net, self).__init__()
# self.hidden_size = hidden_size
# self.gru = nn.GRU(input_size=26, hidden_size=self.hidden_size, num_layers=2, batch_first=True,
# bidirectional=True, dropout=0.1)
# self.linear = nn.Linear(MAX_LENGTH * 2 * HIDDEN_SIZE, 2)
# self.init_weights()
#
# def init_weights(self):
# initrange = 0.1
# self.linear.bias.data.zero_()
# self.linear.weight.data.uniform_(-initrange, initrange)
#
# def forward(self, input, hidden=None):
# output, hidden = self.gru(input, hidden) # batch * seq_len * (2*256), (2*2) * batch * 256
# output = output.contiguous().view(output.size(0), -1)
# output = self.linear(output)
# return output
class Net(nn.Module): # LSTM output Attention
def __init__(self, hidden_size):
super(Net, self).__init__()
self.hidden_size = hidden_size
self.gru = nn.GRU(input_size=26, hidden_size=self.hidden_size, num_layers=2, batch_first=True, bidirectional=True, dropout=0.1)
self.fc1 = nn.Linear((2 * 2 * hidden_size + 2 * hidden_size), 8) # Attention dim = 8
self.fc2 = nn.Linear(2 * hidden_size, 2)
def forward(self, input, hidden=None):
output, hidden = self.gru(input, hidden) # batch * seq_len * (2*256), (2*2) * batch * 256
hidden = hidden.permute(1,0,2).contiguous() # batch * (2*2) * 256
hidden = hidden.view(hidden.size()[0], -1) # batch * ((2*2) * 256)
hidden = hidden.repeat(output.size()[1], 1, 1).permute(1,0,2) # batch * seq_len * ((2*2) * 256)
combine = torch.cat((output, hidden), dim=2) # batch * seq_len * ((2*2) * 256 + (2 * 256))
combine = self.fc1(combine) # batch * seq_len * 8
combine = torch.tanh(combine)
combine = torch.sum(combine, dim=2) # batch * seq_len
attention = F.softmax(combine, dim=1)
a = attention.unsqueeze(1)
a_apply = a.bmm(output)
emo = self.fc2(a_apply.squeeze(1))
return attention, emo # batch * seq_len, batch * 2
def evaluate(dataloader):
model.eval()
total_acc, total_count = 0, 0
with torch.no_grad():
for i, (wave, label) in enumerate(dataloader):
wave, label = wave.cuda(), label.cuda()
# predicted_label = model(wave)
_, predicted_label = model(wave)
# print(predicted_label.argmax(1))
total_acc += (predicted_label.argmax(1) == label).sum().item()
total_count += label.size(0)
model.train()
return total_acc / total_count
def showAttention(input_sentence, output_words, attentions, title):
plt.rcParams['font.sans-serif'] = ['SimHei'] # 显示中文标签
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.dpi'] = 300 # 分辨率
# 用色条设置图形
fig = plt.figure()
ax = fig.add_subplot(111)
mat = np.repeat(attentions.numpy(), 20, axis=0)
cax = ax.matshow(mat) # colormap, cmap='bone'
# fig.colorbar(cax)
# plt.title(title,verticalalignment='bottom')
# 设置轴
ax.set_xticklabels([''] + input_sentence, rotation=90)
ax.set_yticklabels([''] + output_words)
# Show label at every tick
# ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
# ax.yaxis.set_major_locator(ticker.MultipleLocator(20))
plt.show()
def test(model):
model.eval()
with torch.no_grad():
happy3_wave = '/mnt/Data1/ysc/CASIA database/wangzhe/sad/249.wav'
wave = mfcc_python_speech_features(happy3_wave)
wave = torch.Tensor(wave)
wave = wave.unsqueeze(0)
wave = wave.cuda()
model = model.cuda()
attention, predicted_label = model(wave)
text = 'x'
if predicted_label.argmax(1).data.item() == 1:
text = 'POS √'
if predicted_label.argmax(1).data.item() == 0:
text = 'NEG √'
# print(attention)
showAttention([], [text], attention.cpu(),'')
def epoch_time(start_time, end_time):
elapsed_time = end_time - start_time
elapsed_mins = int(elapsed_time / 60)
elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
return elapsed_mins, elapsed_secs
def generator_batch(data_batch):
wave_batch = []
label_batch = []
for wave, label in data_batch:
wave_batch.append(wave)
label_batch.append(label)
wave_batch = pad_sequence(wave_batch, batch_first=True)
label_batch = torch.tensor(label_batch)
return wave_batch, label_batch
BATCH_SIZE = 128
HIDDEN_SIZE = 96 # 96
MAX_LENGTH = 999
if __name__ == '__main__':
# datafile = WaveDataset()
# print('Dataset loaded! length of dataset is {0}'.format(len(datafile)))
# n_train = int(len(datafile) * 0.85)
# split_train, split_valid = random_split(dataset=datafile, lengths=[n_train, len(datafile) - n_train])
# train_dataloader = DataLoader(split_train, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generator_batch)
# valid_dataloader = DataLoader(split_valid, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generator_batch)
# print('Length of train set is {}, Length of valid set is {}'.format(len(split_train), len(split_valid)))
model = Net(HIDDEN_SIZE)
acc_min = 0.5
if torch.cuda.is_available() == True:
# print('Cuda is available!')
# model = model.cuda()
# optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # 学习率
# scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
# criterion = nn.CrossEntropyLoss()
# losses = []
# start_time = time.time()
# for epoch in range(100): # 最大迭代次数
# cnt = 0
# for wave, label in train_dataloader: # batch * seq_len * input_size
# wave, label = wave.cuda(), label.cuda()
# # out = model(wave) # 对于LSTM网络有一个返回值
# _, out = model(wave) # 对于Attention!网络有两个返回值
# loss = criterion(out, label.squeeze())
# losses.append(loss)
# loss.backward()
# torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25) # 规定了最大不能超过的max_norm
# optimizer.step()
# optimizer.zero_grad()
# # cnt += 1
# end_time = time.time()
# epoch_mins, epoch_secs = epoch_time(start_time, end_time)
# acc_valid = evaluate(valid_dataloader)
# print('当前代数:{},batch:{},平均训练集Loss:{:.4f},验证集精确度:{:.4f},Total time:{}m {}s'.format(epoch, cnt,
# sum(losses) / len(losses),
# acc_valid, epoch_mins,
# epoch_secs))
# if acc_valid > acc_min:
# acc_min = acc_valid
# torch.save(model.state_dict(), 'model_wave_Attention.pth')
# print('Save model!')
# else:
# scheduler.step()
model.load_state_dict(torch.load('model_wave_Attention.pth'))
# model_wave_best.pth: hidden acc:89%
# model_wave_LSTM_output.pth: output acc:82%
# model_wave_Attention.pth: Attention acc:87%
test(model)
# print('验证集精确度:{}'.format(evaluate(valid_dataloader)))
# plt.plot(losses)
# plt.show()
# </editor-fold>
小结
利用LSTM hidden、output、Attention分别实现语音情绪分类,其中利用hidden的分类效果是最好的,未来工作:看多模态中文数据集的论文