主要内容
记录了一些自己在用keras简单实现lstm+ctc中觉得需要注意的点。
lstm和ctc的相关原理不再赘述,附以下两个链接,可供参考。
Layer 输入输出shape
有的时候,虽然感觉原理看了个大概,但实际操作起来还是有点无从下手,所以如果对网络每一层layer中输入输出的shape有着清晰的了解,对于网络的代码实现会有很大帮助。
LSTM层
lstm = LSTM(units=40, return_sequences=True)
输入shape为(batch_size, time_steps, step_length)
输出shape为(batch_size, time_steps, units)
这里的time_steps可以是提取语音特征mfcc的帧数,step_length则是一帧mfcc的特征数
Dense层
dense = Dense(n_classes, activation='softmax')(lstm)
输入shape为(batch_size, time_steps, units)
输出shape为(batch_size, time_steps, n_classes)
这里的n_classes是音素的个数,如26个英文字母+1个space+1个blank
CTC loss
keras自带ctc loss函数为keras.backend.ctc_batch_cost,需要Lambda层进行层封装。
import keras.backend as K
def ctc_lambda_func(args):
y_pred, labels, input_length, label_length = args
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
loss = Lambda(ctc_lambda_func, output_shape=(1, ), name='ctc')\
([dense, label_true, input_length, label_length])
这里的input_length的shape为(batch_size, 1),元素为训练数据的time_steps
label_length的shape为(batch_size, 1),元素为训练数据的max_string_length
模型构建
我们需要构建两个模型base_model和model
base_model 以 dense 作为输出,用于训练好之后的预测
model 以 loss 作为输出,用于训练参数
以下模型使用 GRU,同 LSTM相似
input = Input(shape=(time_steps, step_length))
gru = Bidirectional(GRU(units=40, return_sequences=True), merge_mode='concat')(input)
dense = Dense(n_classes, activation='softmax')(gru)
base_model = Model(inputs=input, outputs=dense)
label_true = Input(shape=[max_label_length])
input_length = Input(shape=[1])
label_length = Input(shape=[1])
loss = Lambda(ctc_lambda_func, output_shape=(1, ), name='ctc')\
([dense, label_true, input_length, label_length])
model = Model(inputs=[input, label_true, input_length, label_length], outputs=loss)
模型训练
首先我们需要model.compile中使用自己定义的ctc_loss损失函数
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer='adadelta')
模型的损失函数参数为模型输出y_pred和真实标签y_true, 由于我们的model输出已经是ctc_loss,所以直接将y_pred作为loss
fittedModel = model.fit([input, labels, input_length, label_length], np.ones(1), batch_size=1, epochs=100,
verbose=2)
由于真实标签 labels 已经作为输入参与到 layer 层计算中,因此 model.fit 中的 y 只需要随意赋值,与 batch_size 大小保持一致
模型测试
训练好 model 后, 使用 base_model 进行预测
y_pred = base_model.predict(input_test)
使用 ctc_decode 对 y_pred 进行解码
decode = K.get_value(K.ctc_decode(y_pred, input_length=np.ones(y_pred.shape[0]) * y_pred.shape[1], greedy=True)[0][0])
这里的 decode 是对应类别的下标,根据下标转换成实际类别即可
简单代码实现
from keras.models import Model
from keras.layers import GRU, Dense, Bidirectional, Input, Lambda
from python_speech_features import *
import keras.backend as K
import numpy as np
import scipy.io.wavfile as wav
def ctc_lambda_func(args):
y_pred, labels, input_length, label_length = args
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
def get_audio_feature(audio_path):
fs, audio = wav.read(audio_path)
print(fs)
print(audio.shape)
# 提取mfcc特征
wav_feature = mfcc(audio, fs, nfft=int(0.025*fs), winfunc=np.hamming)
# delta
d_mfcc_feat1 = delta(wav_feature, 1)
d_mfcc_feat2 = delta(wav_feature, 2)
feature = np.hstack((wav_feature, d_mfcc_feat1, d_mfcc_feat2))
return feature
def get_audio_label(filepath):
SPACE_TOKEN = '<space>'
SPACE_INDEX = 0
FIRST_INDEX = ord('a') - 1
with open(filepath, 'r') as f:
line = f.readlines()[0].strip()
# 空格字符转换成两个空格字符
targets = line.replace(' ', ' ')
# 按空格切分,两个空格之间为''
targets = targets.split(' ')
# 将''转换成空格token
targets = np.hstack([SPACE_TOKEN if x == '' else list(x) for x in targets])
print(targets)
# 将 token转换成数字
targets = np.hstack([SPACE_INDEX if x == SPACE_TOKEN else ord(x) - FIRST_INDEX
for x in targets])
return targets
def decode_ctc(out):
batch_size, decode_len = out.shape[0], out.shape[1]
for i in range(batch_size):
pre = ''.join([' ' if x == 0 else chr(x + ord('a') - 1) for x in out[i]])
print(pre)
feature = get_audio_feature('001.wav')
feature = feature[np.newaxis, :]
print(feature.shape)
labels = get_audio_label('label.txt')
labels = labels[np.newaxis, :]
print(labels.shape)
max_label_length = labels.shape[1]
il = np.ones(1) * feature.shape[1]
print(il.shape)
ll = np.ones(1) * max_label_length
print(ll.shape)
time_step, step_length = feature.shape[1], feature.shape[2]
n_classes = 26 + 1 + 1
input = Input(shape=(time_step, step_length))
gru = Bidirectional(GRU(units=40, return_sequences=True), merge_mode='concat')(input)
dense = Dense(n_classes, activation='softmax')(gru)
base_model = Model(inputs=input, outputs=dense)
label_true = Input(shape=[max_label_length])
input_length = Input(shape=[1])
label_length = Input(shape=[1])
loss = Lambda(ctc_lambda_func, output_shape=(1, ), name='ctc')\
([dense, label_true, input_length, label_length])
model = Model(inputs=[input, label_true, input_length, label_length], outputs=loss)
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer='adadelta')
# model.summary()
fittedModel = model.fit([feature, labels, il, ll], np.ones(1), batch_size=1, epochs=100,
verbose=2)
model.save('lstm_ctc.h5')
base_model.load_weights('lstm_ctc.h5')
y_pred = base_model.predict(feature)
decode = K.ctc_decode(y_pred, input_length=np.ones(y_pred.shape[0]) * y_pred.shape[1], greedy=True)
out = K.get_value(decode[0][0])
decode_ctc(out)
部分代码参考
https://blog.csdn.net/yifen4234/article/details/80334516