Keras lstm+ctc学习心得

最新推荐文章于 2023-12-27 01:46:50 发布

№这般颜色

最新推荐文章于 2023-12-27 01:46:50 发布

阅读量2.5k

点赞数 1

本文链接：https://blog.csdn.net/chaowb/article/details/106550815

版权

主要内容

记录了一些自己在用keras简单实现lstm+ctc中觉得需要注意的点。

lstm和ctc的相关原理不再赘述，附以下两个链接，可供参考。

人人都能看懂的LSTM

一文读懂CRNN+CTC文字识别

Layer 输入输出shape

有的时候，虽然感觉原理看了个大概，但实际操作起来还是有点无从下手，所以如果对网络每一层layer中输入输出的shape有着清晰的了解，对于网络的代码实现会有很大帮助。

LSTM层

lstm = LSTM(units=40, return_sequences=True)

输入shape为（batch_size, time_steps, step_length)

输出shape为（batch_size, time_steps, units)

这里的time_steps可以是提取语音特征mfcc的帧数，step_length则是一帧mfcc的特征数

Dense层

dense = Dense(n_classes, activation='softmax')(lstm)

输入shape为（batch_size, time_steps, units)

输出shape为（batch_size, time_steps, n_classes)

这里的n_classes是音素的个数，如26个英文字母+1个space+1个blank

CTC loss

keras自带ctc loss函数为keras.backend.ctc_batch_cost，需要Lambda层进行层封装。

import keras.backend as K

def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

loss = Lambda(ctc_lambda_func, output_shape=(1, ), name='ctc')\
             ([dense, label_true, input_length, label_length])

这里的input_length的shape为（batch_size, 1)，元素为训练数据的time_steps

label_length的shape为（batch_size, 1)，元素为训练数据的max_string_length

模型构建

我们需要构建两个模型base_model和model

base_model 以 dense 作为输出，用于训练好之后的预测

model 以 loss 作为输出，用于训练参数

以下模型使用 GRU，同 LSTM相似

input = Input(shape=(time_steps, step_length))
gru = Bidirectional(GRU(units=40, return_sequences=True), merge_mode='concat')(input)
dense = Dense(n_classes, activation='softmax')(gru)
base_model = Model(inputs=input, outputs=dense)

label_true = Input(shape=[max_label_length])
input_length = Input(shape=[1])
label_length = Input(shape=[1])
loss = Lambda(ctc_lambda_func, output_shape=(1, ), name='ctc')\
             ([dense, label_true, input_length, label_length])
model = Model(inputs=[input, label_true, input_length, label_length], outputs=loss)

模型训练

首先我们需要model.compile中使用自己定义的ctc_loss损失函数

model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer='adadelta')

模型的损失函数参数为模型输出y_pred和真实标签y_true, 由于我们的model输出已经是ctc_loss，所以直接将y_pred作为loss

fittedModel = model.fit([input, labels, input_length, label_length], np.ones(1), batch_size=1, epochs=100,
                        verbose=2)

由于真实标签 labels 已经作为输入参与到 layer 层计算中，因此 model.fit 中的 y 只需要随意赋值，与 batch_size 大小保持一致

模型测试

训练好 model 后，使用 base_model 进行预测

y_pred = base_model.predict(input_test)

使用 ctc_decode 对 y_pred 进行解码

decode = K.get_value(K.ctc_decode(y_pred, input_length=np.ones(y_pred.shape[0]) * y_pred.shape[1], greedy=True)[0][0])

这里的 decode 是对应类别的下标，根据下标转换成实际类别即可

简单代码实现

from keras.models import  Model
from keras.layers import GRU, Dense, Bidirectional, Input, Lambda
from python_speech_features import *
import keras.backend as K
import numpy as np
import scipy.io.wavfile as wav



def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)


def get_audio_feature(audio_path):

    fs, audio = wav.read(audio_path)
    print(fs)
    print(audio.shape)

    # 提取mfcc特征
    wav_feature = mfcc(audio, fs, nfft=int(0.025*fs), winfunc=np.hamming)
    # delta
    d_mfcc_feat1 = delta(wav_feature, 1)
    d_mfcc_feat2 = delta(wav_feature, 2)
    feature = np.hstack((wav_feature, d_mfcc_feat1, d_mfcc_feat2))

    return feature


def get_audio_label(filepath):

    SPACE_TOKEN = '<space>'
    SPACE_INDEX = 0
    FIRST_INDEX = ord('a') - 1

    with open(filepath, 'r') as f:
        line = f.readlines()[0].strip()
        # 空格字符转换成两个空格字符
        targets = line.replace(' ', '  ')
        # 按空格切分，两个空格之间为''
        targets = targets.split(' ')
        # 将''转换成空格token
        targets = np.hstack([SPACE_TOKEN if x == '' else list(x) for x in targets])
        print(targets)
        # 将 token转换成数字
        targets = np.hstack([SPACE_INDEX if x == SPACE_TOKEN else ord(x) - FIRST_INDEX
                             for x in targets])
        return targets

def decode_ctc(out):
    batch_size, decode_len = out.shape[0], out.shape[1]
    for i in range(batch_size):
        pre = ''.join([' ' if x == 0 else chr(x + ord('a') - 1) for x in out[i]])
        print(pre)

feature = get_audio_feature('001.wav')
feature = feature[np.newaxis, :]
print(feature.shape)
labels = get_audio_label('label.txt')
labels = labels[np.newaxis, :]
print(labels.shape)
max_label_length = labels.shape[1]
il = np.ones(1) * feature.shape[1]
print(il.shape)
ll = np.ones(1) * max_label_length
print(ll.shape)

time_step, step_length = feature.shape[1], feature.shape[2]
n_classes = 26 + 1 + 1

input = Input(shape=(time_step, step_length))
gru = Bidirectional(GRU(units=40, return_sequences=True), merge_mode='concat')(input)
dense = Dense(n_classes, activation='softmax')(gru)
base_model = Model(inputs=input, outputs=dense)

label_true = Input(shape=[max_label_length])
input_length = Input(shape=[1])
label_length = Input(shape=[1])
loss = Lambda(ctc_lambda_func, output_shape=(1, ), name='ctc')\
             ([dense, label_true, input_length, label_length])
model = Model(inputs=[input, label_true, input_length, label_length], outputs=loss)

model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer='adadelta')
# model.summary()

fittedModel = model.fit([feature, labels, il, ll], np.ones(1), batch_size=1, epochs=100,
                        verbose=2)
model.save('lstm_ctc.h5')


base_model.load_weights('lstm_ctc.h5')
y_pred = base_model.predict(feature)
decode = K.ctc_decode(y_pred, input_length=np.ones(y_pred.shape[0]) * y_pred.shape[1], greedy=True)
out = K.get_value(decode[0][0])
decode_ctc(out)

部分代码参考

https://blog.csdn.net/yifen4234/article/details/80334516