首选,我必须吐槽一下,这个数据集我下了快两个星期(ps:没错,你没有看错,我真的下了快两个星期,中途要么是网络断了,然后下载失败,要么是不知道是啥莫名其妙的原因导致下载失败,对了,中途那个网站好像还关闭过,当时我正在下载!!!)。在这里感谢一下师姐,她帮我请另一个师兄用迅雷最后下载好了,对,就是昨晚,我终于见到了完整的thchs-30数据集(哈哈哈)。
OK,正式开始,首先我把这个项目总结一下:
语音识别–基于深度学习的中文语音识别系统
这里采用了两个模型,一个是声学模型,一个是语言模型,将输入的音频信号识别为汉字。其中用的RNN可以长依赖问题(ps:嗯…,这里是从我的笔记本上摘抄下来,所以可能有些话会突然崩出来,但是也是知识)。
基于深度学习的语音识别中的声学模型和语言模型建模
声学模型:CNN-CTC,GRU-CTC,CNN-RNN-CTC
语言模型:transformer,CBHG,n-gram(这个模型仅仅是用来测试一个小文件)
介绍一下语言模型:
CBHG模块是state-of-art的seq2seq模型,用在Google的机器翻译和语音合成中,该模型放在cbhg.py中
基于transformer结构的语言模型transformer.py,该模型已经被证明有强于其他框架的语言表达能力。该模型是自然语言处理这两年最火的模型,今年的bert就是使用的该结构。
数据集用的是中文免费数据集:thchs-30,aishell,prime words,st-cmd,总计450小时。
声学模型:
介绍一下模型:
GRU-CTC
利用循环神经网络可以利用语音上下文相关的信息,得到更加准确地信息,而GUR又能选择性的保留需要的长时信息,使用双向rnn又能够充分的利用上下文信号。
但该方法缺点是一句话说完之后才能进行识别,且训练相对cnn较慢。该模型使用python/keras进行搭建,本文系统都使用python搭建。
网络结构如下:
def _model_init(self):
self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
x = Reshape((-1, 200))(self.inputs)
x = dense(512, x)
x = dense(512, x)
x = bi_gru(512, x)
x = bi_gru(512, x)
x = bi_gru(512, x)
x = dense(512, x)
self.outputs = dense(self.vocab_size, x, activation='softmax')
self.model = Model(inputs=self.inputs, outputs=self.outputs)
self.model.summary()
DFCNN
使用GRU作为语音识别的时候我们会遇到问题,原因如下:
一方面是我们常常使用双向循环神经网络才能取得更好的识别效果,这样会影响解码实时性。
另一方面随着网络结构复杂性增加,双向GRU的参数是相同节点数全连接层的6倍,这样会导致训练速度非常缓慢。
利用CNN参数共享机制,可以将参数数量下降几个数量级别,且深层次的卷积和池化层能够充分考虑语音信号的上下文信息,且可以在较短的时间内就可以得到识别结果,具有较好的实时性。
该模型在cnn_ctc.py中,实验中该模型是所有网络中结果最好的模型,目前能够取得较好的泛化能力。
def cnn_cell(size, x, pool=True):
x = norm(conv2d(size)(x))
x = norm(conv2d(size)(x))
if pool:
x = maxpool(x)
return x
class Am():
def _model_init(self):
self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
self.h1 = cnn_cell(32, self.inputs)
self.h2 = cnn_cell(64, self.h1)
self.h3 = cnn_cell(128, self.h2)
self.h4 = cnn_cell(128, self.h3, pool=False)
self.h5 = cnn_cell(128, self.h4, pool=False)
# 200 / 8 * 128 = 3200
self.h6 = Reshape((-1, 3200))(self.h5)
self.h7 = dense(256)(self.h6)
self.outputs = dense(self.vocab_size, activation='softmax')(self.h7)
self.model = Model(inputs=self.inputs, outputs=self.outputs)
self.model.summary()
DFSMN
前馈记忆神经网络解决了双向GRU的参数过多和实时性较差的缺点,它利用一个记忆模块,包含了上下几帧信息,能够得到不输于双向GRU-CTC的识别结果,阿里最新的开源系统就是基于DFSMN的声学模型,只不过在kaldi的框架上实现的。我们将考虑使用DFSMN+CTC的结构在python上实现。该网络实质上是用一个特殊的CNN就可以取得相同的效果,我们将CNN的宽设置为memory size,将高度设置为feature dim,将channel设置为hidden units,这样一个cnn的层就可以模仿fsmn的实现了。
一:特征提取
二:数据处理
下载数据,生成音频文件和标签文件列表,label数据处理,音频数据处理,数据生成器(确定batch_size和batch_num)
其中有DFCNN模型(一种使用深度卷积神经网络来对时频图进行识别的方法),有论文。
三:模型搭建
构建模型组件,搭建cnn+dnn+ctc的声学模型
四:模型训练,模型推断
在这个模型中有一些比较重要的函数
source_get(),这个函数是用来获取音频文件
read_label() 读取音频文件对应的拼音label,为label建立拼音到id的映射即词典
音频数据处理,只需要获得对应的音频文件名,然后提取所需时频图
模型搭建:训练输入为时频图,标签为对应的拼音标签。
语言模型(1):基于自注意力机制的语言模型(拼音到汉字)
一:数据处理
二:模型搭建
构造建模组件:
layer norm层
def normalize(inputs,
epsilon = 1e-8,
scope="ln",
reuse=None):
'''Applies layer normalization.
Args:
inputs: A tensor with 2 or more dimensions, where the first dimension has
`batch_size`.
epsilon: A floating number. A very small number for preventing ZeroDivision Error.
scope: Optional scope for `variable_scope`.
reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
Returns:
A tensor with the same shape and data dtype as `inputs`.
'''
with tf.variable_scope(scope, reuse=reuse):
inputs_shape = inputs.get_shape()
params_shape = inputs_shape[-1:]
mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
beta= tf.Variable(tf.zeros(params_shape))
gamma = tf.Variable(tf.ones(params_shape))
normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
outputs = gamma * normalized + beta
return outputs
embedding层
def embedding(inputs,
vocab_size,
num_units,
zero_pad=True,
scale=True,
scope="embedding",
reuse=None):
'''Embeds a given tensor.
Args:
inputs: A `Tensor` with type `int32` or `int64` containing the ids
to be looked up in `lookup table`.
vocab_size: An int. Vocabulary size.
num_units: An int. Number of embedding hidden units.
zero_pad: A boolean. If True, all the values of the fist row (id 0)
should be constant zeros.
scale: A boolean. If True. the outputs is multiplied by sqrt num_units.
scope: Optional scope for `variable_scope`.
reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
Returns:
A `Tensor` with one more rank than inputs's. The last dimensionality
should be `num_units`.
For example,
```
import tensorflow as tf
inputs = tf.to_int32(tf.reshape(tf.range(2*3), (2, 3)))
outputs = embedding(inputs, 6, 2, zero_pad=True)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print sess.run(outputs)
>>
[[[ 0. 0. ]
[ 0.09754146 0.67385566]
[ 0.37864095 -0.35689294]]
[[-1.01329422 -1.09939694]
[ 0.7521342 0.38203377]
[-0.04973143 -0.06210355]]]
```
```
import tensorflow as tf
inputs = tf.to_int32(tf.reshape(tf.range(2*3), (2, 3)))
outputs = embedding(inputs, 6, 2, zero_pad=False)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print sess.run(outputs)
>>
[[[-0.19172323 -0.39159766]
[-0.43212751 -0.66207761]
[ 1.03452027 -0.26704335]]
[[-0.11634696 -0.35983452]
[ 0.50208133 0.53509563]
[ 1.22204471 -0.96587461]]]
```
'''
with tf.variable_scope(scope, reuse=reuse):
lookup_table = tf.get_variable('lookup_table',
dtype=tf.float32,
shape=[vocab_size, num_units],
initializer=tf.contrib.layers.xavier_initializer())
if zero_pad:
lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
lookup_table[1:, :]), 0)
outputs = tf.nn.embedding_lookup(lookup_table, inputs)
if scale:
outputs = outputs * (num_units ** 0.5)
return outputs
multihead层
def multihead_attention(emb,
queries,
keys,
num_units=None,
num_heads=8,
dropout_rate=0,
is_training=True,
causality=False,
scope="multihead_attention",
reuse=None):
'''Applies multihead attention.
Args:
queries: A 3d tensor with shape of [N, T_q, C_q].
keys: A 3d tensor with shape of [N, T_k, C_k].
num_units: A scalar. Attention size.
dropout_rate: A floating point number.
is_training: Boolean. Controller of mechanism for dropout.
causality: Boolean. If true, units that reference the future are masked.
num_heads: An int. Number of heads.
scope: Optional scope for `variable_scope`.
reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
Returns
A 3d tensor with shape of (N, T_q, C)
'''
with tf.variable_scope(scope, reuse=reuse):
# Set the fall back option for num_units
if num_units is None:
num_units = queries.get_shape().as_list[-1]
# Linear projections
Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C)
K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
# Split and concat
Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h)
K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)
V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)
# Multiplication
outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)
# Scale
outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
# Key Masking
key_masks = tf.sign(tf.abs(tf.reduce_sum(emb, axis=-1))) # (N, T_k)
key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
paddings = tf.ones_like(outputs)*(-2**32+1)
outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)
# Causality = Future blinding
if causality:
diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)
tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense() # (T_q, T_k)
masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)
paddings = tf.ones_like(masks)*(-2**32+1)
outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)
# Activation
outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)
# Query Masking
query_masks = tf.sign(tf.abs(tf.reduce_sum(emb, axis=-1))) # (N, T_q)
query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
outputs *= query_masks # broadcasting. (N, T_q, C)
# Dropouts
outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
# Weighted sum
outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)
# Restore shape
outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)
# Residual connection
outputs += queries
# Normalize
outputs = normalize(outputs) # (N, T_q, C)
return outputs
feedforward
两层全连接,用卷积模拟加速运算,也可以使用dense层。
def feedforward(inputs,
num_units=[2048, 512],
scope="multihead_attention",
reuse=None):
'''Point-wise feed forward net.
Args:
inputs: A 3d tensor with shape of [N, T, C].
num_units: A list of two integers.
scope: Optional scope for `variable_scope`.
reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
Returns:
A 3d tensor with the same shape and dtype as inputs
'''
with tf.variable_scope(scope, reuse=reuse):
# Inner layer
params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
"activation": tf.nn.relu, "use_bias": True}
outputs = tf.layers.conv1d(**params)
# Readout layer
params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
"activation": None, "use_bias": True}
outputs = tf.layers.conv1d(**params)
# Residual connection
outputs += inputs
# Normalize
outputs = normalize(outputs)
return outputs
label_smoothing
对于训练有好处,将0变为接近零的小数,1变为接近1的数
Args:
inputs: A 3d tensor with shape of [N, T, V], where V is the number of vocabulary.
epsilon: Smoothing rate.
For example,
```
import tensorflow as tf
inputs = tf.convert_to_tensor([[[0, 0, 1],
[0, 1, 0],
[1, 0, 0]],
[[1, 0, 0],
[1, 0, 0],
[0, 1, 0]]], tf.float32)
outputs = label_smoothing(inputs)
with tf.Session() as sess:
print(sess.run([outputs]))
>>
[array([[[ 0.03333334, 0.03333334, 0.93333334],
[ 0.03333334, 0.93333334, 0.03333334],
[ 0.93333334, 0.03333334, 0.03333334]],
[[ 0.93333334, 0.03333334, 0.03333334],
[ 0.93333334, 0.03333334, 0.03333334],
[ 0.03333334, 0.93333334, 0.03333334]]], dtype=float32)]
```
'''
K = inputs.get_shape().as_list()[-1] # number of channels
return ((1-epsilon) * inputs) + (epsilon / K)
搭建模型
class Graph():
def __init__(self, is_training=True):
tf.reset_default_graph()
self.is_training = arg.is_training
self.hidden_units = arg.hidden_units
self.input_vocab_size = arg.input_vocab_size
self.label_vocab_size = arg.label_vocab_size
self.num_heads = arg.num_heads
self.num_blocks = arg.num_blocks
self.max_length = arg.max_length
self.lr = arg.lr
self.dropout_rate = arg.dropout_rate
# input
self.x = tf.placeholder(tf.int32, shape=(None, None))
self.y = tf.placeholder(tf.int32, shape=(None, None))
# embedding
self.emb = embedding(self.x, vocab_size=self.input_vocab_size, num_units=self.hidden_units, scale=True, scope="enc_embed")
self.enc = self.emb + embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.x)[1]), 0), [tf.shape(self.x)[0], 1]),
vocab_size=self.max_length,num_units=self.hidden_units, zero_pad=False, scale=False,scope="enc_pe")
## Dropout
self.enc = tf.layers.dropout(self.enc,
rate=self.dropout_rate,
training=tf.convert_to_tensor(self.is_training))
## Blocks
for i in range(self.num_blocks):
with tf.variable_scope("num_blocks_{}".format(i)):
### Multihead Attention
self.enc = multihead_attention(emb = self.emb,
queries=self.enc,
keys=self.enc,
num_units=self.hidden_units,
num_heads=self.num_heads,
dropout_rate=self.dropout_rate,
is_training=self.is_training,
causality=False)
### Feed Forward
self.outputs = feedforward(self.enc, num_units=[4*self.hidden_units, self.hidden_units])
# Final linear projection
self.logits = tf.layers.dense(self.outputs, self.label_vocab_size)
self.preds = tf.to_int32(tf.argmax(self.logits, axis=-1))
self.istarget = tf.to_float(tf.not_equal(self.y, 0))
self.acc = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.y))*self.istarget)/ (tf.reduce_sum(self.istarget))
tf.summary.scalar('acc', self.acc)
if is_training:
# Loss
self.y_smoothed = label_smoothing(tf.one_hot(self.y, depth=self.label_vocab_size))
self.loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.y_smoothed)
self.mean_loss = tf.reduce_sum(self.loss*self.istarget) / (tf.reduce_sum(self.istarget))
# Training Scheme
self.global_step = tf.Variable(0, name='global_step', trainable=False)
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.lr, beta1=0.9, beta2=0.98, epsilon=1e-8)
self.train_op = self.optimizer.minimize(self.mean_loss, global_step=self.global_step)
# Summary
tf.summary.scalar('mean_loss', self.mean_loss)
self.merged = tf.summary.merge_all()
三:训练模型
参数设定
def create_hparams():
params = tf.contrib.training.HParams(
num_heads = 8,
num_blocks = 6,
# vocab
input_vocab_size = 50,
label_vocab_size = 50,
# embedding size
max_length = 100,
hidden_units = 512,
dropout_rate = 0.2,
lr = 0.0003,
is_training = True)
return params
arg = create_hparams()
arg.input_vocab_size = len(pny2id)
arg.label_vocab_size = len(han2id)
模型训练
import os
epochs = 25
batch_size = 4
g = Graph(arg)
saver =tf.train.Saver()
with tf.Session() as sess:
merged = tf.summary.merge_all()
sess.run(tf.global_variables_initializer())
if os.path.exists('logs/model.meta'):
saver.restore(sess, 'logs/model')
writer = tf.summary.FileWriter('tensorboard/lm', tf.get_default_graph())
for k in range(epochs):
total_loss = 0
batch_num = len(input_num) // batch_size
batch = get_batch(input_num, label_num, batch_size)
for i in range(batch_num):
input_batch, label_batch = next(batch)
feed = {g.x: input_batch, g.y: label_batch}
cost,_ = sess.run([g.mean_loss,g.train_op], feed_dict=feed)
total_loss += cost
if (k * batch_num + i) % 10 == 0:
rs=sess.run(merged, feed_dict=feed)
writer.add_summary(rs, k * batch_num + i)
if (k+1) % 5 == 0:
print('epochs', k+1, ': average loss = ', total_loss/batch_num)
saver.save(sess, 'logs/model')
writer.close()
模型推断
arg.is_training = False
g = Graph(arg)
saver =tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, 'logs/model')
while True:
line = input('输入测试拼音: ')
if line == 'exit': break
line = line.strip('\n').split(' ')
x = np.array([pny2id.index(pny) for pny in line])
x = x.reshape(1, -1)
preds = sess.run(g.preds, {g.x: x})
got = ''.join(han2id[idx] for idx in preds[0])
print(got)
小插曲:语言模型n-gram的应用
(1)人们基于一定的语料库,可以利用这个来预计或者评估一个句子是否合理
(2)评估两个字符串之间的差异程度。这是模糊匹配中常用的一种手段。
--------------------------------------小插曲结束------------------------------------
语言模型(2):基于CBHG模块的拼音到汉字模型
一:数据处理
二:模型搭建
构建组件
embedding层
光有对应的id,没法很好的表征文本信息,这里就涉及到构造词向量,关于词向量不在说明,网上有很多资料,模型中使用词嵌入层,通过训练不断的学习到语料库中的每个字的词向量,代码如下:
import tensorflow as tf
def embed(inputs, vocab_size, num_units, zero_pad=True, scope="embedding", reuse=None):
with tf.variable_scope(scope, reuse=reuse):
lookup_table = tf.get_variable('lookup_table',
dtype=tf.float32,
shape=[vocab_size, num_units],
initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.01))
if zero_pad:
lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
lookup_table[1:, :]), 0)
return tf.nn.embedding_lookup(lookup_table, inputs)
Encoder pre-net module
embeding layer之后是一个encoder pre-net模块,它有两个隐藏层,层与层之间的连接均是全连接;
第一层的隐藏单元数目与输入单元数目一致,
第二层的隐藏单元数目为第一层的一半;两个隐藏层采用的激活函数均为ReLu,并保持0.5的dropout来提高泛化能力
def prenet(inputs, num_units=None, is_training=True, scope="prenet", reuse=None, dropout_rate=0.2):
'''Prenet for Encoder and Decoder1.
Args:
inputs: A 2D or 3D tensor.
num_units: A list of two integers. or None.
is_training: A python boolean.
scope: Optional scope for `variable_scope`.
reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
Returns:
A 3D tensor of shape [N, T, num_units/2].
'''
with tf.variable_scope(scope, reuse=reuse):
outputs = tf.layers.dense(inputs, units=num_units[0], activation=tf.nn.relu, name="dense1")
outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=is_training, name="dropout1")
outputs = tf.layers.dense(outputs, units=num_units[1], activation=tf.nn.relu, name="dense2")
outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=is_training, name="dropout2")
return outputs # (N, ..., num_units[1])
搭建模型
三:模型训练及推断
参数设定
def create_hparams():
params = tf.contrib.training.HParams(
# vocab
pny_size = 50,
han_size = 50,
# embedding size
embed_size = 300,
num_highwaynet_blocks = 4,
encoder_num_banks = 8,
lr = 0.001,
is_training = True)
return params
arg = create_hparams()
arg.pny_size = len(pny2id)
arg.han_size = len(han2id)
模型训练
import os
epochs = 25
batch_size = 4
g = Graph(arg)
saver =tf.train.Saver()
with tf.Session() as sess:
merged = tf.summary.merge_all()
sess.run(tf.global_variables_initializer())
if os.path.exists('logs/model.meta'):
saver.restore(sess, 'logs/model')
writer = tf.summary.FileWriter('tensorboard/lm', tf.get_default_graph())
for k in range(epochs):
total_loss = 0
batch_num = len(input_num) // batch_size
batch = get_batch(input_num, label_num, batch_size)
for i in range(batch_num):
input_batch, label_batch = next(batch)
feed = {g.x: input_batch, g.y: label_batch}
cost,_ = sess.run([g.mean_loss,g.train_op], feed_dict=feed)
total_loss += cost
if (k * batch_num + i) % 10 == 0:
rs=sess.run(merged, feed_dict=feed)
writer.add_summary(rs, k * batch_num + i)
if (k+1) % 5 == 0:
print('epochs', k+1, ': average loss = ', total_loss/batch_num)
saver.save(sess, 'logs/model')
writer.close()
模型推断
arg.is_training = False
g = Graph(arg)
saver =tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, 'logs/model')
while True:
line = input('输入测试拼音: ')
if line == 'exit': break
line = line.strip('\n').split(' ')
x = np.array([pny2id.index(pny) for pny in line])
x = x.reshape(1, -1)
preds = sess.run(g.preds, {g.x: x})
got = ''.join(han2id[idx] for idx in preds[0])
print(got)
在调试这个项目的时候遇到了很多问题,深刻理解到一个道理,就是与配置相关的问题还是重装大法好,将之前的所有配置全部卸载,然后再重新安装,这样一般就会解决这个问题了。在调试的过程中就是这样,无数次出现同一个问题,在网上找的方法都不适用。
可将这个四个数据集用进去:
thchs-30、aishell、primewords、st-cmd四个数据集
数据标签整理在data路径下,其中primewords、st-cmd目前未区分训练集测试集。
若需要使用所有数据集,只需解压到统一路径下,然后设置utils.py中datapath的路径即可。
项目现已训练了一个迷你的语音识别系统,将数据集解压至data,运行test.py。
自己建立模型则需要删除现有模型,重新配置参数训练。
与数据相关参数在utils.py中:
data_type: train, test, dev
data_path: 对应解压数据的路径
thchs30, aishell, prime, stcmd: 是否使用该数据集
batch_size: batch_size
data_length: 我自己做实验时写小一些看效果用的,正常使用设为None即可
shuffle:正常训练设为True,是否打乱训练顺序
代码设置如下:
def data_hparams():
params = tf.contrib.training.HParams(
# vocab
data_type = 'train',
data_path = 'data/',
thchs30 = True,
aishell = True,
prime = False,
stcmd = False,
batch_size = 1,
data_length = None,
shuffle = False)
return params
使用train.py文件进行模型的训练。
声学模型可选cnn-ctc、gru-ctc,只需修改导入路径即可:
from model_speech.cnn_ctc import Am, am_hparams
from model_speech.gru_ctc import Am, am_hparams
语言模型可选transformer和cbhg:
from model_language.transformer import Lm, lm_hparams
from model_language.cbhg import Lm, lm_hparams
最后贴上一些成功的效果图:
声学模型
此项目目前的泛化能力还不是很好,后续我会通过继续学习,希望能够改进,Fighting!!!