语音识别项目(数据集用的是thchs-30)

首选,我必须吐槽一下,这个数据集我下了快两个星期(ps:没错,你没有看错,我真的下了快两个星期,中途要么是网络断了,然后下载失败,要么是不知道是啥莫名其妙的原因导致下载失败,对了,中途那个网站好像还关闭过,当时我正在下载!!!)。在这里感谢一下师姐,她帮我请另一个师兄用迅雷最后下载好了,对,就是昨晚,我终于见到了完整的thchs-30数据集(哈哈哈)。

OK,正式开始,首先我把这个项目总结一下:
语音识别–基于深度学习的中文语音识别系统
这里采用了两个模型,一个是声学模型,一个是语言模型,将输入的音频信号识别为汉字。其中用的RNN可以长依赖问题(ps:嗯…,这里是从我的笔记本上摘抄下来,所以可能有些话会突然崩出来,但是也是知识)。

基于深度学习的语音识别中的声学模型和语言模型建模
声学模型:CNN-CTC,GRU-CTC,CNN-RNN-CTC
语言模型:transformer,CBHG,n-gram(这个模型仅仅是用来测试一个小文件)
介绍一下语言模型:
CBHG模块是state-of-art的seq2seq模型,用在Google的机器翻译和语音合成中,该模型放在cbhg.py中
基于transformer结构的语言模型transformer.py,该模型已经被证明有强于其他框架的语言表达能力。该模型是自然语言处理这两年最火的模型,今年的bert就是使用的该结构。
数据集用的是中文免费数据集:thchs-30,aishell,prime words,st-cmd,总计450小时。

声学模型
介绍一下模型:
GRU-CTC
利用循环神经网络可以利用语音上下文相关的信息,得到更加准确地信息,而GUR又能选择性的保留需要的长时信息,使用双向rnn又能够充分的利用上下文信号。
但该方法缺点是一句话说完之后才能进行识别,且训练相对cnn较慢。该模型使用python/keras进行搭建,本文系统都使用python搭建。
网络结构如下:

    def _model_init(self):
        self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
        x = Reshape((-1, 200))(self.inputs)
        x = dense(512, x)
        x = dense(512, x)
        x = bi_gru(512, x)
        x = bi_gru(512, x)
        x = bi_gru(512, x)
        x = dense(512, x)
        self.outputs = dense(self.vocab_size, x, activation='softmax')
        self.model = Model(inputs=self.inputs, outputs=self.outputs)
        self.model.summary()

DFCNN
使用GRU作为语音识别的时候我们会遇到问题,原因如下:
一方面是我们常常使用双向循环神经网络才能取得更好的识别效果,这样会影响解码实时性。
另一方面随着网络结构复杂性增加,双向GRU的参数是相同节点数全连接层的6倍,这样会导致训练速度非常缓慢。
利用CNN参数共享机制,可以将参数数量下降几个数量级别,且深层次的卷积和池化层能够充分考虑语音信号的上下文信息,且可以在较短的时间内就可以得到识别结果,具有较好的实时性。
该模型在cnn_ctc.py中,实验中该模型是所有网络中结果最好的模型,目前能够取得较好的泛化能力。

def cnn_cell(size, x, pool=True):
    x = norm(conv2d(size)(x))
    x = norm(conv2d(size)(x))
    if pool:
        x = maxpool(x)
    return x

class Am():
    def _model_init(self):
        self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
        self.h1 = cnn_cell(32, self.inputs)
        self.h2 = cnn_cell(64, self.h1)
        self.h3 = cnn_cell(128, self.h2)
        self.h4 = cnn_cell(128, self.h3, pool=False)
        self.h5 = cnn_cell(128, self.h4, pool=False)
        # 200 / 8 * 128 = 3200
        self.h6 = Reshape((-1, 3200))(self.h5)
        self.h7 = dense(256)(self.h6)
        self.outputs = dense(self.vocab_size, activation='softmax')(self.h7)
        self.model = Model(inputs=self.inputs, outputs=self.outputs)
        self.model.summary()

DFSMN
前馈记忆神经网络解决了双向GRU的参数过多和实时性较差的缺点,它利用一个记忆模块,包含了上下几帧信息,能够得到不输于双向GRU-CTC的识别结果,阿里最新的开源系统就是基于DFSMN的声学模型,只不过在kaldi的框架上实现的。我们将考虑使用DFSMN+CTC的结构在python上实现。该网络实质上是用一个特殊的CNN就可以取得相同的效果,我们将CNN的宽设置为memory size,将高度设置为feature dim,将channel设置为hidden units,这样一个cnn的层就可以模仿fsmn的实现了。

一:特征提取
二:数据处理
下载数据,生成音频文件和标签文件列表,label数据处理,音频数据处理,数据生成器(确定batch_size和batch_num)
其中有DFCNN模型(一种使用深度卷积神经网络来对时频图进行识别的方法),有论文。
三:模型搭建
构建模型组件,搭建cnn+dnn+ctc的声学模型
四:模型训练,模型推断
在这个模型中有一些比较重要的函数
source_get(),这个函数是用来获取音频文件
read_label() 读取音频文件对应的拼音label,为label建立拼音到id的映射即词典

音频数据处理,只需要获得对应的音频文件名,然后提取所需时频图
模型搭建:训练输入为时频图,标签为对应的拼音标签。

语言模型(1):基于自注意力机制的语言模型(拼音到汉字)
一:数据处理
二:模型搭建
构造建模组件:
layer norm层

def normalize(inputs, 
              epsilon = 1e-8,
              scope="ln",
              reuse=None):
    '''Applies layer normalization.

    Args:
      inputs: A tensor with 2 or more dimensions, where the first dimension has
        `batch_size`.
      epsilon: A floating number. A very small number for preventing ZeroDivision Error.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns:
      A tensor with the same shape and data dtype as `inputs`.
    '''
    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]

        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta

    return outputs

embedding层

def embedding(inputs, 
              vocab_size, 
              num_units, 
              zero_pad=True, 
              scale=True,
              scope="embedding", 
              reuse=None):
    '''Embeds a given tensor.
    Args:
      inputs: A `Tensor` with type `int32` or `int64` containing the ids
         to be looked up in `lookup table`.
      vocab_size: An int. Vocabulary size.
      num_units: An int. Number of embedding hidden units.
      zero_pad: A boolean. If True, all the values of the fist row (id 0)
        should be constant zeros.
      scale: A boolean. If True. the outputs is multiplied by sqrt num_units.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
    Returns:
      A `Tensor` with one more rank than inputs's. The last dimensionality
        should be `num_units`.

    For example,

    ```
    import tensorflow as tf

    inputs = tf.to_int32(tf.reshape(tf.range(2*3), (2, 3)))
    outputs = embedding(inputs, 6, 2, zero_pad=True)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print sess.run(outputs)
    >>
    [[[ 0.          0.        ]
      [ 0.09754146  0.67385566]
      [ 0.37864095 -0.35689294]]
     [[-1.01329422 -1.09939694]
      [ 0.7521342   0.38203377]
      [-0.04973143 -0.06210355]]]
    ```

    ```
    import tensorflow as tf

    inputs = tf.to_int32(tf.reshape(tf.range(2*3), (2, 3)))
    outputs = embedding(inputs, 6, 2, zero_pad=False)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print sess.run(outputs)
    >>
    [[[-0.19172323 -0.39159766]
      [-0.43212751 -0.66207761]
      [ 1.03452027 -0.26704335]]
     [[-0.11634696 -0.35983452]
      [ 0.50208133  0.53509563]
      [ 1.22204471 -0.96587461]]]
    ```
    '''
    with tf.variable_scope(scope, reuse=reuse):
        lookup_table = tf.get_variable('lookup_table',
                                       dtype=tf.float32,
                                       shape=[vocab_size, num_units],
                                       initializer=tf.contrib.layers.xavier_initializer())
        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                                      lookup_table[1:, :]), 0)
        outputs = tf.nn.embedding_lookup(lookup_table, inputs)

        if scale:
            outputs = outputs * (num_units ** 0.5) 

    return outputs

multihead层

def multihead_attention(emb,
                        queries, 
                        keys, 
                        num_units=None, 
                        num_heads=8, 
                        dropout_rate=0,
                        is_training=True,
                        causality=False,
                        scope="multihead_attention", 
                        reuse=None):
    '''Applies multihead attention.
    
    Args:
      queries: A 3d tensor with shape of [N, T_q, C_q].
      keys: A 3d tensor with shape of [N, T_k, C_k].
      num_units: A scalar. Attention size.
      dropout_rate: A floating point number.
      is_training: Boolean. Controller of mechanism for dropout.
      causality: Boolean. If true, units that reference the future are masked. 
      num_heads: An int. Number of heads.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
        
    Returns
      A 3d tensor with shape of (N, T_q, C)  
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Set the fall back option for num_units
        if num_units is None:
            num_units = queries.get_shape().as_list[-1]
        
        # Linear projections
        Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C)
        K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
        V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
        
        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h) 
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 

        # Multiplication
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)
        
        # Scale
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
        
        # Key Masking
        key_masks = tf.sign(tf.abs(tf.reduce_sum(emb, axis=-1))) # (N, T_k)
        key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
        
        paddings = tf.ones_like(outputs)*(-2**32+1)
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)
  
        # Causality = Future blinding
        if causality:
            diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)
            tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense() # (T_q, T_k)
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)
   
            paddings = tf.ones_like(masks)*(-2**32+1)
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)
  
        # Activation
        outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)
         
        # Query Masking
        query_masks = tf.sign(tf.abs(tf.reduce_sum(emb, axis=-1))) # (N, T_q)
        query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
        outputs *= query_masks # broadcasting. (N, T_q, C)
          
        # Dropouts
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
               
        # Weighted sum
        outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)
        
        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)
              
        # Residual connection
        outputs += queries
              
        # Normalize
        outputs = normalize(outputs) # (N, T_q, C)
 
    return outputs

feedforward
两层全连接,用卷积模拟加速运算,也可以使用dense层。

def feedforward(inputs, 
                num_units=[2048, 512],
                scope="multihead_attention", 
                reuse=None):
    '''Point-wise feed forward net.
    
    Args:
      inputs: A 3d tensor with shape of [N, T, C].
      num_units: A list of two integers.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
        
    Returns:
      A 3d tensor with the same shape and dtype as inputs
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Inner layer
        params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
                  "activation": tf.nn.relu, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        
        # Readout layer
        params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
                  "activation": None, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        
        # Residual connection
        outputs += inputs
        
        # Normalize
        outputs = normalize(outputs)
    
    return outputs

label_smoothing
对于训练有好处,将0变为接近零的小数,1变为接近1的数

Args:
  inputs: A 3d tensor with shape of [N, T, V], where V is the number of vocabulary.
  epsilon: Smoothing rate.

For example,

```
import tensorflow as tf
inputs = tf.convert_to_tensor([[[0, 0, 1], 
   [0, 1, 0],
   [1, 0, 0]],
  [[1, 0, 0],
   [1, 0, 0],
   [0, 1, 0]]], tf.float32)
   
outputs = label_smoothing(inputs)

with tf.Session() as sess:
    print(sess.run([outputs]))

>>
[array([[[ 0.03333334,  0.03333334,  0.93333334],
    [ 0.03333334,  0.93333334,  0.03333334],
    [ 0.93333334,  0.03333334,  0.03333334]],
   [[ 0.93333334,  0.03333334,  0.03333334],
    [ 0.93333334,  0.03333334,  0.03333334],
    [ 0.03333334,  0.93333334,  0.03333334]]], dtype=float32)]   
```
'''
K = inputs.get_shape().as_list()[-1] # number of channels
return ((1-epsilon) * inputs) + (epsilon / K)

搭建模型

class Graph():
    def __init__(self, is_training=True):
        tf.reset_default_graph()
        self.is_training = arg.is_training
        self.hidden_units = arg.hidden_units
        self.input_vocab_size = arg.input_vocab_size
        self.label_vocab_size = arg.label_vocab_size
        self.num_heads = arg.num_heads
        self.num_blocks = arg.num_blocks
        self.max_length = arg.max_length
        self.lr = arg.lr
        self.dropout_rate = arg.dropout_rate
        
        # input
        self.x = tf.placeholder(tf.int32, shape=(None, None))
        self.y = tf.placeholder(tf.int32, shape=(None, None))
        # embedding
        self.emb = embedding(self.x, vocab_size=self.input_vocab_size, num_units=self.hidden_units, scale=True, scope="enc_embed")
        self.enc = self.emb + embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.x)[1]), 0), [tf.shape(self.x)[0], 1]),
                                      vocab_size=self.max_length,num_units=self.hidden_units, zero_pad=False, scale=False,scope="enc_pe")
        ## Dropout
        self.enc = tf.layers.dropout(self.enc, 
                                    rate=self.dropout_rate, 
                                    training=tf.convert_to_tensor(self.is_training))
                
        ## Blocks
        for i in range(self.num_blocks):
            with tf.variable_scope("num_blocks_{}".format(i)):
                ### Multihead Attention
                self.enc = multihead_attention(emb = self.emb,
                                               queries=self.enc, 
                                                keys=self.enc, 
                                                num_units=self.hidden_units, 
                                                num_heads=self.num_heads, 
                                                dropout_rate=self.dropout_rate,
                                                is_training=self.is_training,
                                                causality=False)
                        
        ### Feed Forward
        self.outputs = feedforward(self.enc, num_units=[4*self.hidden_units, self.hidden_units])
            
                
        # Final linear projection
        self.logits = tf.layers.dense(self.outputs, self.label_vocab_size)
        self.preds = tf.to_int32(tf.argmax(self.logits, axis=-1))
        self.istarget = tf.to_float(tf.not_equal(self.y, 0))
        self.acc = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.y))*self.istarget)/ (tf.reduce_sum(self.istarget))
        tf.summary.scalar('acc', self.acc)
                
        if is_training:  
            # Loss
            self.y_smoothed = label_smoothing(tf.one_hot(self.y, depth=self.label_vocab_size))
            self.loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.y_smoothed)
            self.mean_loss = tf.reduce_sum(self.loss*self.istarget) / (tf.reduce_sum(self.istarget))
               
            # Training Scheme
            self.global_step = tf.Variable(0, name='global_step', trainable=False)
            self.optimizer = tf.train.AdamOptimizer(learning_rate=self.lr, beta1=0.9, beta2=0.98, epsilon=1e-8)
            self.train_op = self.optimizer.minimize(self.mean_loss, global_step=self.global_step)
                   
            # Summary 
            tf.summary.scalar('mean_loss', self.mean_loss)
            self.merged = tf.summary.merge_all()

三:训练模型
参数设定

def create_hparams():
    params = tf.contrib.training.HParams(
        num_heads = 8,
        num_blocks = 6,
        # vocab
        input_vocab_size = 50,
        label_vocab_size = 50,
        # embedding size
        max_length = 100,
        hidden_units = 512,
        dropout_rate = 0.2,
        lr = 0.0003,
        is_training = True)
    return params

        
arg = create_hparams()
arg.input_vocab_size = len(pny2id)
arg.label_vocab_size = len(han2id)

模型训练

import os

epochs = 25
batch_size = 4

g = Graph(arg)

saver =tf.train.Saver()
with tf.Session() as sess:
    merged = tf.summary.merge_all()
    sess.run(tf.global_variables_initializer())
    if os.path.exists('logs/model.meta'):
        saver.restore(sess, 'logs/model')
    writer = tf.summary.FileWriter('tensorboard/lm', tf.get_default_graph())
    for k in range(epochs):
        total_loss = 0
        batch_num = len(input_num) // batch_size
        batch = get_batch(input_num, label_num, batch_size)
        for i in range(batch_num):
            input_batch, label_batch = next(batch)
            feed = {g.x: input_batch, g.y: label_batch}
            cost,_ = sess.run([g.mean_loss,g.train_op], feed_dict=feed)
            total_loss += cost
            if (k * batch_num + i) % 10 == 0:
                rs=sess.run(merged, feed_dict=feed)
                writer.add_summary(rs, k * batch_num + i)
        if (k+1) % 5 == 0:
            print('epochs', k+1, ': average loss = ', total_loss/batch_num)
    saver.save(sess, 'logs/model')
    writer.close()

模型推断

arg.is_training = False

g = Graph(arg)

saver =tf.train.Saver()

with tf.Session() as sess:
    saver.restore(sess, 'logs/model')
    while True:
        line = input('输入测试拼音: ')
        if line == 'exit': break
        line = line.strip('\n').split(' ')
        x = np.array([pny2id.index(pny) for pny in line])
        x = x.reshape(1, -1)
        preds = sess.run(g.preds, {g.x: x})
        got = ''.join(han2id[idx] for idx in preds[0])
        print(got)

小插曲:语言模型n-gram的应用
(1)人们基于一定的语料库,可以利用这个来预计或者评估一个句子是否合理
(2)评估两个字符串之间的差异程度。这是模糊匹配中常用的一种手段。
--------------------------------------小插曲结束------------------------------------
语言模型(2):基于CBHG模块的拼音到汉字模型
一:数据处理
二:模型搭建
构建组件
embedding层
光有对应的id,没法很好的表征文本信息,这里就涉及到构造词向量,关于词向量不在说明,网上有很多资料,模型中使用词嵌入层,通过训练不断的学习到语料库中的每个字的词向量,代码如下:

import tensorflow as tf

def embed(inputs, vocab_size, num_units, zero_pad=True, scope="embedding", reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        lookup_table = tf.get_variable('lookup_table',
                                       dtype=tf.float32,
                                       shape=[vocab_size, num_units],
                                       initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.01))
        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                                      lookup_table[1:, :]), 0)
    return tf.nn.embedding_lookup(lookup_table, inputs)

Encoder pre-net module
embeding layer之后是一个encoder pre-net模块,它有两个隐藏层,层与层之间的连接均是全连接;
第一层的隐藏单元数目与输入单元数目一致,
第二层的隐藏单元数目为第一层的一半;两个隐藏层采用的激活函数均为ReLu,并保持0.5的dropout来提高泛化能力

def prenet(inputs, num_units=None, is_training=True, scope="prenet", reuse=None, dropout_rate=0.2):
    '''Prenet for Encoder and Decoder1.
    Args:
      inputs: A 2D or 3D tensor.
      num_units: A list of two integers. or None.
      is_training: A python boolean.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns:
      A 3D tensor of shape [N, T, num_units/2].
    '''

    with tf.variable_scope(scope, reuse=reuse):
        outputs = tf.layers.dense(inputs, units=num_units[0], activation=tf.nn.relu, name="dense1")
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=is_training, name="dropout1")
        outputs = tf.layers.dense(outputs, units=num_units[1], activation=tf.nn.relu, name="dense2")
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=is_training, name="dropout2")
    return outputs  # (N, ..., num_units[1])

搭建模型

三:模型训练及推断
参数设定

def create_hparams():
    params = tf.contrib.training.HParams(
        
        # vocab
        pny_size = 50,
        han_size = 50,
        # embedding size
        embed_size = 300,
        num_highwaynet_blocks = 4,
        encoder_num_banks = 8,
        lr = 0.001,
        is_training = True)
    return params

arg = create_hparams()
arg.pny_size = len(pny2id)
arg.han_size = len(han2id)

模型训练

import os

epochs = 25
batch_size = 4

g = Graph(arg)

saver =tf.train.Saver()
with tf.Session() as sess:
    merged = tf.summary.merge_all()
    sess.run(tf.global_variables_initializer())
    if os.path.exists('logs/model.meta'):
        saver.restore(sess, 'logs/model')
    writer = tf.summary.FileWriter('tensorboard/lm', tf.get_default_graph())
    for k in range(epochs):
        total_loss = 0
        batch_num = len(input_num) // batch_size
        batch = get_batch(input_num, label_num, batch_size)
        for i in range(batch_num):
            input_batch, label_batch = next(batch)
            feed = {g.x: input_batch, g.y: label_batch}
            cost,_ = sess.run([g.mean_loss,g.train_op], feed_dict=feed)
            total_loss += cost
            if (k * batch_num + i) % 10 == 0:
                rs=sess.run(merged, feed_dict=feed)
                writer.add_summary(rs, k * batch_num + i)
        if (k+1) % 5 == 0:
            print('epochs', k+1, ': average loss = ', total_loss/batch_num)
    saver.save(sess, 'logs/model')
    writer.close()

模型推断

arg.is_training = False

g = Graph(arg)

saver =tf.train.Saver()

with tf.Session() as sess:
    saver.restore(sess, 'logs/model')
    while True:
        line = input('输入测试拼音: ')
        if line == 'exit': break
        line = line.strip('\n').split(' ')
        x = np.array([pny2id.index(pny) for pny in line])
        x = x.reshape(1, -1)
        preds = sess.run(g.preds, {g.x: x})
        got = ''.join(han2id[idx] for idx in preds[0])
        print(got)

在调试这个项目的时候遇到了很多问题,深刻理解到一个道理,就是与配置相关的问题还是重装大法好,将之前的所有配置全部卸载,然后再重新安装,这样一般就会解决这个问题了。在调试的过程中就是这样,无数次出现同一个问题,在网上找的方法都不适用。

可将这个四个数据集用进去:
thchs-30、aishell、primewords、st-cmd四个数据集
数据标签整理在data路径下,其中primewords、st-cmd目前未区分训练集测试集。
若需要使用所有数据集,只需解压到统一路径下,然后设置utils.py中datapath的路径即可。

项目现已训练了一个迷你的语音识别系统,将数据集解压至data,运行test.py
自己建立模型则需要删除现有模型,重新配置参数训练。
与数据相关参数在utils.py中:
data_type: train, test, dev
data_path: 对应解压数据的路径
thchs30, aishell, prime, stcmd: 是否使用该数据集
batch_size: batch_size
data_length: 我自己做实验时写小一些看效果用的,正常使用设为None即可
shuffle:正常训练设为True,是否打乱训练顺序
代码设置如下:

def data_hparams():
    params = tf.contrib.training.HParams(
        # vocab
        data_type = 'train',
        data_path = 'data/',
        thchs30 = True,
        aishell = True,
        prime = False,
        stcmd = False,
        batch_size = 1,
        data_length = None,
        shuffle = False)
      return params

使用train.py文件进行模型的训练。
声学模型可选cnn-ctc、gru-ctc,只需修改导入路径即可:
from model_speech.cnn_ctc import Am, am_hparams

from model_speech.gru_ctc import Am, am_hparams
语言模型可选transformer和cbhg:
from model_language.transformer import Lm, lm_hparams

from model_language.cbhg import Lm, lm_hparams
最后贴上一些成功的效果图:
声学模型
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
此项目目前的泛化能力还不是很好,后续我会通过继续学习,希望能够改进,Fighting!!!

  • 7
    点赞
  • 66
    收藏
    觉得还不错? 一键收藏
  • 21
    评论
评论 21
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值