Tensorflow入门教程(三十)语音识别(中)

最新推荐文章于 2024-09-17 13:19:51 发布

远洋之帆

最新推荐文章于 2024-09-17 13:19:51 发布

阅读量6.1k

点赞数 4

分类专栏： SPEECH

SPEECH 专栏收录该内容

4 篇文章 1 订阅

订阅专栏

------韦访 20181126

6、提取音频数据的MFCC特征

上一讲花了很大的篇幅来将这个MFCC特征，现在我们就来提取它。Python牛逼之处就是有非常多的工具支持各种操作，很完善，所以这里也不需要我们从头开始写，可以借助python_speech_features工具来实现。

首先来安装python_speech_features工具，执行以下命令行即可，

sudo pip install python_speech_features

我们将语音数据转换为需要计算的13位或26位不同的倒谱特征的MFCC，将它作为模型的输入。经过转换，数据将会被存储在一个频率特征系数（行）和时间（列）的矩阵中。

因为声音不会孤立的产生，并且没有一对一映射到字符，所以，我们可以通过在当前时间索引之前和之后捕获声音的重叠窗口上训练网络，从而捕获共同作用的影响（即通过影响一个声音影响另一个发音）。

这里先插讲一下语音中的“分帧”和“加窗”的概念，

分帧：

如上图所示，傅里叶变换要求输入的信号是平稳的，但是语音信号在宏观上是不平稳的，在微观上却有短时平稳性（10-30ms内可以认为语音信号近似不变）。所以要把语音信号分为一些小段处理，每一个小段称为一帧。

加窗：

取出一帧信号以后，在进行傅里叶变换前，还有先进行“加窗”操作，“加窗”其实就是乘以一个“窗函数”，如下图所示，

加窗的目的是让一帧信号的幅度在两端渐变到0，这样就可以提供变换结果的分辨率。但是加窗也是有代价的，一帧信号的两端被削弱了，弥补的办法就是，邻近的帧直接要有重叠，而不是直接截取，如下图所示，

如上图所示，两帧之间有重叠部分，帧长为25ms，两帧起点位置的时间差叫帧移，一般取10ms或者帧长的一半。

对于RNN，我们使用之前的9个时间片段和后面的9个时间片段，加上当前时间片段，每个加载窗口总共包括19个时间片段。当梅尔倒谱系数为26时，每个时间片段总共就有494个MFCC特征数。下图是以倒谱系数为13为例的加载窗口实例图，

而当当前序列前或后不够9个序列时，比如第2个序列，这时就需要进行补0操作，将它凑够9个。最后，再进行标准化处理，减去均值，然后除以方差。下面来看代码，


#将音频信息转成MFCC特征

#参数说明---audio_filename：音频文件   numcep：梅尔倒谱系数个数

#       numcontext：对于每个时间段，要包含的上下文样本个数

def audiofile_to_input_vector(audio_filename, numcep, numcontext):

    # 加载音频文件

    fs, audio = wav.read(audio_filename)

    # 获取MFCC系数

    orig_inputs = mfcc(audio, samplerate=fs, numcep=numcep)

    #打印MFCC系数的形状，得到比如(955, 26)的形状

    #955表示时间序列，26表示每个序列的MFCC的特征值为26个

    #这个形状因文件而异，不同文件可能有不同长度的时间序列，但是，每个序列的特征值数量都是一样的

    print(np.shape(orig_inputs))

 

    # 因为我们使用双向循环神经网络来训练,它的输出包含正、反向的结

    # 果,相当于每一个时间序列都扩大了一倍,所以

    # 为了保证总时序不变,使用orig_inputs =

    # orig_inputs[::2]对orig_inputs每隔一行进行一次

    # 取样。这样被忽略的那个序列可以用后文中反向

    # RNN生成的输出来代替,维持了总的序列长度。

    orig_inputs = orig_inputs[::2]#(478, 26)

    print(np.shape(orig_inputs))

    #因为我们讲解和实际使用的numcontext=9，所以下面的备注我都以numcontext=9来讲解

    #这里装的就是我们要返回的数据，因为同时要考虑前9个和后9个时间序列，

    #所以每个时间序列组合了19*26=494个MFCC特征数

    train_inputs = np.array([], np.float32)

    train_inputs.resize((orig_inputs.shape[0], numcep + 2 * numcep * numcontext))

    print(np.shape(train_inputs))#)(478, 494)

 

    # Prepare pre-fix post fix context

    empty_mfcc = np.array([])

    empty_mfcc.resize((numcep))

 

    # Prepare train_inputs with past and future contexts

    #time_slices保存的是时间切片，也就是有多少个时间序列

    time_slices = range(train_inputs.shape[0])

 

    #context_past_min和context_future_max用来计算哪些序列需要补零

    context_past_min = time_slices[0] + numcontext

    context_future_max = time_slices[-1] - numcontext

 

    #开始遍历所有序列

    for time_slice in time_slices:

        #对前9个时间序列的MFCC特征补0，不需要补零的，则直接获取前9个时间序列的特征

        need_empty_past = max(0, (context_past_min - time_slice))

        empty_source_past = list(empty_mfcc for empty_slots in range(need_empty_past))

        data_source_past = orig_inputs[max(0, time_slice - numcontext):time_slice]

        assert(len(empty_source_past) + len(data_source_past) == numcontext)

 

        #对后9个时间序列的MFCC特征补0，不需要补零的，则直接获取后9个时间序列的特征

        need_empty_future = max(0, (time_slice - context_future_max))

        empty_source_future = list(empty_mfcc for empty_slots in range(need_empty_future))

        data_source_future = orig_inputs[time_slice + 1:time_slice + numcontext + 1]

        assert(len(empty_source_future) + len(data_source_future) == numcontext)

 

        #前9个时间序列的特征

        if need_empty_past:

            past = np.concatenate((empty_source_past, data_source_past))

        else:

            past = data_source_past

 

        #后9个时间序列的特征

        if need_empty_future:

            future = np.concatenate((data_source_future, empty_source_future))

        else:

            future = data_source_future

 

        #将前9个时间序列和当前时间序列以及后9个时间序列组合

        past = np.reshape(past, numcontext * numcep)

        now = orig_inputs[time_slice]

        future = np.reshape(future, numcontext * numcep)

 

        train_inputs[time_slice] = np.concatenate((past, now, future))

        assert(len(train_inputs[time_slice]) == numcep + 2 * numcep * numcontext)

 

    # 将数据使用正太分布标准化，减去均值然后再除以方差

    train_inputs = (train_inputs - np.mean(train_inputs)) / np.std(train_inputs)

 

    return train_inputs

7、文字样本转化成向量

对于文字样本，则需要将文字转换成具体的向量，代码如下，


#将字符转成向量，其实就是根据字找到字在word_num_map中所应对的下标

def get_ch_lable_v(txt_file,word_num_map,txt_label=None):

    words_size = len(word_num_map)

 

    to_num = lambda word: word_num_map.get(word, words_size) 

 

    if txt_file!= None:

        txt_label = get_ch_lable(txt_file)

 

    print(txt_label)

    labels_vector = list(map(to_num, txt_label))

    print(labels_vector)

    return labels_vector

怎么理解上面的函数呢？我们来运行一下代码就知道了，上一讲中，我们调用get_wav_files_and_tran_texts函数获取了所有的WAV文件和其对应的翻译文字。现在，我们先来处理一下翻译的文字，先将所有文字提出来，然后，调用collections和Counter方法，统计一下每个字符出现的次数，然后，把它们放到字典里面去，代码如下，


# 字表 

all_words = []  

for label in labels:  

    #print(label)    

    all_words += [word for word in label]

 

#Counter，返回一个Counter对象集合，以元素为key，元素出现的个数为value

counter = Counter(all_words)

#排序

words = sorted(counter)

words_size= len(words)

word_num_map = dict(zip(words, range(words_size)))

 

print(word_num_map)

运行结果如下：

然后，再调用上面的get_ch_lable_v函数，


get_ch_lable_v(None, word_num_map, labels[0])

exit()

运行结果，

可以看到，get_ch_lable_v函数打印了我们传入的翻译文字和一个列表，列表里是一堆数字，我们搜一下2490，看看是不是对应于“闪”字，40是不是对应于“乌”字，

果然跟预想的一样，这样，我们就将文字转换成了向量。

8、将音频数据转为MFCC，将译文转为向量

现在，整合上面两个函数，将音频数据转为时间序列（列）和MFCC（行）的矩阵，将对应的译文转成字向量，代码如下，


#将音频数据转为时间序列（列）和MFCC（行）的矩阵，将对应的译文转成字向量    

def get_audio_and_transcriptch(txt_files, wav_files, n_input, n_context,word_num_map,txt_labels=None):

    

    audio = []

    audio_len = []

    transcript = []

    transcript_len = []

    if txt_files!=None:

        txt_labels = txt_files

 

    for txt_obj, wav_file in zip(txt_labels, wav_files):

        # load audio and convert to features

        audio_data = audiofile_to_input_vector(wav_file, n_input, n_context)

        audio_data = audio_data.astype('float32')

        # print(word_num_map)

        audio.append(audio_data)

        audio_len.append(np.int32(len(audio_data)))

 

        # load text transcription and convert to numerical array

        target = []

        if txt_files!=None:#txt_obj是文件

            target = get_ch_lable_v(txt_obj,word_num_map)

        else:

            target = get_ch_lable_v(None,word_num_map,txt_obj)#txt_obj是labels

        #target = text_to_char_array(target)

        transcript.append(target)

        transcript_len.append(len(target))

 

    audio = np.asarray(audio)

    audio_len = np.asarray(audio_len)

    transcript = np.asarray(transcript)

    transcript_len = np.asarray(transcript_len)

    return audio, audio_len, transcript, transcript_len

9、批次音频数据对齐

上面是对单个音频文件的特征补0，在训练中，文件是一批一批的获取并进行训练的，这就要求每一批音频的时序要统一，所以，下面要做对齐处理。


#对齐处理

def pad_sequences(sequences, maxlen=None, dtype=np.float32,

                  padding='post', truncating='post', value=0.):

    #[478 512 503 406 481 509 422 465]

    lengths = np.asarray([len(s) for s in sequences], dtype=np.int64)

 

    nb_samples = len(sequences)

 

    #maxlen，该批次中，最长的序列长度

    if maxlen is None:

        maxlen = np.max(lengths)

 

    # 在下面的主循环中，从第一个非空序列中获取样本形状以检查一致性

    sample_shape = tuple()

    for s in sequences:

        if len(s) > 0:

            sample_shape = np.asarray(s).shape[1:]

            break

 

    x = (np.ones((nb_samples, maxlen) + sample_shape) * value).astype(dtype)

    for idx, s in enumerate(sequences):

        if len(s) == 0:

            continue  # 序列为空，跳过

 

        #post表示后补零，pre表示前补零

        if truncating == 'pre':

            trunc = s[-maxlen:]

        elif truncating == 'post':

            trunc = s[:maxlen]

        else:

            raise ValueError('Truncating type "%s" not understood' % truncating)

 

        # check `trunc` has expected shape

        trunc = np.asarray(trunc, dtype=dtype)

        if trunc.shape[1:] != sample_shape:

            raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %

                             (trunc.shape[1:], idx, sample_shape))

 

        if padding == 'post':

            x[idx, :len(trunc)] = trunc

        elif padding == 'pre':

            x[idx, -len(trunc):] = trunc

        else:

            raise ValueError('Padding type "%s" not understood' % padding)

 

    return x, lengths

10、创建序列的稀疏表示

下面的函数将创建序列的稀疏表示，


#创建序列的稀疏表示

def sparse_tuple_from(sequences, dtype=np.int32):

    indices = []

    values = []

 

    for n, seq in enumerate(sequences):

        indices.extend(zip([n] * len(seq), range(len(seq))))

        values.extend(seq)

 

    indices = np.asarray(indices, dtype=np.int64)

    values = np.asarray(values, dtype=dtype)

    shape = np.asarray([len(sequences), indices.max(0)[1] + 1], dtype=np.int64)

    # return tf.SparseTensor(indices=indices, values=values, shape=shape)

    return indices, values, shape

上面的函数有什么作用呢？我们写个小demo来测试一下不就知道了吗，代码如下，


sq = [[0,1,2,3,4], [5,6,7,8,]]

indices, values, shape = sparse_tuple_from(sq)

print(indices)

print(values)

print(shape)

运行结果：

11、将字向量转成文字


# Constants

SPACE_TOKEN = '<space>'

SPACE_INDEX = 0

FIRST_INDEX = ord('a') - 1  # 0 is reserved to space

 

#将稀疏矩阵的字向量转成文字

#tuple是sparse_tuple_from函数的返回值

def sparse_tuple_to_texts_ch(tuple,words):

    indices = tuple[0]

    values = tuple[1]

    results = [''] * tuple[2][0]

    for i in range(len(indices)):

        index = indices[i][0]

        c = values[i]

        

        c = ' ' if c == SPACE_INDEX else words[c]

        results[index] = results[index] + c

    return results

 

#将密集矩阵的字向量转成文字

def ndarray_to_text_ch(value,words):

    results = ''

    for i in range(len(value)):

        results += words[value[i]]#chr(value[i] + FIRST_INDEX)

    return results.replace('`', ' ')

上面有将文字转成字向量的函数，那么，也应该有将字向量转成文字的函数，代码如下，

12、next_batch函数

接下来，我们来实现next_batch函数，获取下一batch的训练数据，


#梅尔倒谱系数的个数

n_input = 26

#对于每个时间序列，要包含上下文样本的个数

n_context = 9

#batch大小

batch_size =8

def next_batch(wav_files, labels, start_idx = 0,batch_size=1):

    filesize = len(labels)

    #计算要获取的序列的开始和结束下标

    end_idx = min(filesize, start_idx + batch_size)

    idx_list = range(start_idx, end_idx)

    #获取要训练的音频文件路径和对于的译文

    txt_labels = [labels[i] for i in idx_list]

    wav_files = [wav_files[i] for i in idx_list]

    #将音频文件转成要训练的数据

    (source, audio_len, target, transcript_len) = get_audio_and_transcriptch(None,

                                                      wav_files,

                                                      n_input,

                                                      n_context,word_num_map,txt_labels)

    

    start_idx += batch_size

    # Verify that the start_idx is not largVerify that the start_idx is not ler than total available sample size

    if start_idx >= filesize:

        start_idx = -1

 

    # Pad input to max_time_step of this batch

    # 如果多个文件将长度统一，支持按最大截断或补0

    source, source_lengths = pad_sequences(source)

    #返回序列的稀疏表示

    sparse_labels = sparse_tuple_from(target)

 

    return start_idx,source, source_lengths, sparse_labels

可以写个小demo测试一下上面的函数有没有达到我们的预期，代码如下，


print('音频文件:  ' + wav_files[0])

print('文字内容:  ' + labels[0])

#获取一个batch的数据

next_idx,source,source_len,sparse_lab = next_batch(wav_files,labels,0,batch_size)

print(np.shape(source))

#将字向量转成文字

t = sparse_tuple_to_texts_ch(sparse_lab,words)

print(t[0])

运行结果，