Stanford CS230深度学习(九)注意力机制和语音识别

在CS230的lecture 9中主要讲了深度强化学习(Deep Reinforcement Learning)大体上是如何实施的,以及一些应用场景。课后coursera上遗留的最后C5M3的课程及其编程作业,这一部分主要是讲序列模型中的集束搜索(Beam Search)、注意力机制(Attention)以及语音识别中触发词的检测。

回顾知识点

集束搜索(Beam Search)

seq2seq(sequence to sequence)模型一般包括两个部分:编码器(encoder)和解码器(decoder),其中编码器主要是用于处理序列信息,解码器将这个信息处理出来得到一个输出的序列,这样的模型允许使用长度不同的输入和输出序列。

在课程上讲述的多对多的机器翻译模型中,实际上模型最终是在选择一个最大条件概率的序列输出,即 arg max ⁡ y ( 1 ) , ⋯   , y ( T y ) P ( y ( 1 ) , ⋯   , y ( T y ) ∣ x ( 1 ) , ⋯   , x ( T x ) ) \argmax_{y^{(1)},\cdots,y^{(T_y)}}P(y^{(1)},\cdots,y^{(T_y)}|x^{(1)},\cdots,x^{(T_x)}) y(1),,y(Ty)argmaxP(y(1),,y(Ty)x(1),,x(Tx))但是由于输出空间是一个维数特别大的高维空间,不可能穷举所有 T y T_y Ty个这样的组合,因此采用启发式的搜索机制来求得一个最大化这个概率的组合。

一般说来可以用贪婪搜索(greedy search)分别求得当前最大化 P ( y ( t ) ∣ x ( 1 ) , ⋯   , x ( T x ) , y ( 1 ) , ⋯   , y ( t − 1 ) ) P(y^{(t)}|x^{(1)},\cdots,x^{(T_x)},y^{(1)},\cdots,y^{(t-1)}) P(y(t)x(1),,x(Tx),y(1),,y(t1)) y ( t ) , t = 1 , ⋯   , T y y^{(t)},t=1,\cdots,T_y y(t),t=1,,Ty,但是这样出来的效果一般不好,因为这样很容易陷入局部最优,可能在往后一两步的序列看来,当前最优并不是全局最优。所以采用集束搜索(Beam Search)效果会好很多(尽管它也不是一定能收敛到全局最优 max ⁡ y P ( y ∣ x ) \max_y P(y|x) maxyP(yx))。

集束搜索(Beam Search)需要设置一个超参数集束宽度 B B B,在每次搜索时都选择 B B B个最大化当前条件概率的 P ( y ( t ) ∣ x ( 1 ) , ⋯   , x ( T x ) , y ( 1 ) , ⋯   , y ( t − 1 ) ) P(y^{(t)}|x^{(1)},\cdots,x^{(T_x)},y^{(1)},\cdots,y^{(t-1)}) P(y(t)x(1),,x(Tx),y(1),,y(t1)) y ( t ) y^{(t)} y(t),再对这 B B B个选出来的 y ( t ) y^{(t)} y(t)进行探索,分别选出他们最大条件概率下的 y ( t + 1 ) y^{(t+1)} y(t+1),这样就得到 B ∗ B B*B BB个备选的 y ( t + 1 ) y^{(t+1)} y(t+1),从中选出条件概率最大的 B B B个进入下一轮的搜索。

最后这样通过集束搜索得到的 y y y即为 arg max ⁡ y ∏ t = 1 T y P ( y ( t ) ∣ x , y ( 1 ) , ⋯   , y ( t − 1 ) ) = arg max ⁡ y P ( y ( 1 ) ∣ x ) P ( y ( 2 ) ∣ x , y ( 1 ) ) ⋯ P ( y ( T y ) ∣ x , y ( 1 ) , ⋯   , y ( T y − 1 ) ) = arg max ⁡ y P ( y ( 1 ) , ⋯   , y ( t − 1 ) ∣ x ) \begin{aligned}&\argmax_{y}\prod_{t=1}^{T_y} P(y^{(t)}|x,y^{(1)},\cdots,y^{(t-1)})\\ &=\argmax_{y}P(y^{(1)}|x)P(y^{(2)}|x,y^{(1)})\cdots P(y^{(T_y)}|x,y^{(1)},\cdots,y^{(T_y-1)})\\ &=\argmax_{y}P(y^{(1)},\cdots,y^{(t-1)}|x) \end{aligned} yargmaxt=1TyP(y(t)x,y(1),,y(t1))=yargmaxP(y(1)x)P(y(2)x,y(1))P(y(Ty)x,y(1),,y(Ty1))=yargmaxP(y(1),,y(t1)x)

由于对数函数的单调性,上式的优化又等价于 arg max ⁡ y ∑ t = 1 T y log ⁡ P ( y ( t ) ∣ x , y ( 1 ) , ⋯   , y ( t − 1 ) ) \begin{aligned}&\argmax_{y}\sum_{t=1}^{T_y}\log P(y^{(t)}|x,y^{(1)},\cdots,y^{(t-1)}) \end{aligned} yargmaxt=1TylogP(y(t)x,y(1),,y(t1))

但是这样的结果会导致整个模型偏向输出一个特别短的序列,因为 T y T_y Ty越短,需要乘的项数越少,上式最终得到的概率也会越大。所以需要加入一个类似输出长度的惩罚,使得模型不那么偏向输出短的序列。所以上式优化改为: arg max ⁡ y 1 T y α ∑ t = 1 T y log ⁡ P ( y ( t ) ∣ x , y ( 1 ) , ⋯   , y ( t − 1 ) ) \begin{aligned}&\argmax_{y}{1\over {T_y}^\alpha}\sum_{t=1}^{T_y}\log P(y^{(t)}|x,y^{(1)},\cdots,y^{(t-1)}) \end{aligned} yargmaxTyα1t=1TylogP(y(t)x,y(1),,y(t1))其中 α ∈ [ 0 , 1 ] \alpha\in[0,1] α[0,1]可以作为超参数进行选择。

对于超参数集束宽度 B B B的选择,可以对预测结果进行分析之后进行调整。假设预测序列的标签为 y ∗ y^* y,则模型预测出标签的概率为 P ( y ∗ ∣ x ) P(y^*|x) P(yx)

  • 若当前预测的概率 P ( y ∣ x ) < P ( y ∗ ∣ x ) P(y|x)<P(y^*|x) P(yx)<P(yx),表示模型当前的选择并不能达到最大的概率,那么很大可能是由于集束宽度 B B B还不够大,所以搜索范围太小导致结果可能陷入局部最优,此时增加集束宽度 B B B可能会有帮助。
  • 若当前预测的概率 P ( y ∣ x ) > P ( y ∗ ∣ x ) P(y|x)>P(y^*|x) P(yx)>P(yx),表示尽管 y ∗ y^* y在我们看来是一个更好的序列,但是模型给 y y y的概率更大,觉得它更好,那很大可能这个seq2seq模型存在问题,需要对模型进行改进。

注意力机制(Attention)

在seq2seq模型中编码器获取了整个输入序列的信息,然后解码器需要一个个预测输出序列,而对于类似机器翻译这种任务来说输入序列和输出序列之间存在一定的对应关系,在输出时需要重点关注某一些和当前输出有关的信息,因此采用注意力机制可以让模型抽取准确信息达到更好的效果。

注意力具体表现在解码器的输入是一个注意力机制下的信息向量(后面代码中称为context vector)。假设解码器选用 T y T_y Ty个LSTM单元,第 t t t个时间步传入LSTM单元的信息向量 c ( t ) c^{(t)} c(t)是一个关于解码器所有信息的线性组合,即 c ( t ) = ∑ t ′ T x α ( t , t ′ ) a ( t ′ ) c^{(t)}=\sum_{t^\prime}^{T_x}\alpha^{(t,t^\prime)}a^{(t^\prime)} c(t)=tTxα(t,t)a(t)其中 a ( t ′ ) a^{(t^\prime)} a(t)是编码器的输出, α ( t , t ′ ) \alpha^{(t,t^\prime)} α(t,t)是每个输出 a ( t ′ ) a^{(t^\prime)} a(t)对应的注意力权重,且 ∑ t ′ α ( t , t ′ ) = 1 \sum_{t^\prime}\alpha^{(t,t^\prime)}=1 tα(t,t)=1,它表示的含义是当前模型预测的输出 y ( t ) y^{(t)} y(t)应该在编码器的输出 a ( t ′ ) a^{(t^\prime)} a(t)上放置多少注意力,越接近1表示越看重。

这个注意力权重 a ( t ′ ) a^{(t^\prime)} a(t)需要通过另外的模型来学到。由于这个权重之和为1,所以可以看成是最后经过一个softmax之后得到的值,因此可以用一个带softmax激活的全连接层来学习。另外,因为每次预测的注意力都与上一次预测的内容相关,因此需要把解码器的上一个隐层状态给传入到这个全连接层中考虑进去,所以学习注意力权重的输入就变成了解码器的上一个隐层状态 h ( t − 1 ) h^{(t-1)} h(t1)与解码器的所有输出 a ( t ′ ) a^{(t^\prime)} a(t)

最后得到模型之后将这个注意力权重可视化,可以明确地看到输出内容各部分他们各自重点的关注在哪,输入与输出之间对应关系是什么样的。

触发字检测(Trigger Word Detection)

在语音识别中,输入样本是一个音频片段,模型的输入数据就是这个音频片段的声谱图特征(spectrogram features),声谱图的横轴表示时间,纵轴表示声音的频率。而在触发词检测中,模型的标签数据就是一个长度为 T y T_y Ty的0-1向量,它在触发词发完音后的一定时间内为1,其余为0。

最后由于模型输出的是当前预测为1的概率,所以只需设置一个阈值,大于这个阈值就标记检测到这个触发词即可。

作业代码

1. Neural Machine Translation with Attention

用带注意力机制的序列模型来规范化输出日期格式。

from tensorflow.keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from tensorflow.keras.layers import RepeatVector, Dense, Activation, Lambda
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model, Model
import tensorflow.keras.backend as K
import numpy as np

from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
from nmt_utils import *
import matplotlib.pyplot as plt

# 加载数据集(这个数据集是从Faker中随机生产从1970-1-1开始的随机日期并处理成随机的格式)
m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)
# 查看数据
dataset[:10]

# 数据预处理
Tx = 30 # X的最长长度(不够的padding)
Ty = 10 # Y的标准长度
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

# 查看处理后的数据样例
print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)

index = 0
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()
print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()
print("Source after preprocessing (one-hot):", Xoh[index])
print("Target after preprocessing (one-hot):", Yoh[index])


# 带注意力机制的神经机器翻译
# 关键分为两步:首先是在已知双向LSTM的隐层状态以及s后求得t时刻的注意力权重,由此得到上下文向量
# 然后按照模型来搭建完成model

# 为求得上下文向量定义全局变量层
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor = Dense(1, activation="relu")
activator = Activation(softmax, name='attention_weights')
dotor = Dot(axes=1)

# 求得t时刻的上下文向量
def one_step_attention(a, s_prev):
    """
    Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
    "alphas" and the hidden states "a" of the Bi-LSTM.
    
    Arguments:
    a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
    s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
    
    Returns:
    context -- context vector, input of the next (post-attetion) LSTM cell
    """
    # s_t-1重复Tx次
    s_prev = repeator(s_prev)
    # 合并输入数据
    concat = concatenator([a, s_prev])
    # 传入全连接层
    e = densor(concat)
    # softmax激活得到alpha
    alphas = activator(e)
    # 带注意力权重的上下文
    context = dotor([alphas, a])
    return context

# 为模型定义全局变量层
n_a = 64
n_s = 128
post_attention_LSTM_cell = LSTM(n_s, return_state=True)
output_layer = Dense(len(machine_vocab), activation=softmax)

# 搭建nmt模型
def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
    """
    Arguments:
    Tx -- length of the input sequence
    Ty -- length of the output sequence
    n_a -- hidden state size of the Bi-LSTM
    n_s -- hidden state size of the post-attention LSTM
    human_vocab_size -- size of the python dictionary "human_vocab"
    machine_vocab_size -- size of the python dictionary "machine_vocab"

    Returns:
    model -- Keras model instance
    """
    # pre-attention的输入
    X = Input(shape=(Tx, human_vocab_size))
    # post-attention的隐层初始化输入
    s0 = Input(shape=(n_s,), name='s0')
    c0 = Input(shape=(n_s,), name='c0')
    s = s0
    c = c0
    # 输出初始化
    outputs = []
    # pre-attention Bi-LSTM的所有输出状态a(相当于h_t)
    a = Bidirectional(LSTM(n_a, return_sequences=True))(X)
    # 逐步循环,得到上下文向量再传入LSTM
    for t in range(Ty):
        # 得到t时刻的上下文向量
        context = one_step_attention(a, s)
        # 传入post-attention LSTM
        s, _, c = post_attention_LSTM_cell(context, initial_state=[s, c])
        # 得到hat y_t
        y_t = output_layer(s)
        # append到输出列表中
        outputs.append(y_t)
    # 创建模型
    model = Model(inputs=[X, s0, c0], outputs=outputs)
    return model

# 创建模型
model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))
# 查看
model.summary()
# 编译
out = model.compile(optimizer=Adam(lr=0.005, beta_1=0.9, beta_2=0.999, decay=0.01),
                    metrics=['accuracy'],
                    loss='categorical_crossentropy')
# 初始化输入
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1))

# 训练
model.fit([Xoh, s0, c0], outputs, epochs=10, batch_size=100)

# 预测一些测试样例
EXAMPLES = ['3 June 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']
for example in EXAMPLES:
    source = string_to_int(example, Tx, human_vocab)
    source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source)))
    prediction = model.predict([source.reshape(1,30,37), s0[0].reshape(1,128), c0[0].reshape(1,128)])
    prediction = np.argmax(prediction, axis = -1)
    output = [inv_machine_vocab[int(i)] for i in prediction]
    print("source:", example)
    print("output:", ''.join(output))

# 可视化attention
# 可能需要模型训练得好一点才比较直观
attention_map = plot_attention_map(model, human_vocab, inv_machine_vocab, 'Tue 10 Jul 2007', num=6, n_s=128)


2. Trigger Word Detection

import numpy as np
from pydub import AudioSegment
import random
import sys
import io
import os
import glob
import pygame
from td_utils import *

'''======== 合成音频,创建数据集 ======='''
pygame.mixer.init()
# 一次加载一条音频
pygame.mixer.music.load("data/raw_data/activates/2.wav")
pygame.mixer.music.load("data/raw_data/negatives/3.wav")
pygame.mixer.music.load("data/raw_data/backgrounds/1.wav")
pygame.mixer.music.load("data/audio_examples/example_train.wav")
# 播放
pygame.mixer.music.play()

# 画出声谱图
x = graph_spectrogram("data/audio_examples/example_train.wav")

# 读入训练样本
_, data = wavfile.read("data/audio_examples/example_train.wav")
print("Time steps in audio recording before spectrogram", data[:,0].shape)
print("Time steps in input after spectrogram", x.shape)

# 设置超参数
Tx = 5511 # 由声谱图输入模型的时间步
n_freq = 101 # 声谱图的每个时间步输入模型的频率数
Ty = 1375 # 模型输出的时间步

# 合成单个训练样本
# 导入原始音频(两个背景、正负词各10个)
activates, negatives, backgrounds = load_raw_audio()

# 背景长度为10000,触发词长度不一,1000左右
print("background len: " + str(len(backgrounds[0])))
print("activate[2] len: " + str(len(activates[2])))
print("activate[3] len: " + str(len(activates[3]))) 

# 得到一个长度为segment_ms可以随机插入的时间段
def get_random_time_segment(segment_ms):
    """
    Gets a random time segment of duration segment_ms in a 10,000 ms audio clip.
    
    Arguments:
    segment_ms -- the duration of the audio clip in ms ("ms" stands for "milliseconds")
    
    Returns:
    segment_time -- a tuple of (segment_start, segment_end) in ms
    """
    segment_start = np.random.randint(low=0, high=10000-segment_ms)
    segment_end = segment_start + segment_ms - 1
    return (segment_start, segment_end)

# 确认时间段是否有重叠
def is_overlapping(segment_time, previous_segments):
    """
    Checks if the time of a segment overlaps with the times of existing segments.
    
    Arguments:
    segment_time -- a tuple of (segment_start, segment_end) for the new segment
    previous_segments -- a list of tuples of (segment_start, segment_end) for the existing segments
    
    Returns:
    True if the time segment overlaps with any of the existing segments, False otherwise
    """
    overlap = False
    for (segment_start, segment_end) in previous_segments:
        if segment_time[0]<=segment_end and segment_time[1]>=segment_start:
            overlap = True
            break
    return overlap

# 测试
overlap1 = is_overlapping((950, 1430), [(2000, 2550), (260, 949)])
overlap2 = is_overlapping((2305, 2950), [(824, 1532), (1900, 2305), (3424, 3656)])
print("Overlap 1 = ", overlap1)
print("Overlap 2 = ", overlap2)

# 在背景音频中插入一次音频片段
def insert_audio_clip(background, audio_clip, previous_segments):
    """
    Insert a new audio segment over the background noise at a random time step, ensuring that the 
    audio segment does not overlap with existing segments.
    
    Arguments:
    background -- a 10 second background audio recording.  
    audio_clip -- the audio clip to be inserted/overlaid. 
    previous_segments -- times where audio segments have already been placed
    
    Returns:
    new_background -- the updated background audio
    segment_time -- a tuple of (segment_start, segment_end) in ms
    """
    segment_ms = len(audio_clip)
    segment_time = get_random_time_segment(segment_ms)
    # 查看是否重叠,是的话重新生成一个时间段
    while is_overlapping(segment_time, previous_segments):
        segment_time = get_random_time_segment(segment_ms)
    previous_segments.append(segment_time)
    # 叠加到背景音频中
    new_background = background.overlay(audio_clip, position=segment_time[0])
    return new_background, segment_time
    
# 合成insert_test
np.random.seed(5)
audio_clip, segment_time = insert_audio_clip(backgrounds[0], activates[0], [(3790, 4400)])
audio_clip.export("data/insert_test.wav", format="wav")
print("Segment Time: ", segment_time)
# 播放
pygame.mixer.music.load("data/insert_test.wav")
pygame.mixer.music.play()

# 更新标签,将插入后地方的标记为1
def insert_ones(y, segment_end_ms):
    """
    Update the label vector y. The labels of the 50 output steps strictly after the end of the segment 
    should be set to 1. By strictly we mean that the label of segment_end_y should be 0 while, the
    50 followinf labels should be ones.
    
    Arguments:
    y -- numpy array of shape (1, Ty), the labels of the training example
    segment_end_ms -- the end time of the segment in ms
    
    Returns:
    y -- updated labels
    """
    segment_end_y = int(Ty * segment_end_ms / 10000.0)
    start_y = segment_end_y + 1
    end_y = min(start_y + 50, y.shape[1])
    y[0, start_y : end_y] = 1
    return y
    
# 测试
arr1 = insert_ones(np.zeros((1, Ty)), 9700)
# plt.plot(arr1[0,:])
plt.plot(insert_ones(arr1, 4251)[0,:])
print("sanity checks:", arr1[0][1333], arr1[0][634], arr1[0][635])

# 合成音频,得到训练样本
def create_training_example(background, activates, negatives):
    """
    Creates a training example with a given background, activates, and negatives.
    
    Arguments:
    background -- a 10 second background audio recording
    activates -- a list of audio segments of the word "activate"
    negatives -- a list of audio segments of random words that are not "activate"
    
    Returns:
    x -- the spectrogram of the training example
    y -- the label at each time step of the spectrogram
    """
    np.random.seed(18)
    # 调小背景音
    background = background - 20
    # 初始化y
    y = np.zeros((1, Ty))
    # 记录之前的插入点
    previous_segments = []
    # 随机选择0-4个正音频插入
    number_of_activates = np.random.randint(0, 5)
    random_indices = np.random.randint(10, size=number_of_activates)
    for i in range(number_of_activates):
        background, segment_time = insert_audio_clip(background, activates[random_indices[i]], previous_segments)
        y = insert_ones(y, segment_time[1])
    # 随机选择0-2个负音频插入
    number_of_negatives = np.random.randint(0, 3)
    random_indices = np.random.randint(10, size=number_of_negatives)
    for i in range(number_of_negatives):
        background, _ = insert_audio_clip(background, negatives[random_indices[i]], previous_segments)
    # 标准化合成音频的音量
    background = match_target_amplitude(background, -20.0)
    # 导出
    background.export("data/train.wav", format="wav")
    print("File (train.wav) was saved in your directory.")
    # 得到x的声谱图
    x = graph_spectrogram("data/train.wav")
    return x, y

# 合成
x, y = create_training_example(backgrounds[0], activates, negatives)
# 播放
pygame.mixer.music.load("data/train.wav")
pygame.mixer.music.play()

plt.plot(y[0])

# 导入已经处理好的训练集和开发集
X = np.load("data/XY_train/X.npy")
Y = np.load("data/XY_train/Y.npy")
# 开发集是真实而非合成音频
X_dev = np.load("data/XY_dev/X_dev.npy")
Y_dev = np.load("data/XY_dev/Y_dev.npy")


'''======== 构建模型并训练 ======='''
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Model, load_model, Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from tensorflow.keras.layers import GRU, Bidirectional, BatchNormalization, Reshape
from tensorflow.keras.optimizers import Adam


# 创建模型
def model(input_shape):
    """
    Function creating the model's graph in Keras.
    
    Argument:
    input_shape -- shape of the model's input data (using Keras conventions)

    Returns:
    model -- Keras model instance
    """
    X_input = Input(shape=input_shape)
    
    X = Conv1D(196, 15, strides=4)(X_input)
    X = BatchNormalization()(X)
    X = Activation('relu')(X)
    X = Dropout(0.2)(X)
    
    X = GRU(128, return_sequences=True)(X)
    X = Dropout(0.2)(X)
    X = BatchNormalization()(X)

    X = GRU(128, return_sequences=True)(X)
    X = Dropout(0.2)(X)
    X = BatchNormalization()(X)
    X = Dropout(0.2)(X)
    # 时间分布的全连接层
    X = TimeDistributed(Dense(1, activation='sigmoid'))(X)
    # 创建模型
    model = Model(inputs=X_input, outputs=X)
    return model

# 创建模型
model = model(input_shape=(Tx, n_freq))
model.summary()
# Total params: 523,329

# 训练太慢,所以导入作业中已经在4000个样本上训练了三个小时的模型
model = load_model('models/tr_model.h5')

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
model.fit(X, Y, batch_size=5, epochs=1)
# 精度97.91%

# 开发集上的效果
loss, acc = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)
# 0.94839275

# 预测
# 得到一段音频预测标签
def detect_triggerword(filename):
    plt.subplot(2, 1, 1)

    x = graph_spectrogram(filename)
    # the spectogram outputs (freqs, Tx) and we want (Tx, freqs) to input into the model
    x  = x.swapaxes(0,1)
    x = np.expand_dims(x, axis=0)
    predictions = model.predict(x)
    
    plt.subplot(2, 1, 2)
    plt.plot(predictions[0,:,0])
    plt.ylabel('probability')
    plt.show()
    return predictions

chime_file = "data/audio_examples/chime.wav"
# 生成对应的提示音,但是每隔75个时间步最多响一次
def chime_on_activate(filename, predictions, threshold):
    audio_clip = AudioSegment.from_wav(filename)
    chime = AudioSegment.from_wav(chime_file)
    Ty = predictions.shape[1]
    # Step 1: Initialize the number of consecutive output steps to 0
    consecutive_timesteps = 0
    # Step 2: Loop over the output steps in the y
    for i in range(Ty):
        # Step 3: Increment consecutive output steps
        consecutive_timesteps += 1
        # Step 4: If prediction is higher than the threshold and more than 75 consecutive output steps have passed
        if predictions[0,i,0] > threshold and consecutive_timesteps > 75:
            # Step 5: Superpose audio and background using pydub
            audio_clip = audio_clip.overlay(chime, position = ((i / Ty) * audio_clip.duration_seconds)*1000)
            # Step 6: Reset consecutive output steps to 0
            consecutive_timesteps = 0
        
    audio_clip.export("data/chime_output.wav", format='wav')

# 在测试样本上试试效果
filename = "data/raw_data/dev/2.wav"
pygame.mixer.music.load(filename)
pygame.mixer.music.play()

prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)

# 播放预测后的音频(在activate后面有一个提示音)
pygame.mixer.music.load('data/chime_output.wav')
pygame.mixer.music.play()


# 在自己录制的音频上试试效果
def preprocess_audio(filename):
    # Trim or pad audio segment to 10000ms
    padding = AudioSegment.silent(duration=10000)
    segment = AudioSegment.from_wav(filename)[:10000]
    segment = padding.overlay(segment)
    # Set frame rate to 44100
    segment = segment.set_frame_rate(44100)
    # Export as wav
    segment.export(filename, format='wav')
    
your_filename = "data/audio_examples/my_audio1.wav"
preprocess_audio(your_filename)
pygame.mixer.music.load(your_filename) 
pygame.mixer.music.play()

prediction = detect_triggerword(your_filename)
chime_on_activate(your_filename, prediction, 0.1)
pygame.mixer.music.load('data/chime_output.wav')
pygame.mixer.music.play()


最后试了下在自己录制的音频上的效果,并不是特别好。音频中我总共说了两个触发词,但是只检测出来后一个,而且预测的概率很低,只有当阈值设为0.1附近才能检测得到。
在这里插入图片描述

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值