Trigger_word_detection

最新推荐文章于 2022-08-03 00:49:31 发布

长命百岁️

最新推荐文章于 2022-08-03 00:49:31 发布

阅读量1.3k

点赞数 2

分类专栏：深度学习文章标签：音频处理唤醒词检测 GRU模型声谱图数据合成

本文链接：https://blog.csdn.net/qq_52852138/article/details/121886207

版权

深度学习专栏收录该内容

21 篇文章 2 订阅

订阅专栏

实验日期

2021.12.11

实验环境

# Keras==2.2.5 tensorflow==1.15.0

实验内容

构建音频数据集，并实现一个触发字检测（唤醒词检测）算法
本次实验的触发词为 “activate”，每次听到一个 “activate”，算法都会触发一个响声
我们规定说出 “activate” 为 “positive”，其他情况下都为 “negative”

Data synthesis:Creating a speech dataset

构建一个在不同环境下说出 “activate” 和其他词的数据集

现有数据

在不同环境下的背景噪音
包含"positive / negative" 词的音频片段（包括不同的方言）
总的来说就是有三种音频片段
- “background noise”
- “positive words”
- “negative words”
我们将利用以上三种不同的片段来合成音频数据集

From audio recordings to spectrograms

音频实际上是由麦克风记录气压变化产生的
可以将一段音频想象为一串用于记录气压变化的数
我们使用的音频一秒返回44100个数
我们很难从音频的“原始”表现中判断“激活”这个词是否被说出来了。为了帮助您的序列模型更容易地学习检测触发词，我们将计算音频的声谱图。声谱图告诉我们一个音频片段在某一时刻有多少不同的频率
```
x = graph_spectrogram("audio_examples/example_train.wav")
```

在这里插入图片描述

蓝色的代表出现瓶比较小，绿色的代表出现频率比较大

声谱图将会作为网络的输入 $x$ , $T_x = 5511$ (一个声谱图有5511个时间步)

_, data = wavfile.read("audio_examples/example_train.wav")
print("Time steps in audio recording before spectrogram", data[:,0].shape)
print("Time steps in input after spectrogram", x.shape)

>>Time steps in audio recording before spectrogram (441000,)
>>Time steps in input after spectrogram (101, 5511)

这样我们可以定义一些声谱图的时间步数，和一个时间步中的频率数

Tx = 5511 # The number of time steps input to the model from the spectrogram
n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram

我们定义 $G R U$ 模型的输出 $T_y = 1375$ ，这意味着我们利用 $G R U$ 将一个十秒的音频分成 1375 个时间段，并且尝试从每一个时间段中来判断，该段是否含有 “activate”

Ty = 1375 # The number of time steps in the output of our model

Generating a single training example

因为speech data难获取并且难标注，因此我们选择利用上述三种音频来合成。合成一个训练样本可以分为以下三步

选一个十秒的背景音频
随机插入 0 — 4 段 “activate” 的音频片段
随机插入 0 — 2 段 “negative words”的音频片段

因为我们是插入的音频片段，所以我们知道 “activate” 片段的位置，这样就很容易进行标注

我们利用 pydub 包来处理音频，pydub将 1ms 作为一个离散的时间间隔（10s = 10000ms），这也是我们为什么将一个十秒的片段表示为10000个step

# Load audio segments using pydub 
activates, negatives, backgrounds = load_raw_audio()

print("background len: " + str(len(backgrounds[0])))    # Should be 10,000, since it is a 10 sec clip
print("activate[0] len: " + str(len(activates[0])))     # Maybe around 1000, since an "activate" audio clip is usually around 1 sec (but varies a lot)
print("activate[1] len: " + str(len(activates[1])))     # Different "activate" clips can have different lengths 

>>background len: 10000
>>activate[0] len: 721
>>activate[1] len: 731

注意：我们在向背景noise中添加片段的时候，添加片段的位置不能与已经存在的片段的位置有重叠

背景音频的标签全设为 0 ，当插入一段 “activate” 音频时，我们将后面的 50 个step的标签都设为 1
在这里插入图片描述

我们可以利用四个函数
在这里插入图片描述

第一个函数

用于随机产生一段音频的起始位置和终止位置

输入：随机产生的片段的长度
输出：随机产生片段的起始位置和终止位置

def get_random_time_segment(segment_ms):
    """
    Gets a random time segment of duration segment_ms in a 10,000 ms audio clip.
    
    Arguments:
    segment_ms -- 产生片段的长度
    
    Returns:
    segment_time -- a tuple of (segment_start, segment_end) in ms
    """
    
    segment_start = np.random.randint(low=0, high=10000-segment_ms)   # Make sure segment doesn't run past the 10sec background 
    segment_end = segment_start + segment_ms - 1
    
    return (segment_start, segment_end)

第二个函数

用来判断即将新加入的片段是否和已经加入的片段有重合

输入
- 要插入的音频的起始位置和终止位置
- 已经存在的音频片段的位置

def is_overlapping(segment_time, previous_segments):
    """
    Checks if the time of a segment overlaps with the times of existing segments.
    
    Arguments:
    segment_time -- a tuple of (segment_start, segment_end) for the new segment
    previous_segments -- a list of tuples of (segment_start, segment_end) for the existing segments
    
    Returns:
    True if the time segment overlaps with any of the existing segments, False otherwise
    """
    
    segment_start, segment_end = segment_time
    
    ### START CODE HERE ### (≈ 4 line)
    # Step 1: Initialize overlap as a "False" flag. (≈ 1 line)
    overlap = False
    
    # Step 2: loop over the previous_segments start and end times.
    # Compare start/end times and set the flag to True if there is an overlap (≈ 3 lines)
    if overlap == False:
        for previous_start, previous_end in previous_segments:
            if previous_start <= segment_end and previous_end >= segment_start:
                overlap = True
    ### END CODE HERE ###

    return overlap

第三个函数

用来向背景音频中插入其他类型的音频，该过程可以用四步总结

输入
- 背景音频
- 想要插入的音频片段
- 已经存在的音频片段
输出
- 插入音频与背景音频合并之后的新的背景音频
- 新插入音频的起始位置和终止位置
首先利用之前的函数，创建一段随机的插入位置
判断插入位置与之前存在的片段是否产生重叠，产生随机片段一直到没有重叠区域为止
将新插入的片段插入到已存在的片段中
进行音乐片段的剪辑

# GRADED FUNCTION: insert_audio_clip

def insert_audio_clip(background, audio_clip, previous_segments):
    """
    Insert a new audio segment over the background noise at a random time step, ensuring that the 
    audio segment does not overlap with existing segments.
    
    Arguments:
    background -- a 10 second background audio recording.  
    audio_clip -- the audio clip to be inserted/overlaid. 
    previous_segments -- times where audio segments have already been placed
    
    Returns:
    new_background -- the updated background audio
    """
    
    # Get the duration of the audio clip in ms
    segment_ms = len(audio_clip)
    print(segment_ms)
    
    ### START CODE HERE ### 
    # Step 1: Use one of the helper functions to pick a random time segment onto which to insert 
    # the new audio clip. (≈ 1 line)
    segment_time = get_random_time_segment(segment_ms)
    
    # Step 2: Check if the new segment_time overlaps with one of the previous_segments. If so, keep 
    # picking new segment_time at random until it doesn't overlap. (≈ 2 lines)
    while is_overlapping(segment_time , previous_segments):
        segment_time = get_random_time_segment(segment_ms)

    # Step 3: Add the new segment_time to the list of previous_segments (≈ 1 line)
    previous_segments.append(segment_time)
    ### END CODE HERE ###
    
    # Step 4: Superpose audio segment and background
    new_background = background.overlay(audio_clip, position = segment_time[0])
    
    return new_background, segment_time

第四个函数

将activate样本的标签设为1

输入：
- 目前的标签向量 y
- 插入音频的结尾位置

# GRADED FUNCTION: insert_ones

def insert_ones(y, segment_end_ms):
    """
    Update the label vector y. The labels of the 50 output steps strictly after the end of the segment 
    should be set to 1. By strictly we mean that the label of segment_end_y should be 0 while, the
    50 followinf labels should be ones.
    
    
    Arguments:
    y -- numpy array of shape (1, Ty), the labels of the training example
    segment_end_ms -- the end time of the segment in ms
    
    Returns:
    y -- updated labels
    """
    
    # duration of the background (in terms of spectrogram time-steps)
    segment_end_y = int(segment_end_ms * Ty / 10000.0)
    
    # Add 1 to the correct index in the background label (y)
    ### START CODE HERE ### (≈ 3 lines)
    for i in range(segment_end_y+1 , segment_end_y+51):
        if i < Ty:
            y[0, i] = 1.0
    ### END CODE HERE ###
    
    return y

利用上述函数，构建训练数据样例

将 “activate” 音频和 “negative” 音频都插入 background 中

将向量 $y$ 初始化为大小为 $1,T_y)$ 的 0 向量
将现存的片段集合初始化为空
随机插入 0 — 4 条 “activate” 音频片段，并且将标签 $y$ 的对应位置设置为 1
随机插入 0 — 2 条 “negative” 音频片段

# GRADED FUNCTION: create_training_example

def create_training_example(background, activates, negatives):
    """
    Creates a training example with a given background, activates, and negatives.
    
    Arguments:
    background -- a 10 second background audio recording
    activates -- a list of audio segments of the word "activate"
    negatives -- a list of audio segments of random words that are not "activate"
    
    Returns:
    x -- the spectrogram of the training example
    y -- the label at each time step of the spectrogram
    """
    
    # Set the random seed
    np.random.seed(18)
    
    # Make background quieter
    background = background - 20

    ### START CODE HERE ###
    # Step 1: Initialize y (label vector) of zeros (≈ 1 line)
    y = np.zeros((1,Ty))

    # Step 2: Initialize segment times as empty list (≈ 1 line)
    previous_segments = []
    ### END CODE HERE ###
    
    # Select 0-4 random "activate" audio clips from the entire list of "activates" recordings
    number_of_activates = np.random.randint(0, 5)
    random_indices = np.random.randint(len(activates), size=number_of_activates)
    random_activates = [activates[i] for i in random_indices]
    
    ### START CODE HERE ### (≈ 3 lines)
    # Step 3: Loop over randomly selected "activate" clips and insert in background
    for random_activate in random_activates:
        # Insert the audio clip on the background
        background, segment_time = insert_audio_clip(background, random_activate, previous_segments)
        # Retrieve segment_start and segment_end from segment_time
        segment_start, segment_end = segment_time
        # Insert labels in "y"
        y = insert_ones(y, segment_end)
    ### END CODE HERE ###

    # Select 0-2 random negatives audio recordings from the entire list of "negatives" recordings
    number_of_negatives = np.random.randint(0, 3)
    random_indices = np.random.randint(len(negatives), size=number_of_negatives)
    random_negatives = [negatives[i] for i in random_indices]

    ### START CODE HERE ### (≈ 2 lines)
    # Step 4: Loop over randomly selected negative clips and insert in background
    for random_negative in random_negatives:
        # Insert the audio clip on the background 
        background, _ = insert_audio_clip(background, random_negative, previous_segments)
    ### END CODE HERE ###
    
    # Standardize the volume of the audio clip 
    background = match_target_amplitude(background, -20.0)

    # Export new training example 
    file_handle = background.export("train" + ".wav", format="wav")
    print("File (train.wav) was saved in your directory.")
    
    # Get and plot spectrogram of the new recording (background with superposition of positive and negatives)
    x = graph_spectrogram("train.wav")
    
    return x, y

调用结果

x, y = create_training_example(backgrounds[0], activates, negatives)

在这里插入图片描述

Full training set

加载进来已经产生好的训练集

# Load preprocessed training examples
X = np.load("./XY_train/X.npy")
Y = np.load("./XY_train/Y.npy")

Development set

加载进来25个十秒的录制的真实音频，分布和测试集类似

# Load preprocessed dev set examples
X_dev = np.load("./XY_dev/X_dev.npy")
Y_dev = np.load("./XY_dev/Y_dev.npy")

Model

导入需要的库

Build the model

from keras.callbacks import ModelCheckpoint
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape
from keras.optimizers import Adam

在这里插入图片描述
CONV-1D的输入是 5511 个时间步的声谱，输出 1375 个时间步，用于后来分析出 $T_y = 1375$ ，也就是判断这 1375 个时间步是否含有 “activate”

构造这个模型可以分为 4 步

实现卷积，利用 Conv1D() 函数，其中有 196 filters，filter size = 15 ， stride = 4
产生第一个 GRU 层，通过 X = GRU(units = 128, return_sequences = True)(X)
- return_sequences = True ,代表 GRU 的 hidden states 都会被传递给下一层
产生第二个 GRU 层，和上一层类似，只不过多了一层 Dropout 层
创建一个时间分布的致密层，通过 X = TimeDistributed(Dense(1, activation = “sigmoid”))(X)

# GRADED FUNCTION: model

def model(input_shape):
    """
    Function creating the model's graph in Keras.

    Argument:
    input_shape -- shape of the model's input data (using Keras conventions)

    Returns:
    model -- Keras model instance
    """

    X_input = Input(shape = input_shape)

    ### START CODE HERE ###

    # Step 1: CONV layer (≈4 lines)
    X = Conv1D(196, 15, strides=4)(X_input)             # CONV1D
    X = BatchNormalization()(X)                         # Batch normalization
    X = Activation('relu')(X)                           # ReLu activation
    X = Dropout(rate=0.8)(X)                                 # dropout (use 0.8)

    # Step 2: First GRU Layer (≈4 lines)
    X = GRU(units = 128, return_sequences=True)(X)      # GRU (use 128 units and return the sequences)
    X = Dropout(rate=0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                         # Batch normalization

    # Step 3: Second GRU Layer (≈4 lines)
    X = GRU(units = 128, return_sequences=True)(X)      # GRU (use 128 units and return the sequences)
    X = Dropout(rate=0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                         # Batch normalization
    X = Dropout(rate=0.8)(X)                                 # dropout (use 0.8)

    # Step 4: Time-distributed dense layer (≈1 line)
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)

    ### END CODE HERE ###

    model = Model(inputs = X_input, outputs = X)

    return model

模型创建和记录

model = model(input_shape = (Tx, n_freq)) # 创建
model.summary() # 记录

Fit the model

加载进来已经训练好的模型

model = load_model('./models/tr_model.h5')

train the model

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
model.fit(X, Y, batch_size = 5, epochs=1)

Test the model

loss, acc = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)

requirements — conda

# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: win-64
_tflow_select=2.2.0=eigen
absl-py=0.13.0=py36haa95532_0
astor=0.8.1=py36haa95532_0
blas=1.0=mkl
ca-certificates=2021.10.26=haa95532_2
certifi=2016.2.28=py36_0
colorama=0.3.9=py36_0
coverage=4.4.1=py36_0
cycler=0.10.0=py36_0
cython=0.26=py36_0
decorator=4.1.2=py36_0
freetype=2.10.4=hd328e21_0
gast=0.2.2=py36_0
google-pasta=0.2.0=pyhd3eb1b0_0
grpcio=1.36.1=py36hc60d5dd_1
h5py=2.10.0=py36h5e291fa_0
hdf5=1.10.4=h7ebc959_0
icc_rt=2019.0.0=h0cc432a_1
icu=58.2=ha925a31_3
importlib-metadata=4.8.1=py36haa95532_0
intel-openmp=2021.4.0=haa95532_3556
ipykernel=4.6.1=py36_0
ipython=6.1.0=py36_0
ipython_genutils=0.2.0=py36_0
jedi=0.10.2=py36_2
jpeg=9d=h2bbff1b_0
jupyter_client=5.1.0=py36_0
jupyter_core=4.3.0=py36_0
keras=2.2.5=py36_1
keras-applications=1.0.8=py_1
keras-preprocessing=1.1.2=pyhd3eb1b0_0
kiwisolver=1.3.1=py36hd77b12b_0
libgpuarray=0.7.6=hfa6e2cd_0
libpng=1.6.37=h2a8f88b_0
libprotobuf=3.17.2=h23ce68f_1
libpython=2.0=py36_0
m2w64-binutils=2.25.1=5
m2w64-bzip2=1.0.6=6
m2w64-crt-git=5.0.0.4636.2595836=2
m2w64-gcc=5.3.0=6
m2w64-gcc-ada=5.3.0=6
m2w64-gcc-fortran=5.3.0=6
m2w64-gcc-libgfortran=5.3.0=6
m2w64-gcc-libs=5.3.0=7
m2w64-gcc-libs-core=5.3.0=7
m2w64-gcc-objc=5.3.0=6
m2w64-gmp=6.1.0=2
m2w64-headers-git=5.0.0.4636.c0ad18a=2
m2w64-isl=0.16.1=2
m2w64-libiconv=1.14=6
m2w64-libmangle-git=5.0.0.4509.2e5a9a2=2
m2w64-libwinpthread-git=5.0.0.4634.697f757=2
m2w64-make=4.1.2351.a80a8b8=2
m2w64-mpc=1.0.3=3
m2w64-mpfr=3.1.4=4
m2w64-pkg-config=0.29.1=2
m2w64-toolchain=5.3.0=7
m2w64-tools-git=5.0.0.4592.90b8472=2
m2w64-windows-default-manifest=6.4=3
m2w64-winpthreads-git=5.0.0.4634.697f757=2
m2w64-zlib=1.2.8=10
mako=1.0.6=py36_0
markdown=3.3.4=py36haa95532_0
markupsafe=1.0=py36_0
matplotlib=3.2.2=1
matplotlib-base=3.2.2=py36hfa737b6_1
mkl=2020.2=256
mkl-service=2.3.0=py36h196d8e1_0
mkl_fft=1.3.0=py36h46781fe_0
mkl_random=1.1.1=py36h47e9c7a_0
msys2-conda-epoch=20160418=1
numpy=1.19.2=py36hadc3359_0
numpy-base=1.19.2=py36ha3acd2a_0
openssl=1.1.1l=h2bbff1b_0
opt_einsum=3.3.0=pyhd3eb1b0_1
path.py=10.3.1=py36_0
pickleshare=0.7.4=py36_0
pip=9.0.1=py36_1
prompt_toolkit=1.0.15=py36_0
protobuf=3.17.2=py36hd77b12b_0
pydub=0.25.1=pyhd8ed1ab_0
pygments=2.2.0=py36_0
pygpu=0.7.6=py36h2a96729_0
pyparsing=2.2.0=py36_0
pyqt=5.9.2=py36h6538335_2
pyreadline=2.1=py36_0
python=3.6.13=h3758d61_0
python-dateutil=2.6.1=py36_0
python_abi=3.6=2_cp36m
pyyaml=3.12=py36_0
pyzmq=16.0.2=py36_0
qt=5.9.7=vc14h73c81de_0
scipy=1.5.2=py36h9439919_0
setuptools=36.4.0=py36_1
simplegeneric=0.8.1=py36_1
sip=4.19.8=py36h6538335_0
six=1.16.0=pyhd3eb1b0_0
sqlite=3.36.0=h2bbff1b_0
tensorboard=1.15.0=pyhb230dea_0
tensorflow=1.15.0=eigen_py36h932cce6_0
tensorflow-base=1.15.0=eigen_py36h07d2309_0
tensorflow-estimator=2.6.0=pyh7b7c402_0
termcolor=1.1.0=py36_0
theano=0.9.0=py36_0
tornado=4.5.2=py36_0
traitlets=4.3.2=py36_0
typing_extensions=3.10.0.2=pyh06a4308_0
vc=14.2=h21ff451_1
vs2015_runtime=14.27.29016=h5e58377_2
wcwidth=0.1.7=py36_0
webencodings=0.5.1=py36_1
werkzeug=0.16.1=py_0
wheel=0.29.0=py36_0
wincertstore=0.2=py36_0
wrapt=1.12.1=py36he774522_1
zipp=3.6.0=pyhd3eb1b0_0
zlib=1.2.11=h62dcd97_4

长命百岁️

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
Trigger_word_detection

实验日期2021.12.11实验环境# Keras==2.2.5 tensorflow==1.15.0实验内容构建音频数据集，并实现一个触发字检测（唤醒词检测）算法本次实验的触发词为 “activate”，每次听到一个 “activate”，算法都会触发一个响声我们规定说出 “activate” 为 “positive”，其他情况下都为 “negative”Data synthesis:Creating a speech dataset构建一个在不同环境下说出 “activate
复制链接

扫一扫