论文笔记：Attention-based End-to-End Models for Small-Footprint Keyword Spotting

最新推荐文章于 2022-10-10 21:27:40 发布

1024+1

最新推荐文章于 2022-10-10 21:27:40 发布

阅读量1.5k

点赞数 1

分类专栏： KWS 文章标签：语音识别

本文链接：https://blog.csdn.net/qq13269503103/article/details/107169586

版权

KWS 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

《Attention-based End-to-End Models for Small-Footprint Keyword Spotting》

小米团队+西北工业大学，Interspeech2018

1.Abstract

we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS)

作者主要提出了一个基于attention的端到端模型以应用于低资源占用的KWS任务

Our model consists of an encoder and an attention mechanism. The encoder transforms the input signal into a high level representation using RNNs. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection

1.模型由encoder和attention机制组成；

2.encoder是由RNN来提取特征；

3.attention将特征进行加权，生成固定长度的特征向量；

4.最后经过线性全连接层和softmax映射，转化为后验概率，用于关键词检查

We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN.

同时评估了LSTM、GRU和CRNN三种编码器结构的实验表现

Experiments on real-world wake-up data show that our approach outperforms the recent Deep KWS approach by a large margin and the best performance is achieved by CRNN

作者在真实唤醒词数据来测试的，很不错

实验结果：To be more specific, with ~84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.

84k的模型参数量，很实用。当FA=0.1时误拒率为1.02%。

Introduction

第一段先定义Keyword spotting (KWS) = spoken term detection (STD) = wake-up word detection

然后阐述KWS任务的研究意义和难点

研究意义：Specifically, as a typical application of KWS, wake-up word detection has become an indispensable function on various devices, in order to enable users to have a fully hands-free experience

难点：A practical on-device KWS module must minimize the false rejection rate at a low false alarm rate to make it easy to use, while limiting the memory footprint（内存资源占用小）, latency（延时低） and computational cost（计算量低）as small as possible.

第二段作者把KWS算法分为两类：第一种是large vocabulary continuous speech recognition (LVCSR) based systems，e.g., using end-to-end based acoustic models；第二种是keyword/filler hidden Markov model (HMM) 方法。

HMM方法：（1）需要对关键词和非关键词的语音片段都进行训练，然后利用viterbi进行解码。（2）高斯混合模型（GMM）方法最开始用于提取特征，后来替换成DNN来提取特征。（3）HMM序列建模过程还可以替换成RNN+CTC结构或者attention-based结构

In this paper，...，By saying end-to-end, we mean that: (1) a simple model that directly outputs keyword detection; (2) no complicated searching involved; (3) no alignments needed beforehand to train the model.

（1）直接输出关键词检测结果（2）不需要搜索（解码）过程参与（3）无需逐帧对齐的label就可以训练模型了

It is intuitive to use attention mechanism in KWS: humans are able to focus on a certain region of an audio stream with “high resolution” (e.g., the listener’s name) while perceiving the surrounding audio in “low resolution”, and then adjusting the focal point over time.

强调了attention机制的优点，balabala

In terms of end-to-end and small-footprint, the closest approach to ours is the one proposed by Kliegl et al. [18], where a convolutional recurrent neural network (CRNN) architecture is used. However, the latency introduced by its long decoding window (T=1.5 secs) makes the system difficult to use in real applications.

作者说CRNN方法和他们较为接近，但是他们的方法有1.5s的延时，难以实用

2. Attention-based KWS

2.1 End-to-end architecture

图1是整个结构示意图。（1）encoder是个RNN，T帧输入，同时T帧输出（2）后面接了个单层attention对特征进行重加权（3）最后加一个线性层和一个softmax，计算概率得分

2.2. Attention mechanism

作者尝试了两种attention：（1）Average attention（2）Soft attention

第（1）种不引入参数量，对每一帧的特征直接进行平均；第（2）种是常见的attention方式，重加权

2.3. Decoding

流式解码过程示意图：

Similar to the Deep KWS system, our system is triggered when the p(y = 1) exceeds a preset threshold. During decoding, in Fig. 2, the input is a sliding window of speech features, which has a preset length and contains the entire keyword. Meanwhile, a frame shift is employed....For a sliding window, we only need to feed one frame into the network for computation and the rest frames have been already computed in the previous sliding window

当某一帧p(y = 1)超过给定的阈值时，系统被触发；

作者预设了一个窗口长度，窗口包含了连续多帧的语音特征，不断对窗口内执行keyword检测；

窗口向右移动时，只需要对新加进来的帧重新计算特征。

3. Experiments

3.1. Datasets

认真看一下数据集准备：

We evaluated the proposed approach using real-world wake-up data collected from Mi AI Speaker（https://www.mi.com/aispeaker/）. The wake-up word is a four-syllable Mandarin Chinese term (“xiao-ai-tong-xue”). We collected ~188.9K positive examples (~99.8h) and ~1007.4K negative examples (~1581.8h) as the training set. The held out validation set has ~9.9K positive examples and ~53.0K negative examples. The test data set has ~28.8K positive examples (~15.2h) and ~32.8K negative examples (~37h).

通过米家AI产品收集的真实唤醒词数据作为数据集。唤醒词是四个字的普通话词（小-爱-同-学）。

（1）训练集：正样本有188.9k条，共99.8小时；负样本有1007.4k条，共1581.8小时

（2）验证集：正样本有9.9k条；负样本有53k条；

（3）测试集：正样本28.8k条，共15.2小时；负样本有32.8k条，共37小时

频谱特征处理：Each audio frame was computed based on a 40-channel Mel-filterbank with 25ms windowing and 10ms frame shift. Then the filterbank feature was converted to per-channel energy normalized (PCEN) Mel-spectrograms.

窗长（帧宽）是25ms，帧移是10ms，每一帧都是40维的Mel fbank特征，并逐通道能量归一化（PCEN）

3.2. Baseline

We reimplemented the Deep KWS system [9] as the baseline, in which the network predicts the posteriors for the four Chinese syllables in the wake-up word and a filler. The “filler” here means any voice that is not contain the keyword.

首先，拿了一篇2014年的KWS论文做baseline

作者提出的结构细节：The feed-forward DNN model had 3 hidden layers and 64 hidden nodes per layer with rectified linear unit (ReLU) non-linearity. An input window with 15 left frames and 5 right frames was used. The LSTM and GRU models were built with 2 hidden layers and 64 hidden nodes per layer. For the GRU KWS model, the final GRU layer was followed by a fully connected layer with ReLU non-linearity. There were no stacked frames in the input for the LSTM and GRU models. The smoothing window for Deep KWS was set to 20 frames.

（1）3层的DNN，64个隐层节点（应该是神经元吧），+ReLU激活函数

（2）以当前帧为中心，window=左15帧+右5帧

（3）LSTM和GRU都是2个隐层，64个隐层节点，+ReLU激活函数

（4）输出层带全连接+ReLU激活函数

（5）作者拿20帧做smoothing！

We also trained a TDNN-based acoustic model using ~3000 hours of speech data to perform frame-level alignment before KWS model training.

作者还用3000小时数据训练了一个TDNN模型，是用逐帧对齐的label训的

3.3. Experimental Setup

We used ADAM [23] as the optimization method while we decayed the learning rate from 1e-3 to 1e-4 after it converged. Gradient norm clipping to 1 was applied, together with L2 weight decay 1e-5. The positive training sample has a frame length of T = 1.9 seconds which ensures the entire wake-up word is included. Accordingly, in the attention models, the input window has set to 189 frames to cover the length of the wake-up word. We randomly selected 189 contiguous frames from the negative example set to train the attention models. At runtime, the sliding window was set to 100 frames and frame shift was set to 1.

（1）Adam梯度优化，学习率从0.001到0.0001

（2）梯度裁剪，max_norm = 1

（3）L2正则化，weight decay = 1e-5

（4）正样本的帧长为1.9s，即189帧的窗长。（1帧*25ms + 188帧*10ms）。基本能够覆盖整个唤醒词

（5）应该是运行时窗长都是100帧地执行，计算量小。当检测到keyword时（即当前帧超过给定的阈值，触发），那么就把窗长变为189帧，检测是否为keyword。

实验结果

从表一和图三作者得出：（1）所提出的attention模型拥有小的参数量（2）性能比Deep KWS更好（3）GRU比LSTM的表现更好（4）Soft-attention比average attention表现更好。最终GRU soft attention模型以53.4k的参数量，1.93%的误拒率表现最好