论文笔记:Attention-based End-to-End Models for Small-Footprint Keyword Spotting

we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS)


Our model consists of an encoder and an attention mechanism. The encoder transforms the input signal into a high level representation using RNNs. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection





 We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN.


Experiments on real-world wake-up data show that our approach outperforms the recent Deep KWS approach by a large margin and the best performance is achieved by CRNN


实验结果:To be more specific, with ~84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.



第一段先定义Keyword spotting (KWS) = spoken term detection (STD) = wake-up word detection


研究意义:Specifically, as a typical application of KWS, wake-up word detection has become an indispensable function on various devices, in order to enable users to have a fully hands-free experience

难点:A practical on-device KWS module must minimize the false rejection rate at a low false alarm rate to make it easy to use, while limiting the memory footprint(内存资源占用小), latency(延时低) and computational cost(计算量低)as small as possible.

第二段作者把KWS算法分为两类:第一种是large vocabulary continuous speech recognition (LVCSR) based systems,e.g., using end-to-end based acoustic models;第二种是keyword/filler hidden Markov model (HMM) 方法。


In this paper,...,By saying end-to-end, we mean that: (1) a simple model that directly outputs keyword detection; (2) no complicated searching involved; (3) no alignments needed beforehand to train the model.

It is intuitive to use attention mechanism in KWS: humans are able to focus on a certain region of an audio stream with “high resolution” (e.g., the listener’s name) while perceiving the surrounding audio in “low resolution”, and then adjusting the focal point over time.


In terms of end-to-end and small-footprint, the closest approach to ours is the one proposed by Kliegl et al. [18], where a convolutional recurrent neural network (CRNN) architecture is used. However, the latency introduced by its long decoding window (T=1.5 secs) makes the system difficult to use in real applications.


2. Attention-based KWS

2.1 End-to-end architecture


2.2. Attention mechanism

作者尝试了两种attention:(1)Average attention(2)Soft attention


 2.3. Decoding


Similar to the Deep KWS system, our system is triggered when the p(y = 1) exceeds a preset threshold. During decoding, in Fig. 2, the input is a sliding window of speech features, which has a preset length and contains the entire keyword. Meanwhile, a frame shift is employed....For a sliding window, we only need to feed one frame into the network for computation and the rest frames have been already computed in the previous sliding window

3. Experiments

3.1. Datasets


We evaluated the proposed approach using real-world wake-up data collected from Mi AI Speaker(https://www.mi.com/aispeaker/). The wake-up word is a four-syllable Mandarin Chinese term (“xiao-ai-tong-xue”). We collected ~188.9K positive examples (~99.8h) and ~1007.4K negative examples (~1581.8h) as the training set. The held out validation set has ~9.9K positive examples and ~53.0K negative examples. The test data set has ~28.8K positive examples (~15.2h) and ~32.8K negative examples (~37h).





频谱特征处理:Each audio frame was computed based on a 40-channel Mel-filterbank with 25ms windowing and 10ms frame shift. Then the filterbank feature was converted to per-channel energy normalized (PCEN) Mel-spectrograms.

3.2. Baseline

We reimplemented the Deep KWS system [9] as the baseline, in which the network predicts the posteriors for the four Chinese syllables in the wake-up word and a filler. The “filler” here means any voice that is not contain the keyword.


作者提出的结构细节:The feed-forward DNN model had 3 hidden layers and 64 hidden nodes per layer with rectified linear unit (ReLU) non-linearity. An input window with 15 left frames and 5 right frames was used. The LSTM and GRU models were built with 2 hidden layers and 64 hidden nodes per layer. For the GRU KWS model, the final GRU layer was followed by a fully connected layer with ReLU non-linearity. There were no stacked frames in the input for the LSTM and GRU models. The smoothing window for Deep KWS was set to 20 frames.






 We also trained a TDNN-based acoustic model using ~3000 hours of speech data to perform frame-level alignment before KWS model training.


 3.3. Experimental Setup

 We used ADAM [23] as the optimization method while we decayed the learning rate from 1e-3 to 1e-4 after it converged. Gradient norm clipping to 1 was applied, together with L2 weight decay 1e-5. The positive training sample has a frame length of T = 1.9 seconds which ensures the entire wake-up word is included. Accordingly, in the attention models, the input window has set to 189 frames to cover the length of the wake-up word. We randomly selected 189 contiguous frames from the negative example set to train the attention models. At runtime, the sliding window was set to 100 frames and frame shift was set to 1. 


从表一和图三作者得出:(1)所提出的attention模型拥有小的参数量(2)性能比Deep KWS更好(3)GRU比LSTM的表现更好(4)Soft-attention比average attention表现更好。最终GRU soft attention模型以53.4k的参数量,1.93%的误拒率表现最好

这部分猜是凑字数的,图4~6真费眼睛啊。作者得出两个结论:(1)从表2,图4和5对比得知,单层GRU,128个隐层节点结果最好;(2)额外加上一个CRNN层来提取invariant features,参数量增加到84.1%,同时性能提升到FRR=1.02%









