论文笔记:Attention-based End-to-End Models for Small-Footprint Keyword Spotting

《Attention-based End-to-End Models for Small-Footprint Keyword Spotting》

小米团队+西北工业大学,Interspeech2018

1.Abstract

we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS)

 作者主要提出了一个基于attention的端到端模型以应用于低资源占用的KWS任务

Our model consists of an encoder and an attention mechanism. The encoder transforms the input signal into a high level representation using RNNs. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection

1.模型由encoder和attention机制组成;

2.encoder是由RNN来提取特征;

3.attention将特征进行加权,生成固定长度的特征向量;

4.最后经过线性全连接层和softmax映射,转化为后验概率,用于关键词检查

 We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN.

 同时评估了LSTM、GRU和CRNN三种编码器结构的实验表现

Experiments on real-world wake-up data show that our approach outperforms the recent Deep KWS approach by a large margin and the best performance is achieved by CRNN

作者在真实唤醒词数据来测试的,很不错

实验结果:To be more specific, with ~84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.

84k的模型参数量,很实用。当FA=0.1时误拒率为1.02%。

Introduction

第一段先定义Keyword spotting (KWS) = spoken term detection (STD) = wake-up word detection

然后阐述KWS任务的研究意义和难点

研究意义:Specifically, as a typical application of KWS, wake-up word detection has become an indispensable function on various devices, in order to enable users to have a fully hands-free experience

难点:A practical on-device KWS module must minimize the false rejection rate at a low false alarm rate to make it easy to use, while limiting the memory footprint(内存资源占用小), latency(延时低) and computational cost(计算量低)as small as possible.

第二段作者把KWS算法分为两类:第一种是large vocabulary continuous speech recognition (LVCSR) based systems,e.g., using end-to-end based acoustic models;第二种是keyword/filler hidden Markov model (HMM) 方法。

HMM方法:(1)需要对关键词和非关键词的语音片段都进行训练,然后利用viterbi进行解码。(2)高斯混合模型(GMM)方法最开始用于提取特征,后来替换成DNN来提取特征。(3)HMM序列建模过程还可以替换成RNN+CTC结构或者attention-based结构

In this paper,...,By saying end-to-end, we mean that: (1) a simple model that directly outputs keyword detection; (2) no complicated searching involved; (3) no alignments needed beforehand to train the model.

 (1)直接输出关键词检测结果(2)不需要搜索(解码)过程参与(3) 无需逐帧对齐的label就可以训练模型了

It is intuitive to use attention mechanism in KWS: humans are able to focus on a certain region of an audio stream with “high resolution” (e.g., the listener’s name) while perceiving the surrounding audio in “low resolution”, and then adjusting the focal point over time.

强调了attention机制的优点,balabala

In terms of end-to-end and small-footprint, the closest approach to ours is the one proposed by Kliegl et al. [18], where a convolutional recurrent neural network (CRNN) architecture is used. However, the latency introduced by its long decoding window (T=1.5 secs) makes the system difficult to use in real applications.

作者说CRNN方法和他们较为接近,但是他们的方法有1.5s的延时,难以实用

2. Attention-based KWS

2.1 End-to-end architecture

图1是整个结构示意图。(1)encoder是个RNN,T帧输入,同时T帧输出(2)后面接了个单层attention对特征进行重加权(3)最后加一个线性层和一个softmax,计算概率得分

2.2. Attention mechanism

作者尝试了两种attention:(1)Average attention(2)Soft attention

第(1)种不引入参数量,对每一帧的特征直接进行平均;第(2)种是常见的attention方式,重加权

 2.3. Decoding

流式解码过程示意图:

Similar to the Deep KWS system, our system is triggered when the p(y = 1) exceeds a preset threshold. During decoding, in Fig. 2, the input is a sliding window of speech features, which has a preset length and contains the entire keyword. Meanwhile, a frame shift is employed....For a sliding window, we only need to feed one frame into the network for computation and the rest frames have been already computed in the previous sliding window

当某一帧p(y = 1)超过给定的阈值时,系统被触发;

作者预设了一个窗口长度,窗口包含了连续多帧的语音特征,不断对窗口内执行keyword检测;

窗口向右移动时,只需要对新加进来的帧重新计算特征。

3. Experiments

3.1. Datasets

认真看一下数据集准备:

We evaluated the proposed approach using real-world wake-up data collected from Mi AI Speaker(https://www.mi.com/aispeaker/). The wake-up word is a four-syllable Mandarin Chinese term (“xiao-ai-tong-xue”). We collected ~188.9K positive examples (~99.8h) and ~1007.4K negative examples (~1581.8h) as the training set. The held out validation set has ~9.9K positive examples and ~53.0K negative examples. The test data set has ~28.8K positive examples (~15.2h) and ~32.8K negative examples (~37h).

通过米家AI产品收集的真实唤醒词数据作为数据集。唤醒词是四个字的普通话词(小-爱-同-学)。

(1)训练集:正样本有188.9k条,共99.8小时;负样本有1007.4k条,共1581.8小时

(2)验证集:正样本有9.9k条;负样本有53k条;

(3)测试集:正样本28.8k条,共15.2小时;负样本有32.8k条,共37小时

频谱特征处理:Each audio frame was computed based on a 40-channel Mel-filterbank with 25ms windowing and 10ms frame shift. Then the filterbank feature was converted to per-channel energy normalized (PCEN) Mel-spectrograms.

窗长(帧宽)是25ms,帧移是10ms,每一帧都是40维的Mel fbank特征,并逐通道能量归一化(PCEN)

3.2. Baseline

We reimplemented the Deep KWS system [9] as the baseline, in which the network predicts the posteriors for the four Chinese syllables in the wake-up word and a filler. The “filler” here means any voice that is not contain the keyword.

首先,拿了一篇2014年的KWS论文做baseline

作者提出的结构细节:The feed-forward DNN model had 3 hidden layers and 64 hidden nodes per layer with rectified linear unit (ReLU) non-linearity. An input window with 15 left frames and 5 right frames was used. The LSTM and GRU models were built with 2 hidden layers and 64 hidden nodes per layer. For the GRU KWS model, the final GRU layer was followed by a fully connected layer with ReLU non-linearity. There were no stacked frames in the input for the LSTM and GRU models. The smoothing window for Deep KWS was set to 20 frames.

 (1)3层的DNN,64个隐层节点(应该是神经元吧),+ReLU激活函数

(2)以当前帧为中心,window=左15帧+右5帧

(3)LSTM和GRU都是2个隐层,64个隐层节点,+ReLU激活函数

(4)输出层带全连接+ReLU激活函数

(5)作者拿20帧做smoothing!

 We also trained a TDNN-based acoustic model using ~3000 hours of speech data to perform frame-level alignment before KWS model training.

作者还用3000小时数据训练了一个TDNN模型,是用逐帧对齐的label训的

 3.3. Experimental Setup

 We used ADAM [23] as the optimization method while we decayed the learning rate from 1e-3 to 1e-4 after it converged. Gradient norm clipping to 1 was applied, together with L2 weight decay 1e-5. The positive training sample has a frame length of T = 1.9 seconds which ensures the entire wake-up word is included. Accordingly, in the attention models, the input window has set to 189 frames to cover the length of the wake-up word. We randomly selected 189 contiguous frames from the negative example set to train the attention models. At runtime, the sliding window was set to 100 frames and frame shift was set to 1. 

 (1)Adam梯度优化,学习率从0.001到0.0001

(2)梯度裁剪,max_norm = 1

(3)L2正则化,weight decay = 1e-5

(4)正样本的帧长为1.9s,即189帧的窗长。(1帧*25ms + 188帧*10ms)。基本能够覆盖整个唤醒词

(5)应该是运行时窗长都是100帧地执行,计算量小。当检测到keyword时(即当前帧超过给定的阈值,触发),那么就把窗长变为189帧,检测是否为keyword。

 实验结果

 

从表一和图三作者得出:(1)所提出的attention模型拥有小的参数量(2)性能比Deep KWS更好(3)GRU比LSTM的表现更好(4)Soft-attention比average attention表现更好。最终GRU soft attention模型以53.4k的参数量,1.93%的误拒率表现最好

这部分猜是凑字数的,图4~6真费眼睛啊。作者得出两个结论:(1)从表2,图4和5对比得知,单层GRU,128个隐层节点结果最好;(2)额外加上一个CRNN层来提取invariant features,参数量增加到84.1%,同时性能提升到FRR=1.02%

 

 

 

 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
自动增益控制(Automatic Gain Control,简称AGC)和多样式训练(Multi-Style Training)对于稳健小体积的有着重要意义。 首先,自动增益控制(AGC)是一种技术,可以自动调整信号的增益,以确保信号在传输过程中保持适当的强度。在语音识别和音频处理中,AGC可以有效地处理各种输入信号的音量差异,使其更适合于后续的处理过程。通过调整增益,AGC可以提高信号质量、减少噪音干扰,从而使得小体积系统更加稳健。 其次,多样式训练(Multi-Style Training)是一种训练方法,通过使用大量不同风格和语调的语音样本来增强语音识别系统的鲁棒性。传统的语音识别系统通常只在标准风格的语音上进行训练,导致在其他风格的语音输入时识别率下降。而采用多样式训练方法,系统可以学习到更广泛的语音样式,使得在各种语音输入情况下都能取得较好的识别效果。对于小体积的系统来说,多样式训练可以提高系统的鲁棒性,减少输入多样性带来的挑战。 综上所述,自动增益控制和多样式训练对于稳健小体积系统的重要性体现在它们能够提高信号质量、减少噪音干扰,并且增加系统对各种不同语音风格的适应能力。这些技术的应用可以使得小体积系统在不同环境和语音输入情况下都能取得较好的效果,提高用户体验和系统的实用性。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值