SUMMARY OF > Small footprint keyword spotting using deep neural networks

最新推荐文章于 2022-10-10 21:27:40 发布

西工大苗苗

最新推荐文章于 2022-10-10 21:27:40 发布

阅读量638

点赞数 1

分类专栏： KWS DNN 文章标签：论文语音识别

本文链接：https://blog.csdn.net/qq_36323559/article/details/100864422

版权

本文概述了一种使用深度神经网络（DNN）进行小内存关键字识别的方法，对比了传统基于HMM的基线模型，并讨论了DNN在减少运行时间和内存占用方面的优势。特征提取模块每10ms生成一次特征，使用RNN-VAD识别语音区域，DNN预测后验概率以简化后处理。实验表明，DNN在多个方面优于HMM，包括鲁棒性和模型大小。

摘要由CSDN通过智能技术生成

文章目录

DNN keyword spotting - summary

DNN keyword spotting - summary

论文链接

https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/42537.pdf

1. Problem description

针对语音识别中的关键字识别即KWS问题

1.1 Previous approach

Large vocabulary continuous speech recognition (LVCSR) -> rich lattices [2,3,4]
- Offline processing
- Indexing and search for keywords
- Often used in search large audio databases
Keyword/Filler HMM [5,6,7,8,9]
- Online Viterbi algorithm (Computational expensive)
- One HMM model is trained for each word and the other is trained as fillers (For non-keyword segment)
Large margin formulation *[10,11]
RNN [12,13]
- Need longer time span to identify keywords, long latency

2. Model proposed

在这里插入图片描述

2.1 Deep KWS

Pros:
- Keywords as well as sub-word units
- No need for sequence search algorithm(decoding)
- Shorter run time and latency
- smaller memory footprint

2.1.1 component

2.1.1.1 Feature extraction module

The same as baseline HMM system

Rate: A vector of features every 10ms
- Computed every 10ms over a window of 25ms
Procedure:
1. Use RNN-VAD*[14]* to identify speech region
2. Generate log-fbank feature (Add sufficient left and right context)
  - Deep KWS use 10 future frames and 30 past frames
    - Why asymmetric? → reduce latency
  - HMM baseline use 5 future frames and 10 past frames

2.1.1.2 DNN

Predict posterior probabilities

Structure:
- (FC → ReLU)* → Softmax
Labeling:
- Represent entire words/ sub-words
  - Computational efficient
  - Simpler posterior handling
  - Output: ${x_j,i_j\}_j$
    Where $x_j$ stand for $j^{th}$ frame, $i$ is label number(see below)
Training：
- LR decay: exponential
- Maximize cross-entropy training criterion to train DNN:
  $F(\theta) = \sum_{j}{logp{_i}_{_j}{_j}}$