文章目录
DNN keyword spotting - summary
论文链接
https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/42537.pdf
1. Problem description
针对语音识别中的关键字识别即KWS问题
1.1 Previous approach
- Large vocabulary continuous speech recognition (LVCSR) -> rich lattices [2,3,4]
- Offline processing
- Indexing and search for keywords
- Often used in search large audio databases
- Keyword/Filler HMM [5,6,7,8,9]
- Online Viterbi algorithm (Computational expensive)
- One HMM model is trained for each word and the other is trained as fillers (For non-keyword segment)
- Large margin formulation *[10,11]
- RNN [12,13]
- Need longer time span to identify keywords, long latency
2. Model proposed
2.1 Deep KWS
- Pros:
- Keywords as well as sub-word units
- No need for sequence search algorithm(decoding)
- Shorter run time and latency
- smaller memory footprint
2.1.1 component
2.1.1.1 Feature extraction module
The same as baseline HMM system
- Rate: A vector of features every 10ms
- Computed every 10ms over a window of 25ms
- Computed every 10ms over a window of 25ms
- Procedure:
- Use RNN-VAD*[14]* to identify speech region
- Generate log-fbank feature (Add sufficient left and right context)
- Deep KWS use 10 future frames and 30 past frames
- Why asymmetric? → reduce latency
- HMM baseline use 5 future frames and 10 past frames
- Deep KWS use 10 future frames and 30 past frames
2.1.1.2 DNN
Predict posterior probabilities
- Structure:
- (FC → ReLU)* → Softmax
- Labeling:
- Represent entire words/ sub-words
- Computational efficient
- Simpler posterior handling
- Output: { x j , i j } j \{x_j,i_j\}_j {
xj,ij}j
Where x j x_j xj stand for j t h j^{th} jthframe, i i i is label number(see below)
- Represent entire words/ sub-words
- Training:
- LR decay: exponential
- Maximize cross-entropy training criterion to train DNN:
F ( θ ) = ∑ j l o g p i j j F(\theta) = \sum_{j}{logp{_i}_{_j}{_j}} F