本文开篇介绍深度学习的发展是由于GPU的发展,以及有很多well-suited 的算法;之后介绍paper的结构。
我认为本文最重要的在于其对解码和alignment的理解:
(1)关于解码:Once the acoustic and language models are trained , a decoding process can be applied to unknown speech to generate transcripts. Formally , we want to find the most likely sequence of words w*for a set of observations o:w* = argmax p(o|w)p(w).Decoding is not limited to words;phone decoding can be performed in the same manner.
(2)关于alignment解释:Applying DNNs to ASR;Human transcriptions consist of words , as opposed to the targets(such as phone-states) needed to train a DNN.Furthermore training requires a target for each frame (once every 10ms in our system), while transcriptions typically only delineate word or segment boundaries. For these reasons ,an alignment is performed prior to training a DNN.During alignment ,words are replaced with sequenses of phones ,which are further divided into phones-state.The phones-state are then time aligned with the audio frames.An external dictionary is useed to provide possible word-to-phone mappings,and an existing GMM-HMM systemsis used for selecting the best phone sequence and determining the time boundaries of the phones-state.(这段很长,需要有一定的概念基础才能理解,所谓的对齐的目的是让我们的语音帧与音素状态对应。在kaldi里面,一般都是在经过两次对齐然后再训练,再让语音帧信号与新生成的模型对齐,这样就得到了更好的标注结果)
本文对本人对DNN的输入进行了解释:DNNs were trained using 52-component feature vectors consisting of 12 Perceptual Linear Prediction cofficients,along with the zeroth and first ,second ,and third order differentials.Feature extraction was performed using HTK.Feature vectors fron 13 successive frames were combined into a signal DNNinput vectoe of length 676.(这里本人你的理解是采用mSGD算法进行训练,batch为52,然后将mfcc特征进行串联作为DNN的输入)
最后总结本文,我认为本文,作者最牛逼的地方并不是他的想法多创新,而是在于他们有先进的设备,以及算法编程能力,将kaldi里面的将近17000行代码硬生生的减少到了800行,而且性能更优。