Abstract
We apply the recently proposedContext-Dependent DeepNeural-Network HMMs, or CD-DNN-HMMs, to speech-to-texttranscription. For single-pass speaker-independent recognition on the RT03SFisher portion of phone-call transcription benchmark (Switchboard), the word-error rate is reduced from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs, to 18.5%—a 33% relative improvement.
CD-DNN-HMMs combine classic artificial-neural-network HMMs with traditional tied-state triphones and deep-belief network pre-training. They hadpreviously been shown to reduce errors by 16% relatively when trained on tensof hours of data using hundreds of tied states. This paper takes CD-DNNHMMsfurther and applies them to transcription using over 300 hours of training data,over 9000 tied states, and up to 9 hidden layers, and demonstrates howsparseness can be exploited.
On four less well-matched transcription tasks, we observe relativeerror reductions of 22 –28%.
Index Terms: speechrecognition, deep belief networks, deep neural networks
1. Introduction
Since the early 90’s, artificial neural networks (ANNs) have been used to model the state emission probabilities of HMM speech recognizers [1]. While traditional Gaussian mixture model (GMM)-HMMs model context dependency through tied context-dependent states (e.g.CART-clustered crossword triphones [2]), ANN-HMMs were never used to do sodirectly. Instead, networks were often factorized, e.g. into a monophone and acontext-dependent part [3], or hierarchically decomposed [4]. It has been commonly assumed that hundreds or thousands of triphone states were just toomany to be accurately modeled or trained in a neural network. Only recently didYu et al. discover that doing so isnot only feasible but works very well [5].
Context-dependent deep-neural-network HMMs, or CDDNN-HMMs [5, 6],apply the classical ANN-HMMs of the
90’s to traditional tied-state triphonesdirectly, exploiting Hinton’s deep-belief-network (DBN) pre-training procedure.This was shown to lead to a very promising and possibly disruptive acousticmodel as indicated by a 16% relative recognition error reduction overdiscriminatively trained GMM-HMMs on a business search task [5, 6], whichfeatures short query utterances, tens of hours of training data, and hundredsof tied states.
This paper takes this model a step further and serves severalpurposes. First, we show that the exact same CD-DNN HMM can be effectively scaled up in terms of training-data size (from 24 hours to over 300), modelcomplexity (from 761 tied triphone states to over 9000), depth (from 5 to 9 hidden layers), and task (from voice queries to speech-to-text transcription).This is demonstrated on a publicly available benchmark, the Switchboardphone-call transcription task (2000 NIST Hub5 and RT03S sets). We should note here that ANNs have been trained on up to 2000 hours of speech before [7], butwith much fewer output units (monophones) and fewer hidden layers.
Second, we advance the CD-DNN-HMMs by introducing weight sparseness and the related learning strategy and demonstrate that this can reducerecognition error or model size.
Third, we present the statistical view of the multi-layer perceptron (MLP) and DBN and provide empirical evidence for understanding which factors contribute most to the accuracy improvements achieved by the CD-DNN-HMMs.
2. The Context-Dependent DeepNeural Network HMM
A deep neuralnetwork (DNN) is a conventional multi-layer perceptron (MLP, [8]) with manyhidden layers, optionally initialized using the DBN pre-training algorithm. Inthe following, we want to recap the DNN from a statistical viewpoint anddescribe its integration with context-dependent HMMs for speech recognition.For a more detailed description, please refer to [6].
2.1. Multi-Layer Perceptron—AStatistical View
An MLP as used in this paper models theposterior probability Ps|o(s|o) of a class s given an observation vector o, as a stack of (L + 1) layers oflog-linear models. The first L layers,
, modelposterior probabilities of hidden binaryvectors hgiveninput vectors v, while the toplayer L models the desiredclass posterior as
= softmaxs(zL(vL ))
withweight matrices Wand biasvectors a, where and are the j-thcomponent of,respectively.
The precise modelingof Ps|o(s|o) requiresintegration over all possible values of h across alllayers which is infeasible. An effective practical trick is to replace themarginalization with the “mean-field approximation” [9]. Given observation o, we set v0 =o and choosethe conditional expectation as input to the next layer,
Copyright © 2011 ISCA 28-31 August 2011, Florence, Italy |
where σj(z) = 1/(1 + e− zj).
MLPs are often trainedwith the error back-propagation procedure(BP) [10] with stochastic gradient ascent
for anobjective function D and learning rate .If the objective is to maximize the total log posterior probability over the T training samples o(t) withground-truth labels s(t), i.e.
, (1)
then the gradients are
;
eL(t) = (logsoftmax)
diag
with error signals , thecomponent-wise derivatives j ·(1 − σj(z)) and (logsoftmax) δs(t),j − softmaxj(z), andKronecker delta δ.
BP, however, can easilyget trapped in poor local optima for deep networks. This can be some what alleviated by growing the model layer by layer, or more effectively by using the DBN pre-training procedure described next.
2.2. DBN Pre-Training
The deep belief network (DBN), proposed byHinton [11], provides a new way to train deep generative models. The layerwisegreedy pre-training algorithm developed in DBN was later found to be alsoeffective in training DNNs.
The DBN pre-trainingprocedure treats each consecutive pair of layers in the MLP as a restricted Boltzmann machine (RBM) [11]whose joint probability is defined as
for theBernoulli-Bernoulli RBM applied to binary v with asecond bias vector b and normalizationterm Zh,v, and
for the Gaussian-Bernoulli RBM applied tocontinuous v. In both cases theconditional probability Ph|v(h|v) has the sameform as that in an MLP layer.
The RBM parameters can be efficiently trained in an unsupervised fashion by maximizing thelikelihood L = over training samples v(t) with the approximate contrastivedivergence algorithm [11, 12]. We use the specific form given in [12]:
with vˆ(t) = σ(Whˆ(t) + b), where hˆ(t) is a binary randomsample from Ph|v(·|v(t)).
To train multiplelayers, one trains the first layer, freezes it, and uses the conditionalexpectation of the output as the input to the next layer and continue trainingnext layers. Hinton and many others have found that initializing MLPs with pretrained parameters never hurts and often helps [11].
2.3. Integrating DNNs with CD-HMMs
Following the traditional ANN-HMMs of the90’s [1], we replace the acoustic model’s Gaussian mixtures with an MLP andcompute the HMM’s state emission likelihoods po|s(o|s) by convertingstate posteriors obtained from the MLP to likelihoods:
const(s). (2)
Here, classes s correspondto HMM states, and observation vectors o are regularacoustic feature vectors augmented with neighbor frames (5 on each side in ourcase). Ps(s) is the prior probability of state s.
However, unlike earlierANN-HMM systems, we model tied triphone states directly. It had long beenassumed that the thousands of triphone states were too many to be accuratelymodeled by an MLP, but [5] has shown that doing so is not only feasible but worksvery well. This is a critical factor in achieving the unusual accuracyimprovements in this paper. The resulting model is called the Context-Dependent Deep Neural Network HMM, or CD-DNN-HMM.