# Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

## Abstract

We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

Index Terms—Artificial neural network–hidden Markov model(ANN-HMM), context-dependent phone, deep belief network,deep neural network hidden Markov model (DNN-HMM), speech recognition, large-vocabulary speech recognition (LVSR).

I. INTRODUCTION

EVEN after decades of research and many successfully deployed commercial products, the performance of automatic speech recognition (ASR) systems in real usage scenarios lags behind human level performance (e.g., [2], [3]). There have been some notable recent advances in discriminative training (see an overview in [4]; e.g., maximum mutual information (MMI) estimation [5], minimum classification error (MCE) training [6], [7], and minimum phone error (MPE) training [8], [9]), in large-margin techniques (such as large-margin estimation [10], [11], large-margin hidden Markov model (HMM) [12], large-margin MCE [13]–[16], and boosted MMI [17]), as well as in novel acoustic models (such as conditional random fields (CRFs) [18]–[20], hidden CRFs [21], [22], and segmental CRFs [23]). Despite these advances, the elusive goal of human level accuracy in real-world conditions requires continued,vibrant research.

Recently, a major advance has been made in training densely connected, directed belief nets with many hidden layers. The resulting deep belief nets learn a hierarchy of nonlinear feature detectors that can capture complex statistical patterns in data. The deep belief net training algorithm suggested in [24] first initializes the weights of each layer individually in a purely unsupervised1 way and then fine-tunes the entire network using labeled data. This semi-supervised approach using deep models has proved effective in a number of applications, including coding and classification for speech, audio, text, and image data ([25]–[29]). These advances triggered interest in developing acoustic models based on pre-trained neural networks and other deep learning techniques for ASR. For example, context-independent pre-trained, deep neural network HMM hybrid architectures have recently been proposed for phone recognition [30]–[32] and have achieved very competitive performance. Using pre-training to initialize the weights of a deep neural network has two main potential benefits that have been discussed in the literature. In [33], evidence was presented that is consistent with viewing pre-training as a peculiar sort of data-dependent regularizer whose effect on generalization error does not diminish with more data, even when the dataset is so vast that training cases are never repeated. The regularization effect from using information in the distribution of inputs can allow highly expressive models to be trained on comparably small quantities of labeled data. Additionally, [34], [33], and others have also reported experimental evidence consistent with pre-training aiding the subsequent optimization, typically performed by stochastic gradient descent. Thus, pre-trained neural networks often also achieve lower training error than neural networks that are not pre-trained (although this effect can often be confounded by the use of early stopping). These effects are especially pronounced in deep autoencoders.

Deep belief network pre-training was the first pre-trainingmethod to be widely studied, although many other techniques now exist in the literature (e.g., [35]). After [34] showed that deep auto-encoders could be trained effectively using deep belief net pre-training, there was a resurgence of interest in using deeper neural networks for applications. Although less pathological deep architectures than deep autoencoders can in some cases be trained without pre-training, for many problems and model architectures, researchers have reported pre-training to be helpful (even in some cases for large single hidden layer neural networks trained on massive datasets, as in [28]). We view the various unsupervised pre-training techniques as convenient and robust ways to help train neural networks with many hidden layers that are generally helpful, rarely hurtful, and sometimes essential.

In this paper, we propose a novel acoustic model, a hybrid between a pre-trained, deep neural network (DNN) and a context-dependent (CD) hidden Markov model. The pre-training algorithm we use is the deep belief network (DBN) pre-training algorithm of [24], but we will denote our model with the abbreviation DNN-HMM to help distinguish it from a dynamic Bayes net (which we will not abreviate in this article) and to make it clear that we abandon the deep belief network once pre-training is complete and only retain and continue training the recognition weights. CD-DNN-HMMs combine the representational power of deep neural networks and the sequential modeling ability of context-dependent hidden Markov models (HMMs). In this paper, we illustrate the key ingredients of the model, describe the procedure to learn the CD-DNN-HMMs’ parameters, analyze how various important design choices affect the recognition performance, and demonstrate that CD-DNN-HMMs can significantly outperform strong discriminatively-trained context-dependent Gaussian mixture model hidden Markov model (CD-GMM-HMM) baselines on the challenging business search dataset of [36], collected under actual usage conditions. To our best knowledge, this is the first time DNN-HMMs, which are formerly only used for phone recognition, are successfully applied to large-vocabulary speech recognition (LVSR) problems.

A. Previous Work Using Neural Network Acoustic Models

The combination of artificial neural networks (ANNs) and HMMs as an alternative paradigm for ASR started between the end of 1980s and the beginning of the 1990s. A variety of different architectures and training algorithms have been proposed in the literature (see the comprehensive survey in [37]). Among these techniques, the ones most relevant to this work are those that use the ANNs to estimate the HMM state-posterior probabilities [38]–[45], which have been referred to as ANN-HMM hybrid models in the literature. In these ANN-HMM hybrid architectures, each output unit of the ANN is trained to estimate the posterior probability of a continuous density HMMs’ state given the acoustic observations. ANN-HMM hybrid models were seen as a promising technique for LVSR in the mid-1990s. In addition to their inherently discriminative nature, ANN-HMMs have two additional advantages: the training can be performed using the embedded Viterbi algorithm and the decoding is generally quite efficient.

Most early work (e.g., [39] and [38]) on the hybrid approach used context-independent phone states as labels for ANN training and considered small vocabulary tasks. ANN-HMMs were later extended to model context-dependent phones and were applied to mid-vocabulary and some large-vocabulary ASR tasks (e.g., in [45], which also employed recurrent neural architectures). However, in earlier work on context dependent ANN-HMM hybrid architectures [46], the posterior probability of the context-dependent phone was modeled as either

(1)

(2)

where is the acoustic observation at time is one of the clustered context classes is either a context-independent phone or a state in a context-independent phone. ANNs were used to estimate and (alternatively and ). Note that although these types of context-dependent ANN-HMMs outperformed GMM-HMMs for some tasks, the improvements were small.

These earlier hybrid attempts had some important limitations.For example, using only backpropagation to train the ANN makes it challenging (although not impossible) to exploit more than two hidden layers well and the context-dependent model described above does not take advantage of the numerous effective techniques developed for GMM-HMMs. Around 1999, the desire to use HMM advances from the speech research community directly without developing replacement techniques and tools contributed to a shift from using neural nets to predict phonetic states to using neural nets to augment features for later use in a conventional GMM-HMM recognizer (e.g., [47]). In this work, however, we do not take that approach, but instead we try to improve the earlier hybrid approaches by replacing more traditional neural nets with deeper, pre-trained neural nets and by using the senones [48] (tied triphone states) of a GMM-HMM tri-phone model as the output units of the neural network, in line with state-of-the-art HMM systems.

Although this work uses the hybrid approach, as alluded to above, much recent work using neural networks in acoustic modeling uses the so-called TANDEM approach, first proposed in [49]. The TANDEM approach augments the input to a GMM-HMM system with features derived from the suitably transformed output of one or more neural networks, typically trained to produce distributions over monophone targets. In a similar vein, [50] uses features derived from an earlier “bottle-neck” hidden layer instead of using the neural network outputs directly. Many recent papers (e.g., [51]–[54]) train neural networks on LVSR datasets (often in excess of 1000 hours of data) and use variants of these approaches, either augmenting the input to a GMM-HMM system with features based on the neural network outputs or some earlier hidden layer. Although a neural network nominally containing three hidden layers (the largest number of layers investigated in [55]) might be used to create bottle-neck features, if the feature layer is the middle hidden layer then the resulting features are only produced by an encoder with a single hidden layer.

Neural networks for producing bottle-neck features are very similar architecturally to autoencoders since both typically have a small code layer. Deeper neural networks, especially deeper autoencoders, are known to be difficult to train with backpropagation alone. For example, [34] reports in one experiment that they are unable to get results nearly so good as those possible with deep belief network pre-training when training a deep (the encoder and decoder in their architecture both had three hidden layers) autoencoder with a nonlinear conjugate gradient algorithm. Both [56] and [57] investigate why training deep feed-forward neural networks can often be easier with some form of pre-training or a sophisticated optimizer of the sort used in [58].

B. Introduction to the DNN-HMM Approach

The primary contributions of this work are the development of a context-dependent, pre-trained, deep neural network HMM hybrid acoustic model (CD-DNN-HMM); a description of our recipe for applying this sort of model to LVSR problems; and an analysis of our results which show substantial improvements in recognition accuracy for a difficult LVSR task over discriminatively-trained pure CD-GMM-HMM systems. Our work differs from earlier context-dependent ANN-HMMs [42], [41] in two key respects. First, we used deeper, more expressive neural network architectures and thus employed the unsupervised DBN pre-training algorithm to make sure training would be effective. Second, we used posterior probabilities of senones (tied triphone HMM states) [48] as the output of the neural network, instead of the combination of context-independent phone and context class used previously in hybrid architectures. This second difference also distinguishes our work from earlier uses of DNN-HMM hybrids for phone recognition [30]–[32], [59]. Note that [59], which also appears in this issue, is the context-independent version of our approach and builds the foundation for our work. The work in this paper focuses on context-dependent DNN-HMMs using posterior probabilities of senones as network outputs and can be successfully applied to large vocabulary tasks. Training the neural network to predict a distribution over senones causes more bits of information to be present in the neural network training labels. It also incorporates context-dependence into the neural network outputs (which, since we are not using a Tandem approach, lets us use a decoder based on triphone HMMs), and it may have additional benefits. Our evaluation was done on LVSR instead of phoneme recognition tasks as was the case in [30]–[32], [59]. It represents the first large-vocabulary application of a pre-trained,deep neural network approach. Our results show that our CD-DNN-HMM system provides dramatic improvements over a discriminatively trained CD-GMM-HMM baseline.

The remainder of this paper is organized as follows. In Section II, we briefly introduce RBMs and deep belief nets, and outline the general pre-training strategy we use. In Section III, we describe the basic ideas, the key properties, and the training and decoding strategies of our CD-DNN-HMMs. In Section IV, we analyze experimental results on a 65 vocabulary business search dataset collected from the Bing mobile voice search application (formerly known as Live Search for mobile [36], [60]) under real usage scenarios. Section V offers conclusions and directions for future work.

## II. DEEP BELIEF NETWORKS

Deep belief networks (DBNs) are probabilistic generative models with multiple layers of stochastic hidden units above a single bottom layer of observed variables that represent a data vector. DBNs have undirected connections between the top two layers and directed connections to all other layers from the layer above. There is an efficient unsupervised algorithm, first described in [24], for learning the connection weights in a DBN that is equivalent to training each adjacent pair of layers as an restricted Boltzmann machine (RBM). There is also a fast, approximate, bottom-up inference algorithm to infer the states of all hidden units conditioned on a data vector. After the unsupervised pre-training phase, Hinton et al. [24] used the up-down algorithm to optimize all of the DBN weights jointly. During this fine-tuning phase, a supervised objective function could also be optimized.

In this paper, we use the DBN weights resulting from the unsupervised pre-training algorithm to initialize the weights of a deep, but otherwise standard, feed-forward neural network and then simply use the backpropagation algorithm [61] to fine-tune the network weights with respect to a supervised criterion. Pretraining followed by stochastic gradient descent is our method of choice for training deep neural networks because it often outperforms random initialization for the deeper architectures we are interested in training and provides results very robust to the initial random seed. The generative model learned during pre-training helps prevent overfitting, even when using models with very high capacity and can aid in the subsequent optimization of the recognition weights.

Although empirical results ultimately are the best reason for the use of a technique, our motivation for even trying to find and apply deeper models that might be capable of learning rich, distributed representations of their input is also based on formal and informal arguments by other researchers in the machine learning community. As argued in [62] and [63], insufficiently deep architectures can require an exponential blow-up in the number of computational elements needed to represent certain functions satisfactorily. Thus, one primary motivation for using deeper models such as neural networks with many layers is that they have the potential to be much more representationally ef- ficient for some problems than shallower models like GMMs. Furthermore, GMMs as used in speech recognition typically have a large number of Gaussians with independently parameterized means which may result in those Gaussians being highly localized and thus may result in such models only performing local generalization. In effect, such a GMM would partition the input space into regions each modeled by a single Gaussian.[64] proved that constant leaf decision trees require a number of training cases exponential in their input dimensionality to learn certain rapidly varying functions. [64] also makes more general and less formal arguments that models that create a single hard or soft partitioning of the input space and use separately parameterized simple models for each region are doomed to have similar generalization issues when trained on rapidly varying functions. In a related vein, [65] also proves an analogous “curse of rapidly-varying functions” for a large class of local kernel machines that include both supervised learning algorithms (e.g., SVMs with Gaussian kernels) and many semi-supervised algorithms and unsupervised manifold learning algorithms. It is our fear that functions important for solving difficult perceptual tasks in domains such as computer vision and computer audition will have a componential structure that makes them vary rapidly even though there is perhaps only a comparatively small number of factors that cause these variations. Although it remains to be seen to what extent these arguments about architectural depth and local generalization apply to speech recognition, one of our hopes in this work is to demonstrate that replacing GMMs with deeper models can reduce recognition error in a difficult LVSR task, even if we are unable to show that our proposed system performs well because of some sort of avoidance of the potential issues we discuss above.

A. Restricted Boltzmann Machines

RBMs [66] are a type of undirected graphical model constructed from a layer of binary stochastic hidden units and a layer of stochastic visible units that, for the purposes of this work, will either be Bernoulli or Gaussian distributed conditional on the hidden units. The visible and hidden units form a bipartite graph with no visible-visible or hidden-hidden connections. For concreteness, we will assume the visible units are binary for the moment (we always assume binary hidden units in this work) and describe how we deal with real-valued speech data at the end of this section. An RBM assigns an energy to every configuration of visible and hidden state vectors, denoted and respectively, according to

(3)

where is the matrix of visible/hidden connection weights, is a visible unit bias, and is a hidden unit bias. The probability of any particular setting of the visible and hidden units is given in terms of the energy of that configuration by

(4)

where the normalization factor is known as the partition function.