Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

Abstract

We apply the recently proposedContext-Dependent DeepNeural-Network HMMs, or CD-DNN-HMMs, to speech-to-texttranscription. For single-pass speaker-independent recognition on the RT03SFisher portion of phone-call transcription benchmark (Switchboard), the word-error rate is reduced from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs, to 18.5%—a 33% relative improvement.

CD-DNN-HMMs combine classic artificial-neural-network HMMs with traditional tied-state triphones and deep-belief network pre-training. They hadpreviously been shown to reduce errors by 16% relatively when trained on tensof hours of data using hundreds of tied states. This paper takes CD-DNNHMMsfurther and applies them to transcription using over 300 hours of training data,over 9000 tied states, and up to 9 hidden layers, and demonstrates howsparseness can be exploited.

On four less well-matched transcription tasks, we observe relativeerror reductions of 22 –28%.

Index Terms: speechrecognition, deep belief networks, deep neural networks

1.    Introduction

Since the early 90’s, artificial neural networks  (ANNs)  have been used to model the state emission probabilities of HMM speech recognizers  [1]. While traditional Gaussian mixture model (GMM)-HMMs model context dependency through tied context-dependent states (e.g.CART-clustered crossword triphones [2]), ANN-HMMs were  never used to do sodirectly. Instead, networks were often factorized, e.g. into a monophone and acontext-dependent part [3], or hierarchically decomposed [4]. It has been commonly assumed that hundreds or thousands of triphone states were just toomany to be accurately modeled or trained in a neural network. Only recently didYu et al. discover that doing so isnot only feasible but works very well [5].

Context-dependent deep-neural-network HMMs, or CDDNN-HMMs [5, 6],apply the classical ANN-HMMs of the

90’s to traditional tied-state triphonesdirectly, exploiting Hinton’s deep-belief-network (DBN) pre-training procedure.This was shown to lead to a very promising and possibly disruptive acousticmodel as indicated by a 16% relative recognition error reduction overdiscriminatively trained GMM-HMMs on a business search task [5, 6], whichfeatures short query utterances, tens of hours of training data, and hundredsof tied states.

This paper takes this model a step further and serves severalpurposes. First, we show that the exact same CD-DNN HMM can be effectively scaled up in terms of training-data size (from 24 hours to over 300), modelcomplexity (from 761 tied triphone states to over 9000), depth (from 5 to 9 hidden layers), and task (from voice queries to speech-to-text transcription).This is demonstrated on a publicly available benchmark, the Switchboardphone-call transcription task (2000 NIST Hub5 and RT03S sets).  We should note here that  ANNs have been trained on up to 2000 hours of speech before [7], butwith much fewer output units (monophones) and fewer hidden layers.

Second, we advance the CD-DNN-HMMs by introducing weight sparseness and the related learning strategy and demonstrate that this can reducerecognition error or model size.

Third, we present the statistical view of the multi-layer perceptron (MLP) and DBN and provide empirical evidence for understanding which factors contribute most to the accuracy improvements achieved by the CD-DNN-HMMs.

2.      The Context-Dependent DeepNeural Network HMM

A deep neuralnetwork (DNN) is a conventional multi-layer perceptron (MLP, [8]) with manyhidden layers, optionally initialized using the DBN pre-training algorithm. Inthe following, we want to recap the DNN from a statistical viewpoint anddescribe its integration with context-dependent HMMs for speech recognition.For a more detailed description, please refer to [6].

2.1.   Multi-Layer Perceptron—AStatistical View

An MLP as used in this paper models theposterior probability Ps|o(s|o) of a class s given an observation vector o, as a stack of (L + 1) layers oflog-linear models. The first L layers,

, modelposterior probabilities of hidden binaryvectors hgiveninput vectors v, while the toplayer L models the desiredclass posterior as

=         softmaxs(zL(vL ))

withweight matrices Wand biasvectors a, where  and  are the j-thcomponent of,respectively.

The precise modelingof Ps|o(s|o) requiresintegration over all possible values of h across alllayers which is infeasible. An effective practical trick is to replace themarginalization with the “mean-field approximation” [9]. Given observation o, we set v0 =o and choosethe conditional expectation  as input  to the next layer,

Copyright © 2011 ISCA  28-31  August 2011, Florence, Italy

where σj(z) = 1/(1 + ezj).

MLPs are often trainedwith the error back-propagation procedure(BP) [10] with stochastic gradient ascent

for anobjective function D and learning rate .If the objective is to maximize the total log posterior probability over the T training samples o(t) withground-truth labels s(t), i.e.

,    (1)

then the gradients are

     ;    

eL(t)       =       (logsoftmax)

  diag

with error signals , thecomponent-wise derivatives j ·(1 − σj(z)) and (logsoftmax) δs(t),j − softmaxj(z), andKronecker delta δ.

BP, however, can easilyget trapped in poor local optima for deep networks. This can be some what alleviated by growing the model layer by layer, or more effectively by using the DBN  pre-training procedure described next.

2.2.   DBN Pre-Training

The deep belief network (DBN), proposed byHinton [11], provides a new way to train deep generative models. The layerwisegreedy pre-training algorithm developed in DBN was later found to be alsoeffective in training DNNs.

The DBN pre-trainingprocedure treats each consecutive pair of layers in the MLP as a restricted Boltzmann machine (RBM) [11]whose joint probability is defined as

for theBernoulli-Bernoulli RBM applied to binary v with asecond bias vector b and normalizationterm Zh,v, and

for the Gaussian-Bernoulli RBM applied tocontinuous v. In both cases theconditional probability Ph|v(h|v) has the sameform as that in an MLP layer.

The RBM parameters can be efficiently trained in an unsupervised fashion by maximizing thelikelihood L =  over training samples v(t) with the approximate contrastivedivergence algorithm [11, 12]. We use the specific form given in [12]:

with vˆ(t) = σ(Whˆ(t) + b), where hˆ(t) is a binary randomsample from Ph|v(·|v(t)).

To train multiplelayers, one trains the first layer, freezes it, and uses the conditionalexpectation of the output as the input to the next layer and continue trainingnext layers. Hinton and many others have found that initializing MLPs with pretrained parameters never hurts and often helps [11].

2.3.   Integrating DNNs with CD-HMMs

Following the traditional ANN-HMMs of the90’s [1], we replace the acoustic model’s Gaussian mixtures with an MLP andcompute the HMM’s state emission likelihoods po|s(o|s) by convertingstate posteriors obtained from the MLP to likelihoods:

 const(s)(2)

Here, classes s correspondto HMM states, and observation vectors o are regularacoustic feature vectors augmented with neighbor frames (5 on each side in ourcase). Ps(s) is the prior probability of state s.

However, unlike earlierANN-HMM systems, we model tied triphone states directly. It had long beenassumed that the thousands of triphone states were too many to be accuratelymodeled by an MLP, but [5] has shown that doing so is not only feasible but worksvery well. This is a critical factor in achieving the unusual accuracyimprovements in this paper. The resulting model is called the Context-Dependent Deep Neural Network HMM, or CD-DNN-HMM.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 内容概要 《计算机试卷1》是一份综合性的计算机基础和应用测试卷,涵盖了计算机硬件、软件、操作系统、网络、多媒体技术等多个领域的知识点。试卷包括单选题和操作应用两大类,单选题部分测试学生对计算机基础知识的掌握,操作应用部分则评估学生对计算机应用软件的实际操作能力。 ### 适用人群 本试卷适用于: - 计算机专业或信息技术相关专业的学生,用于课程学习或考试复习。 - 准备计算机等级考试或职业资格认证的人士,作为实战演练材料。 - 对计算机操作有兴趣的自学者,用于提升个人计算机应用技能。 - 计算机基础教育工作者,作为教学资源或出题参考。 ### 使用场景及目标 1. **学习评估**:作为学校或教育机构对学生计算机基础知识和应用技能的评估工具。 2. **自学测试**:供个人自学者检验自己对计算机知识的掌握程度和操作熟练度。 3. **职业发展**:帮助职场人士通过实际操作练习,提升计算机应用能力,增强工作竞争力。 4. **教学资源**:教师可以用于课堂教学,作为教学内容的补充或学生的课后练习。 5. **竞赛准备**:适合准备计算机相关竞赛的学生,作为强化训练和技能检测的材料。 试卷的目标是通过系统性的题目设计,帮助学生全面复习和巩固计算机基础知识,同时通过实际操作题目,提高学生解决实际问题的能力。通过本试卷的学习与练习,学生将能够更加深入地理解计算机的工作原理,掌握常用软件的使用方法,为未来的学术或职业生涯打下坚实的基础。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值