DeepSpeech: Accurate Speech Recognition with GPU-Accelerated DeepLearning

  1. DeepSpeech: Accurate Speech Recognition with GPU-Accelerated DeepLearning

Share:

Postedon by BryanCatanzaro 7Comments Tagged CUBLAS,cuDNN,DeepLearning, speechrecognition

Speechrecognition is an established technology, but it tends to fail whenwe need it the most, such as in noisy or crowded environments, orwhen the speaker is far away from the microphone. At Baidu we areworking to enable truly ubiquitous, natural speech interfaces. Inorder to achieve this, we must improve the accuracy of speechrecognition, especially in these challenging environments. We set outto make progress towards this goal by applying DeepLearning in a new way to speech recognition.

DeepLearning has transformed many important tasks; it has been successfulbecause it scales well: it can absorb large amounts of data to createhighly accurate models. Indeed, most industrial speech recognitionsystems rely on Deep Neural Networks as a component, usually combinedwith other algorithms. Many researchers have long believed that DeepNeural Networks (DNNs) could provide even better accuracy for speechrecognition if they were used for the entire system, rather than justas the acoustic modeling component. However, it has proven difficultto find an end-to-end speech recognition system based on DeepLearning that improves on the state of the art.

    1. Modeland Data Co-design

Oneof the reasons this has been difficult is that training thesenetworks on large datasets is computationally very intensive. Theprocess of training DNNs is iterative: we instantiate ideas aboutmodels in computer code that trains a model, then we train the modelon a training set and test it, which gives us new ideas about how toimprove the model or training set. The latency of this loop is therate limiting step that gates progress. Our models are relativelylarge, containing billions of connections, and we train them onthousands of hours of data, which means that training our modelstakes a lot of computation.

Wechoose our model sizes so that our models have the right amount ofcapacity to match our training dataset. If our model has too manyparameters for our dataset, it will overfit the data: essentiallyusing the excess capacity of the network to memorize trainingexamples, which leads to a brittle model that performs well on thetraining set, but poorly on real-world data. Conversely, if our modelhas too few parameters for the dataset, it will underfit the data,which means the model fails to learn enough from the dataset.Therefore, choosing model size and dataset is a co-design process,where we incrementally increase the model size and obtain moretraining data. Arbitrarily increasing either generally leads to poorresults.

  1. MaximizingStrong Scaling

Thisobservation determines how we use parallelism. Although our modelsare large, we can’t just weakly scale them to larger numbers ofGPUs. We care primarily about strongscalability, because if we can get training toscale strongly with more GPUs, we can reduce the latency of ourtraining process on model sizes that are relevant to our datasets.This allows us to come up with new ideas more quickly, drivingprogress.

Accordingly,we use multiple GPUs, working together, to train our models. GPUs areespecially well suited for training these networks for a couple ofreasons. Of course, we rely on the high arithmetic throughput andmemory bandwidth that GPUs provide. But there are some otherimportant factors: firstly, the CUDA programming environment is quitemature, and well-integrated into other HPC infrastructure, such asMPI and Infiniband, which makes us much more productive as we codeour ideas into a system to train and test our models. CUDA librariesare also essential to this project: our system relies on both NVIDIAcuBLASand cuDNN.

Secondly,because each GPU is quite powerful, we don’t have to over-partitionour models to gain compute power. Finely partitioning our modelsacross multiple processors is challenging, due to inefficienciesinduced by communication latency as well as algorithmic challenges,which scale unfavorably with increased partitioning. Let me explainwhy this is the case.

Whentraining models, we rely on two different types of parallelism, whichare often called “model parallelism” and “data parallelism”.Model parallelism refers to parallelizing the model itself,distributing the neurons across different processors. Some models areeasier to partition than others. As we employ model parallelism, theamount of work assigned to each processor decreases, which limitsscalability because at some point, the processors are under-occupied.

Wealso use “data parallelism”, which in this context refers toparallelizing the training process by partitioning the dataset acrossprocessors. Although we have large numbers of training examples,scalability when using data parallelism is limited due to the need toreplicate the model for each partition of the dataset, andconsequently to combine information learned from each partition ofthe dataset to produce a single model.

  1. HowWe Parallelize Our Model

Trainingneural networks involves solving a highly non-convex numericaloptimization. Much research has been conducted to find the bestalgorithm for solving this optimization problem, and the currentstate of the art is to use stochastic gradient descent (SGD) withmomentum. Although effective, this algorithm is particularlydifficult to scale, because it favors taking many small optimizationsteps in sequence, rather than taking a few large steps. This impliesexamining a relatively small group of training examples at a time. Wetrained our speech recognition system on minibatches of 512 examples,each subdivided into microbatches of 128 examples. We processed on 8Tesla K40s for each training instance in 4 separate pairs, using4-way data parallelism and two-way model parallelism.

Asshown in Figure 1, our model has 5 layers. The first layer is aconvolutional layer that operates on an input spectrogram (a 2-Dsignal where one dimension represents frequency and the other time)and produces many 1-D responses, for each time sample. cuDNN’sability to operate on non-square images with asymmetric padding madeimplementation of this layer simple and efficient. The next twolayers and the final layer are fully connected, implemented usingcuBLAS.

Ifour model were made with only these 4 layers, model parallelism wouldbe fairly straightforward, since these layers are independent in thetime dimension. However, the 4^th^ layer is responsible forpropagating information along the time dimension. We use abidirectional recurrent layer, where the forward and backwarddirections are independent. Because the recurrent layers require asequential implementation of activations followed by non-linearity,they are difficult to parallelize. Traditional approaches likeprefix-sum do not work, because the implied operator (matrix multiplyfollowed by non-linearity) is extremely non-associative.

Thebidirectional layer accounts for about 40% of the total trainingtime, so parallelizing it is essential. We gain two-way modelparallelism despite the sequential nature of recurrent neuralnetworks by exploiting the independence of the forward and backwarddirection in the bidirectional recurrent layer. We divide the neuronresponses in half along the time dimension, assigning each half to aGPU. For the recurrent layer, we have the first GPU process theforward direction, while the second processes the backward direction,until they both reach the partition at the center of the timedimension. They then exchange activations and switch roles: the firstGPU processes the backward direction, while the second processes theforward direction.

    1. Results:Advancing the State of the Art in Speech Recognition

Combiningthese techniques, we built a system that allows us to train our modelon thousands of hours of speech data. Because we have been able toiterate quickly on our models, we were able to create a speechrecognition system based on an end-to-end deep neural network thatsignificantly improves on the state of the art, especially for noisyenvironments.

Weattained a 16% word error rate on the full switchboard dataset, awidely used standard dataset where the prior best result was 18.4%.On a noisy test set we developed, we achieved a 19.1% word errorrate, which compares favorably to the Google API, which achieved30.5% error rate, Microsoft Bing, which achieved 36.1%, and AppleDictation, which achieved a 43.8% error rate. Often, when commercialspeech systems have low confidence in their transcription due toexcessive noise, they refuse to provide any transcription at all. Theabove error rates were computed only for utterances on which allcompared systems produced transcriptions, to give them the benefit ofthe doubt on more difficult utterances.

Insummary, the computational resources afforded by GPUs, coupled withsimple, scalable models, allowed us to iterate more quickly on ourlarge datasets, leading to a significant improvement in speechrecognition accuracy. We were able to show that speech recognitionsystems built on deep learning from input to output can outperformtraditional systems built with more complicated algorithms. Webelieve that with more data and compute resources we will be able toimprove speech recognition even further, working towards the goal ofenabling ubiquitous, natural speech interfaces.

    1. Learnmore at GTC 2015

Ifyou’re interested in learning more about this work, come see ourGTC 2015 talk “Speech:The Next Generation” at 3PM Tuesday, March 17in room 210A of the San Jose Convention Center (session S5631), orreadour paper.

Withdozensof sessions on Machine Learning and Deep Learning, you’ll findthat GTC is the place to learn about machine learning in 2015!Readers of Parallel Forall can use the discount code GM15PFAB to get20% off any conference pass!


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值