python语音识别_Python语音识别终极指南-CSDN博客

本文详细介绍了Python中语音识别的工作原理、可用的库、如何安装和使用SpeechRecognition包，以及如何处理音频文件和麦克风输入。文章还提供了一个简单的'猜词'游戏示例，展示了语音识别的应用。此外，还探讨了噪声对语音识别准确性的影响以及如何调整敏感度。最后，文章提到了更多资源，供进一步学习和了解语音识别技术。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

python语音识别

Have you ever wondered how to add speech recognition to your Python project? If so, then keep reading! It’s easier than you might think.

您是否曾经想过如何在Python项目中添加语音识别？如果是这样，请继续阅读！它比您想象的要容易。

Far from a being a fad, the overwhelming success of speech-enabled products like Amazon Alexa has proven that some degree of speech support will be an essential aspect of household tech for the foreseeable future. If you think about it, the reasons why are pretty obvious. Incorporating speech recognition into your Python application offers a level of interactivity and accessibility that few technologies can match.

诸如Amazon Alexa之类的语音支持产品取得了巨大的成功，这远非时尚，但事实证明，在可预见的将来，一定程度的语音支持将成为家用技术的重要方面。如果您考虑一下，其原因很明显。将语音识别功能集成到您的Python应用程序中，可以提供很少有技术可以匹敌的交互性和可访问性。

The accessibility improvements alone are worth considering. Speech recognition allows the elderly and the physically and visually impaired to interact with state-of-the-art products and services quickly and naturally—no GUI needed!

仅可访问性改进值得考虑。语音识别功能使老年人以及肢体和视力障碍人士可以快速自然地与最新的产品和服务进行交互-无需GUI！

Best of all, including speech recognition in a Python project is really simple. In this guide, you’ll find out how. You’ll learn:

最重要的是，在Python项目中包括语音识别非常简单。在本指南中，您将了解操作方法。您将学到：

How speech recognition works,
What packages are available on PyPI; and
How to install and use the SpeechRecognition package—a full-featured and easy-to-use Python speech recognition library.

语音识别的工作原理
PyPI提供哪些软件包；和
如何安装和使用SpeechRecognition包-一个功能齐全且易于使用的Python语音识别库。

In the end, you’ll apply what you’ve learned to a simple “Guess the Word” game and see how it all comes together.

最后，您将把学到的知识应用到一个简单的“猜单词”游戏中，并查看它们如何结合在一起。

Free Bonus: Click here to download a Python speech recognition sample project with full source code that you can use as a basis for your own speech recognition apps.

免费红利： 单击此处下载具有完整源代码的Python语音识别示例项目，您可以将其用作自己的语音识别应用程序的基础。

语音识别如何工作–概述 (How Speech Recognition Works – An Overview)

Before we get to the nitty-gritty of doing speech recognition in Python, let’s take a moment to talk about how speech recognition works. A full discussion would fill a book, so I won’t bore you with all of the technical details here. In fact, this section is not pre-requisite to the rest of the tutorial. If you’d like to get straight to the point, then feel free to skip ahead.

在开始使用Python进行语音识别的本质之前，让我们先花点时间来讨论语音识别的工作原理。进行全面的讨论将使您充满精力，因此在这里我不会为您带来所有技术细节。实际上，本节不是本教程其余部分的前提条件。如果您想直截了当，请随时跳过。

Speech recognition has its roots in research done at Bell Labs in the early 1950s. Early systems were limited to a single speaker and had limited vocabularies of about a dozen words. Modern speech recognition systems have come a long way since their ancient counterparts. They can recognize speech from multiple speakers and have enormous vocabularies in numerous languages.

语音识别源于1950年代初期在贝尔实验室进行的研究。早期的系统仅限于一个说话者，并且词汇量大约只有十几个单词。自从古代语音识别系统以来，现代语音识别系统已经走了很长一段路。他们可以识别来自多个说话者的语音，并具有多种语言的大量词汇。

The first component of speech recognition is, of course, speech. Speech must be converted from physical sound to an electrical signal with a microphone, and then to digital data with an analog-to-digital converter. Once digitized, several models can be used to transcribe the audio to text.

语音识别的第一部分当然是语音。必须使用麦克风将语音从物理声音转换为电信号，然后通过模数转换器转换为数字数据。数字化后，可以使用多种模型将音频转录为文本。

Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM). This approach works on the assumption that a speech signal, when viewed on a short enough timescale (say, ten milliseconds), can be reasonably approximated as a stationary process—that is, a process in which statistical properties do not change over time.

大多数现代语音识别系统都依赖于所谓的隐马尔可夫模型（HMM）。这种方法的假设是，在足够短的时间尺度（例如10毫秒）上查看语音信号时，可以合理地将其近似为固定过程，即统计属性不会随时间变化的过程。

In a typical HMM, the speech signal is divided into 10-millisecond fragments. The power spectrum of each fragment, which is essentially a plot of the signal’s power as a function of frequency, is mapped to a vector of real numbers known as cepstral coefficients. The dimension of this vector is usually small—sometimes as low as 10, although more accurate systems may have dimension 32 or more. The final output of the HMM is a sequence of these vectors.

在典型的HMM中，语音信号被分为10毫秒的片段。每个片段的功率谱（实质上是信号功率随频率变化的曲线图）被映射到称为倒频谱系数的实数向量。此向量的维数通常很小，有时低至10，尽管更精确的系统可能具有32或更大的维数。 HMM的最终输出是这些向量的序列。

To decode the speech into text, groups of vectors are matched to one or more phonemes—a fundamental unit of speech. This calculation requires training, since the sound of a phoneme varies from speaker to speaker, and even varies from one utterance to another by the same speaker. A special algorithm is then applied to determine the most likely word (or words) that produce the given sequence of phonemes.

为了将语音解码为文本，矢量组必须与一个或多个音素（一种基本的语音单位）相匹配。该计算需要训练，因为一个音素的声音因扬声器的不同而不同，甚至同一扬声器的一个发音到另一个也不同。然后应用特殊算法来确定产生给定音素序列的最可能单词（或多个单词）。

One can imagine that this whole process may be computationally expensive. In many modern speech recognition systems, neural networks are used to simplify the speech signal using techniques for feature transformation and dimensionality reduction before HMM recognition. Voice activity detectors (VADs) are also used to reduce an audio signal to only the portions that are likely to contain speech. This prevents the recognizer from wasting time analyzing unnecessary parts of the signal.

可以想象整个过程可能在计算上是昂贵的。在许多现代语音识别系统中，神经网络用于在HMM识别之前使用特征转换和降维技术简化语音信号。语音活动检测器（VAD）也用于将音频信号减少到仅可能包含语音的部分。这样可以防止识别器浪费时间分析信号的不必要部分。

Fortunately, as a Python programmer, you don’t have to worry about any of this. A number of speech recognition services are available for use online through an API, and many of these services offer Python SDKs.

幸运的是，作为Python程序员，您不必为此担心。通过API可以在线使用多种语音识别服务，其中许多服务都提供Python SDK。

选择一个Python语音识别包 (Picking a Python Speech Recognition Package)

A handful of packages for speech recognition exist on PyPI. A few of them include:

PyPI上存在一些用于语音识别的软件包。其中一些包括：

Some of these packages—such as wit and apiai—offer built-in features, like natural language processing for identifying a speaker’s intent, which go beyond basic speech recognition. Others, like google-cloud-speech, focus solely on speech-to-text conversion.

其中一些软件包（例如wit和apiai）提供了内置功能，例如用于识别说话者意图的自然语言处理，这超出了基本的语音识别能力。其他如google-cloud-speech则仅专注于语音到文本的转换。

There is one package that stands out in terms of ease-of-use: SpeechRecognition.

就易用性而言，有一种软件包非常出色：SpeechRecognition。

Recognizing speech requires audio input, and SpeechRecognition makes retrieving this input really easy. Instead of having to build scripts for accessing microphones and processing audio files from scratch, SpeechRecognition will have you up and running in just a few minutes.

识别语音需要音频输入，而SpeechRecognition使得检索此输入非常容易。无需构建用于访问麦克风和从头开始处理音频文件的脚本，SpeechRecognition只需几分钟即可启动并运行。

The SpeechRecognition library acts as a wrapper for several popular speech APIs and is thus extremely flexible. One of these—the Google Web Speech API—supports a default API key that is hard-coded into the SpeechRecognition library. That means you can get off your feet without having to sign up for a service.

SpeechRecognition库充当几种流行语音API的包装，因此非常灵活。其中之一（Google Web Speech API）支持默认API密钥，该密钥硬编码到SpeechRecognition库中。这意味着您无需注册服务就可以站起来。

The flexibility and ease-of-use of the SpeechRecognition package make it an excellent choice for any Python project. However, support for every feature of each API it wraps is not guaranteed. You will need to spend some time researching the available options to find out if SpeechRecognition will work in your particular case.

SpeechRecognition软件包的灵活性和易用性使其成为任何Python项目的绝佳选择。但是，不能保证支持它包装的每个API的每个功能。您将需要花费一些时间来研究可用的选项，以找出SpeechRecognition在您的特定情况下是否可以工作。

So, now that you’re convinced you should try out SpeechRecognition, the next step is getting it installed in your environment.

因此，既然您已经确信您应该尝试SpeechRecognition，那么下一步就是将其安装在您的环境中。

安装语音识别 (Installing SpeechRecognition)

SpeechRecognition is compatible with Python 2.6, 2.7 and 3.3+, but requires some additional installation steps for Python 2. For this tutorial, I’ll assume you are using Python 3.3+.

SpeechRecognition与Python 2.6、2.7和3.3+兼容，但是需要一些额外的Python 2安装步骤。对于本教程，我假设您使用的是Python 3.3+。

You can install SpeechRecognition from a terminal with pip:

您可以使用pip在终端上安装SpeechRecognition：

 $ pip install SpeechRecognition
$ pip install SpeechRecognition

Once installed, you should verify the installation by opening an interpreter session and typing:

安装后，您应该通过打开解释器会话并键入以下命令来验证安装：

Note: The version number you get might vary. Version 3.8.1 was the latest at the time of writing.

注意：您获得的版本号可能会有所不同。在撰写本文时，版本3.8.1是最新的。

Go ahead and keep this session open. You’ll start to work with it in just a bit.

继续进行此会议。您将很快开始使用它。

SpeechRecognition will work out of the box if all you need to do is work with existing audio files. Specific use cases, however, require a few dependencies. Notably, the PyAudio package is needed for capturing microphone input.

如果您需要做的就是使用现有的音频文件，那么语音识别即开即用。但是，特定的用例需要一些依赖性。值得注意的是，需要PyAudio软件包来捕获麦克风输入。

You’ll see which dependencies you need as you read further. For now, let’s dive in and explore the basics of the package.

在进一步阅读时，您将看到需要哪些依赖项。现在，让我们深入探讨该软件包的基础知识。

`Recognizer`类 (The `Recognizer` Class)

All of the magic in SpeechRecognition happens with the Recognizer class.

SpeechRecognition中的所有魔力都发生在Recognizer类中。

The primary purpose of a Recognizer instance is, of course, to recognize speech. Each instance comes with a variety of settings and functionality for recognizing speech from an audio source.

当然， Recognizer实例的主要目的是识别语音。每个实例都具有用于识别来自音频源的语音的各种设置和功能。

Creating a Recognizer instance is easy. In your current interpreter session, just type:

创建Recognizer实例很容易。在当前的解释器会话中，键入：

 >>> >>>  r r = = srsr .. RecognizerRecognizer ()
()

Each Recognizer instance has seven methods for recognizing speech from an audio source using various APIs. These are:

每个Recognizer实例都有七种使用各种API从音频源识别语音的方法。这些是：

recognize_bing(): Microsoft Bing Speech
recognize_google(): Google Web Speech API
recognize_google_cloud(): Google Cloud Speech – requires installation of the google-cloud-speech package
recognize_houndify(): Houndify by SoundHound
recognize_ibm(): IBM Speech to Text
recognize_sphinx(): CMU Sphinx – requires installing PocketSphinx
recognize_wit(): Wit.ai

recognize_bing() ： Microsoft Bing语音
recognize_google() ： Google Web Speech API
recognize_google_cloud() ： Google Cloud Speech –需要安装google-cloud-speech软件包
ognize_houndify recognize_houndify() ：通过SoundHound进行猎犬化
recognize_ibm() ： IBM语音转文本
ognize_sphinx recognize_sphinx() ： CMU Sphinx –需要安装PocketSphinx
ognize_wit recognize_wit() ： Wit.ai

Of the seven, only recognize_sphinx() works offline with the CMU Sphinx engine. The other six all require an internet connection.

在这七个中，只有recognize_sphinx()与CMU Sphinx引擎脱机工作。其他六个都需要互联网连接。

A full discussion of the features and benefits of each API is beyond the scope of this tutorial. Since SpeechRecognition ships with a default API key for the Google Web Speech API, you can get started with it right away. For this reason, we’ll use the Web Speech API in this guide. The other six APIs all require authentication with either an API key or a username/password combination. For more information, consult the SpeechRecognition docs.

关于每种API的功能和优点的完整讨论超出了本教程的范围。由于SpeechRecognition随附了Google Web Speech API的默认API密钥，因此您可以立即开始使用它。因此，我们将在本指南中使用Web Speech API。其他六个API都需要使用API密钥或用户名/密码组合进行身份验证。有关更多信息，请参阅SpeechRecognition 文档。

Caution: The default key provided by SpeechRecognition is for testing purposes only, and Google may revoke it at any time. It is not a good idea to use the Google Web Speech API in production. Even with a valid API key, you’ll be limited to only 50 requests per day, and there is no way to raise this quota. Fortunately, SpeechRecognition’s interface is nearly identical for each API, so what you learn today will be easy to translate to a real-world project.

注意： SpeechRecognition提供的默认密钥仅用于测试目的， Google可以随时将其撤消 。在生产中使用Google Web Speech API并不是一个好主意。即使使用有效的API密钥，您每天也只能限制50个请求，并且无法提高此配额。幸运的是，每个API的SpeechRecognition界面几乎相同，因此您今天所学的内容将很容易转换为实际项目。

Each recognize_*() method will throw a speech_recognition.RequestError exception if the API is unreachable. For recognize_sphinx(), this could happen as the result of a missing, corrupt or incompatible Sphinx installation. For the other six methods, RequestError may be thrown if quota limits are met, the server is unavailable, or there is no internet connection.

如果API无法访问，则每个speech_recognition.RequestError recognize_*()方法都会引发speech_recognition.RequestError异常。对于recognize_sphinx() ，这可能是由于缺少，损坏或不兼容的Sphinx安装而导致的。对于其他六个方法，如果满足配额限制，服务器不可用或没有互联网连接，则可能引发RequestError 。

Ok, enough chit-chat. Let’s get our hands dirty. Go ahead and try to call recognize_google() in your interpreter session.

好的，闲聊。让我们弄脏双手。继续尝试尝试在您的解释器会话中调用recognize_google() 。

What happened?

发生了什么？

You probably got something that looks like this:

您可能会得到如下所示的内容：

 Traceback (most recent call last):
  File Traceback (most recent call last):
  File "<stdin>", line "<stdin>" , line 1, in 1 , in <module>
<module>
TypeError: TypeError : recognize_google() missing 1 required positional argument: 'audio_data'
recognize_google() missing 1 required positional argument: 'audio_data'

You might have guessed this would happen. How could something be recognized from nothing?

您可能已经猜到会发生这种情况。怎么可能一无所获？

All seven recognize_*() methods of the Recognizer class require an audio_data argument. In each case, audio_data must be an instance of SpeechRecognition’s AudioData class.

Recognizer类的所有七个Recognizer recognize_*()方法都需要一个audio_data参数。在每种情况下， audio_data必须是audio_data的AudioData类的实例。

There are two ways to create an AudioData instance: from an audio file or audio recorded by a microphone. Audio files are a little easier to get started with, so let’s take a look at that first.

有两种方法可以创建AudioData实例：从音频文件或麦克风录制的音频。音频文件更容易上手，因此让我们先来看一下。

处理音频文件 (Working With Audio Files)

Before you continue, you’ll need to download an audio file. The one I used to get started, “harvard.wav,” can be found here. Make sure you save it to the same directory in which your Python interpreter session is running.

在继续之前，您需要下载音频文件。我以前开始使用的那个文件“ harvard.wav”可以在这里找到。确保将其保存到运行Python解释器会话的目录中。

SpeechRecognition makes working with audio files easy thanks to its handy AudioFile class. This class can be initialized with the path to an audio file and provides a context manager interface for reading and working with the file’s contents.

凭借其方便的AudioFile类，SpeechRecognition使处理音频文件变得容易。可以使用音频文件的路径初始化该类，并提供一个上下文管理器接口来读取和使用文件的内容。

支持的文件类型 (Supported File Types)

Currently, SpeechRecognition supports the following file formats:

目前，SpeechRecognition支持以下文件格式：

WAV: must be in PCM/LPCM format
AIFF
AIFF-C
FLAC: must be native FLAC format; OGG-FLAC is not supported

WAV：必须为PCM / LPCM格式
联合会
联合会
FLAC：必须为本地FLAC格式；不支持OGG-FLAC

If you are working on x-86 based Linux, macOS or Windows, you should be able to work with FLAC files without a problem. On other platforms, you will need to install a FLAC encoder and ensure you have access to the flac command line tool. You can find more information here if this applies to you.

如果您在基于x-86的Linux，macOS或Windows上工作，则应该能够处理FLAC文件而不会出现问题。在其他平台上，您将需要安装FLAC编码器并确保可以访问flac命令行工具。如果适用，您可以在这里找到更多信息。

使用`record()`从文件中捕获数据 (Using `record()` to Capture Data From a File)

Type the following into your interpreter session to process the contents of the “harvard.wav” file:

在解释器会话中键入以下内容，以处理“ harvard.wav”文件的内容：

The context manager opens the file and reads its contents, storing the data in an AudioFile instance called source. Then the record() method records the data from the entire file into an AudioData instance. You can confirm this by checking the type of audio:

上下文管理器打开文件并读取其内容，并将数据存储在名为source.的AudioFile实例中source. 然后record()方法将整个文件中的数据记录到AudioData实例中。您可以通过检查audio类型来确认：

 >>> >>>  typetype (( audioaudio )
)
<class 'speech_recognition.AudioData'>
<class 'speech_recognition.AudioData'>

You can now invoke recognize_google() to attempt to recognize any speech in the audio. Depending on your internet connection speed, you may have to wait several seconds before seeing the result.

现在，您可以调用recognize_google()尝试识别音频中的任何语音。根据您的Internet连接速度，您可能需要等待几秒钟才能看到结果。

Congratulations! You’ve just transcribed your first audio file!

恭喜你！您刚刚转录了第一个音频文件！

If you’re wondering where the phrases in the “harvard.wav” file come from, they are examples of Harvard Sentences. These phrases were published by the IEEE in 1965 for use in speech intelligibility testing of telephone lines. They are still used in VoIP and cellular testing today.

如果您想知道“ harvard.wav”文件中的短语从何而来，它们就是哈佛句子的示例。这些短语由IEEE于1965年发布，用于电话线的语音清晰度测试。今天，它们仍用于VoIP和蜂窝测试中。

The Harvard Sentences are comprised of 72 lists of ten phrases. You can find freely available recordings of these phrases on the Open Speech Repository website. Recordings are available in English, Mandarin Chinese, French, and Hindi. They provide an excellent source of free material for testing your code.

哈佛句子由十个短语的72个列表组成。您可以在Open Speech Repository网站上找到这些短语的免费录音。提供英语，中文，法语和北印度语的录音。他们提供了免费的免费材料来测试您的代码。

捕获具有`offset`和`duration` (Capturing Segments With `offset` and `duration`)

What if you only want to capture a portion of the speech in a file? The record() method accepts a duration keyword argument that stops the recording after a specified number of seconds.

如果您只想捕获文件中语音的一部分怎么办？ record()方法接受一个duration关键字参数，该参数在指定的秒数后停止记录。

For example, the following captures any speech in the first four seconds of the file:

例如，以下内容捕获了文件前四秒内的所有语音：

 >>> >>>  with with harvard harvard as as sourcesource :
:
...     ...     audio audio = = rr .. recordrecord (( sourcesource , , durationduration == 44 )
)
...
...
>>> >>>  rr .. recognize_googlerecognize_google (( audioaudio )
)
'the stale smell of old beer lingers'
'the stale smell of old beer lingers'

The record() method, when used inside a with block, always moves ahead in the file stream. This means that if you record once for four seconds and then record again for four seconds, the second time returns the four seconds of audio after the first four seconds.

在with块中使用record()方法时，它总是在文件流中向前移动。这意味着，如果您先录制四秒钟，然后再录制四秒钟，则第二次返回前四秒钟之后的四秒钟音频。

Notice that audio2 contains a portion of the third phrase in the file. When specifying a duration, the recording might stop mid-phrase—or even mid-word—which can hurt the accuracy of the transcription. More on this in a bit.

请注意， audio2包含文件中第三个短语的一部分。指定持续时间时，录音可能会停止中间短语，甚至是单词中间的单词，这可能会损害转录的准确性。进一步了解这一点。

In addition to specifying a recording duration, the record() method can be given a specific starting point using the offset keyword argument. This value represents the number of seconds from the beginning of the file to ignore before starting to record.

除了指定记录持续时间外，可以使用offset关键字参数为record()方法指定特定的起点。该值表示从文件开始到开始记录之前要忽略的秒数。

To capture only the second phrase in the file, you could start with an offset of four seconds and record for, say, three seconds.

要仅捕获文件中的第二个短语，您可以以四秒的偏移量开始并记录三秒钟。

 >>> >>>  with with harvard harvard as as sourcesource :
:
...     ...     audio audio = = rr .. recordrecord (( sourcesource , , offsetoffset == 44 , , durationduration == 33 )
)
...
...
>>> >>>  recognizerrecognizer .. recognize_googlerecognize_google (( audioaudio )
)
'it takes heat to bring out the odor'
'it takes heat to bring out the odor'

The offset and duration keyword arguments are useful for segmenting an audio file if you have prior knowledge of the structure of the speech in the file. However, using them hastily can result in poor transcriptions. To see this effect, try the following in your interpreter:

如果您事先了解文件中语音的结构，则offset和duration关键字参数对于分段音频文件很有用。但是，匆忙使用它们会导致转录不佳。若要查看此效果，请在解释器中尝试以下操作：

By starting the recording at 4.7 seconds, you miss the “it t” portion a the beginning of the phrase “it takes heat to bring out the odor,” so the API only got “akes heat,” which it matched to “Mesquite.”

通过在4.7秒开始录制，您会错过“ it t”部分，即短语“它需要加热才能散发出气味”的开头，因此API仅得到了“吸收热量”，它与“ Mesquite”相匹配。 ”

Similarly, at the end of the recording, you captured “a co,” which is the beginning of the third phrase “a cold dip restores health and zest.” This was matched to “Aiko” by the API.

同样，在录制结束时，您捕获了“ a co”，这是第三个短语“冷浸恢复健康和热情的开始”。 API将其与“ Aiko”相匹配。

There is another reason you may get inaccurate transcriptions. Noise! The above examples worked well because the audio file is reasonably clean. In the real world, unless you have the opportunity to process audio files beforehand, you can not expect the audio to be noise-free.

还有一个原因可能是您的转录不正确。噪声！上面的示例效果很好，因为音频文件相当干净。在现实世界中，除非您有机会事先处理音频文件，否则不能指望音频没有噪音。

噪声对语音识别的影响 (The Effect of Noise on Speech Recognition)

Noise is a fact of life. All audio recordings have some degree of noise in them, and un-handled noise can wreck the accuracy of speech recognition apps.

噪音是生活的事实。所有音频记录中都有一定程度的噪音，未处理的噪音会破坏语音识别应用程序的准确性。

To get a feel for how noise can affect speech recognition, download the “jackhammer.wav” file here. As always, make sure you save this to your interpreter session’s working directory.

要了解噪声如何影响语音识别，请在此处下载“ jackhammer.wav”文件。与往常一样，请确保将其保存到解释器会话的工作目录中。

This file has the phrase “the stale smell of old beer lingers” spoken with a loud jackhammer in the background.

该文件的短语为“老啤酒残留的陈旧气味”，在背景中用响亮的手提钻说出来。

What happens when you try to transcribe this file?

当您尝试转录此文件时会发生什么？

 >>> >>>  jackhammer jackhammer = = srsr .. AudioFileAudioFile (( 'jackhammer.wav''jackhammer.wav' )
)
>>> >>>  with with jackhammer jackhammer as as sourcesource :
:
...     ...     audio audio = = rr .. recordrecord (( sourcesource )
)
...
...
>>> >>>  rr .. recognize_googlerecognize_google (( audioaudio )
)
'the snail smell of old gear vendors'
'the snail smell of old gear vendors'

Way off!

滚开！

So how do you deal with this? One thing you can try is using the adjust_for_ambient_noise() method of the Recognizer class.

那么您如何处理呢？您可以尝试的一件事是使用Recognizer类的adjust_for_ambient_noise()方法。

That got you a little closer to the actual phrase, but it still isn’t perfect. Also, “the” is missing from the beginning of the phrase. Why is that?

这使您更接近实际的短语，但是它仍然不是完美的。另外，短语开头没有“ the”。这是为什么？

The adjust_for_ambient_noise() method reads the first second of the file stream and calibrates the recognizer to the noise level of the audio. Hence, that portion of the stream is consumed before you call record() to capture the data.

adjust_for_ambient_noise()方法读取文件流的第一秒，并将识别器校准为音频的噪声水平。因此，在调用record()捕获数据之前先消耗掉流的那部分。

You can adjust the time-frame that adjust_for_ambient_noise() uses for analysis with the duration keyword argument. This argument takes a numerical value in seconds and is set to 1 by default. Try lowering this value to 0.5.

您可以使用duration关键字参数来调整adjust_for_ambient_noise()用于分析的duration 。此参数采用秒为单位的数值，默认情况下设置为1。尝试将此值降低到0.5。

 >>> >>>  with with jackhammer jackhammer as as sourcesource :
:
...     ...     rr .. adjust_for_ambient_noiseadjust_for_ambient_noise (( sourcesource , , durationduration == 0.50.5 )
)
...     ...     audio audio = = rr .. recordrecord (( sourcesource )
)
...
...
>>> >>>  rr .. recognize_googlerecognize_google (( audioaudio )
)
'the snail smell like old Beer Mongers'
'the snail smell like old Beer Mongers'

Well, that got you “the” at the beginning of the phrase, but now you have some new issues! Sometimes it isn’t possible to remove the effect of the noise—the signal is just too noisy to be dealt with successfully. That’s the case with this file.

好吧，这让您在短语的开头有了“ the”，但现在您遇到了一些新问题！有时无法消除噪声的影响-信号太嘈杂而无法成功处理。该文件就是这种情况。

If you find yourself running up against these issues frequently, you may have to resort to some pre-processing of the audio. This can be done with audio editing software or a Python package (such as SciPy) that can apply filters to the files. A detailed discussion of this is beyond the scope of this tutorial—check out Allen Downey’s Think DSP book if you are interested. For now, just be aware that ambient noise in an audio file can cause problems and must be addressed in order to maximize the accuracy of speech recognition.

如果您发现自己经常遇到这些问题，则可能必须对音频进行一些预处理。这可以通过音频编辑软件或可以将过滤器应用于文件的Python软件包（例如SciPy）来完成。有关此内容的详细讨论超出了本教程的范围，如果您有兴趣，请查阅Allen Downey的Think DSP书籍。现在，请注意音频文件中的环境噪声可能会引起问题，必须加以解决才能使语音识别的准确性最大化。

When working with noisy files, it can be helpful to see the actual API response. Most APIs return a JSON string containing many possible transcriptions. The recognize_google() method will always return the most likely transcription unless you force it to give you the full response.

使用嘈杂的文件时，查看实际的API响应可能会有所帮助。大多数API返回包含许多可能的转录的JSON字符串。除非您强迫它提供完整的响应，否则recognize_google()方法将始终返回最可能的转录。

You can do this by setting the show_all keyword argument of the recognize_google() method to True.

您可以通过将show_all recognize_google()方法的show_all关键字参数设置为True.

As you can see, recognize_google() returns a dictionary with the key 'alternative' that points to a list of possible transcripts. The structure of this response may vary from API to API and is mainly useful for debugging.

如您所见， recognize_google()返回带有键'alternative'的字典，该字典指向可能的成绩单列表。该响应的结构可能因API而异，并且主要用于调试。

By now, you have a pretty good idea of the basics of the SpeechRecognition package. You’ve seen how to create an AudioFile instance from an audio file and use the record() method to capture data from the file. You learned how record segments of a file using the offset and duration keyword arguments of record(), and you experienced the detrimental effect noise can have on transcription accuracy.

到目前为止，您已经对SpeechRecognition包的基础知识有了很好的了解。您已经了解了如何从音频文件创建AudioFile实例，以及如何使用record()方法从文件中捕获数据。您了解了如何使用record()的offset和duration关键字参数来record()文件的记录段，并体验了噪声对转录精度的有害影响。

Now for the fun part. Let’s transition from transcribing static audio files to making your project interactive by accepting input from a microphone.

现在是有趣的部分。通过接受麦克风的输入，让我们从录制静态音频文件过渡到使您的项目具有交互性。

使用麦克风 (Working With Microphones)

To access your microphone with SpeechRecognizer, you’ll have to install the PyAudio package. Go ahead and close your current interpreter session, and let’s do that.

要使用SpeechRecognizer访问麦克风，您必须安装PyAudio软件包。继续并关闭当前的解释器会话，让我们开始。

安装PyAudio (Installing PyAudio)

The process for installing PyAudio will vary depending on your operating system.

根据您的操作系统，安装PyAudio的过程会有所不同。

Debian Linux (Debian Linux)

If you’re on Debian-based Linux (like Ubuntu) you can install PyAudio with apt:

如果您使用的是基于Debian的Linux（例如Ubuntu），则可以使用apt安装PyAudio：

 $ sudo apt-get install python-pyaudio python3-pyaudio
$ sudo apt-get install python-pyaudio python3-pyaudio

Once installed, you may still need to run pip install pyaudio, especially if you are working in a virtual environment.

安装后，您可能仍需要运行pip install pyaudio ，尤其是在虚拟环境中工作时。

苹果系统 (macOS)

For macOS, first you will need to install PortAudio with Homebrew, and then install PyAudio with pip:

对于macOS，首先需要使用Homebrew安装PortAudio，然后使用pip安装PyAudio：

视窗 (Windows)

On Windows, you can install PyAudio with pip:

在Windows上，您可以使用pip安装PyAudio：

 $ pip install pyaudio
$ pip install pyaudio

测试安装 (Testing the Installation)

Once you’ve got PyAudio installed, you can test the installation from the console.

一旦安装了PyAudio，就可以从控制台测试安装。

Make sure your default microphone is on and unmuted. If the installation worked, you should see something like this:

确保您的默认麦克风已打开且未静音。如果安装成功，您应该会看到以下内容：

 A moment of silence, please...
A moment of silence, please...
Set minimum energy threshold to 600.4452854381937
Set minimum energy threshold to 600.4452854381937
Say something!
Say something!

Go ahead and play around with it a little bit by speaking into your microphone and seeing how well SpeechRecognition transcribes your speech.

继续前进，通过对着麦克风讲话并查看SpeechRecognition对您的语音的转录效果，来进行一些操作。

Note: If you are on Ubuntu and get some funky output like ‘ALSA lib … Unknown PCM’, refer to this page for tips on suppressing these messages. This output comes from the ALSA package installed with Ubuntu—not SpeechRecognition or PyAudio. In all reality, these messages may indicate a problem with your ALSA configuration, but in my experience, they do not impact the functionality of your code. They are mostly a nuisance.

注意：如果您在Ubuntu上并获得一些时髦的输出，例如“ ALSA lib…Unknown PCM”，请参阅此页面以获取禁止显示这些消息的提示。此输出来自随Ubuntu一起安装的ALSA软件包，而不是SpeechRecognition或PyAudio。在现实中，这些消息可能表明您的ALSA配置有问题，但是以我的经验，它们不会影响代码的功能。它们主要是令人讨厌的。

`Microphone`类 (The `Microphone` Class)

Open up another interpreter session and create an instance of the recognizer class.

打开另一个解释器会话并创建识别器类的实例。

Now, instead of using an audio file as the source, you will use the default system microphone. You can access this by creating an instance of the Microphone class.

现在，您将使用默认的系统麦克风，而不是使用音频文件作为源。您可以通过创建Microphone类的实例来访问它。

 >>> >>>  mic mic = = srsr .. MicrophoneMicrophone ()
()

If your system has no default microphone (such as on a RaspberryPi), or you want to use a microphone other than the default, you will need to specify which one to use by supplying a device index. You can get a list of microphone names by calling the list_microphone_names() static method of the Microphone class.

如果您的系统没有默认麦克风（例如RaspberryPi上的麦克风），或者您要使用默认麦克风以外的其他麦克风，则需要通过提供设备索引来指定使用哪个麦克风。您可以通过调用Microphone类的list_microphone_names()静态方法来获取麦克风名称列表。

Note that your output may differ from the above example.

请注意，您的输出可能与上面的示例不同。

The device index of the microphone is the index of its name in the list returned by list_microphone_names(). For example, given the above output, if you want to use the microphone called “front,” which has index 3 in the list, you would create a microphone instance like this:

麦克风的设备索引是其名称在list_microphone_names().返回的列表中的索引list_microphone_names(). 例如，给定以上输出，如果您要使用名为“ front”的麦克风，该麦克风在列表中的索引为3，则可以创建一个麦克风实例，如下所示：

 >>> >>>  # This is just an example; do not run
# This is just an example; do not run
>>> >>>  mic mic = = srsr .. MicrophoneMicrophone (( device_indexdevice_index == 33 )
)

For most projects, though, you’ll probably want to use the default system microphone.

但是，对于大多数项目，您可能需要使用默认的系统麦克风。

使用`listen()`捕获麦克风输入 (Using `listen()` to Capture Microphone Input)

Now that you’ve got a Microphone instance ready to go, it’s time to capture some input.

既然您已经准备好了Microphone实例，是时候捕获一些输入了。

Just like the AudioFile class, Microphone is a context manager. You can capture input from the microphone using the listen() method of the Recognizer class inside of the with block. This method takes an audio source as its first argument and records input from the source until silence is detected.

就像AudioFile类一样， Microphone是上下文管理器。您可以使用with块内的Recognizer类的listen()方法捕获来自麦克风的输入。此方法将音频源作为其第一个参数，并记录来自该源的输入，直到检测到静音为止。

Once you execute the with block, try speaking “hello” into your microphone. Wait a moment for the interpreter prompt to display again. Once the “>>>” prompt returns, you’re ready to recognize the speech.

一旦执行with块，请尝试对着麦克风说“你好”。请稍等片刻，以再次显示解释器提示。返回“ >>>”提示后，您就可以识别语音了。

 >>> >>>  rr .. recognize_googlerecognize_google (( audioaudio )
)
'hello'
'hello'

If the prompt never returns, your microphone is most likely picking up too much ambient noise. You can interrupt the process with Ctrl + C to get your prompt back.

如果提示从不消失，则很可能是您的麦克风拾取了太多的环境噪声。您可以使用Ctrl + C中断该过程以返回提示。

To handle ambient noise, you’ll need to use the adjust_for_ambient_noise() method of the Recognizer class, just like you did when trying to make sense of the noisy audio file. Since input from a microphone is far less predictable than input from an audio file, it is a good idea to do this anytime you listen for microphone input.

要处理环境噪声，您将需要使用Recognizer类的adjust_for_ambient_noise()方法，就像您试图弄清嘈杂的音频文件一样。由于来自麦克风的输入比来自音频文件的输入要难预测得多，因此在您收听麦克风输入时随时执行此操作是一个好主意。

After running the above code, wait a second for adjust_for_ambient_noise() to do its thing, then try speaking “hello” into the microphone. Again, you will have to wait a moment for the interpreter prompt to return before trying to recognize the speech.

运行完上面的代码后，请等待一秒钟以使adjust_for_ambient_noise() ，然后尝试对着麦克风说“你好”。同样，您将需要等待片刻，以便口译员提示返回，然后再尝试识别语音。

Recall that adjust_for_ambient_noise() analyzes the audio source for one second. If this seems too long to you, feel free to adjust this with the duration keyword argument.

回想一下adjust_for_ambient_noise()分析音频源一秒钟。如果这对您来说似乎太长了，请随时使用duration关键字参数对其进行调整。

The SpeechRecognition documentation recommends using a duration no less than 0.5 seconds. In some cases, you may find that durations longer than the default of one second generate better results. The minimum value you need depends on the microphone’s ambient environment. Unfortunately, this information is typically unknown during development. In my experience, the default duration of one second is adequate for most applications.

SpeechRecognition文档建议使用持续时间不少于0.5秒。在某些情况下，您可能会发现持续时间长于默认值一秒会产生更好的结果。您所需的最小值取决于麦克风的周围环境。不幸的是，这些信息在开发过程中通常是未知的。以我的经验，一秒钟的默认持续时间对于大多数应用程序来说足够了。

处理无法识别的语音 (Handling Unrecognizable Speech)

Try typing the previous code example in to the interpeter and making some unintelligible noises into the microphone. You should get something like this in response:

尝试在交织器中键入前面的代码示例，并向麦克风发出一些难以理解的噪音。您应该得到如下响应：

 Traceback (most recent call last):
  File Traceback (most recent call last):
  File "<stdin>", line "<stdin>" , line 1, in 1 , in <module>
  File <module>
  File "/home/david/real_python/speech_recognition_primer/venv/lib/python3.5/site-packages/speech_recognition/__init__.py", line "/home/david/real_python/speech_recognition_primer/venv/lib/python3.5/site-packages/speech_recognition/__init__.py" , line 858, in 858 , in recognize_google
    recognize_google
    if if not not isinstanceisinstance (( actual_resultactual_result , , dictdict ) ) or or lenlen (( actual_resultactual_result .. getget (( "alternative""alternative" , , [])) [])) == == 00 : : raise raise UnknownValueErrorUnknownValueError ()
()
speech_recognition.UnknownValueError
speech_recognition.UnknownValueError

Audio that cannot be matched to text by the API raises an UnknownValueError exception. You should always wrap calls to the API with try and except blocks to handle this exception.

API无法将其与文本匹配的音频会引发UnknownValueError异常。您应该始终使用try和except块包装对API的调用，以处理此异常。

NOTE: You may have to try harder than you expect to get the exception thrown. The API works very hard to transcribe any vocal sounds. Even short grunts were transcribed as words like “how” for me. Coughing, hand claps, and tongue clicks would consistently raise the exception.

注意：您可能需要比预期的加倍努力才能引发异常。该API很难记录任何声音。对我来说，即使是短暂的咕unt也被转录为“如何”之类的词。咳嗽，拍手和舌头咔嗒声会不断引发异常。

放在一起：“猜词”游戏 (Putting It All Together: A “Guess the Word” Game)

Now that you’ve seen the basics of recognizing speech with the SpeechRecognition package let’s put your newfound knowledge to use and write a small game that picks a random word from a list and gives the user three attempts to guess the word.

既然您已经了解了使用SpeechRecognition包进行语音识别的基础知识，就可以使用新发现的知识，并编写一个小型游戏，该游戏从列表中选择一个随机单词，并为用户提供三种猜测单词的尝试。

Here is the full script:

这是完整的脚本：

Let’s break that down a little bit.

让我们分解一下。

The recognize_speech_from_mic() function takes a Recognizer and Microphone instance as arguments and returns a dictionary with three keys. The first key, "success", is a boolean that indicates whether or not the API request was successful. The second key, "error", is either None or an error message indicating that the API is unavailable or the speech was unintelligible. Finally, the "transcription" key contains the transcription of the audio recorded by the microphone.

recognize_speech_from_mic()函数将Recognizer和Microphone实例作为参数，并返回具有三个键的字典。第一个键"success"是一个布尔值，指示API请求是否成功。第二个键"error"为“ None或错误消息，指示API不可用或语音难以理解。最后， "transcription"键包含麦克风记录的音频的转录。

The function first checks that the recognizer and microphone arguments are of the correct type, and raises a TypeError if either is invalid:

该函数首先检查recognizer和microphone参数的类型是否正确，如果其中一个无效，则引发TypeError ：

 if if not not isinstanceisinstance (( recognizerrecognizer , , srsr .. RecognizerRecognizer ):
    ):
    raise raise TypeErrorTypeError (( '`recognizer` must be `Recognizer` instance''`recognizer` must be `Recognizer` instance' )

)

if if not not isinstanceisinstance (( microphonemicrophone , , srsr .. MicrophoneMicrophone ):
    ):
    raise raise TypeErrorTypeError (( '`microphone` must be a `Microphone` instance''`microphone` must be a `Microphone` instance' )
)

The listen() method is then used to record microphone input:

然后，使用listen()方法记录麦克风输入：

The adjust_for_ambient_noise() method is used to calibrate the recognizer for changing noise conditions each time the recognize_speech_from_mic() function is called.

所述adjust_for_ambient_noise()方法用于校准识别器为每个时间改变的噪声条件recognize_speech_from_mic()函数被调用。

Next, recognize_google() is called to transcribe any speech in the recording. A try...except block is used to catch the RequestError and UnknownValueError exceptions and handle them accordingly. The success of the API request, any error messages, and the transcribed speech are stored in the success, error and transcription keys of the response dictionary, which is returned by the recognize_speech_from_mic() function.

接下来，调用recognize_google()来转录录音中的所有语音。 try...except块用于捕获RequestError和UnknownValueError异常并进行相应处理。 API请求的成功，任何错误消息以及转录的语音都存储在response字典的success ， error和transcription键中，该键由recognize_speech_from_mic()函数返回。

 response response = = {
    {
    "success""success" : : TrueTrue ,
    ,
    "error""error" : : NoneNone ,
    ,
    "transcription""transcription" : : None
None
}

}

trytry :
    :
    responseresponse [[ "transcription""transcription" ] ] = = recognizerrecognizer .. recognize_googlerecognize_google (( audioaudio )
)
except except srsr .. RequestErrorRequestError :
    :
    # API was unreachable or unresponsive
    # API was unreachable or unresponsive
    responseresponse [[ "success""success" ] ] = = False
    False
    responseresponse [[ "error""error" ] ] = = "API unavailable"
"API unavailable"
except except srsr .. UnknownValueErrorUnknownValueError :
    :
    # speech was unintelligible
    # speech was unintelligible
    responseresponse [[ "error""error" ] ] = = "Unable to recognize speech"

"Unable to recognize speech"

return return response
response

You can test the recognize_speech_from_mic() function by saving the above script to a file called “guessing_game.py” and running the following in an interpreter session:

您可以通过将上面的脚本保存到名为“ guessing_game.py”的文件并在解释器会话中运行以下命令来测试recognize_speech_from_mic()函数：

The game itself is pretty simple. First, a list of words, a maximum number of allowed guesses and a prompt limit are declared:

游戏本身非常简单。首先，声明单词列表，允许的最大猜测数和提示限制：

 WORDS WORDS = = [[ 'apple''apple' , , 'banana''banana' , , 'grape''grape' , , 'orange''orange' , , 'mango''mango' , , 'lemon''lemon' ]
]
NUM_GUESSES NUM_GUESSES = = 3
3
PROMPT_LIMIT PROMPT_LIMIT = = 5
5

Next, a Recognizer and Microphone instance is created and a random word is chosen from WORDS:

接下来，创建一个Recognizer and Microphone实例，并从WORDS选择一个随机词：

After printing some instructions and waiting for 3 three seconds, a for loop is used to manage each user attempt at guessing the chosen word. The first thing inside the for loop is another for loop that prompts the user at most PROMPT_LIMIT times for a guess, attempting to recognize the input each time with the recognize_speech_from_mic() function and storing the dictionary returned to the local variable guess.

在打印了一些指令并等待3三秒钟后，将使用一个for循环来管理每个用户猜测所选单词的尝试。 for循环中的第一件事是另一个for循环，它会提示用户最多PROMPT_LIMIT次进行猜测，每次尝试使用PROMPT_LIMIT recognize_speech_from_mic()函数识别输入，并将返回的字典存储到局部变量guess 。

If the "transcription" key of guess is not None, then the user’s speech was transcribed and the inner loop is terminated with break. If the speech was not transcribed and the "success" key is set to False, then an API error occurred and the loop is again terminated with break. Otherwise, the API request was successful but the speech was unrecognizable. The user is warned and the for loop repeats, giving the user another chance at the current attempt.

如果guess的"transcription"键不是None ，那么将转录用户的语音，并以break终止内部循环。如果语音未转录，并且"success"键设置为False ，则发生API错误，并且循环再次以break终止。否则，API请求成功，但语音无法识别。将警告用户，并重复for循环，为用户提供当前尝试的另一次机会。

 for for j j in in rangerange (( PROMPT_LIMITPROMPT_LIMIT ):
    ):
    printprint (( 'Guess {}. Speak!''Guess {}. Speak!' .. formatformat (( ii ++ 11 ))
    ))
    guess guess = = recognize_speech_from_micrecognize_speech_from_mic (( recognizerrecognizer , , microphonemicrophone )
    )
    if if guessguess [[ "transcription""transcription" ]:
        ]:
        break
    break
    if if not not guessguess [[ "success""success" ]:
        ]:
        break
    break
    printprint (( "I didn't catch that. What did you say?"I didn't catch that. What did you say? nn "" )
)

Once the inner for loop terminates, the guess dictionary is checked for errors. If any occurred, the error message is displayed and the outer for loop is terminated with break, which will end the program execution.

内部for循环终止后，将检查guess字典中是否有错误。如果发生任何错误，将显示错误消息，并以break终止外部for循环，这将结束程序执行。

If there weren’t any errors, the transcription is compared to the randomly selected word. The lower() method for string objects is used to ensure better matching of the guess to the chosen word. The API may return speech matched to the word “apple” as “Apple” or “apple,” and either response should count as a correct answer.

如果没有任何错误，则将转录与随机选择的单词进行比较。字符串对象的lower()方法用于确保猜测与所选单词的更好匹配。 API可能会返回与单词“ apple”匹配的语音，例如“ Apple”或“ apple”，并且任何一个响应都应视为正确答案。

If the guess was correct, the user wins and the game is terminated. If the user was incorrect and has any remaining attempts, the outer for loop repeats and a new guess is retrieved. Otherwise, the user loses the game.

如果猜测正确，则用户获胜并且游戏终止。如果用户不正确并且有任何剩余尝试，则外部for循环会重复并检索新的猜测。否则，用户将输掉游戏。

 guess_is_correct guess_is_correct = = guessguess [[ "transcription""transcription" ]] .. lowerlower () () == == wordword .. lowerlower ()
()
user_has_more_attempts user_has_more_attempts = = i i < < NUM_GUESSES NUM_GUESSES - - 1

1

if if guess_is_correctguess_is_correct :
    :
    printprint (( 'Correct! You win!''Correct! You win!' .. formatformat (( wordword ))
    ))
    break
break
elif elif user_has_more_attemptsuser_has_more_attempts :
    :
    printprint (( 'Incorrect. Try again.'Incorrect. Try again. nn '' )
)
elseelse :
    :
    printprint (( "Sorry, you lose!"Sorry, you lose! nn I was thinking of '{}'."I was thinking of '{}'." .. formatformat (( wordword ))
    ))
    break
break

When run, the output will look something like this:

运行时，输出将如下所示：

回顾与其他资源 (Recap and Additional Resources)

In this tutorial, you’ve seen how to install the SpeechRecognition package and use its Recognizer class to easily recognize speech from both a file—using record()—and microphone input—using listen(). You also saw how to process segments of an audio file using the offset and duration keyword arguments of the record() method.

在本教程中，您已经了解了如何安装SpeechRecognition包并使用其Recognizer类轻松地从文件（使用record()和麦克风输入record()和listen().识别语音listen(). 您还看到了如何使用record()方法的offset和duration关键字参数来处理音频文件的片段。

You’ve seen the effect noise can have on the accuracy of transcriptions, and have learned how to adjust a Recognizer instance’s sensitivity to ambient noise with adjust_for_ambient_noise(). You have also learned which exceptions a Recognizer instance may throw—RequestError for bad API requests and UnkownValueError for unintelligible speech—and how to handle these with try...except blocks.

您已经看到了噪声对转录准确性的影响，并且学习了如何通过adjust_for_ambient_noise().来调整Recognizer实例对环境噪声的敏感度adjust_for_ambient_noise(). 您还了解了Recognizer实例可能会抛出哪些异常（针对错误的API请求的RequestError和针对难以理解的语音的UnkownValueError以及如何使用try...except块处理这些异常。

Speech recognition is a deep subject, and what you have learned here barely scratches the surface. If you’re interested in learning more, here are some additional resources.

语音识别是一门很深的主题，您在这里学到的知识几乎没有触及表面。如果您有兴趣了解更多信息，这里有一些其他资源。

Free Bonus: Click here to download a Python speech recognition sample project with full source code that you can use as a basis for your own speech recognition apps.

免费红利： 单击此处下载具有完整源代码的Python语音识别示例项目，您可以将其用作自己的语音识别应用程序的基础。

For more information on the SpeechRecognition package:

有关SpeechRecognition包的更多信息：

A few interesting internet resources:

一些有趣的互联网资源：

Behind the Mic: The Science of Talking with Computers. A short film about speech processing by Google.
A Historical Perspective of Speech Recognition by Huang, Baker and Reddy. Communications of the ACM (2014). This article provides an in-depth and scholarly look at the evolution of speech recognition technology.
The Past, Present and Future of Speech Recognition Technology by Clark Boyd at The Startup. This blog post presents an overview of speech recognition technology, with some thoughts about the future.

麦克风背后：与计算机对话的科学。 Google制作的有关语音处理的短片。
Huang，Baker和Reddy 的语音识别的历史视角。 ACM通讯（2014）。本文对语音识别技术的发展进行了深入的学术研究。
克拉克·博伊德（Clark Boyd）在创业公司的语音识别技术的过去，现在和未来这篇博客文章概述了语音识别技术，并对未来有所思考。

Some good books about speech recognition:

一些有关语音识别的好书：

The Voice in the Machine: Building Computers That Understand Speech, Pieraccini, MIT Press (2012). An accessible general-audience book covering the history of, as well as modern advances in, speech processing.
Fundamentals of Speech Recognition, Rabiner and Juang, Prentice Hall (1993). Rabiner, a researcher at Bell Labs, was instrumental in designing some of the first commercially viable speech recognizers. This book is now over 20 years old, but a lot of the fundamentals remain the same.
Automatic Speech Recognition: A Deep Learning Approach, Yu and Deng, Springer (2014). Yu and Deng are researchers at Microsoft and both very active in the field of speech processing. This book covers a lot of modern approaches and cutting-edge research but is not for the mathematically faint-of-heart.

《机器中的声音：构建能够理解语音的计算机》，皮拉恰尼，麻省理工学院出版社，2012年。一本通俗易懂的通用图书，内容涉及语音处理的历史以及现代进展。
语音识别基础，Rabiner和Juang，Prentice Hall（1993）。贝尔实验室的研究员Rabiner在设计一些最早的商业上可行的语音识别器时发挥了作用。这本书已有20多年的历史了，但是许多基本原理都保持不变。
自动语音识别：深度学习方法，Yu and Deng，Springer（2014）。 Yu和Deng是Microsoft的研究人员，在语音处理领域都非常活跃。本书涵盖了许多现代方法和前沿研究，但不适用于数学上微弱的人。

附录：识别英语以外的其他语言的语音 (Appendix: Recognizing Speech in Languages Other Than English)

Throughout this tutorial, we’ve been recognizing speech in English, which is the default language for each recognize_*() method of the SpeechRecognition package. However, it is absolutely possible to recognize speech in other languages, and is quite simple to accomplish.

在本教程中，我们已经在英语识别语音，这是每一个默认的语言recognize_*()的语音识别包的方法。但是，绝对可以识别其他语言的语音，并且非常容易实现。

To recognize speech in a different language, set the language keyword argument of the recognize_*() method to a string corresponding to the desired language. Most of the methods accept a BCP-47 language tag, such as 'en-US' for American English, or 'fr-FR' for French. For example, the following recognizes French speech in an audio file:

要识别其他语言的语音，请将recognize_*()方法的language关键字参数设置为与所需语言相对应的字符串。大多数方法都接受BCP-47语言标签，例如美国英语为'en-US' 'fr-FR' ，法语为'fr-FR' 。例如，以下内容可识别音频文件中的法语语音：

 import import speech_recognition speech_recognition as as sr

sr

r r = = srsr .. RecognizerRecognizer ()

()

with with srsr .. AudioFileAudioFile (( 'path/to/audiofile.wav''path/to/audiofile.wav' ) ) as as sourcesource :
    :
    audio audio = = rr .. recordrecord (( sourcesource )

)

rr .. recognize_googlerecognize_google (( audioaudio , , languagelanguage == 'fr-FR''fr-FR' )
)

Only the following methods accept a language keyword argument:

仅以下方法接受language关键字参数：

recognize_bing()
recognize_google()
recognize_google_cloud()
recognize_ibm()
recognize_sphinx()

recognize_bing()
recognize_google()
recognize_google_cloud()
recognize_ibm()
recognize_sphinx()

To find out which language tags are supported by the API you are using, you’ll have to consult the corresponding documentation. A list of tags accepted by recognize_google() can be found in this Stack Overflow answer.

要找出您所使用的API支持哪些语言标签，您必须查阅相应的文档。可在此Stack Overflow答案中找到被recognize_google()接受的标签列表。