20190509

最新推荐文章于 2022-08-01 09:16:31 发布

Grace_yanyanyan

最新推荐文章于 2022-08-01 09:16:31 发布

阅读量755

点赞数

分类专栏：学习笔记文章标签：语音识别

本文链接：https://blog.csdn.net/yj13811596648/article/details/90023636

版权

学习笔记专栏收录该内容

68 篇文章 1 订阅

订阅专栏

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
论文下载地址

论文阅读笔记：

The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories. website
语言数据联盟(LDC)是一个由大学、图书馆、公司和政府研究实验室组成的开放联盟。
LDC was formed in 1992 to address the critical data shortage then facing language technology research and development.
LDC成立于1992年，旨在解决当时语言技术研究和发展中面临的严重数据短缺问题。
The Advanced Research Projects Agency provided seed funding for the Consortium and the National Science Foundation provided additional support via Grant IRI-9528587 from the Information and Intelligent Systems division.
美国高级研究计划局为该联盟提供了种子基金，美国国家科学基金会通过信息和智能系统司提供的ri -9528587赠款提供了额外支持。
Initially, LDC’s primary role was as a repository and distribution point for language resources.
最初，LDC的主要角色是作为语言资源的存储库和分发点。
Since that time, and with the help of its members, LDC has grown into an organization that creates and distributes a wide array of language resources.
从那时起，在其成员的帮助下，LDC已成长为一个创造和分配各种语文资源的组织。
LDC also supports sponsored research programs and language-based technology evaluations by providing resources and contributing organizational expertise.
LDC还通过提供资源和贡献组织专门知识，支持资助的研究项目和基于语言的技术评估。
LDC is hosted by the University of Pennsylvania and is a center within the University’s School of Arts and Sciences.
LDC由宾夕法尼亚大学主办，是该校艺术与科学学院的中心。
LDC’s connection with Penn provides a strong foundation for the Consortium’s research and outreach to an active and diverse member community.
LDC与宾夕法尼亚州立大学的联系为该联盟的研究和向活跃和多样化的成员社区扩展提供了强大的基础。

Creative commons international attribution international 4.0 license
website，可选中文
What we do
What is Creative Commons?

Creative Commons helps you legally share your knowledge and creativity to build a more equitable, accessible, and innovative world.
Creative Commons（是一个公司的名字，机构的名字）帮助您合法地共享您的知识和创造力，以建立一个更加公平、可访问和创新的世界。
We unlock the full potential of the internet to drive a new era of development, growth and productivity.
我们充分发挥互联网的潜力，推动发展、增长和生产力进入新时代。
With a network of staff, board, and affiliates around the world, Creative Commons provides free, easy-to-use copyright licenses to make a simple and standardized way to give the public permission to share and use your creative work–on conditions of your choice.
Creative Commons由世界各地的员工、董事会和附属机构组成网络，提供免费、易于使用的版权许可，使公众能够以一种简单和标准化的方式分享和使用您选择的创造性工作条件。
One goal of Creative Commons is to increase the amount of openly licensed creativity in “the commons” — the body of work freely available for legal use, sharing, repurposing, and remixing.
知识共享的一个目标是增加“知识共享”中公开许可的创造力的数量，“知识共享”是指可以自由使用、共享、重新利用和重新混合的作品。
Through the use of CC licenses, millions of people around the world have made their photos, videos, writing, music, and other creative content available for any member of the public to use.
通过CC许可证的使用，全世界数以百万计的人将他们的照片、视频、写作、音乐和其他创造性内容提供给公众使用。
Today CC Search comes out of beta, with over 300 million images indexed from multiple collections, a major redesign, and faster, more relevant search.
今天CC Search从beta版出来，从多个集合中索引了超过3亿张图片，这是一个重大的重新设计，搜索速度更快，更相关。
It’s the result of a huge amount of work from the engineering team at Creative Commons and our community of volunteer developers.
这是Creative Commons的工程团队和我们的志愿开发者社区大量工作的结果。
Last week the European Commission announced it has adopted CC BY 4.0 and CC0 to share published documents, including photos, videos, reports, peer-reviewed studies, and data.
上周，欧盟委员会宣布，它已采用CC BY 4.0和CC0共享已发布的文件，包括照片、视频、报告、同行评议研究和数据。
The Commission joins other public institutions around the world that use standard, legally interoperable tools like Creative Commons licenses and public domain tools to share a wide range of
委员会与世界上其他公共机构一起使用标准的、法律上可互操作的工具，如知识共享许可证和公共领域工具来共享广泛的领域
From January to April 2019, Creative Commons hosted three CC Certificate courses and a Facilitators course to train the next cohort of Certificate instructors.
从2019年1月到4月，知识共享中心举办了三场CC证书课程和一场推动者课程，培训下一批证书讲师。
Participants from Australia, Qatar, South Africa, Egypt, Indonesia, Canada, Argentina, United Kingdom, Colombia, Spain, Mexico, Denmark, New Zealand, Sweden, Taiwan, Hong Kong, and United States engaged in rigorous readings, assignments, discussions
来自澳大利亚、卡塔尔、南非、埃及、印度尼西亚、加拿大、阿根廷、英国、哥伦比亚、西班牙、墨西哥、丹麦、新西兰、瑞典、台湾、香港和美国的与会者进行了严格的阅读、作业、讨论

Mozilla common voice
website
中文简介
 中文介绍2

LibriSpeech
介绍此数据集的论文
This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems.
本文介绍了一种适合于语音识别系统训练和评价的新型阅读英语语音语料库。
The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz.
LibriSpeech语料库来自LibriVox项目的有声读物，包含以16千赫采样的1000小时语音。
We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models.
我们提供了免费的语料库供下载，以及单独准备的语言模型训练数据和预先构建的语言模型。
We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself.
我们的研究表明，在《华尔街日报》的测试集上，接受LibriSpeech培训的声学模型的错误率低于接受《华尔街日报》培训的模型。
We are also releasing Kaldi scripts that make it easy to build these systems.
我们还发布了Kaldi脚本，使构建这些系统变得更加容易。

什么是FLAC？here
FLAC中文可解释为无损音频压缩编码。FLAC是一套著名的自由音频压缩编码，其特点是无损压缩。不同于其他有损压缩编码如MP3及AAC，它不会破坏任何原有的音频资讯，所以可以还原音乐光盘音质 [1] 。2012年以来它已被很多软件及硬件音频产品（如CD等）所支持。

TIDIGITS
website
A large speech database has been collected for use in designing and evaluating algorithms for speaker-independent recognition of connected digit sequences.
本文收集了一个大型语音数据库，用于设计和评估与说话者无关的连接数字序列识别算法。
This dialectically balanced database consists of more than 25 thousand digit sequences spoken by over 300 men, women, and children.
这个辩证平衡的数据库由超过300名男性、女性和儿童所说的超过25000个数字序列组成。
The data were collected in a quiet environment and digitized at 20 kHz.
这些数据是在安静的环境中收集的，并以20千赫的速度数字化。
Formal human listening tests on this database provided certification of the labelling of the digit sequences, and also provided information about human recognition performance and the inherent recognizability of the data.
该数据库的正式人类听力测试提供了数字序列标签的认证，还提供了有关人类识别性能和数据固有可识别性的信息。

nist sphere 是一种声音的文件格式，用modern software比较难decode
NIST：
美国国家标准与技术研究院（National Institute of Standards and Technology，NIST）直属美国商务部，从事物理、生物和工程方面的基础和应用研究，以及测量技术和测试方法方面的研究，提供标准、标准参考数据及有关服务，在国际上享有很高的声誉。

CHiME(Computational Hearing in Multisource Environments)
喜报丨科大讯飞包揽第五届国际多通道语音分离和识别大赛（CHiME-5）全
国际多通道语音分离和识别大赛（CHiME）组委会在微软海得拉巴研发中心揭晓了最新一届CHiME-5的比赛结果，科大讯飞团队再次包揽大赛全部四个项目的冠军并大幅刷新了各项目的最好成绩！
CHiME(Computational Hearing in Multisource Environments)属于国际语音识别评测中的高难度比赛，始办于2011年，由法国计算机科学与自动化研究所、英国谢菲尔德大学、美国三菱电子研究实验室等知名研究机构所发起。比赛的目的是希望学术界和工业届针对高噪声和混响等现象影响下的实际场景提出全新的语音识别解决方案，以进一步提升语音识别的实用性和普适性，目前CHiME比赛已经举办五届，成为业界影响力最大、参赛队伍最多、水平最高的多通道噪声鲁棒性语音识别比赛。

Interspeech 2019 Computational Paralinguistics Challenge (ComParE)
website
The Interspeech 2019 Computational Paralinguistics ChallengE (ComParE) is an open Challenge dealing with states and traits of speakers as manifested in their speech signal’s properties.
2019年Interspeech computing Paralinguistics ChallengE (ComParE)是一个开放性的挑战，研究说话人的状态和特征，表现在说话人的语音信号特性上。
There have so far been ten consecutive Challenges at INTERSPEECH since 2009 (cf. the repository), but there still exists a multiplicity of not yet covered, but highly relevant paralinguistic phenomena.
自2009年以来，INTERSPEECH(参见存储库)已经连续遇到了10次挑战，但仍然存在许多尚未涉及的、但高度相关的副语言现象。
Thus, we introduce four new tasks by the Styrian Dialects Sub-Challenge, the Continuous Sleepiness Sub-Challenge, the Baby Sounds Sub-Challenge, and the Orca Activity Sub-Challenge.
因此，我们介绍了四个新的任务，斯特利亚方言子挑战，持续困倦子挑战，婴儿声音子挑战，虎鲸活动子挑战。
For the tasks, the data are provided by the organisers.
有关活动的资料由主办机构提供。

Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant
website
The “Hey Siri” feature allows users to invoke Siri hands-free.
“嘿Siri”功能允许用户免提Siri。
A very small speech recognizer runs all the time and listens for just those two words.
一个非常小的语音识别器一直在运行，只监听这两个单词。
When it detects “Hey Siri”, the rest of Siri parses the following speech as a command or query.
当它检测到“Hey Siri”时，Siri的其余部分将以下语音解析为命令或查询。
The “Hey Siri” detector uses a Deep Neural Network (DNN) to convert the acoustic pattern of your voice at each instant into a probability distribution over speech sounds.
“嘿，Siri”探测器使用深度神经网络(DNN)将你在每个瞬间的声音模式转换成语音的概率分布。
It then uses a temporal integration process to compute a confidence score that the phrase you uttered was “Hey Siri”.
然后，它会使用一个时间整合过程来计算你说的那个短语是“嘿，Siri”的置信度。
If the score is high enough, Siri wakes up.
如果分数足够高，Siri就会醒来。
This article takes a look at the underlying technology.
本文将介绍基础技术。
It is aimed primarily at readers who know something of machine learning but less about speech recognition.
它主要面向那些对机器学习有所了解但对语音识别了解较少的读者。

Being able to use Siri without pressing buttons is particularly useful when hands are busy, such as when cooking or driving, or when using the Apple Watch.
在双手忙碌的时候，比如做饭、开车或使用Apple Watch时，无需按键就能使用Siri尤其有用。
As Figure 1 shows, the whole system has several parts.
如图1所示，整个系统由几个部分组成。
Most of the implementation of Siri is “in the Cloud”, including the main automatic speech recognition, the natural language interpretation and the various information services.
Siri的大部分实现都是“在云端”进行的，包括主要的自动语音识别、自然语言解释和各种信息服务。
There are also servers that can provide updates to the acoustic models used by the detector.
还有一些服务器可以为探测器使用的声学模型提供更新。
This article concentrates on the part that runs on your local device, such as an iPhone or Apple Watch.
本文主要讨论在本地设备上运行的部分，比如iPhone或Apple Watch。
In particular, it focusses on the detector: a specialized speech recognizer which is always listening just for its wake-up phrase (on a recent iPhone with the “Hey Siri” feature enabled).
特别是，它专注于探测器:一个专门的语音识别器，它总是在听它的叫醒短语(在最近启用了“嘿Siri”功能的iPhone上)。

The microphone in an iPhone or Apple Watch turns your voice into a stream of instantaneous waveform samples, at a rate of 16000 per second.
iPhone或Apple Watch上的麦克风可以将你的声音转换成一串瞬时波形样本，速度为每秒16000次。
A spectrum analysis stage converts the waveform sample stream to a sequence of frames, each describing the sound spectrum of approximately 0.01 sec. About twenty of these frames at a time (0.2 sec of audio) are fed to the acoustic model, a Deep Neural Network (DNN) which converts each of these acoustic patterns into a probability distribution over a set of speech sound classes: those used in the “Hey Siri” phrase, plus silence and other speech, for a total of about 20 sound classes.
光谱分析阶段将波形样本流转换为一个帧序列,每个描述的声谱大约0.01秒。大约20帧的一次(0.2秒的音频)的声学模型,DNN将每种声学模式转换成一个在一组语音类上的概率分布:“嘿Siri”中使用这些短语,加上沉默和其他演讲,总共大约20种声音。
See Figure 2.
参见图2。

The DNN consists mostly of matrix multiplications and logistic nonlinearities.
DNN主要由矩阵乘法和逻辑非线性构成。
Each “hidden” layer is an intermediate representation discovered by the DNN during its training to convert the filter bank inputs to sound classes.
每个“隐藏”层都是DNN在训练过程中发现的一个中间表示，用于将过滤器库输入转换为声音类。
The final nonlinearity is essentially a Softmax function (a.k.a. a general logistic or normalized exponential), but since we want log probabilities the actual math is somewhat simpler.
最后的非线性本质上是一个Softmax函数(也称为一般的逻辑或归一化指数)，但由于我们想要log概率，实际的数学要简单一些。

Figure 2.
图2。
The Deep Neural Network used to detect “Hey Siri.”
用来检测“嘿Siri”的深度神经网络。
The hidden layers are actually fully connected.
隐藏层实际上是完全连接的。
The top layer performs temporal integration.
顶层执行时间集成。
The actual DNN is indicated by the dashed box.
实际的DNN由虚线框表示。

We choose the number of units in each hidden layer of the DNN to fit the computational resources available when the “Hey Siri” detector runs.
我们选择DNN每个隐藏层的单元数，以适应“Hey Siri”探测器运行时可用的计算资源。
Networks we use typically have five hidden layers, all the same size: 32, 128, or 192 units depending on the memory and power constraints.
我们使用的网络通常有5个隐藏层，大小相同:32、128或192个单元，这取决于内存和功率限制。
On iPhone we use two networks—one for initial detection and another as a secondary checker.
在iPhone上，我们使用两个网络——一个用于初始检测，另一个作为辅助检查器。
The initial detector uses fewer units than the secondary checker.
初始检测器比次级检测器使用更少的单元。

The output of the acoustic model provides a distribution of scores over phonetic classes for every frame.
声学模型的输出提供了每帧语音类的分数分布。
A phonetic class is typically something like “the first part of an /s/ preceded by a high front vowel and followed by a front vowel.”
一个音素类，通常类似于“一个/s/的第一部分前面有一个高前元音，后面有一个前元音”。

We want to detect “Hey Siri” if the outputs of the acoustic model are high in the right sequence for the target phrase.
我们想要检测“嘿Siri”，如果声音模型的输出在正确的序列为目标短语高。
To produce a single score for each frame we accumulate those local values in a valid sequence over time.
为了为每一帧生成一个单独的分数，我们将这些局部值按有效的顺序累积。
This is indicated in the final (top) layer of Figure 2 as a recurrent network with connections to the same unit and the next in sequence.
这在图2的最后一层(顶部)中表示为一个循环网络，连接到同一个单元，然后依次连接到下一个单元。
Inside each unit there is a maximum operation and an add:
每个单元内都有一个最大操作和一个加法:
在这里插入图片描述

Fi,t is the accumulated score for state i of the model
Fi,t为模型状态i的累计得分
qi,t is the output of the acoustic model—the log score for the phonetic class associated with the ith state given the acoustic pattern around time t
qi,t是声学模型的输出——给定时间t前后的声学模式，与第i个状态相关联的语音类的日志分数
si is a cost associated with staying in state i
si是与保持状态i相关的成本
mi is a cost for moving on from state i
mi是离开状态i的代价

Both si and mi are based on analysis of durations of segments with the relevant labels in the training data.
si和mi都是基于对训练数据中相关标签的片段持续时间的分析。
(This procedure is an application of dynamic programming, and can be derived based on ideas about Hidden Markov Models—HMMs.)
(这个过程是动态规划的一个应用，可以基于隐马尔可夫模型(hmms)的思想推导出来。)

Figure 3.
图3。
Visual depiction of the equation
方程的可视化描述
在这里插入图片描述
Each accumulated score Fi,t is associated with a labelling of previous frames with states, as given by the sequence of decisions by the maximum operation.
每个累积的分数Fi,t与之前的帧的状态标签相关联，由最大操作的决策序列给出。
The final score at each frame is Fi,t, where the last state of the phrase is state I and there are N frames in the sequence of frames leading to that score.
每一帧的最终得分是Fi,t，其中短语的最后一个状态是状态I，有N帧的帧序列指向该得分。
(N could be found by tracing back through the sequence of max decisions, but is actually done by propagating forwards the number of frames since the path entered the first state of the phrase.)
(N可以通过回溯max decision序列来找到，但实际上是通过自路径进入短语的第一个状态以来向前传播帧数来实现的。)
Almost all the computation in the “Hey Siri” detector is in the acoustic model.
“Hey Siri”探测器中几乎所有的计算都是在声学模型中进行的。
The temporal integration computation is relatively cheap, so we disregard it when assessing size or computational resources.
时间积分计算相对便宜，因此我们在评估大小或计算资源时忽略了它。

You may get a better idea of how the detector works by looking at Figure 4, which shows the acoustic signal at various stages, assuming that we are using the smallest DNN.
您可以通过查看图4更好地了解检测器的工作原理，图4显示了在不同阶段的声学信号，假设我们使用的是最小的DNN。
At the very bottom is a spectrogram of the waveform from the microphone.
在最底部是麦克风波形的频谱图。
In this case, someone is saying “Hey Siri what …” The brighter parts are the loudest parts of the phrase.
在这个例子中，有人在说“嘿，Siri什么……”
The Hey Siri pattern is between the vertical blue lines.
嘿Siri的图案在垂直的蓝线之间。

（粘图太麻烦了，还是自己去看原图吧）
The second horizontal strip up from the bottom shows the result of analyzing the same waveform with a mel filter bank, which gives weight to frequencies based on perceptual measurements.
从底部向上的第二个水平条带显示了用mel滤波器组分析相同波形的结果，该滤波器组基于感知测量对频率进行加权。
This conversion also smooths out the detail that is visible in the spectrogram and due to the fine-structure of the excitation of the vocal tract: either random, as in the /s/, or periodic, seen here as vertical striations.
这种转换也使光谱图中可见的细节变得平滑，这是由于声道兴奋的精细结构造成的:要么是随机的，如/s/，要么是周期性的，在这里可以看到垂直条纹。
The alternating green and blue horizontal strips labelled H1 to H5 show the numerical values (activations) of the units in each of the five hidden layers.
标记为H1到H5的绿色和蓝色相间的水平条带显示了五个隐藏层中每个单元的数值(激活)。
The 32 hidden units in each layer have been arranged for this figure so as to put units with similar outputs together.
每一层的32个隐藏单元都按照这个图形排列，以便将输出相似的单元放在一起。

The next strip up (with the yellow diagonal) shows the output of the acoustic model.
下一个条带(黄色对角线)显示了声学模型的输出。
At each frame there is one output for each position in the phrase, plus others for silence and other speech sounds.
在每一帧中，短语中的每个位置都有一个输出，加上其他用于沉默和其他语音的输出。
The final score, shown at the top, is obtained by adding up the local scores along the bright diagonal according to Equation 1.
最后的分数，如上图所示，是根据公式1将亮对角线上的局部分数相加得到的。
Note that the score rises to a peak just after the whole phrase enters the system.
请注意，在整个短语进入系统之后，分数会上升到一个峰值。
We compare the score with a threshold to decide whether to activate Siri.
我们将分数与一个阈值进行比较，以决定是否激活Siri。
In fact the threshold is not a fixed value.
事实上，阈值并不是一个固定的值。
We built in some flexibility to make it easier to activate Siri in difficult conditions while not significantly increasing the number of false activations.
我们增加了一些灵活性，使Siri在困难的情况下更容易激活，同时不会显著增加错误激活的次数。
There is a primary, or normal threshold, and a lower threshold that does not normally trigger Siri.
有一个主阈值，或正常阈值，还有一个较低的阈值，通常不会触发Siri。
If the score exceeds the lower threshold but not the upper threshold, then it may be that we missed a genuine “Hey Siri” event.
如果分数超过了下限而不是上限，那么我们可能错过了一个真正的“嘿Siri”活动。
When the score is in this range, the system enters a more sensitive state for a few seconds, so that if the user repeats the phrase, even without making more effort, then Siri triggers.
当分数在这个范围内时，系统会在几秒钟内进入更敏感的状态，因此，如果用户重复这个短语，即使不做更多努力，Siri也会触发。
This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time.
这种二次机会机制显著提高了系统的可用性，但由于它只是在很短的时间内处于这种超敏感状态，因此不会增加太多的虚警率。
(We discuss testing and tuning for accuracy later.)
(我们稍后将讨论测试和调优，以确保准确性。)

Responsiveness and Power: Two Pass Detection
响应能力和功率:双通检测

The “Hey Siri” detector not only has to be accurate, but it needs to be fast and not have a significant effect on battery life.
“嘿Siri”探测器不仅要准确，而且要速度快，而且不会对电池寿命产生显著影响。
We also need to minimize memory use and processor demand—particularly peak processor demand.
我们还需要尽量减少内存的使用和处理器的需求，特别是峰值处理器的需求。
To avoid running the main processor all day just to listen for the trigger phrase, the iPhone’s Always On Processor (AOP) (a small, low-power auxiliary processor, that is, the embedded Motion Coprocessor) has access to the microphone signal (on 6S and later).
为了避免整天运行主处理器只是为了听触发短语，iPhone总是在处理器上(AOP)(一个小的、低功耗的辅助处理器，也就是嵌入式运动协处理器)可以访问麦克风信号(6S及以后版本)。
We use a small proportion of the AOP’s limited processing power to run a detector with a small version of the acoustic model (DNN).
我们使用AOP有限处理能力的一小部分来运行一个带有小版本声学模型(DNN)的检测器。
When the score exceeds a threshold the motion coprocessor wakes up the main processor, which analyzes the signal using a larger DNN.
当分数超过阈值时，运动协处理器唤醒主处理器，主处理器使用较大的DNN分析信号。
In the first versions with AOP support, the first detector used a DNN with 5 layers of 32 hidden units and the second detector had 5 layers of 192 hidden units.
在支持AOP的第一个版本中，第一个检测器使用了5层32个隐藏单元的DNN，第二个检测器使用了5层192个隐藏单元。

Apple Watch presents some special challenges because of the much smaller battery.
苹果手表(Apple Watch)的电池要小得多，因此面临一些特殊的挑战。
Apple Watch uses a single-pass “Hey Siri” detector with an acoustic model intermediate in size between those used for the first and second passes on other iOS devices.
Apple Watch使用的是单通道“Hey Siri”探测器，其声学模型介于其他iOS设备的第一次和第二次通道之间。
The “Hey Siri” detector runs only when the watch motion coprocessor detects a wrist raise gesture, which turns the screen on.
“嘿，Siri”探测器只有在手表运动协同处理器检测到一个抬起手腕的手势时才会运行，这个手势会打开屏幕。
At that point there is a lot for WatchOS to do—power up, prepare the screen, etc.—so the system allocates “Hey Siri” only a small proportion (~5%) of the rather limited compute budget.
在这一点上，WatchOS要做的事情还有很多，比如启动电源、准备屏幕等等——所以系统只分配了“Hey Siri”在相当有限的计算预算中的一小部分(~5%)。
It is a challenge to start audio capture in time to catch the start of the trigger phrase, so we make allowances for possible truncation in the way that we initialize the detector.
及时启动音频捕获以捕获触发器短语的开头是一个挑战，因此我们在初始化检测器时考虑了可能的截断。

“Hey Siri” Personalized
“嘿,Siri“个性化
We designed the always-on “Hey Siri” detector to respond whenever anyone in the vicinity says the trigger phrase.
我们设计了一种随时开机的“嘿，Siri”探测器，只要附近有人说了触发词，它就会做出反应。
To reduce the annoyance of false triggers, we invite the user to go through a short enrollment session.
为了减少错误触发器带来的麻烦，我们邀请用户进行一个简短的注册会话。
During enrollment, the user says five phrases that each begin with “Hey Siri.
注册期间，用户会说五个短语，每个短语都以“嘿，Siri”开头。
” We save these examples on the device.
“我们把这些例子保存在设备上。
We compare any possible new “Hey Siri” utterance with the stored examples as follows.
我们将任何可能的新“Hey Siri”发音与存储的示例进行比较，如下所示。
The (second-pass) detector produces timing information that is used to convert the acoustic pattern into a fixed-length vector, by taking the average over the frames aligned to each state.
(第二遍)探测器产生时间信息，用于将声波模式转换成固定长度的向量，通过对每个状态对齐的帧取平均值。
A separate, specially trained DNN transforms this vector into a “speaker space” where, by design, patterns from the same speaker tend to be close, whereas patterns from different speakers tend to be further apart.
一个单独的，经过特殊训练的DNN将这个向量转换成一个“说话者空间”，在这里，通过设计，来自同一个说话者的模式趋向于接近，而来自不同说话者的模式趋向于更远。
We compare the distances to the reference patterns created during enrollment with another threshold to decide whether the sound that triggered the detector is likely to be “Hey Siri” spoken by the enrolled user.
我们将注册过程中创建的参考模式的距离与另一个阈值进行比较，以确定触发检测器的声音是否可能是注册用户所说的“Hey Siri”。
This process not only reduces the probability that “Hey Siri” spoken by another person will trigger the iPhone, but also reduces the rate at which other, similar-sounding phrases trigger Siri.
这个过程不仅降低了别人说“Hey Siri”触发iPhone的可能性，还降低了其他发音相似的短语触发Siri的几率。

Further Checks
进一步检查
If the various stages on the iPhone pass it on, the waveform arrives at the Siri server.
如果iPhone上的各个阶段将其传递下去，波形就会到达Siri服务器。
If the main speech recognizer hears it as something other than “Hey Siri” (for example “Hey Seriously”) then the server sends a cancellation signal to the phone to put it back to sleep, as indicated in Fig 1.
如果主语音识别器听到的不是“Hey Siri”(例如“Hey Seriously”)，那么服务器就会向手机发送一个取消信号，让它重新进入睡眠状态，如图1所示。
On some systems we run a cut-down version of the main recognizer on the device to provide an extra check earlier.
在一些系统上，我们在设备上运行一个精简版的主识别器，以提供额外的检查。

The Acoustic Model: Training
声学模型:训练
The DNN acoustic model is at the heart of the “Hey Siri” detector.
DNN声学模型是“嘿Siri”探测器的核心。
So let’s take a look at how we trained it.
让我们看看我们是如何训练它的。
Well before there was a Hey Siri feature, a small proportion of users would say “Hey Siri” at the start of a request, having started by pressing the button.
在有“嘿Siri”功能之前，只有一小部分用户会在请求开始时说“嘿Siri”，他们会先按下按钮。
We used such “Hey Siri” utterances for the initial training set for the US English detector model.
在美国英语检测器模型的初始训练集中，我们使用了这样的“Hey Siri”发音。
We also included general speech examples, as used for training the main speech recognizer.
我们还包括一般的语音例子，用于训练主要的语音识别器。
In both cases, we used automatic transcription on the training phrases.
在这两种情况下，我们都使用了训练短语的自动转录。
Siri team members checked a subset of the transcriptions for accuracy.
Siri团队成员检查了一小部分文字的准确性。
We created a language-specific phonetic specification of the “Hey Siri” phrase.
我们为“Hey Siri”这个短语创建了一个特定于语言的语音规范。
In US English, we had two variants, with different first vowels in “Siri”—one as in “serious” and the other as in “Syria.
在美式英语中，我们有两个变体，“Siri”的第一个元音不同——一个是“serious”，另一个是“Syria”。
” We also tried to cope with a short break between the two words, especially as the phrase is often written with a comma: “Hey, Siri.
我们还试着在两个单词之间安排一个短暂的停顿，尤其是这个短语经常用逗号来写:“嘿，Siri。”
” Each phonetic symbol results in three speech sound classes (beginning, middle and end) each of which has its own output from the acoustic model.
每个音标产生三个语音类(开始、中间和结束)，每个语音类都有自己的声学模型输出。

We used a corpus of speech to train the DNN for which the main Siri recognizer provided a sound class label for each frame.
我们使用语料库来训练DNN，主要的Siri识别器为每个帧提供一个声音类标签。
There are thousands of sound classes used by the main recognizer, but only about twenty are needed to account for the target phrase (including an initial silence), and one large class class for everything else.
主识别器使用了数千个声音类，但是只需要大约20个就可以解释目标短语(包括初始的静默)，而一个大型类可以解释其他所有内容。
The training process attempts to produce DNN outputs approaching 1 for frames that are labelled with the relevant states and phones, based only on the local sound pattern.
训练过程试图仅根据本地的声音模式，为标有相关状态和电话的帧生成接近1的DNN输出。
The training process adjusts the weights using standard back-propagation and stochastic gradient descent.
训练过程使用标准反向传播和随机梯度下降调整权重。
We have used a variety of neural network training software toolkits, including Theano, Tensorflow, and Kaldi.
我们使用了各种神经网络训练软件工具包，包括Theano、Tensorflow和Kaldi。
This training process produces estimates of the probabilities of the phones and states given the local acoustic observations, but those estimates include the frequencies of the phones in the training set (the priors), which may be very uneven, and have little to do with the circumstances in which the detector will be used, so we compensate for the priors before the acoustic model outputs are used.
这个培训过程产生的估计概率的手机和州送给当地的木观察,但这些估计包括手机的频率在训练集(先验),这可能是非常不均匀,并与探测器的情况下将被使用,所以我们弥补之前的先知先觉声学模型输出。

Training one model takes about a day, and there are usually a few models in training at any one time.
一个模型的训练大约需要一天的时间，而且通常在任何时候都有几个模型在训练中。
We generally train three versions: a small model for the first pass on the motion coprocessor, a larger-size model for the second pass, and a medium-size model for Apple Watch.
我们通常会训练三个版本:第一个版本的小尺寸，第二个版本的大尺寸，以及一个中等尺寸的Apple Watch。
“Hey Siri” works in all languages that Siri supports, but “Hey Siri” isn’t necessarily the phrase that starts Siri listening.
“Hey Siri”适用于Siri支持的所有语言，但“Hey Siri”不一定是Siri开始听的那个词。
For instance, French-speaking users need to say “Dis Siri” while Korean-speaking users say “Siri 야” (Sounds like “Siri Ya.”) In Russian it is “привет Siri “ (Sounds like “Privet Siri”), and in Thai “หวัดดี Siri”.
例如,讲法语的用户需要说“Dis Siri”而讲韩语用户说“Siri야”(听起来像“Siri丫。”)在俄罗斯“приветSiri”(听起来像“女贞Siri”),在泰国“หวัดดีSiri”。
(Sounds like “Wadi Siri”.)
(听起来像“Wadi Siri”。)

Testing and Tuning
测试和调优
An ideal detector would fire whenever the user says “Hey Siri,” and not fire at other times.
理想的探测器会在用户说“嘿，Siri”的时候启动，而不是在其他时候启动。
We describe the accuracy of the detector in terms of two kinds of error: firing at the wrong time, and failing to fire at the right time.
我们用两种误差来描述探测器的精度:在错误的时间发射和未能在正确的时间发射。
The false-accept rate (FAR or false-alarm rate), is the number of false activations per hour (or mean hours between activations) and the false-reject rate (FRR) is the proportion of attempted activations that fail.
错误接受率(FAR或错误警报率)是每小时(或两次激活之间的平均小时)的错误激活数，错误拒绝率(FRR)是尝试激活失败的比例。
(Note that the units we use to measure FAR are not the same as those we use for FRR.
(注意，我们用来测量距离的单位与我们用来测量FRR的单位是不一样的。
Even the dimensions are different.
甚至维度也是不同的。
So there is no notion of an equal error rate.)
所以不存在等错误率的概念)
For a given model we can change the balance between the two kinds of error by changing the activation threshold.
对于给定的模型，我们可以通过改变激活阈值来改变两种误差之间的平衡。
Figure 6 shows examples of this trade-off, for two sizes of early-development models.
图6显示了两种规模的早期开发模型之间的权衡。
Changing the threshold moves along the curve.
改变阈值沿曲线移动。

During development we try to estimate the accuracy of the system by using a large test set, which is quite expensive to collect and prepare, but essential.
在开发过程中，我们试图使用一个大型的测试集来估计系统的准确性，这个测试集的收集和准备非常昂贵，但却是必不可少的。
There is “positive” data and “negative” data.
有“正面”数据和“负面”数据。
The “positive” data does contain the target phrase.
“积极”数据确实包含目标短语。
You might think that we could use utterances picked up by the “Hey Siri” system, but the system doesn’t capture the attempts that failed to trigger, and we want to improve the system to include as many of such failed attempts as possible.
你可能认为我们可以使用“嘿，Siri”系统接收到的语音，但系统不会捕捉那些没有触发的语音，我们希望改进系统，尽可能多地包含这些失败的语音。
At first we used the utterances of “Hey Siri” that some users said as they pressed the Home button, but these users are not attempting to catch Siri’s attention, (the button does that) and the microphone is bound to be within arm’s reach, whereas we also want “Hey Siri” to work across a room.
起初我们使用“嘿Siri”的话语,一些用户说,他们按下按钮,但是这些用户并不试图抓住Siri的注意,(按钮)和麦克风注定是一臂之遥内,而我们也希望“嘿Siri”工作在一个房间。
We made recordings specially in various conditions, such as in the kitchen (both close and far), car, bedroom, and restaurant, by native speakers of each language.
我们在不同的环境下，如厨房(近的和远的)、汽车、卧室和餐厅，由说每种语言的母语者专门录制。

We use the “negative” data to test for false activations (and false wakes).
我们使用“阴性”数据来测试错误激活(和错误唤醒)。
The data represent thousands of hours of recordings, from various sources, including podcasts and non-“Hey Siri” inputs to Siri in many languages, to represent both background sounds (especially speech) and the kinds of phrases that a user might say to another person.
这些数据代表了数千小时的录音，来自不同的来源，包括播客和用多种语言输入Siri的非“Hey Siri”输入，既代表背景声音(尤其是语音)，也代表用户可能对另一个人说的短语。
We need such a lot of data because we are trying to estimate false-alarm rates as low as one per week.
我们需要这么多的数据，因为我们正试图估计误报率低至每周一次。
(If there are any occurrences of the target phrase in the negative data we label them as such, so that we do not count responses to them as errors.)
(如果消极数据中出现了目标短语，我们将其标记为错误，这样我们就不会将对它们的响应计算为错误。)

Figure 6.
图6。
Detector accuracy.
探测器的精度。
Trade-offs against detection threshold for small and larger DNNs
对大小dnn的检测阈值进行权衡

Tuning is largely a matter of deciding what thresholds to use.
调优主要是决定使用什么阈值。
In Figure 6, the two dots on the lower trade-off curve for the larger model show possible normal and second-chance thresholds.
在图6中，较大模型的较低权衡曲线上的两个点显示了可能的正常阈值和二次机会阈值。
The operating point for the smaller (first-pass) model would be is at the right-hand side.
较小的(first-pass)模型的操作点应该在右边。
These curves are just for the two stages of the detector, and do not include the personalized stage or subsequent checks.
这些曲线仅适用于检测器的两个阶段，不包括个性化阶段或随后的检查。
While we are confident that models that appear to perform better on the test set probably are really better, it is quite difficult to convert offline test results into useful predictions of the experience of users.
虽然我们相信在测试集中表现得更好的模型可能真的更好，但是很难将离线测试结果转换为对用户体验的有用预测。
So in addition to the offline measurements described previously, we estimate false-alarm rates (when Siri turns on without the user saying “Hey Siri”) and imposter-accept rates (when Siri turns on when someone other than the user who trained the detector says “Hey Siri”) weekly by sampling from production data, on the latest iOS devices and Apple Watch.
除了先前描述的离线测量,我们估计错误报警率(当Siri在没有用户说“嘿Siri”)和imposter-accept率(当Siri当有人除了训练检测器的用户说,“嘿,Siri”)每周通过从生产数据抽样,最新的iOS设备上和苹果的手表。
This does not give us rejection rates (when the system fails to respond to a valid “Hey Siri”) but we can estimate rejection rates from the proportion of activations just above the threshold that are valid, and a sampling of just-below threshold events on devices carried by development staff.
这并没有给我们排除率(当系统无法响应一个有效的“Hey Siri”时)，但我们可以通过激活刚好超过有效阈值的部分和开发人员携带的设备上刚好低于阈值的事件的抽样来估计排除率。
We continually evaluate and improve “Hey Siri,” and the model that powers it, by training and testing using variations of the approach described here.
我们不断地评估和改进“嘿Siri”，以及为它提供动力的模型，通过使用这里描述的各种方法进行培训和测试。
We train in many different languages and test under a wide range of conditions.
我们接受多种语言的培训，并在各种条件下进行测试。
Next time you say “Hey Siri” you may think of all that goes on to make responding to that phrase happen, but we hope that it “just works!”
下次你说“嘿，Siri”的时候，你可能会想到如何回应这句话，但我们希望它“能正常工作!”