The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories. website
LDC was formed in 1992 to address the critical data shortage then facing language technology research and development.
The Advanced Research Projects Agency provided seed funding for the Consortium and the National Science Foundation provided additional support via Grant IRI-9528587 from the Information and Intelligent Systems division.
美国高级研究计划局为该联盟提供了种子基金,美国国家科学基金会通过信息和智能系统司提供的ri -9528587赠款提供了额外支持。
Initially, LDC’s primary role was as a repository and distribution point for language resources.
Since that time, and with the help of its members, LDC has grown into an organization that creates and distributes a wide array of language resources.
LDC also supports sponsored research programs and language-based technology evaluations by providing resources and contributing organizational expertise.
LDC is hosted by the University of Pennsylvania and is a center within the University’s School of Arts and Sciences.
LDC’s connection with Penn provides a strong foundation for the Consortium’s research and outreach to an active and diverse member community.

Creative commons international attribution international 4.0 license
What we do
What is Creative Commons?

Creative Commons helps you legally share your knowledge and creativity to build a more equitable, accessible, and innovative world.
Creative Commons(是一个公司的名字,机构的名字)帮助您合法地共享您的知识和创造力,以建立一个更加公平、可访问和创新的世界。
We unlock the full potential of the internet to drive a new era of development, growth and productivity.
With a network of staff, board, and affiliates around the world, Creative Commons provides free, easy-to-use copyright licenses to make a simple and standardized way to give the public permission to share and use your creative work–on conditions of your choice.
Creative Commons由世界各地的员工、董事会和附属机构组成网络,提供免费、易于使用的版权许可,使公众能够以一种简单和标准化的方式分享和使用您选择的创造性工作条件。
One goal of Creative Commons is to increase the amount of openly licensed creativity in “the commons” — the body of work freely available for legal use, sharing, repurposing, and remixing.
Through the use of CC licenses, millions of people around the world have made their photos, videos, writing, music, and other creative content available for any member of the public to use.
Today CC Search comes out of beta, with over 300 million images indexed from multiple collections, a major redesign, and faster, more relevant search.
今天CC Search从beta版出来,从多个集合中索引了超过3亿张图片,这是一个重大的重新设计,搜索速度更快,更相关。
It’s the result of a huge amount of work from the engineering team at Creative Commons and our community of volunteer developers.
这是Creative Commons的工程团队和我们的志愿开发者社区大量工作的结果。
Last week the European Commission announced it has adopted CC BY 4.0 and CC0 to share published documents, including photos, videos, reports, peer-reviewed studies, and data.
上周,欧盟委员会宣布,它已采用CC BY 4.0和CC0共享已发布的文件,包括照片、视频、报告、同行评议研究和数据。
The Commission joins other public institutions around the world that use standard, legally interoperable tools like Creative Commons licenses and public domain tools to share a wide range of
From January to April 2019, Creative Commons hosted three CC Certificate courses and a Facilitators course to train the next cohort of Certificate instructors.
Participants from Australia, Qatar, South Africa, Egypt, Indonesia, Canada, Argentina, United Kingdom, Colombia, Spain, Mexico, Denmark, New Zealand, Sweden, Taiwan, Hong Kong, and United States engaged in rigorous readings, assignments, discussions

This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems.
The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz.
We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models.
We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself.
We are also releasing Kaldi scripts that make it easy to build these systems.

FLAC中文可解释为无损音频压缩编码。FLAC是一套著名的自由音频压缩编码,其特点是无损压缩。不同于其他有损压缩编码如MP3及AAC,它不会破坏任何原有的音频资讯,所以可以还原音乐光盘音质 [1] 。2012年以来它已被很多软件及硬件音频产品(如CD等)所支持。

A large speech database has been collected for use in designing and evaluating algorithms for speaker-independent recognition of connected digit sequences.
This dialectically balanced database consists of more than 25 thousand digit sequences spoken by over 300 men, women, and children.
The data were collected in a quiet environment and digitized at 20 kHz.
Formal human listening tests on this database provided certification of the labelling of the digit sequences, and also provided information about human recognition performance and the inherent recognizability of the data.

nist sphere 是一种声音的文件格式,用modern software比较难decode
美国国家标准与技术研究院(National Institute of Standards and Technology,NIST)直属美国商务部,从事物理、生物和工程方面的基础和应用研究,以及测量技术和测试方法方面的研究,提供标准、标准参考数据及有关服务,在国际上享有很高的声誉。

CHiME(Computational Hearing in Multisource Environments)
CHiME(Computational Hearing in Multisource Environments)属于国际语音识别评测中的高难度比赛,始办于2011年,由法国计算机科学与自动化研究所、英国谢菲尔德大学、美国三菱电子研究实验室等知名研究机构所发起。比赛的目的是希望学术界和工业届针对高噪声和混响等现象影响下的实际场景提出全新的语音识别解决方案,以进一步提升语音识别的实用性和普适性,目前CHiME比赛已经举办五届,成为业界影响力最大、参赛队伍最多、水平最高的多通道噪声鲁棒性语音识别比赛。

Interspeech 2019 Computational Paralinguistics Challenge (ComParE)
The Interspeech 2019 Computational Paralinguistics ChallengE (ComParE) is an open Challenge dealing with states and traits of speakers as manifested in their speech signal’s properties.
2019年Interspeech computing Paralinguistics ChallengE (ComParE)是一个开放性的挑战,研究说话人的状态和特征,表现在说话人的语音信号特性上。
There have so far been ten consecutive Challenges at INTERSPEECH since 2009 (cf. the repository), but there still exists a multiplicity of not yet covered, but highly relevant paralinguistic phenomena.
Thus, we introduce four new tasks by the Styrian Dialects Sub-Challenge, the Continuous Sleepiness Sub-Challenge, the Baby Sounds Sub-Challenge, and the Orca Activity Sub-Challenge.
For the tasks, the data are provided by the organisers.

Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant
The “Hey Siri” feature allows users to invoke Siri hands-free.
A very small speech recognizer runs all the time and listens for just those two words.
When it detects “Hey Siri”, the rest of Siri parses the following speech as a command or query.
当它检测到“Hey Siri”时,Siri的其余部分将以下语音解析为命令或查询。
The “Hey Siri” detector uses a Deep Neural Network (DNN) to convert the acoustic pattern of your voice at each instant into a probability distribution over speech sounds.
It then uses a temporal integration process to compute a confidence score that the phrase you uttered was “Hey Siri”.
If the score is high enough, Siri wakes up.
This article takes a look at the underlying technology.
It is aimed primarily at readers who know something of machine learning but less about speech recognition.

Being able to use Siri without pressing buttons is particularly useful when hands are busy, such as when cooking or driving, or when using the Apple Watch.
在双手忙碌的时候,比如做饭、开车或使用Apple Watch时,无需按键就能使用Siri尤其有用。
As Figure 1 shows, the whole system has several parts.
Most of the implementation of Siri is “in the Cloud”, including the main automatic speech recognition, the natural language interpretation and the various information services.
There are also servers that can provide updates to the acoustic models used by the detector.
This article concentrates on the part that runs on your local device, such as an iPhone or Apple Watch.
本文主要讨论在本地设备上运行的部分,比如iPhone或Apple Watch。
In particular, it focusses on the detector: a specialized speech recognizer which is always listening just for its wake-up phrase (on a recent iPhone with the “Hey Siri” feature enabled).

The microphone in an iPhone or Apple Watch turns your voice into a stream of instantaneous waveform samples, at a rate of 16000 per second.
iPhone或Apple Watch上的麦克风可以将你的声音转换成一串瞬时波形样本,速度为每秒16000次。
A spectrum analysis stage converts the waveform sample stream to a sequence of frames, each describing the sound spectrum of approximately 0.01 sec. About twenty of these frames at a time (0.2 sec of audio) are fed to the acoustic model, a Deep Neural Network (DNN) which converts each of these acoustic patterns into a probability distribution over a set of speech sound classes: those used in the “Hey Siri” phrase, plus silence and other speech, for a total of about 20 sound classes.
See Figure 2.

The DNN consists mostly of matrix multiplications and logistic nonlinearities.
Each “hidden” layer is an intermediate representation discovered by the DNN during its training to convert the filter bank inputs to sound classes.
The final nonlinearity is essentially a Softmax function (a.k.a. a general logistic or normalized exponential), but since we want log probabilities the actual math is somewhat simpler.

Figure 2.
The Deep Neural Network used to detect “Hey Siri.”
The hidden layers are actually fully connected.
The top layer performs temporal integration.
The actual DNN is indicated by the dashed box.
We choose the number of units in each hidden layer of the DNN to fit the computational resources available when the “Hey Siri” detector runs.
我们选择DNN每个隐藏层的单元数,以适应“Hey Siri”探测器运行时可用的计算资源。
Networks we use typically have five hidden layers, all the same size: 32, 128, or 192 units depending on the memory and power constraints.
On iPhone we use two networks—one for initial detection and another as a secondary checker.
The initial detector uses fewer units than the secondary checker.

The output of the acoustic model provides a distribution of scores over phonetic classes for every frame.
A phonetic class is typically something like “the first part of an /s/ preceded by a high front vowel and followed by a front vowel.”

We want to detect “Hey Siri” if the outputs of the acoustic model are high in the right sequence for the target phrase.
To produce a single score for each frame we accumulate those local values in a valid sequence over time.
This is indicated in the final (top) layer of Figure 2 as a recurrent network with connections to the same unit and the next in sequence.
Inside each unit there is a maximum operation and an add:

Fi,t is the accumulated score for state i of the model
qi,t is the output of the acoustic model—the log score for the phonetic class associated with the ith state given the acoustic pattern around time t
si is a cost associated with staying in state i
mi is a cost for moving on from state i

Both si and mi are based on analysis of durations of segments with the relevant labels in the training data.
(This procedure is an application of dynamic programming, and can be derived based on ideas about Hidden Markov Models—HMMs.)

Figure 3.
Visual depiction of the equation
Each accumulated score Fi,t is associated with a labelling of previous frames with states, as given by the sequence of decisions by the maximum operation.
The final score at each frame is Fi,t, where the last state of the phrase is state I and there are N frames in the sequence of frames leading to that score.
(N could be found by tracing back through the sequence of max decisions, but is actually done by propagating forwards the number of frames since the path entered the first state of the phrase.)
(N可以通过回溯max decision序列来找到,但实际上是通过自路径进入短语的第一个状态以来向前传播帧数来实现的。)
Almost all the computation in the “Hey Siri” detector is in the acoustic model.
“Hey Siri”探测器中几乎所有的计算都是在声学模型中进行的。
The temporal integration computation is relatively cheap, so we disregard it when assessing size or computational resources.

You may get a better idea of how the detector works by looking at Figure 4, which shows the acoustic signal at various stages, assuming that we are using the smallest DNN.
At the very bottom is a spectrogram of the waveform from the microphone.
In this case, someone is saying “Hey Siri what …” The brighter parts are the loudest parts of the phrase.
The Hey Siri pattern is between the vertical blue lines.

The second horizontal strip up from the bottom shows the result of analyzing the same waveform with a mel filter bank, which gives weight to frequencies based on perceptual measurements.
This conversion also smooths out the detail that is visible in the spectrogram and due to the fine-structure of the excitation of the vocal tract: either random, as in the /s/, or periodic, seen here as vertical striations.
The alternating green and blue horizontal strips labelled H1 to H5 show the numerical values (activations) of the units in each of the five hidden layers.
The 32 hidden units in each layer have been arranged for this figure so as to put units with similar outputs together.

The next strip up (with the yellow diagonal) shows the output of the acoustic model.
At each frame there is one output for each position in the phrase, plus others for silence and other speech sounds.
The final score, shown at the top, is obtained by adding up the local scores along the bright diagonal according to Equation 1.
Note that the score rises to a peak just after the whole phrase enters the system.
We compare the score with a threshold to decide whether to activate Siri.
In fact the threshold is not a fixed value.
We built in some flexibility to make it easier to activate Siri in difficult conditions while not significantly increasing the number of false activations.
There is a primary, or normal threshold, and a lower threshold that does not normally trigger Siri.
If the score exceeds the lower threshold but not the upper threshold, then it may be that we missed a genuine “Hey Siri” event.
When the score is in this range, the system enters a more sensitive state for a few seconds, so that if the user repeats the phrase, even without making more effort, then Siri triggers.
This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time.
(We discuss testing and tuning for accuracy later.)

Responsiveness and Power: Two Pass Detection

The “Hey Siri” detector not only has to be accurate, but it needs to be fast and not have a significant effect on battery life.
We also need to minimize memory use and processor demand—particularly peak processor demand.
To avoid running the main processor all day just to listen for the trigger phrase, the iPhone’s Always On Processor (AOP) (a small, low-power auxiliary processor, that is, the embedded Motion Coprocessor) has access to the microphone signal (on 6S and later).
We use a small proportion of the AOP’s limited processing power to run a detector with a small version of the acoustic model (DNN).
When the score exceeds a threshold the motion coprocessor wakes up the main processor, which analyzes the signal using a larger DNN.
In the first versions with AOP support, the first detector used a DNN with 5 layers of 32 hidden units and the second detector had 5 layers of 192 hidden units.

Apple Watch presents some special challenges because of the much smaller battery.
苹果手表(Apple Watch)的电池要小得多,因此面临一些特殊的挑战。
Apple Watch uses a single-pass “Hey Siri” detector with an acoustic model intermediate in size between those used for the first and second passes on other iOS devices.
Apple Watch使用的是单通道“Hey Siri”探测器,其声学模型介于其他iOS设备的第一次和第二次通道之间。
The “Hey Siri” detector runs only when the watch motion coprocessor detects a wrist raise gesture, which turns the screen on.
At that point there is a lot for WatchOS to do—power up, prepare the screen, etc.—so the system allocates “Hey Siri” only a small proportion (~5%) of the rather limited compute budget.
在这一点上,WatchOS要做的事情还有很多,比如启动电源、准备屏幕等等——所以系统只分配了“Hey Siri”在相当有限的计算预算中的一小部分(~5%)。
It is a challenge to start audio capture in time to catch the start of the trigger phrase, so we make allowances for possible truncation in the way that we initialize the detector.

“Hey Siri” Personalized
We designed the always-on “Hey Siri” detector to respond whenever anyone in the vicinity says the trigger phrase.
To reduce the annoyance of false triggers, we invite the user to go through a short enrollment session.
During enrollment, the user says five phrases that each begin with “Hey Siri.
” We save these examples on the device.
We compare any possible new “Hey Siri” utterance with the stored examples as follows.
我们将任何可能的新“Hey Siri”发音与存储的示例进行比较,如下所示。
The (second-pass) detector produces timing information that is used to convert the acoustic pattern into a fixed-length vector, by taking the average over the frames aligned to each state.
A separate, specially trained DNN transforms this vector into a “speaker space” where, by design, patterns from the same speaker tend to be close, whereas patterns from different speakers tend to be further apart.
We compare the distances to the reference patterns created during enrollment with another threshold to decide whether the sound that triggered the detector is likely to be “Hey Siri” spoken by the enrolled user.
我们将注册过程中创建的参考模式的距离与另一个阈值进行比较,以确定触发检测器的声音是否可能是注册用户所说的“Hey Siri”。
This process not only reduces the probability that “Hey Siri” spoken by another person will trigger the iPhone, but also reduces the rate at which other, similar-sounding phrases trigger Siri.
这个过程不仅降低了别人说“Hey Siri”触发iPhone的可能性,还降低了其他发音相似的短语触发Siri的几率。

Further Checks
If the various stages on the iPhone pass it on, the waveform arrives at the Siri server.
If the main speech recognizer hears it as something other than “Hey Siri” (for example “Hey Seriously”) then the server sends a cancellation signal to the phone to put it back to sleep, as indicated in Fig 1.
如果主语音识别器听到的不是“Hey Siri”(例如“Hey Seriously”),那么服务器就会向手机发送一个取消信号,让它重新进入睡眠状态,如图1所示。
On some systems we run a cut-down version of the main recognizer on the device to provide an extra check earlier.

The Acoustic Model: Training
The DNN acoustic model is at the heart of the “Hey Siri” detector.
So let’s take a look at how we trained it.
Well before there was a Hey Siri feature, a small proportion of users would say “Hey Siri” at the start of a request, having started by pressing the button.
We used such “Hey Siri” utterances for the initial training set for the US English detector model.
在美国英语检测器模型的初始训练集中,我们使用了这样的“Hey Siri”发音。
We also included general speech examples, as used for training the main speech recognizer.
In both cases, we used automatic transcription on the training phrases.
Siri team members checked a subset of the transcriptions for accuracy.
We created a language-specific phonetic specification of the “Hey Siri” phrase.
我们为“Hey Siri”这个短语创建了一个特定于语言的语音规范。
In US English, we had two variants, with different first vowels in “Siri”—one as in “serious” and the other as in “Syria.
” We also tried to cope with a short break between the two words, especially as the phrase is often written with a comma: “Hey, Siri.
” Each phonetic symbol results in three speech sound classes (beginning, middle and end) each of which has its own output from the acoustic model.

We used a corpus of speech to train the DNN for which the main Siri recognizer provided a sound class label for each frame.
There are thousands of sound classes used by the main recognizer, but only about twenty are needed to account for the target phrase (including an initial silence), and one large class class for everything else.
The training process attempts to produce DNN outputs approaching 1 for frames that are labelled with the relevant states and phones, based only on the local sound pattern.
The training process adjusts the weights using standard back-propagation and stochastic gradient descent.
We have used a variety of neural network training software toolkits, including Theano, Tensorflow, and Kaldi.
This training process produces estimates of the probabilities of the phones and states given the local acoustic observations, but those estimates include the frequencies of the phones in the training set (the priors), which may be very uneven, and have little to do with the circumstances in which the detector will be used, so we compensate for the priors before the acoustic model outputs are used.

Training one model takes about a day, and there are usually a few models in training at any one time.
We generally train three versions: a small model for the first pass on the motion coprocessor, a larger-size model for the second pass, and a medium-size model for Apple Watch.
我们通常会训练三个版本:第一个版本的小尺寸,第二个版本的大尺寸,以及一个中等尺寸的Apple Watch。
“Hey Siri” works in all languages that Siri supports, but “Hey Siri” isn’t necessarily the phrase that starts Siri listening.
“Hey Siri”适用于Siri支持的所有语言,但“Hey Siri”不一定是Siri开始听的那个词。
For instance, French-speaking users need to say “Dis Siri” while Korean-speaking users say “Siri 야” (Sounds like “Siri Ya.”) In Russian it is “привет Siri “ (Sounds like “Privet Siri”), and in Thai “หวัดดี Siri”.
例如,讲法语的用户需要说“Dis Siri”而讲韩语用户说“Siri야”(听起来像“Siri丫。”)在俄罗斯“приветSiri”(听起来像“女贞Siri”),在泰国“หวัดดีSiri”。
(Sounds like “Wadi Siri”.)
(听起来像“Wadi Siri”。)

Testing and Tuning
An ideal detector would fire whenever the user says “Hey Siri,” and not fire at other times.
We describe the accuracy of the detector in terms of two kinds of error: firing at the wrong time, and failing to fire at the right time.
The false-accept rate (FAR or false-alarm rate), is the number of false activations per hour (or mean hours between activations) and the false-reject rate (FRR) is the proportion of attempted activations that fail.
(Note that the units we use to measure FAR are not the same as those we use for FRR.
Even the dimensions are different.
So there is no notion of an equal error rate.)
For a given model we can change the balance between the two kinds of error by changing the activation threshold.
Figure 6 shows examples of this trade-off, for two sizes of early-development models.
Changing the threshold moves along the curve.

During development we try to estimate the accuracy of the system by using a large test set, which is quite expensive to collect and prepare, but essential.
There is “positive” data and “negative” data.
The “positive” data does contain the target phrase.
You might think that we could use utterances picked up by the “Hey Siri” system, but the system doesn’t capture the attempts that failed to trigger, and we want to improve the system to include as many of such failed attempts as possible.
At first we used the utterances of “Hey Siri” that some users said as they pressed the Home button, but these users are not attempting to catch Siri’s attention, (the button does that) and the microphone is bound to be within arm’s reach, whereas we also want “Hey Siri” to work across a room.
We made recordings specially in various conditions, such as in the kitchen (both close and far), car, bedroom, and restaurant, by native speakers of each language.

We use the “negative” data to test for false activations (and false wakes).
The data represent thousands of hours of recordings, from various sources, including podcasts and non-“Hey Siri” inputs to Siri in many languages, to represent both background sounds (especially speech) and the kinds of phrases that a user might say to another person.
这些数据代表了数千小时的录音,来自不同的来源,包括播客和用多种语言输入Siri的非“Hey Siri”输入,既代表背景声音(尤其是语音),也代表用户可能对另一个人说的短语。
We need such a lot of data because we are trying to estimate false-alarm rates as low as one per week.
(If there are any occurrences of the target phrase in the negative data we label them as such, so that we do not count responses to them as errors.)

Figure 6.
Detector accuracy.
Trade-offs against detection threshold for small and larger DNNs

Tuning is largely a matter of deciding what thresholds to use.
In Figure 6, the two dots on the lower trade-off curve for the larger model show possible normal and second-chance thresholds.
The operating point for the smaller (first-pass) model would be is at the right-hand side.
These curves are just for the two stages of the detector, and do not include the personalized stage or subsequent checks.
While we are confident that models that appear to perform better on the test set probably are really better, it is quite difficult to convert offline test results into useful predictions of the experience of users.
So in addition to the offline measurements described previously, we estimate false-alarm rates (when Siri turns on without the user saying “Hey Siri”) and imposter-accept rates (when Siri turns on when someone other than the user who trained the detector says “Hey Siri”) weekly by sampling from production data, on the latest iOS devices and Apple Watch.
This does not give us rejection rates (when the system fails to respond to a valid “Hey Siri”) but we can estimate rejection rates from the proportion of activations just above the threshold that are valid, and a sampling of just-below threshold events on devices carried by development staff.
这并没有给我们排除率(当系统无法响应一个有效的“Hey Siri”时),但我们可以通过激活刚好超过有效阈值的部分和开发人员携带的设备上刚好低于阈值的事件的抽样来估计排除率。
We continually evaluate and improve “Hey Siri,” and the model that powers it, by training and testing using variations of the approach described here.
We train in many different languages and test under a wide range of conditions.
Next time you say “Hey Siri” you may think of all that goes on to make responding to that phrase happen, but we hope that it “just works!”





