SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

最新推荐文章于 2024-04-19 09:34:44 发布

Grace_yanyanyan

最新推荐文章于 2024-04-19 09:34:44 发布

阅读量4k

点赞数 3

分类专栏： papers

原文链接：https://arxiv.org/pdf/1904.08779v1.pdf

版权

papers 专栏收录该内容

17 篇文章 1 订阅

订阅专栏

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Daniel S. Park∗, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le
Google Brain
{danielspark, williamchan, ngyuzh, chungchengc, barretzoph, cubuk, qvl}@google.com

https://arxiv.org/pdf/1904.08779v1.pdf

Abstract
摘要
We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coef- ficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state- of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5’00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previ- ous state-of-the-art hybrid system at 8.3%/17.3% WER.
本文提出了一种简单的语音识别数据增强方法SpecAugment。SpecAugment直接应用于神经网络的特征输入（即滤波器组系数）。增强策略包括 warping the features, masking blocks of frequency channels, and masking blocks of time steps。我们将SpecAugment应用于端到端语音识别任务的Listen, Attend and Spell 网络。我们在LibriSpeech 960h和Swichboard 300h任务上实现了最先进的性能，优于所有以前的工作。在LibriSpeech上，我们在不使用语言模型的情况下实现了6.8%的测试效率，在与语言模型的浅层融合情况下实现了5.8%的测试效率。这与以前的先进的混合系统的7.5%功率相比。对于Switchboard，我们在不使用语言模型的情况下，在Hub5’00测试集的Switchboard/CallHome 部分实现了7.2%/14.6%，在浅层融合的情况下实现了6.8%/14.1%，与之前的最先进的混合系统相比，后者的功耗为8.3%/17.3%。

Introduction
一。介绍
Deep Learning has been applied successfully to Automatic Speech Recognition (ASR) [1], where the main focus of re- search has been designing better network architectures, for ex- ample, DNNs [2], CNNs [3], RNNs [4] and end-to-end models [5, 6, 7]. However, these models tend to overfit easily and re- quire large amounts of training data [8].
深度学习已经成功地应用于自动语音识别（ASR）[1]，其中研究的主要焦点是设计更好的网络结构，例如DNN[2]、CNN[3]、RNN[4]和端到端模型[5、6、7]。然而，这些模型容易过度拟合，需要大量的训练数据[8]。
Data augmentation has been proposed as a method to gen- erate additional training data for ASR. For example, in [9, 10], artificial data was augmented for low resource speech recogni- tion tasks. Vocal Tract Length Normalization has been adapted for data augmentation in [11]. Noisy audio has been synthesised via superimposing clean audio with a noisy audio signal in [12]. Speed perturbation has been applied on raw audio for LVSCR tasks in [13]. The use of an acoustic room simulator has been explored in [14]. Data augmentation for keyword spotting has been studied in [15, 16]. More generally, learned augmentation techniques have explored different sequences of augmentation transformations that have achieved state-of-the-art performance in the image domain [17].
提出了一种数据增强的方法来生成ASR的附加训练数据。例如，在[9，10]中，人工数据被扩充用于低资源语音识别任务。声道长度标准化已经被应用于[11]中的数据增强。在[12]中，通过将干净的音频与嘈杂的音频信号叠加来合成嘈杂的音频。在[13]中，对LVSCR任务的原始音频应用了速度扰动。[14]中探讨了声学室模拟器的使用。文献[15，16]研究了用于关键词识别的数据增强。更普遍地说，学习的增强技术探索了增强变换的不同序列，这些序列在图像域中实现了最先进的性能[17]。
Inspired by the recent success of augmentation in the speech and vision domains, we propose SpecAugment, an aug- mentation method that operates on the log mel spectrogram of the input audio, rather than the raw audio itself. This method is simple and computationally cheap to apply, as it directly acts on the log mel spectrogram as if it were an image, and does not require any additional data. We are thus able to apply SpecAug- ment online during training. SpecAugment consists of three kinds of deformations of the log mel spectrogram. The first is time warping, a deformation of the time-series in the time direction. The other two augmentations, inspired by “Cutout”, proposed in computer vision [18], are time and frequency mask- ing, where we mask a block of consecutive time steps or mel frequency channels.
受最近语音和视觉领域增强技术的成功启发，我们提出了SpecAugment，**一种基于输入音频的对数mel谱图而不是原始音频本身的增强方法。**这种方法简单，计算成本低，因为它直接作用于对数mel谱图，就好像它是一个图像，不需要任何额外的数据。因此，我们可以在训练期间在线进行数据的SpecAugment增强。SpecAugment由log mel谱图的三种变形组成。首先是时间扭曲，是一种时间序列在时间方向上的变形。另外两种增强方法，受计算机视觉[18]中提出的“截断”的启发，是时间和频率masking，即我们mask连续时间步或mel频率通道的块。
This approach while rudimentary, is remarkably effective and allows us to train end-to-end ASR networks, called Lis- ten Attend and Spell (LAS) [6], to surpass more complicated hybrid systems, and achieve state-of-the-art results even with- out the use of Language Models (LMs). On LibriSpeech [19], we achieve 2.8% Word Error Rate (WER) on the test- clean set and 6.8% WER on the test-other set, without the use of an LM. Upon shallow fusion [20] with an LM trained on the LibriSpeech LM corpus, we are able to better our per- formance (2.5% WER on test-clean and 5.8% WER on test- other), improving the current state of the art on test-other by 22% relatively. On Switchboard 300h (LDC97S62) [21], we obtain 7.2% WER on the Switchboard portion of the Hub5’00 (LDC2002S09, LDC2003T02) test set, and 14.6% on the Call- Home portion, without using an LM. Upon shallow fusion with an LM trained on the combined transcript of the Switch- board and Fisher (LDC200{4,5}T19) [22] corpora, we obtain 6.8%/14.1% WER on the Switchboard/Callhome portion.
这种方法虽然还处于初级阶段，但却非常有效，使我们能够训练称为Lis-ten-attention and-Spell（LAS）[6]的端到端ASR网络，以超越更复杂的混合系统，并在不使用语言模型（LMs）的情况下获得最新的结果。在LibriSpeech[19]上，我们在不使用LM的情况下，在test-clean集上达到2.8%的字错误率（WER），在test-other集上达到6.8%的WER。利用LibriSpeech LM语料库训练的LM进行浅层融合[20]后，我们能够提高我们的性能（干净测试的效率为2.5%，其他测试的效率为5.8%），将其他测试的当前技术水平相对提高22%。在Switchboard300h（LDC97S62）[21]上，我们在Hub5’00（LDC2002SO9，LDC2003T02）测试集的Switchboard部分获得7.2%的功率，在CallHome 部分获得14.6%的功率，而不使用LM。在Switchboard和Fisher（LDC200{4,5}T19）[22]语料库的联合转录本上训练的LM进行浅层融合后，我们在Switchboard/Callhome部分获得6.8%/14.1%的功率。
Augmentation Policy

We aim to construct an augmentation policy that acts on the log mel spectrogram directly, which helps the network learn use- ful features. Motivated by the goal that these features should be robust to deformations in the time direction, partial loss of frequency information and partial loss of small segments of speech, we have chosen the following deformations to make up a policy:
我们的目的是构造一个直接作用于对数mel谱图的增强策略，帮助网络学习有用的特征。为了使这些特征对时间方向上的变形、频率信息的部分丢失和小段语音的部分丢失具有鲁棒性，我们选择了以下变形作为策略：

Time warping is applied via the function sparse image warp of tensorflow. Given a log mel spectrogram with τ time steps, we view it as an image where the time axis is horizontal and the frequency axis is vertical. A random point along the horizontal line passing through the center of the image within the time steps (W,τ − W) is to be warped either to the left or right by a distance w chosen from a uniform distribution from 0 to the time warp parameter W along that line.
1.通过tensorflow的函数稀疏图像扭曲应用时间扭曲。给定具有时间步长的对数梅尔谱图，我们将其视为时间轴是水平的并且频率轴是垂直的图像。在时间步长（W，τ-W）内，沿穿过图像中心的水平线的随机点将被向左或向右翘曲，从沿该线从0到时间翘曲参数W的均匀分布中选择距离W。
Frequency masking is applied so that f consecutive mel frequency channels [f0 , f0 + f ) are masked, where f is first chosen from a uniform distribution from 0 to the frequency mask parameter F, and f0 is chosen from [0, ν − f ). ν is the number of mel frequency channels.
2.应用频率掩蔽，使得f个连续的mel频率通道[f0，f0+f）被掩蔽，其中f首先从0到频率掩蔽参数f的均匀分布中选择，f0从[0，ν-f）中选择。ν是mel频率通道的数量。
Time masking is applied so that consecutive time steps [t0,t0 + t) are masked, where t is first chosen from a uniform distribution from 0 to the time mask parameter T, and t0 is chosen from [0,τ − t).
3。时间屏蔽应用于屏蔽时间步[t0，t0+t），其中t首先从0到时间屏蔽参数t的均匀分布中选择，t0从[0，τ-t）中选择。
• We introduce an upper bound on the time mask so that a time mask cannot be wider than p times the number of time steps.
我们在时间掩模上引入一个上界，使得时间掩模的宽度不能大于时间步数的p倍。
Figure 1 shows examples of the individual augmentations ap- plied to a single input. The log mel spectrograms are normal- ized to have zero mean value, and thus setting the masked value to zero is equivalent to setting it to the mean value.
图1显示了应用于单个输入的单个增强的示例。对数mel谱图被标准化为平均值为零，因此将掩蔽值设置为零等同于将其设置为平均值。
We can consider policies where multiple frequency and time masks are applied. The multiple masks may overlap. In this work, we mainly consider a series of hand-crafted policies, LibriSpeech basic (LB), LibriSpeech double (LD), Switchboard mild (SM) and Switchboard strong (SS) whose parameters are summarized in Table 1. In Figure 2, we show an example of a log mel spectrogram augmented with policies LB and LD.
我们可以考虑应用多个频率和时间掩码的策略。多个遮罩可能重叠。在这项工作中，我们主要考虑一系列手工编制的策略，LibriSpeech basic（LB）、LibriSpeech double（LD）、Switchboard mild（SM）和Switchboard strong（SS），其参数汇总在表1中。在图2中，我们展示了一个使用策略LB和LD增强的log-mel谱图示例。
Model

We use Listen, Attend and Spell (LAS) networks [6] for our ASR tasks. These models, being end-to-end, are simple to train and have the added benefit of having well-documented bench- marks [23, 24] that we are able to build upon to get our results. In this section, we review LAS networks and introduce some notation to parameterize them. We also introduce the learning rate schedules we use to train the networks, as they turn out to be an important factor in determining performance. We end with reviewing shallow fusion [20], which we have used to in- corporate language models for further gains in performance.
我们使用Listen, Attend and Spell（LAS）网络[6]来完成ASR任务。这些模型是端到端的，训练起来很简单，而且还有一个额外的好处，那就是我们有well-documented benchmarks[23，24]，可以在此基础上得到我们的结果。在这一节中，我们回顾了LAS网络并介绍了一些符号来参数化它们。我们还介绍了用于训练网络的学习速率计划，因为它们是决定性能的一个重要因素。最后，我们回顾了浅层融合[20]，我们将其用于合并语言模型来得到性能的进一步提高。

3.1. LAS Network Architectures

We use Listen, Attend and Spell (LAS) networks [6] for end-to- end ASR studied in [24], for which we use the notation LAS- d-w. The input log mel spectrogram is passed in to a 2-layer Convolutional Neural Network (CNN) with max-pooling and stride of 2. The output of the CNN is passes through an en- coder consisting of d stacked bi-directional LSTMs with cell size w to yield a series of attention vectors. The attention vec- tors are fed into a 2-layer RNN decoder of cell dimension w, which yields the tokens for the transcript. The text is tokenized using a Word Piece Model (WPM) [25] of vocabulary size 16k for LibriSpeech and 1k for Switchboard. The WPM for Lib- riSpeech 960h is constructed using the training set transcripts. For the Switchboard 300h task, transcripts from the training set is combined with those of the Fisher corpus to construct the WPM. The final transcripts are obtained by a beam search with beam size 8. For comparison with [24], we note that their “large model” in our notation is LAS-4-1024.
对于[24]中研究的端到端ASR，我们使用Listen，attend和Spell（LAS）网络[6]，为此我们使用了符号LAS-d-w。输入的日志mel谱图被传递到max-pooling和步长为2的2层卷积神经网络（CNN）。CNN的输出通过一个由d个双向LSTMs组成的编码器，其单元大小为w，产生一系列的注意向量。注意向量被输入到 cell dimension为w的2层RNN解码器中，该解码器产生用于transcript的tokens。文本使用单词块模型（WPM）[25]进行标记，LibriSpeech的词汇大小为16k，Switchboard的词汇大小为1k。Lib-riSpeech 960h的WPM是使用训练集转录本构建的。对于Switchboard300h任务，训练集的文本与Fisher语料库的文本相结合，构建WPM。最终的成绩单是通过光束大小为8的光束搜索获得的。为了与[24]进行比较，我们注意到它们在我们的符号中的“大模型”是LAS-4-1024。