Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis翻译（不含实验部分）

GodGump

于 2023-01-12 21:43:28 发布

阅读量271

点赞数

分类专栏：机器学习文章标签：语音识别人工智能

原文链接：https://arxiv.org/pdf/1806.04558.pdf

版权

机器学习专栏收录该内容

51 篇文章 0 订阅

订阅专栏

阅读须知

(1)文中出现[xxx]表示原文引用的文献序号
(2)大多数是使用Transform模型机器翻译的，本人感觉自己的经验读了一遍以后，进行了一些修改，如果有问题可以一起讨论

摘要

原文

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTS task, and is able to synthesize natural speech from speakers unseen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

翻译

我们描述了一个基于神经网络的文本到语音(TTS)合成系统，该系统能够以不同说话者的声音生成语音音频，包括那些在训练中看不到的声音。我们的系统由三个独立训练的组件组成:(1)说话者编码器网络，使用独立的嘈杂语音数据集(不是很多的人物语音数据集)训练说话者验证任务，从目标说话者的仅几秒钟的参考语音生成固定维的嵌入向量;(2)基于Tacotron 2的序列对序列合成网络，以说话者嵌入为条件，从文本生成梅尔谱图;(3)基于wavenet的自回归声码器网络，将mel谱图转换为时域波形样本。我们证明，所提出的模型能够将经过区别训练的说话人编码器学习到的说话人变异性知识转移到多说话人TTS任务中，并能够从训练中看不到的说话人合成自然语音。为了获得最佳的泛化性能，我们量化了在大而多样的扬声器集上训练扬声器编码器的重要性。最后，我们展示了随机采样的说话人嵌入可以用于合成不同于训练中使用的新说话人声音的语音，表明该模型学习了高质量的说话人表示。

可能不知道意思的英文

synthesis 合成，融合

代码的思考

1.原文提到，数据集很少，这个东西我们可以使用数据增强和采样去解决，但是如果这个模型效果理想，数据不够，是不是在其他任务中用这个模型去生成呢？我看了不少博主和UP主已经做到了和真人相似度极高的效果。
2.原文提到的Tacotron2是由Google Brain在2017年提出来的一个End-to-End语音合成框架。模型从下到上可以看作由两部分组成：
声谱预测网络：一个Encoder-Attention-Decoder网络，用于将输入的字符序列预测为梅尔频谱的帧序列
声码器（vocoder）：一个WaveNet的修订版，用于将预测的梅尔频谱帧序列产生时域波形

介绍

原文

The goal of this work is to build a TTS system which can generate natural speech for a variety of speakers in a data efficient manner. We specifically address a zero-shot learning setting, where a few seconds of untranscribed reference audio from a target speaker is used to synthesize new speech in that speaker’s voice, without updating any model parameters. Such systems have accessibility applications, such as restoring the ability to communicate naturally to users who have lost their voice and are therefore unable to provide many new training examples. They could also enable new applications, such as transferring a voice across languages for more natural speech-to-speech translation, or generating realistic speech from text in low resource settings. However, it is also important to note the potential for misuse of this technology, for example impersonating someone’s voice without their consent. In order to address safety concerns consistent with principles such as [1], we verify that voices generated by the proposed model can easily be distinguished from real voices. Synthesizing natural speech requires training on a large number of high quality speech-transcript pairs, and supporting many speakers usually uses tens of minutes of training data per speaker [8]. Recording a large amount of high quality data for many speakers is impractical. Our approach is to decouple speaker modeling from speech synthesis by independently training a speaker-discriminative embedding network that captures the space of speaker characteristics and training a high quality TTS
model on a smaller dataset conditioned on the representation learned by the first network. Decoupling the networks enables them to be trained on independent data, which reduces the need to obtain high quality multispeaker training data. We train the speaker embedding network on a speaker verification task to determine if two different utterances were spoken by the same speaker. In contrast to the subsequent TTS model, this network is trained on untranscribed speech containing reverberation and background noise from a large number of speakers. We demonstrate that the speaker encoder and synthesis networks can be trained on unbalanced and disjoint sets of speakers and still generalize well. We train the synthesis network on 1.2K speakers and show that training the encoder on a much larger set of 18K speakers improves adaptation quality, and further enables synthesis of completely novel speakers by sampling from the embedding prior. There has been significant interest in end-to-end training of TTS models, which are trained directly from text-audio pairs, without depending on hand crafted intermediate representations [ 17, 23 ]. Tacotron 2 [15 ] used WaveNet [ 19] as a vocoder to invert spectrograms generated by an encoderdecoder architecture with attention [ 3], obtaining naturalness approaching that of human speech by combining Tacotron’s [ 23] prosody with WaveNet’s audio quality. It only supported a single speaker. Gibiansky et al. [ 8] introduced a multispeaker variation of Tacotron which learned low-dimensional speaker embedding for each training speaker. Deep Voice 3 [13 ] proposed a fully convolutional encoder-decoder architecture which scaled up to support over 2,400 speakers from LibriSpeech [ 12 ]. These systems learn a fixed set of speaker embeddings and therefore only support synthesis of voices seen during training. In contrast, VoiceLoop [ 18 ] proposed a novel architecture based on a fixed size memory buffer which can generate speech from voices unseen during training. Obtaining good results required tens of minutes of enrollment speech and transcripts for a new speaker. Recent extensions have enabled few-shot speaker adaptation where only a few seconds of speech per speaker (without transcripts) can be used to generate new speech in that speaker’s voice. [ 2] extends Deep Voice 3, comparing a speaker adaptation method similar to [18] where the model parameters (including speaker embedding) are fine-tuned on a small amount of adaptation data to aspeaker encoding method which uses a neural network to predict speaker embedding directly from a spectrogram. The latter approach is significantly more data efficient, obtaining higher naturalness using small amounts of adaptation data, in as few as one or two utterances. It is also significantly more computationally efficient since it does not require hundreds of backpropagation iterations. Nachmani et al. [ 10 ] similarly extended VoiceLoop to utilize a target speaker encoding network to predict a speaker embedding. This network is trained jointly with the synthesis network using a contrastive triplet loss to ensure that embeddings predicted from utterances by the same speaker are closer than embeddings computed from different speakers. In addition, a cycle-consistency loss is used to ensure that the synthesized speech encodes to a similar embedding as the adaptation utterance. A similar spectrogram encoder network, trained without a triplet loss, was shown to work for transferring target prosody to synthesized speech [ 16]. In this paper we demonstrate that training a similar encoder to discriminate between speakers leads to reliable transfer of speaker characteristics. Our work is most similar to the speaker encoding models in [ 2, 10 ], except that we utilize a network independently-trained for a speaker verification task on a large dataset of untranscribed audio from tens of thousands of speakers, using a state-of-the-art generalized end-to-end loss [ 22]. [ 10 ] incorporated a similar speaker-discriminative representation into their model, however all components were trained jointly. In contrast, we explore transfer learning from a pre-trained speaker verification model. Doddipatla et al. [ 7] used a similar transfer learning configuration where a speaker embedding computed from a pre-trained speaker classifier was used to condition a TTS system. In this paper we utilize an end-to-end synthesis network which does not rely on intermediate linguistic features, and a substantially different speaker embedding network which is not limited to a closed set of speakers. Furthermore, we analyze how quality varies with the number of speakers in the training set, and find that zero-shot transfer requires training on thousands of speakers, many more than were used in [7].

翻译

这项工作的目标是建立一个TTS系统，能够以数据高效的方式为各种说话人生成自然语音。我们具体讨论了zero-shot学习设置。其中使用来自目标说话人的几秒钟未转录的参考音频来合成该说话人的声音中的新语音，而不更新任何模型参数。这样的系统具有可访问性应用，例如恢复与失声的用户自然通信的能力，因此不能提供许多新的训练示例。它们还可以支持新的应用程序，例如跨语言传输语音以实现更自然的语音到语音转换，或者在低资源设置下从文本生成逼真的语音。然而，同样重要的是要注意到这项技术被滥用的可能性，例如在未经某人同意的情况下模仿某人的声音。为了解决与[1]等原则一致的安全问题，我们验证了所提出的模型产生的声音可以很容易地与真实声音区分开来。合成自然语音需要在大量高质量的语音-文本对上进行训练，支持许多说话人通常需要每个说话人几十分钟的训练数据[8]。对于许多演讲者来说，录制大量高质量的数据是不切实际的。我们的方法是通过独立地训练一个捕捉说话人特征空间的说话人区分嵌入网络和训练高质量的TTS来将说话人建模与语音合成分离
以第一个网络学习到的东西为条件，在较小的数据集上进行模型训练。解耦网络使它们能够在独立数据上进行训练，从而减少了获得高质量多扬声器训练数据的需求。我们在说话人验证任务上训练说话人嵌入网络，以确定同一说话人是否说了两个不同的话语。与随后的TTS模型相比，该网络是在包含大量扬声器的混响和背景噪声的未转录语音上进行训练的。
我们证明了说话人编码器和合成网络可以在不平衡和不相交的说话人集合上进行训练，并且仍然具有很好的泛化能力。我们在1.2K说话人上训练合成网络，并表明在更大的18K说话人集合上训练编码器可以提高适应质量，并进一步通过从嵌入先验采样来合成完全新的说话人。
人们对TTS模型的端到端训练产生了极大的兴趣，TTS模型直接从文本音频对中训练，而不依赖手工制作的中间表示[17，23]。Tacotron 2[15]使用WaveNet[19]作为声码器，以注意力[3]反转编码器架构生成的频谱图，通过Tacotron[23]的韵律与WaveNet的音频质量相结合，获得接近人类语音的自然度。它只支持一个扬声器。Gibiansky等人[8]介绍了Tacotron的多传感器变体，该变体学习了每个训练说话者的低维说话者嵌入。Deep Voice 3[13]提出了一种全卷积编码器-解码器架构，该架构可扩展到支持超过2400个来自LibriSpeech[12]的扬声器。这些系统学习一组固定的说话人嵌入，因此只支持训练期间看到的语音合成。相比之下，VoiceLoop[18]提出了一种基于固定大小内存缓冲区的新架构，该缓冲区可以从训练期间看不见的语音中生成语音。要取得好成绩，新演讲者需要几十分钟的演讲和内容稿.
最近的扩展实现了少镜头说话人自适应，即每个说话人只需几秒钟的语音（无需转录）就可以在说话人的语音中生成新的语音。[2]扩展了Deep Voice 3，将类似于[18]的说话人自适应方法与aspeaker编码方法进行了比较，其中模型参数（包括说话人嵌入）在少量自适应数据上进行了微调，该方法使用神经网络直接从频谱图预测说话人嵌入。后一种方法显著提高了数据效率，使用少量的自适应数据，在一到两个话语中获得更高的自然度。它还显著提高了计算效率，因为它不需要数百次反向传播迭代。Nachmani等人[10]类似地扩展了VoiceLoop，以利用目标说话者编码网络来预测说话者嵌入。该网络使用对比三元组损失与合成网络联合训练，以确保从同一说话者的话语预测的嵌入比从不同说话者计算的嵌入更接近。此外，使用循环一致性损失来确保合成语音编码到与自适应话语类似的嵌入
一个类似的语谱图编码网，在没有三元组丢失的情况下训练，被证明在将目标韵律转换为合成语音方面起作用[16]。在本文中，我们证明了训练一个类似的编码器来区分说话人导致了说话人特征的可靠转移。我们的工作与[2，10]中的说话人编码模型最相似，只是我们利用了一个独立训练的网络，用于在来自数万名说话人的未转录音频的大型数据集上执行说话人验证任务，使用最先进的广义端到端损失[22]。[10]在他们的模型中纳入了类似的说话人歧视表征，但所有组成部分都是联合训练的。相比之下，我们探索了从预先训练的说话人验证模型中进行的迁移学习。Doddipatla等人。[7]使用了类似的转移学习配置，其中使用从预先训练的说话人分类器计算的说话人嵌入来调节TTS系统。在本文中，我们利用了一个不依赖中间语言特征的端到端合成网络，以及一个本质上不同的说话人嵌入网络，它不限于一个封闭的说话人集合。此外，我们还分析了音质随训练集说话人数量的变化情况，发现零镜头转换需要数千秒的训练

可能不懂意思的生词

spectrogram 谱图
et 外星人

Multispeaker speech synthesis model

多人语音合成模型

原文

Our system is composed of three independently trained neural networks, illustrated in Figure 1: (1) a recurrent speaker encoder, based on [22 ], which computes a fixed dimensional vector from a speech signal, (2) a sequence-to-sequence synthesizer, based on [ 15], which predicts a mel spectrogram from a sequence of grapheme or phoneme inputs, conditioned on the speaker embedding vector, and (3) an autoregressive WaveNet [19 ] vocoder, which converts the spectrogram into time domain waveforms

在这里插入图片描述

2.1 Speaker encoder

The speaker encoder is used to condition the synthesis network on a reference speech signal from the desired target speaker. Critical to good generalization is the use of a representation which captures the characteristics of different speakers, and the ability to identify these characteristics using only a short adaptation signal, independent of its phonetic content and background noise. These requirements are satisfied using a speaker-discriminative model trained on a text-independent speaker verification task. We follow [22 ], which proposed a highly scalable and accurate neural network framework for speaker verification. The network maps a sequence of log-mel spectrogram frames computed from a speech utterance of arbitrary length, to a fixed-dimensional embedding vector, known as d-vector [20 , 9 ]. The network is trained to optimize a generalized end-to-end speaker verification loss, so that embeddings of utterances from the same speaker have high cosine similarity, while those of utterances from different speakers are far apart in the embedding space. The training dataset consists of speech audio examples segmented into 1.6 seconds and associated speaker identity labels; no transcripts are used.
Input 40-channel log-mel spectrograms are passed to a network consisting of a stack of 3 LSTM layers of 768 cells, each followed by a projection to 256 dimensions. The final embedding is created by L2-normalizing the output of the top layer at the final frame. During inference, an arbitrary length utterance is broken into 800ms windows, overlapped by 50%. The network is run independently on each window, and the outputs are averaged and normalized to create the final utterance embedding. Although the network is not optimized directly to learn a representation which captures speaker characteristics relevant to synthesis, we find that training on a speaker discrimination task leads to an embedding which is directly suitable for conditioning the synthesis network on speaker identity.

2.2 Synthesizer

We extend the recurrent sequence-to-sequence with attention Tacotron 2 architecture [ 15] to support multiple speakers following a scheme similar to [ 8]. An embedding vector for the target speaker is concatenated with the synthesizer encoder output at each time step. In contrast to [8], we find that simply passing embeddings to the attention layer, as in Figure 1, converges across different speakers. We compare two variants of this model, one which computes the embedding using the speaker encoder, and a baseline which optimizes a fixed embedding for each speaker in the training set, essentially learning a lookup table of speaker embeddings similar to [8, 13]. The synthesizer is trained on pairs of text transcript and target audio. At the input, we map the text to a sequence of phonemes, which leads to faster convergence and improved pronunciation of rare words and proper nouns. The network is trained in a transfer learning configuration, using a pretrained speaker encoder (whose parameters are frozen) to extract a speaker embedding from the target audio, i.e. the speaker reference signal is the same as the target speech during training. No explicit speaker identifier labels are used during training. Target spectrogram features are computed from 50ms windows computed with a 12.5ms step, passed through an 80-channel mel-scale filterbank followed by log dynamic range compression. We extend [15 ] by augmenting the L2 loss on the predicted spectrogram with an additional L1 loss. In practice,Figure 2: Example synthesis of a sentence in different voices using the proposed system. Mel spectrograms are visualized for reference utterances used to generate speaker embeddings (left), and the corresponding synthesizer outputs (right). The text-to-spectrogram alignment is shown in red. Three speakers held out of the train sets are used: one male (top) and two female (center and bottom). we found this combined loss to be more robust on noisy training data. In contrast to [ 10 ], we don’t introduce additional loss terms based on the speaker embedding.
在这里插入图片描述

2.3 Neural vocoder

We use the sample-by-sample autoregressive WaveNet [ 19 ] as a vocoder to invert synthesized mel spectrograms emitted by the synthesis network into time-domain waveforms. The architecture is the same as that described in [15], composed of 30 dilated convolution layers. The network is not directly conditioned on the output of the speaker encoder. The mel spectrogram predicted by the synthesizer network captures all of the relevant detail needed for high quality synthesis of a variety of voices, allowing a multispeaker vocoder to be constructed by simply training on data from many speakers

2.4 Inference and zero-shot speaker adaptation

During inference the model is conditioned using arbitrary untranscribed speech audio, which does not need to match the text to be synthesized. Since the speaker characteristics to use for synthesis are inferred from audio, it can be conditioned on audio from speakers that are outside the training set. In practice we find that using a single audio clip of a few seconds duration is sufficient to synthesize new speech with the corresponding speaker characteristics, representing zero-shot adaptation to novel speakers. In Section 3 we evaluate how well this process generalizes to previously unseen speakers. An example of the inference process is visualized in Figure 2, which shows spectrograms synthesized using several different 5 second speaker reference utterances. Compared to those of the female (center and bottom) speakers, the synthesized male (top) speaker spectrogram has noticeably lower fundamental frequency, visible in the denser harmonic spacing (horizontal stripes) in low frequencies, as well as formants, visible in the mid-frequency peaks present during vowel sounds such as the ‘i’ at 0.3 seconds – the top male F2 is in mel channel 35, whereas the F2 of the middle speaker appears closer to channel 40. Similar differences are also visible in sibilant sounds, e.g. the ‘s’ at 0.4 seconds contains more energy in lower frequencies in the male voice than in the female voices. Finally, the characteristic speaking rate is also captured to some extent by the speaker embedding, as can be seen by the longer signal duration in the bottom row compared to the top two. Similar observations can be made about the corresponding reference utterance spectrograms in the right column.
在这里插入图片描述

翻译

我们的系统由三个独立训练的神经网络组成，如图1所示：(1)基于[22]的递归说话人编码器，它从语音计算固定的维矢量（2）一个基于[15]的序列到序列合成器，它根据说话人嵌入向量从一系列字素或音素输入预测梅尔谱图，以及（3）一个自回归WaveNet[19]声码器，它将谱图转换为时域波形。
在这里插入图片描述

2.1 说话者编码器

说话者编码器用于根据来自期望目标说话对象的参考语音信号来调节合成网络。良好概括的关键是使用能够捕捉不同说话者特征的表示，以及仅使用短适应信号识别这些特征的能力，而不依赖于其语音内容和背景噪声。使用在文本无关说话者验证任务上训练的说话者辨别模型来满足这些要求。我们遵循[22]，该文提出了一种用于说话人验证的高度可扩展和精确的神经网络框架。该网络将从任意长度的语音话语计算的对数梅尔谱图帧序列映射到固定维嵌入向量，称为d向量[20，9]。该网络被训练为优化广义的端到端说话人验证损失，使得来自同一说话人的话语的嵌入具有高余弦相似度，而来自不同说话人的话语在嵌入空间中相距很远。训练数据集由分成1.6秒的语音音频示例和相关的说话者身份标签组成；不使用转录本。
输入的40通道对数梅尔谱图被传递到由768个单元的3个LSTM层的堆栈组成的网络，每个LSTM层之后是256维的投影。最终嵌入是通过L2标准化最终帧处顶层的输出来创建的。在推理过程中，任意长度的话语被分成800ms的窗口，重叠50%。网络在每个窗口上独立运行，并且输出被平均和归一化以创建最终的话语嵌入。
尽管网络没有被直接优化以学习捕获与合成相关的说话者特征的表示，但我们发现，对说话者辨别任务的训练导致嵌入，该嵌入直接适合于在说话者身份上调节合成网络。

2.2 合成器

我们将重复序列扩展到具有注意力的序列Tacotron 2架构[15]，以支持遵循类似于[8]的方案的多个扬声器。目标说话者的嵌入向量在每个时间步长与合成器编码器输出连接。与[8]相反，我们发现，如图1所示，简单地将嵌入传递到注意力层会在不同的说话者之间收敛。我们比较了该模型的两种变体，一种是使用说话者编码器计算嵌入，另一种是优化训练集中每个说话者的固定嵌入的基线，本质上学习类似于[8，13]的说话者嵌入查找表。合成器是在成对的文本转录和目标音频上训练的。在输入端，我们将文本映射到一个音素序列，这将导致更快的收敛，并改善罕见词和专有名词的发音。在转移学习配置中训练网络，使用预训练的扬声器编码器（其参数被冻结）从目标音频中提取扬声器嵌入，即，在训练期间，扬声器参考信号与目标语音相同。培训期间不使用明确的说话者标识符标签。目标谱图特征是从50ms窗口中计算出来的，该窗口以12.5ms的步长计算，通过80通道mel-scale滤波器组，然后进行对数动态范围压缩。我们通过增加预测谱图上的L2损失和额外的L1损失来扩展[15]。在实践中，图2：使用所提出的系统以不同声音合成句子的示例。Mel谱图被可视化为用于生成说话者嵌入（左）和相应合成器输出（右）的参考话语。文本与声谱图的对齐显示为红色。使用了火车组外的三个扬声器：一个男性（顶部）和两个女性（中间和底部）。我们发现这种组合损失在有噪声的训练数据上更加鲁棒。与[10]相反，我们没有基于说话人嵌入引入额外的损失项。
在这里插入图片描述

2.3 神经声码器

我们使用逐样本自回归WaveNet[19]作为声码器，将合成网络发射的合成mel谱图转换为时域波形。该架构与[15]中描述的架构相同，由30个扩张卷积层组成。网络不直接取决于扬声器编码器的输出。合成器网络预测的mel频谱图捕获了各种声音的高质量合成所需的所有相关细节，允许通过简单地对来自许多扬声器的数据进行训练来构建多传感器声码器。

2.4推理和zero-shot扬声器自适应

在推理过程中，模型使用任意未记录的语音音频进行调节，而不需要匹配要合成的文本。由于用于合成的扬声器特性是从音频中推断出来的，因此可以根据来自训练集之外的扬声器的音频进行调整。在实践中，我们发现，使用持续数秒的单个音频剪辑足以合成具有相应说话者特征的新语音，这代表对新说话者的零镜头适应。在第3节中，我们评估了这一过程对之前未见过的演讲者的推广效果。推理过程的一个例子如图2所示，它显示了使用几个不同的5秒说话者参考话语合成的频谱图。与女性（中央和底部）扬声器相比，合成的男性（顶部）扬声器频谱图具有明显更低的基频，在低频中更密集的谐波间隔（水平条纹）中可见，以及共振峰，在0.3秒的元音（如“i”）中出现的中频峰值中可见——顶部男性F2在mel通道35中，而中间扬声器的F2看起来更接近通道40。类似的差异也可以在嘶嘶声中看到，例如，0.4秒时的“s”在男声的低频率中比女声中包含更多的能量。最后，特征说话率也在一定程度上被说话人嵌入捕获。与前两行相比，下一行的信号持续时间更长。可以对右列中的相应参考话语谱图进行类似的观察
在这里插入图片描述

关于代码的思考

1.对于speaker encoder我们可以限制最大的说话内容量，少的部分我们矩阵置0，多的部分直接舍弃的方式进行维度的固定。

GodGump

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis翻译（不含实验部分）

(1)文中出现[xxx]表示原文引用的文献序号(2)大多数是使用Transform模型机器翻译的，本人感觉自己的经验读了一遍以后，进行了一些修改，如果有问题可以一起讨论We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unse
复制链接

扫一扫