ICASSP 2019----Adapting End-to-end Neural Speaker Verification to New Languages and Recording Condit

https://ieeexplore.ieee.org/document/8682611

Adapting End-to-end Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training
通过对抗性训练,使端到端的神经网络型说话人验证,适应新的语言和录音条件
Abstract:
文摘:
In this article we propose a novel approach for adapting speaker embeddings to new domains based on adversarial training of neural networks.
本文提出了一种基于神经网络对抗性训练的说话人嵌入适应新领域的新方法。
We apply our embeddings to the task of text-independent speaker verification, a challenging, real-world problem in biometric security.
我们将嵌入式应用到文本无关的说话人验证任务中,这是一个具有挑战性的、现实世界中的生物特征安全问题。
We further the development of end-to-end speaker embedding models by combing a novel 1-dimensional, self-attentive residual network, an angular margin loss function and adversarial training strategy.
结合一种新颖的,一维的,自我关注的残差网络、使用angular margin作为损失函数,结合对抗性训练策略,进一步发展了端到端说话人嵌入模型。
Our model is able to learn extremely compact, 64-dimensional speaker embeddings that deliver competitive performance on a number of popular datasets using simple cosine distance scoring.
我们的模型能够学习非常紧凑的64维说话人嵌入,使用简单的余弦距离评分,在许多流行的数据集上提供具有竞争力的性能。
One the NIST-SRE 2016 task we are able to beat a strong i-vector baseline, while on the Speakers in the Wild task our model was able to outperform both i-vector and x-vector baselines, showing an absolute improvement of 2.19% over the latter.
在2016年的NIST-SRE任务中,我们能够打败一个强大的i-vector基线,而在实际环境的任务中,我们的模型能够同时胜过i-vector和x-vector基线,与后者相比,我们的模型绝对提高了2.19%。
Additionally, we show that the integration of adversarial training consistently leads to a significant improvement over an unadapted model.
此外,我们还表明,与不适应的模型相比,对抗性训练的集成,始终能够带来显著的改进。

SECTION 1.INTRODUCTION
1.节介绍
Text-Independent Speaker Verification systems are binary classifiers that given two recordings answer the question: Are the people speaking in the two recordings the same person?
文本无关的说话人验证系统是二分类器,它给出两个录音,回答以下问题:在两个录音中说话的人是同一个人吗?
The answer is typically delivered in the form of a scalar value or verification score.
答案通常以标量值或验证分数的形式交付。
Verification scores can be formulated as a likelihood ratio, as in the popular i-vector/PLDA approach [1], [2].
验证分数可以用似然比来表示,就像流行的i-vector/PLDA方法[1]、[2]一样。
An alternate approach is to use simple distance metrics like mean-squared error or cosine distance.
另一种方法是使用简单的距离度量,如均方误差或余弦距离。
Verification models that can be scored like this typically need to optimize the distance metric itself, i.e. they are optimized end-to-end.
可以这样评分的验证模型,通常需要优化,距离度量,本身,即它们是端到端优化的。
While contrastive loss based end-to-end face verification models have shown state-of-the-art performance [3], their adoption in the speaker verification community has not been widespread due to the difficulties associated with training such models.
虽然基于对比损失函数的端到端人脸验证模型显示了最先进的性能[3],但由于训练此类模型的困难,它们在说话人验证社区中的应用并不广泛。
State-of-the-art speaker verification systems follow the same recipe as i-vector systems by using a LDA/PLDA classifier, but replace the i-vector extractor with a Deep Neural Network (DNN) feature extractor [4].
最先进的说话人验证系统采用了与i-vector系统相同的方法,使用LDA/PLDA分类器,但是用深度神经网络(DNN)特征提取器[4]代替i-vector提取器。
The DNN embedding model is trained by minimizing the cross-entropy loss over speakers in the training data.
通过最小化训练数据中说话人的交叉熵损失来训练DNN嵌入模型。
While cross-entropy minimization is simpler than optimizing contrastive losses, the nature of the verification problem makes learning a good DNN embedding model challenging.
虽然,交叉熵最小化,比,优化对比损失,更简单,但是说话人验证问题的性质,使得学习一个好的DNN嵌入模型更难。
This is evidenced by the Kaldi x-vector recipe, which we use as one of the baseline systems in this work.
这一点可以通过Kaldi x-vector方法得到证明,我们将其用作这项工作的基线之一。
The recipe involves extensive data preparation, followed by a multi-GPU training strategy that involves a sophisticated model averaging technique combined with a natural gradient variant of SGD [4].
该方法涉及广泛的数据准备,随后是一个多gpu训练策略,涉及复杂的模型平均技术与SGD[4]的自然梯度变体相结合。
Replicating the performance of x-vectors with conventional first order optimizers is nontrivial [5].
用传统的一阶优化器复制 x-vectors 的性能是非常有用的[5]。
In this article we present Domain Adversarial Neural Speaker Embeddings (DANSE) for text-independent speaker verification.
本文提出了一种用于文本无关说话人验证的领域对抗性神经说话人嵌入方法(DANSE)。
We make the following contributions:
我们的贡献如下:
We propose a novel architecture for extracting neural speaker embeddings based on a 1-dimensional residual network and a self-attention model.
提出了一种基于一维残差网络和自注意模型的神经说话人嵌入提取方法。
The model can be trained using a simple data sampling strategy and using traditional first order optimizers.
该模型可以使用简单的数据采样方法和传统的一阶优化器进行训练。
We show that the DANSE model can be optimized end-to-end to learn extremely compact (64-dimensional) embeddings that deliver competitive speaker verification performance using simple cosine scoring.
我们展示了,DANSE模型可以通过端到端优化来学习非常紧凑(64维)的嵌入,使用简单的余弦评分来提供具有竞争力的说话人验证性能。
Finally, we propose to integrate adversarial training into part of learning a speaker embedding model, in order to learn domain invariant features [6].
最后,我们提出将对抗性训练整合到学习说话人嵌入模型中,以学习领域不变的特征[6]。
Modern speaker verification datasets like NIST-SRE 2016 and Speakers in the Wild (SITW) are challenging because in-domain or target data is not available for training verification systems [7], [8].
现代说话人验证数据集(如NIST-SRE 2016和野外说话人(SITW))具有挑战性,因为域内或目标数据无法用于训练验证系统[7]、[8]。
This leads to a domain shift between training and test datasets, which in turn degrades performance.
这将导致训练和测试数据集之间的领域转移,从而降低性能。
Our key insight in this work is that verification performance can be improved significantly by encouraging the speaker embedding model to learn domain invariant features.
我们在这项工作中的主要观点是,通过提高speaker嵌入模型来学习域不变特性,可以显著提高验证性能。
We achieve this through Domain Adversarial Training (DAT) using the framework of Gradient Reversal [9].
我们利用梯度反转[9]框架,通过域对抗性训练(DAT)实现了这一目标。
This allows us to learn domain invariant speaker embeddings using a small amount of unlabelled, target domain data.
这允许我们使用少量未标记的目标域数据,学习域不变说话人嵌入。
The gradient reversal model has been applied previously to the speaker verification problem in the i-vector domain [6], and our work extends this model to work directly with speech features (MFCC, Spectrogram etc).
梯度反转模型以前已经应用于i-向量域[6]的说话人验证问题,我们的工作将该模型扩展到直接处理语音特征(MFCC、谱图等)。
One of the main objectives of this work is to show that domain robust speaker recognizers can be learned end-to-end -a task which has proved especially challenging on the NIST datasets.
这项工作的主要目标之一是,展示领域健壮的说话者识别器可以用端到端的方法学习——这一任务在NIST数据集中被证明是很难的。
Our experiments suggest that the main requirement for learning such models is to optimize for similarity, which we achieve by using a margin based loss function.
我们的实验表明,学习这类模型的主要要求是优化相似性,我们使用基于边际的损失函数来实现。
Additionally, we make the use of a data sampling strategy that is easy to implement.
此外,我们还使用了易于实现的数据抽样方法。
We believe that this is an important factor as it encourages replication and further improvements to the model.
我们认为这是一个重要的因素,因为它促进了复制和对模型的进一步改进。
One advantage of end-to-end optimization is that the learned speaker embeddings are more inherently discriminative.
端到端优化的一个优点是,所学习的说话人嵌入具有更强的内在辨别能力。
Intuitively, such embeddings are likely to be more beneficial when used as components or conditional inputs to other speech processing applications [10], [11].
直观地说,当作为其他语音处理应用程序[10]、[11]的组件或条件输入时,这种嵌入可能会更有用。

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值