ICASSP 2019----A Denoising Autoencoder for Speaker Recognition. Results on the MCE 2018 Challenge

https://ieeexplore.ieee.org/document/8683525

Abstract:
文摘:
We propose a Denoising Autoencoder (DAE) for speaker recognition, trained to map each individual ivector to the mean of all ivectors belonging to that particular speaker.
我们提出了一种用于说话人识别的去噪自编码器(DAE),训练它将每个单独的ivector映射到属于该特定说话人的所有ivector的平均值。
The aim of this DAE is to compensate for inter-session variability and increase the discriminative power of the ivectors prior to PLDA scoring.
该DAE的目的是在PLDA评分前补偿会话间的变异性,提高ivector的识别能力。
We test the proposed approach on the MCE 2018 1st Multi-target speaker detection and identification Challenge Evaluation.
我们在MCE 2018第一届多目标说话人检测与识别挑战评估中测试了该方法。
This evaluation presents a call-center fraud detection scenario: given a speech segment, detect if it belongs to any of the speakers in a blacklist.
这个评估提供了一个呼叫中心欺诈检测场景:给定一个语音段,检测它是否属于黑名单中的任何发言者。
We show that our DAE system consistently outperforms the usual LDA + PLDA pipeline, achieving a Top-S EER of 4.33% and Top-1 EER of 6.11% on the evaluation set, which represents a 45.6% error reduction with respect to the baseline system provided by organizers.
我们的DAE系统始终优于通常的LDA + PLDA管道,在评价集上的Top-S EER为4.33%,Top-1 EER为6.11%,相对于组织者提供的基线系统,误差降低了45.6%。

SECTION 1.INTRODUCTION
1.节介绍
Since the introduction of total variability modeling and ivectors [1] as a fixed and low dimensional representation of speech segments, the GMM-ivector, and more recently DNN-ivector [2], [3] paradigm, followed by a discriminative back-end, has become the de-facto standard in speaker recognition.
自从引入全变率建模和ivector[1]作为固定的低维语音段表示,GMM-ivector,以及最近的DNN-ivector[2],[3]范式,然后是一个有区别的后端,已经成为说话人识别的事实上的标准。
This backend usually consists on Linear Discriminat Analysis (LDA) to project ivectors to a lower dimension while increasing their discriminative power, length-normalization, and Probabilistic Linear Discriminant Analysis (PLDA) to account for inter-session variability [4], [5], [6].
该后端通常由线性判别分析(LDA)组成,用于将ivector投影到较低的维度,同时增加它们的判别能力、长度标准化和概率线性判别分析(PLDA),以考虑会话间的差异[4]、[5]、[6]。
More recently, alternative representations, such as x-vectors [7], [8], have been proposed.
最近,提出了另一种表示方法,如x向量[7]、[8]。
However, LDA followed by PLDA remains the state-of-the-art backend.
然而,LDA之后的PLDA仍然是最先进的后端。
We propose the use of a Denoising Autoencoder (DAE) [9] to increase the discriminative power of ivectors and compensate for inter-session variability.
我们提出使用去噪自编码器(DAE)[9]来提高ivector的识别能力和补偿会话间的变异性。
An autoencoder is a neural network architecture that learns an internal representation that allows it to reconstruct its inputs.
自动编码器是一种神经网络体系结构,它学习内部表示,允许它重建其输入。
A denoising autoencoder is a particular type of autoencoder that learns to reconstruct a “clean” version of its inputs.
去噪自动编码器是一种特殊类型的自动编码器,它学习重构输入的“干净”版本。
In our case, the DAE takes as input an ivector and tries to map it to the mean of all the ivectors of that particular speaker.
在我们的例子中,DAE接受一个ivector作为输入,并试图将它映射到特定说话者的所有ivector的平均值。
To this end, the DAE is trained to maximize the cosine distance between its output and the mean ivector for that speaker.
为此,DAE经过训练,使其输出与说话者的平均ivector之间的余弦距离最大化。
Our proposed backend consists of: length normalization, DAE transformation and PLDA scoring.
我们提出的后端包括:长度归一化、DAE转换和PLDA评分。
We test this approach in the MCE 2018 1st Multi-target speaker detection and identification Challenge Evaluation.
我们在MCE 2018首届多目标说话人检测与识别挑战评估中测试了该方法。
The task for the MCE 2018 Evaluation is to detect if a given speech segment belongs to any of the speakers in a blacklist.
MCE 2018评估的任务是检测给定的语音片段是否属于黑名单中的任何发言者。
The challenge is divided into two related subtasks: Top-S detection, i.e. detecting if the segment belongs to any of the blacklist speakers;
该挑战分为两个相关子任务:Top-S检测,即检测该段是否属于黑名单说话者;
and Top-1 detection, i.e. detecting which specific blacklist speaker (if any) is speaking in the segment.
和Top-1检测,即检测哪个特定的黑名单说话者(如果有的话)在该段说话。
We refer the reader to [10] for a detailed description of the challenge.
我们请读者参考[10]以获得对挑战的详细描述。
In this paper, we describe in detail our submission for the challenge and show that the proposed DAE + PLDA backend outperforms the conventional LDA + PLDA approach.
在本文中,我们详细描述了我们提交的挑战,并表明所提出的DAE + PLDA后端优于传统的LDA + PLDA方法。
Our best system achieves a Top-S EER of 4.33% and Top-1 EER of 6.11% on the evaluation set, which represents a 45.6% error reduction with respect to the baseline system provided by the organizers.
我们最好的系统在评估集上获得了4.33%的Top-S EER和6.11%的Top-1 EER,相对于主办方提供的基线系统,误差降低了45.6%。
We have released source code for DAE training and testing1.
我们已经发布了DAE培训和测试的源代码1。
Previous work has proposed a number of alternatives to LDA [11], [12] to account for the multimodal, non-Gaussian distribution of ivectors.
以前的工作已经提出了许多LDA[11]、[12]的替代方案来解释ivector的多模态非高斯分布。
Our approach differs from these alternatives in the sense that it is not designed to replace LDA but to attack the problem from a different angle.
我们的方法与这些替代方法的不同之处在于,它不是为了取代LDA,而是从不同的角度来解决问题。
As we show in Section 4, in fact, both techniques can be combined by using LDA in the DAE-transformed ivectors or by transforming LDA-projected ivectors.
正如我们在第4节中所示,实际上,这两种技术都可以通过在dae转换的ivector中使用LDA或通过转换ldap投影的ivector来组合。
The use of denoising autoencoders for speaker recognition has been previously proposed for tasks such as denoising ivectors [13] or domain adaptation [14].
使用去噪自编码器对说话人进行识别,已经被提出用于去噪ivectors[13]或domain adaptation[14]等任务。
In [15], [16] an approach similar to ours is proposed.
在[15],[16]中提出了一种类似于我们的方法。
First, a Restricted Boltzmann Machine (RBM) is trained and then a DAE is fine-tuned.
首先,对受限玻尔兹曼机(RBM)进行训练,然后对DAE进行微调。
In contrast, our approach is much simpler.
相比之下,我们的方法要简单得多。
We show that even the simplest DAE can outperform the traditional LDA-PLDA backend.
我们证明,即使是最简单的DAE也可以胜过传统的ldap - plda后端。
The rest of the paper is organized as follows.
本文的其余部分组织如下。
Section 2 provides an overview of the MCE 2018 evaluation, Section 3 describes both the baseline LDA-PLDA system and the proposed DAE-PLDA system and Section 4 presents the results.
第2节概述了MCE 2018评估,第3节描述了基线ldap - plda系统和拟议的DAE-PLDA系统,第4节介绍了结果。
Finally, conclusions are drawn in Section 5.
最后,第五部分得出结论。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值