Kai Liu, Huan Zhou
Artificial Intelligence Application Research Center, Huawei Technologies Shenzhen, PRC

A text-independent speaker verification system suffers severe performance degradation under short utterance condition. To address the problem, in this paper, we propose an adversari- ally learned embedding mapping model that directly maps a short embedding to an enhanced embedding with increased discriminability. In particular, a Wasserstein GAN with a bunch of loss criteria are investigated. These loss functions have distinct optimization objectives and some of them are less favoured for the speaker verification research area. Dif- ferent from most prior studies, our main objective in this study is to investigate the effectiveness of those loss criteria by con- ducting numerous ablation studies. Experiments on Voxceleb dataset showed that some criteria are beneficial to the veri- fication performance while some have trivial effects. Lastly, a Wasserstein GAN with chosen loss criteria, without fine- tuning, achieves meaningful advancements over the baseline, with 4% relative improvements on EER and 7% on minDCF in the challenging scenario of short 2second utterances.
与文本无关的说话人验证系统在短语音条件下性能严重下降。为了解决这一问题,本文提出了一种敌方学习的嵌入映射模型,该模型直接将短嵌入映射到具有更高可分辨性的增强嵌入。特别地,研究了具有一系列损耗准则的Wasserstein GAN。这些损失函数具有明显的优化目标,其中一些函数不太适合于说话人验证领域。与以往的研究不同,本研究的主要目的是通过大量的消融研究来探讨这些丧失标准的有效性。在Voxceleb数据集上的实验表明,一些准则有利于验证性能的提高,而一些准则的影响较小。最后,一个Wasserstein GAN在没有微调的情况下选择了损耗标准,在较短的2秒话语的挑战性场景中,与基线相比取得了有意义的进步,EER相对提高了4%,minDCF相对提高了7%。

    Text-independent Speaker Verification (SV) aims to automat- ically verify the identity of a speaker, given enrolled speaker record and some test speech signal (with no special constraint on phonetic content). The most important step in the SV pipeline is to map speech of arbitrary duration into speaker representation of fixed dimension. It’s desirable for such a speaker representation to be compact, discriminative and ro- bust to extrinsic and intrinsic variations.
    Several types of speaker representations have been de- veloped over the past decades. The well-known i-vector [3] has been the state-of-the-art speaker representation, usually associated with a simple cosine-scoring strategy or more powerful probability linear discriminant analysis (PLDA) [12, 4] as verifier. With the advent of deep neural networks (DNNs), a variety of DNN frameworks and loss functions have been developed to learn deep speaker representations, known as embeddings. By training these networks with either
    the cross-entropy loss, or some form of contrastive loss on large amount of data, the resulting embeddings are speaker- discriminative. Compared to the i-vector, those embeddings, such as x-vector[2] and GhostVLAD-aggregated embedding [18] (or G-vector for short), are promising, demonstrating competitive performance for long speeches and distinct ad- vantage for short speeches. Furthermore, the recently devel- oped G-vector further shows considerable gains over x-vector for noisy test conditions, which makes it more favorable for a practical SV system.

牛津的

However, the performance of a SV system usually de- grades in real scenarios, due to prevalent mismatches between development and test condition, such as channel, domain or duration mismatch [11, 5, 18]. For instance, it has been ob- served [5] that on NIST-SRE 2010 test set (female part), the performance of i-vector/PLDA system drops from 2.48% to 24.78% when the verification trial was shortened from full- duration to 5 seconds long.
然而,由于开发和测试条件之间普遍存在不匹配,例如通道、域或持续时间不匹配,SV系统的性能在实际场景中通常会降低[11、5、18]。例如,在NIST-SRE 2010测试集(女性部分)上,当验证试验从完整持续时间缩短到5秒时,i-vector/PLDA系统的性能从2.48%下降到24.78%。
Numerous research studies have been proposed to miti- gate the short duration effect. An early family of researches aimed to modify different aspects of i-vector based SV sys- tem, e.g., feature extraction techniques, intermediate param- eter estimation, speaker model generation, score normaliza- tion techniques, as summarized in [11]. Recently, more novel deep learning technologies are explored. For instance, in- sufficient phonetic information is compensated by a teacher- student learning framework [17] and scoring scheme is cal- ibrated by transfer learning [13]. Another research strategy is to design duration robust speaker embeddings to dealing with utterances of arbitrary duration. By applying different neural network architectures and alternative loss functions, the discriminability of embeddings is further enhanced. For example, Inception Net with triplet loss is depolyed in [20], Inception-ResNet with joint softmax and center loss in [8] and ResCNN with novel speaker identity subspace loss in [14].







Generative Adversarial Networks (GANs) [6] are one of the most popular deep learning algorithm developed recently. GANs have the potential to generate realistic instances and provide a solution to problems that require a generative so- lution, most notably in various image-to-image translation tasks.

这个是GAN的提出论文,是Ian Goodfellow搞出来的。

In this study, we aim to investigate the short duration is-sue presented in a practical SV system. Contrary to the most techniques mentioned above, our proposed approach works directly on the speaker embeddings. In particular, given short and long embedding pairs extracted from same speaker and session, we propose to use adversarial learning of Wasser- stein GAN to learn a new embedding with enhanced discrim- inability. To test our approach, G-vector is chosen as the em- bedding benchmark in our experiments due to its promising performance on short speeches. This put forward a challenge to our study than those prior studies which benchmarked with the i-vectors.
The remainder of this paper is organized as follows: Sec- tion 2 briefly introduces the related works of our methods. Section 3 details our proposed Wassertein GAN based ap- proach. Section 4 presents experimental results and discus- sions. Finally, our conclusions are given in Section 5.


2.1. Wasserstein-GAN

GANs [6] are deep generative models comprised of two net- works, a generator and a discriminator. The discriminator D tries to learn the difference between real sample y and fake sample g generated from noise η, and the generator G tries to fool the discriminator. That is, the following minimax loss function is optimized through alternating optimization, until equilibrium is reached.
However, training a GAN model is difficult due to well- known diminishing or exploding gradients issue. The issues has been addressed by Wasserstein GAN (WGAN) [1], where the discriminator is designed to find a good fw and a new loss function is configured as measuring the Wasserstein distance:
然而,由于众所周知的梯度递减或爆炸问题,训练GAN模型是困难的。Wasserstein GAN(WGAN)[1]已经解决了这些问题,其中鉴别器被设计成寻找良好的fw,并且新的损耗函数被配置成测量Wasserstein距离:公式如下:


2.2. Deployments of GANs inSV

Motivated by the remarkable success in image-to-image translation, GANs have been actively deployed in SV re- search community, mainly to handle domain-mismatch issue, like transforming i-vectors [15] and x-vectors [19]. In con- trast, there are few works to use GANs to handle the short du- ration issue. To authors’ best knowledge, the only published work is to propose compensating the i-vectors via conditional GAN [7]. However, limited performance improvements were observed. The proposed system alone failed to outperform the baseline system, and only score-wise fusion based system showed better performance than the baseline.




In authors’ opinion, training GAN is non-trivial, the rea- son behind such results might be the oversight on effects of loss functions of conditional GAN. As such, in this study, we investigate the problem and seek to reveal some guidelines on choosing beneficial loss functions to make the model perform better.


The architecture of our proposed approach is illustrated in Fig.1. Here x and y are D-dimensional G-vectors correspond- ing to short and long utterance embedding from same speaker session, z is speaker identity labels. With given x, y, z, the proposed system is trained to learn a D-dimensional embed- ding g, with the expectation that the g-based SV system can outperform the one based on x.
Overall, the proposed architecture can be decomposed into four core components: embedding generator Gf , speaker label predictor Gc, distance calculator Gd and Wasserstine discriminator Dw . All components are jointly trained in order to generate enhanced embeddings with carefully handcraft optimization objects, as described as follows.

3.1. Proposed Discriminator-Related Loss Functions

As aforementioned, the primary task of the proposed ap- proach is to learn embedding with enhanced discriminability. Let P denote the data distribution, we propose to achieve the task by mapping Pg from initial Px to the target Py by adver- sarial learning of WGAN. To this end, in the discriminative model, several loss criteria are investigated with different optimization objectives.
Following the conventional definition of min-max function, the loss function of WGAN is:
Inspired by the idea of conditional GAN [10], in this study, we investigate a novel loss function by optimizing the Wassertein distance between joint data distributions. That is, to control the data to be discriminated by concatenating short embedding x with the conventional discriminator input. The corresponding min-max function is updated as:



在本研究中,受条件GAN的启发,我们通过优化联合数据分布之间的Wassertein距离,研究了一种新的损失函数。也就是说,通过将短嵌入x与传统的鉴别器输入串联来控制要鉴别器化的数据。相应的min max函数更新为:
In addition, to seek more discriminability, the Fre ́chet In- ception Distance (FID) [9], as a popular metric to calculate the distance between feature vectors of real and generated im- ages, is also explored herein. Assuming Py and Pg as normal distributions with means μy , μg and co-variance matrices Cy , Cg , FID loss can be calculated by:
此外,为了寻求更高的可分辨性,本文还探讨了Fréchet-In-ception Distance(FID)[9]作为计算真实图像和生成图像特征向量之间距离的常用度量。假设Py和Pg为正态分布,平均值为μy,μg,协方差矩阵为Cy,Cg,FID损失可通过以下公式计算:
3.2. Proposed Generator-Related Loss Functions

In order to guide GAN training with the objective of feature discriminability, four loss criteria are investigated herein as extra training guides for the GAN training.
To verify the speaker label, the widely adopted multiclass cross-entropy (CE) loss is investigated with formulation of:
where N is the batch size, c is the number of classes. gi de- notes the i-th generated embedding sample and zi is the cor- responding label index. W ∈ RD∗c and b ∈ Rc denotes the weight matrix and bias in the project layer.
其中N是批处理大小,c是类的数量。gi 表示第i个生成的嵌入样本和zi是对应的标签索引。W∈RD*c和b∈Rc表示项目层中的权重矩阵和偏差。
To explicitly penalize the class-related classification error, triplet loss is deployed as well, where a baseline (anchor) in- put is compared to a positive (truthy) input and a negative (falsy) input. Let Γ be the set of all possible embedding triplets γ = (ga , gp , gn ) in the training set, the loss is defined as:
where ga is an anchor input, gp is a positive input from the same class and gn is a negative input from a different class, Ψ ∈ R+ is safety margin between positive and negative pairs.
Apart from the above, to minimize intra-class variation, center loss [16] is also adopted. It can be formulated as:
where c
denotes the ith deep feature belonging to the yith class and m is the size of mini-batch.
To better guide the training process, the similarity be- tween enhanced embedding and its target is explicitly con- sidered. It’s measured by the cosine distance and evaluated as a dot product as follow:
where g ̄ and y ̄ are normalized version of embedding g and y, respectively.
In all, we propose to train the generator Gf with the total loss defined as:
总之,我们建议训练generator Gf时的total loss 定义为:
and discriminator Dw with:
After the training of WGAN, the generative model Gf is retained. At the SV test stage, a short embedding x for any given test short utterance, can be easily mapped to its enhanced version (g) by directly applying the feed-forward model of Gf on the x.

    This section details our experimental setups and investigation results on the effectiveness of the above proposed loss criteria.

4.1. Experimental Setup

We use a subset of the Voxceleb2 to train our proposed sys- tem, where 1,057 speakers are chosen with total 164,716 ut- terances. Those utterances are randomly cut to 2 seconds as short utterance. Similarly, a subset of Voxceleb1 with 40 speakers is sampled and total 13,265 utterance pairs are used for testing.
The VGG-Restnet34s network is used to extract G-vectors as our baseline system. Regarding the GAN training, the learning rates for both Gf and Dw are 0.0001; Adam op timization is adopted; weight clipping is employed for Dw with threshold setting from -0.01 to 0.01 and batch size is set as 128.

4.2. Ablation Studies on Various Loss Functions
To verify the importance of proposed loss criteria, a bunch of ablation studies are conducted by choosing different com- binations of them. The overall results are illustrated in Tab.1, where Lc, Lt denote Lcenter and Ltriplet, respec- tively. Triplet a means that inputs are sampled from both y and g and b means from g only.
In our study, total 8 systems (v1 − v8), by combining different loss criteria with Watterstein GAN, are evaluated. Their corresponding detection error trade-off (DET) curves are plotted in Fig.2.
在我们的研究中,通过将不同的损失标准与Watterstein GAN相结合,对总共8个系统(v1-v8)进行了评估。相应的检测误差权衡(DET)曲线如图2所示。
From the above experimental results, the following con-
clusions could be drawn:

• FID loss has positive effect (v1 vs. v2);
• Conditional WGAN outperforms WGAN (v3 vs. v4);
• Triplet loss is preferred (v7 vs. v2);
• Triplet a greatly outperforms triplet b (v3 vs. v8);
• softmax has positive effect (v3 vs. v5);
• Center loss has negative effect (v6 vs. v7);
• Cosine loss has significant positive effect (v6 vs. v8).

The above findings are very interesting with a twofold out- come. Firstly, it demonstrates that additional training func- tions (e.g. traditional softmax, cosine loss and triplet loss) all have positive contribution to the performance, which verifies our earlier statement that extra training guides might be help- ful for feature discriminability. Secondly, some less-favoured
loss criteria to a typical SV system (e.g. FID loss and con- ditional WGAN loss) are surprisingly helpful, which are un- usual findings and might be worthy of further investigation.

4.3. Comparison with the Baseline System

In the end, we make a performance comparison between our best system (v3) and the G-vector baseline system. Herein the comparison is measured in terms of equal error rate (EER) and minDCF. The results are reported in Tab.2.
From the table, we can see that our proposed system also has the merit for generalization and behave consistently for different short duration over the baseline system. In detail, for verification with 2 second enroll-test utterances, our proposed system shows 4.2% relative EER improvement and 7.2% rela- tive minDCF improvement. For shorter utterances with dura- tion of 1 second, it shows comparable EER (3.8%) improve- ment.
It’s worth noting that due to time constraint, the FID loss function has not been added to our final system; besides, there is no any fine-tuning on hyper-parameters, loss weights α, β, γ, λ, ε and triplet margin η. This means there are still a lot of room for improvements in our system.

    In this paper, we have successfully applied WGAN to learn enhanced embedding for speaker verification application with short utterances. Our main contributions are twofold: pro- posed WGAN-based kernel system; and on top of it, validated the effectiveness of a bunch of loss criteria on the GAN train- ing. Our final proposed system outperforms the baseline sys- tem for the challenging short speaker verification scenarios. In all, our experiments show both decent advancement and a potential direction where our further research goes forward.

