icassp2020---XMU-TS SYSTEMS FOR NIST SRE19 CTS CHALLENGE

最新推荐文章于 2022-04-21 17:59:27 发布

Grace_yanyanyan

最新推荐文章于 2022-04-21 17:59:27 发布

阅读量1.8k

点赞数

分类专栏： icassp2020

原文链接：www.baidu.com

版权

icassp2020 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

XMU-TS SYSTEMS FOR NIST SRE19 CTS CHALLENGE

Hao Lu1, Jianfeng Zhou2, Miao Zhao1, Wendian Lei3, Qingyang Hong∗1, Lin Li∗2
1School of Informatics, Xiamen University, China，厦门大学信息学院
2School of Electronic Science and Engineering, Xiamen University, China 中国厦门大学电子科学与工程学院
3Xiamen Talentedsoft Co., Ltd., China {qyhong,lilin}@xmu.edu.cn

ABSTRACT

摘要
In this paper, we present our submitted XMU-TS system for NIST SRE19 CTS Challenge. The evaluation of this year on- ly offers the open training condition. With the large amounts of data assimilated into training set, the diversity of training data sources inevitably leads to domain mismatch, which be- comes a key factor affecting the system performance. In order to solve this problem, we have made a lot of attempts. Based on the x-vector framework, we used different network struc- tures, and tried to modify the performance of factorized time delay deep neural network (F-TDNN) and residual network (ResNet). In addition, in the back-end classifier, we used do- main adaption to eliminate the impact of domain mismatch. Finally, we employed Adaptive Symmetric Score Normaliza- tion (AS-Norm) for score normalization to adjust the fraction- al distribution space. These attempts have enriched the diver- sity of our systems, enabling the fusion system to comple- ment each subsystem and improve the final submission per- formace.
在这篇文章中，我们为NIST SRE19CTS挑战赛提交了XMU-TS系统。今年的评估提供了开放式训练的条件。随着大量的数据被吸收到训练集中，训练数据源的多样性不可避免地导致域失配，而域失配是影响系统性能的一个关键因素。为了解决这个问题，我们做了很多尝试。在x矢量框架的基础上，采用不同的网络结构，对因子化时滞深神经网络（F-TDNN）和残差网络（ResNet）的性能进行了改进。另外，在后端分类器中，我们使用domain自适应来消除域失配的影响。最后，我们使用自适应对称记分标准化（Adaptive Symmetric Score Normaliza- tion (AS-Norm）进行score normalization，以调整score分布空间。这些尝试丰富了我们系统的多样性，使融合系统能够完成每个子系统并提高最终提交性能。

1. INTRODUCTION

The Speaker Recognition Evaluation, sponsored by the US National Institute of Standards and Technology (NIST), has been one of the most representative contests in speaker recog- nition since 1996. Research teams from all over the world constantly explore new algorithms and state-of-the-art tech- nologies for speaker recognition. The 2019 NIST Speaker Recognition Evaluation (SRE19) includes two separate activ- ities:
由美国国家标准与技术研究所（NIST）主办的说话人识别评估自1996年以来一直是最具代表性的说话人识别竞赛之一。来自世界各地的研究团队不断探索用于说话人识别的新算法和最新技术。2019年NIST说话人识别评估（SRE19）包括两个独立的活动：

CTS: The evaluation data is conversational telephone speech obtained from Call My Net 2 (CMN2) corpus.
CTS：评估数据是从CallMyNet2（CMN2）语料库中获得的会话性电话语音。
Multimedia: The evaluation data includes audio and vi- sual data obtained from Video Annotation for Speech Technology (VAST) corpus.
多媒体：评价数据包括语音技术视频标注（VAST）语料库中的音频和视频数据。

The goal of the two tasks above is to determine whether the enrollment speaker is present in the test statement. Our system only conducts the CTS task.
以上两项任务的目标是确定the enrollment speaker是否出现在测试语句中。我们的系统只执行CTS任务。

Since this evaluation provides options for open training data, it will inevitably lead to the introduction of large-scale publicly available data sets for system development. It is con- ceivable that the domain mismatch between individual data sets and test data will arise due to the different collection en- vironment of data sets. We started the system development work for this challenge and tried to eliminate the performance degradation caused by the domain mismatch.
由于这一评估为开放式培训数据提供了选择，因此不可避免地会导致为系统开发引入大规模的公共数据集。由于数据集的收集环境不同，可能会导致单个数据集与测试数据之间的域不匹配。我们为这个挑战开始了系统开发工作，并试图消除由域不匹配导致的性能下降。

The first thing we thought of is to increase the diversi- ty of subsystems, and it is most convenient to extract differ- ent acoustic features for training. In our experiments, three types of features have been employed for training. And it is necessary to find a robust training system. In the field of speaker recognition, one of the most representative architec- tures at present is time delay deep neural network (TDNN) based x-vector[1, 2]. In[3], D.Snyder used data augmentation to improve the robustness of the x-vector system, which also demonstrates that x-vector is data driven. This character of x- vector and its excellent performance are perfectly in line with our needs. Therefore, we mainly explored the architectures of speaker recognition system based on x-vector. In terms of network structure, we mainly explored TDNN, extended TDNN (E-TDNN)[4], F-TDNN[5] and ResNet[6]. E-TDNN has a deeper network structure than TDNN to learn more in- formation. F-TDNN factorizes the parameter matrices into s- maller matrices, which makes the training more efficient. And ResNet can learn a lot of detailed temporal information.
我们首先想到的是增加子系统的多样性，最方便的是提取不同的声学特征进行训练。在我们的实验中，使用了三种类型的特征进行训练。因此有必要建立一个健壮的训练系统。在说话人识别领域，目前最具代表性的体系结构之一是基于x矢量的时延深度神经网络（TDNN）。在文献[3]中，D.Snyder利用数据增广来提高x向量系统的鲁棒性，这也证明了x向量是数据驱动的。x矢量的这种特性及其优良的性能完全符合我们的需要。因此，我们主要研究了基于x矢量的说话人识别系统的体系结构。在网络结构方面，主要研究了TDNN、扩展TDNN（E-TDNN）[4]、F-TDNN[5]和ResNet[6]。为了学习更多的信息，E-TDNN比TDNN具有更深的网络结构。F-TDNN将参数矩阵分解成更小的矩阵，提高了训练效率。ResNet可以学习到很多详细的时间方面的信息。

Following the extraction of x-vector, we used probabilis- tic linear discriminant analysis (PLDA) [7] for the back-end scoring. We also employed centering, whitening, LDA, do- main adaption and length normalization on x-vector before scoring. These have played an important role in eliminating domain mismatches. After the scoring, we used AS-Norm[8] to optimize the distribution of scores. In this process, the im- pact of domain mismatch is gradually weakened, which fur- ther improves our system performance.
在提取x向量之后，我们使用概率线性判别分析（PLDA）[7]进行后端评分。评分前对x向量进行中心化、白化、LDA、domain自适应和长度归一化。这些在消除域不匹配方面发挥了重要作用。得分后，我们用AS-Norm[8]来优化得分分布。在此过程中，域失配的影响逐渐减弱，从而提高了系统的性能。

The rest of the paper is organized as follow: Section 2 gives the description of datasets and acoustic feature extrac- tion. In Section 3, we described the details of the subsystems we developed for SRE19. Section 4 illustrates the back-end and score normalization. In Section 5, we reports the result of our subsystems for SRE18 evaluation. Finally, we conclude our work in Section 6.
论文的其余部分安排如下：第2节给出了数据集描述和声学特征提取。在第3节中，我们描述了为SRE19开发的子系统的细节。第4节说明了后端和分数标准化。在第5节中，我们报告了SRE18评估的子系统结果。最后，我们在第6节结束我们的工作。

2. DATA PREPARATION

2.1. Datasets

For the open training condition, the publicly available data sets, including the corpuses of speaker recognition evalua- tions (SREs), SwitchBoard and VoxCeleb [9, 10], were used for systems training and development.
在开放式训练条件下，使用公开的数据集，包括说话人识别评估（SREs）、SwitchBoard和VoxCeleb[9,10]等，进行系统训练和开发。
By combing the above individual datasets, three different training sets were obtained:
通过对上述单个数据集的梳理，得到了三个不同的训练集：
(i) Train-v1: This set includes the corpuses of NIST SRE04, 05, 08, 10, SRE12-tel, MIXER6 and Switch- board. It contains 84,287 recordings from 5,238 speak- ers.
训练集-1：该组包括NIST SRE04、05、08、10、SRE12-tel、 MIXER6和Switchboard。它收录了5238名说话者的84287段录音。
(ii) Train-v2: This set consists of VoxCeleb1 and VoxCele- b2. It contains 2,040,479 recordings from 7,205 speak- ers.
训练集-2：该组由VoxCeleb1和VoxCelb2组成。它包含了2040479个来自7205个说话人的录音。
(iii) Train-v3: This set includes the corpuses of NIST SRE04, 05, 08, 10, MIXER6, VoxCeleb1, VoxCeleb2 and Switchboard. This provides 2,124,766 recordings from 12,443 speakers.
训练集-3：该组包括NIST SRE04、05、08、10、MIXER6、VoxCeleb1、VoxCeleb2和 Switchboard。这提供了来自12443个说话人的2124766个录音。

We also employ additive noises and reverberation (i.e., Babble, Noise, Music and Reverb from Musan [11] and re- verberation [12]) as described in [3] to augment the training data. This operation can make the systems more robust, and alleviate the problem of training data domain mismatch.
我们还使用附加噪声和混响（即来自Musan[11]和reverberation[12]的杂音、噪声、音乐和混响）来增加训练数据，如[3]所述。这种操作可以提高系统的鲁棒性，缓解训练数据域不匹配的问题。

2.2. Acoustic feature extraction声学特征提取

2.2.1. Mel frequency cepstral coefficient
For the Mel frequency cepstral coefficient (MFCC) feature extraction, all audios were converted to the cepstral fea- tures of 23-dimensional MFCC with a frame-length of 25ms and a frame shift of 10ms. The cepstral filter banks were selected within the range of 20 to 3700 Hz. Then, a frame- level energy-based voice activity detector (VAD) selection was conducted to the features. This was followed by local cepstral mean and variance normalization (CMVN) over a 3-second sliding window. All operations of feature extraction were based on Kaldi toolkit[13].
在Mel频率倒谱系数（MFCC）特征提取中，将所有音频转换成帧长25ms、帧移10ms的23维MFCC倒谱特征，在20～3700hz范围内选择倒谱滤波器组。然后，对特征进行基于帧级能量的语音活动检测器（VAD）选择。接下来是3秒滑动窗口上的局部倒谱均值和方差归一化（CMVN）。特征提取的所有操作都基于Kaldi工具包[13]。

2.2.2. Perceptual linear predictive features 线性感知预测特征
Perceptual linear predictive (PLP) is also a common acoustic feature. Compared to the linear prediction coefficient (LPC), it is more in line with the auditory mechanism of the human ear. 20-dimension PLP feature with 3-dimension pitch (PLP- Pitch) parameters were also adopted as the features for the performance comparation in this work. Similar to MFCC, VAD and CMVN were used in sequence.
感知线性预测（PLP）也是一种常见的声学特征。与线性预测系数（LPC）相比，它更符合人耳的听觉机制。本文还采用了具有三维螺距（PLP-pitch）参数的20维PLP特征作为性能比较的特征。与MFCC相似，VAD和CMVN被依次使用。

2.2.3. Filter bank features 滤波器组特征
The other subsystems were based on the filter bank (FB) feature. The FB feature retains a lot of raw information, which makes it possible for the neural network to learn more useful information. Of course, this also requires the neural network itself to have strong modeling capability. The FB feature vec- tors include 40 dimensional FBs and energy value extracted from the raw signal with a 25ms frame-length. Similar to M- FCC, VAD and CMVN were also used in sequence.
其他子系统基于滤波器组（FB）特性。FB特征保留了大量的原始信息，使得神经网络能够学习到更多有用的信息。当然，这也要求神经网络本身具有很强的建模能力。FB特征向量包括40维FB和从25毫秒帧长的原始信号中提取的能量值。与MFCC相似，VAD和CMVN也依次使用。

3. SUB SYSTEMS

       子系统

The final submitted system is based on the fusion of several x-vector systems with different datasets and features. In this section we will introduce the details of each subsystem.
最终提交的系统是基于多个具有不同数据集和特征的x向量系统的融合。在本节中，我们将介绍每个子系统的详细信息。

3.1. TDNN x-vector systems

• tdnn-v1: TDNN x-vector with the architecture pro- posed in [2] trained on 2-fold Train-v1 with 23- dimension PLP-Pitch features.
tdnn-v1: tdnn x矢量，其架构和[2]中一样，在具有23维PLP音高特征的2倍Train-v1上进行训练。

[2] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur, “Deep neural network embed- dings for text-independent speaker verification.,” in IN- TERSPEECH, 2017, pp. 999–1003.

• tdnn-v2: TDNN x-vector trained on 2-fold Train-v1 with 23-dimention MFCC features.
tdnn-v2: tdnn x矢量，在具有23维MFCC特征的2倍Train-v1上训练。

3.2. Extended TDNN x-vector systems
扩展的TDNNX矢量系统
• e-tdnn-v1: Extended TDNN x-vector trained on 2-fold Train-v3 with 26-dimension MFCC-Pitch features (23- dimension MFCC and 3-dimension Pitch are concate- nated together). The configuration of extended TDNN x-vector could be found in [4].
e-tdnn-v1：扩展的tdnn x向量，在2倍的Train-v3上训练，具有26维MFCC-音高特征（23维MFCC和3维音高结合在一起）。扩展的TDNN x矢量的结构可以在[4]中找到。

[4] David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan McCree, Daniel Povey, and Sanjeev Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in 2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (I- CASSP). IEEE, 2019, pp. 5796–5800.

• e-tdnn-v2: Extended TDNN x-vector trained on 2-fold Train-v3 with 23-dimension PLP-Pitch features.
e-tdnn-v2：扩展tdnn x矢量，在2倍Train-v3上训练，具有23维PLP-Pitch特性。

• e-tdnn-v3: Extended TDNN x-vector trained on 5-fold Train-v2 with 23-dimension MFCC features.
e-tdnn-v3：扩展tdnn x矢量，在5倍Train-v2上训练，具有23维MFCC特征。

• e-tdnn-v4: Extended TDNN x-vector trained on 2-fold Train-v3 with 23-dimension PLP-Pitch features which are the same as e-tdnn-v2, but differ in the VAD proce- dure. The threshold for calculating the VAD is relaxed, allowing some of the noise to be retained.
e-tdnn-v4：扩展tdnn x矢量，在2倍Train-v3上训练，具有23维PLP-Pitch特征，与e-tdnn-v2相同，但在VAD过程中有所不同。计算VAD的阈值被放宽，允许保留一些噪声。

• e-tdnn-v5: Extended TDNN x-vector trained on 3-fold Train-v3 with 23-dimension MFCC features.
e-tdnn-v5：在具有23维MFCC特征的3倍Train-v3上训练的扩展tdnn x向量。

• e-tdnn-v6: Extended TDNN x-vector trained on 3-fold Train-v3 with 23-dimension MFCC features and focal loss [14] which was proposed to solve the imbalance of samples for different classes. In this system, focal loss was used as the objective for training.
e-tdnn-v6：在具有23维MFCC特征和focal loss [14]的3倍Train-v3上训练的扩展tdnn x矢量， [14]提出了解决不同类别样本不平衡的方法。该系统以focal loss 为训练目标。

3.3. Factorized TDNN x-vector systems
因子化TDNN x向量系统
The core trick of F-TDNN is factorizing matrices with a semi- orthogonal constraint. This obviously reduces the amount of parameters and proves that there is no loss of modeling capa- bility through the singular value decomposition (SVD). The configuration of the first two factorized TDNN x-vector sys- tems could be found in[15]. We have modified the architec- ture of the factorized TDNN to make it deeper in the third system, and the architecture configuration is shown in Table 1.
F-TDNN的核心技巧是分解具有半正交约束的矩阵。这明显减少了参数的数量，并证明了通过奇异值分解（SVD）建模的能力没有损失。前两个因子化TDNN x向量系统的结构见[15]。我们修改了分解TDNN的架构，使其在第三个系统中更深入，架构配置如表1所示。

[15] Jesu ́s Villalba, Nanxin Chen, David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Jonas Borgstrom, Fred Richardson, Suwon Shon, Franc ̧ois Grondin, et al., “The jhu-mit system description for nist sre18,” Johns Hopkins University, Baltimore, MD, Tech. Rep, 2018.

f-tdnn-v1: Factorized TDNN x-vector trained on 5-fold Train-v2 with 23-dimension MFCC features.
f-tdnn-v1：在具有23维MFCC特征的5倍Train-v2上训练的Factorized TDNN x-vector

f-tdnn-v2: Factorized TDNN x-vector trained on 3-fold Train-v3 with 23-dimention MFCC features.
f-tdnn-v2：在具有23维MFCC特征的3倍Train-v3上训练的Factorized TDNN x-vector

f-tdnn-v3: Factorized TDNN x-vector trained on 2- fold Train-v3 with 44-dimension FB-Pitch features (41-dimension FB and 3-dimension Pitch are concate- nated together).
f-tdnn-v3：在具有44维FB-Pitch特征的2倍Train-v3上训练的Factorized TDNN x-vector（41维FB和3维Pitch被合并在一起）。

3.4.ResNet x-vector systems

res-v1: Resnet trained on 2-fold Train-v3 with 44- dimension FB-Pitch features. The architecture config- uration of this system is shown in Table 2.
res-v1：Resnet在2倍Train-v3上训练，具有44维FB-Pitch特征。该系统的架构配置如表2所示。
在这里插入图片描述
Most of the subsystems were implemented in Kaldi toolkit [13] except that the e-tdnn-v5 and e-tdnn-v6 both were imple- mented in Pytorch[16].
除了e-tdnn-v5和e-tdnn-v6都在Pytorch中实现之外，大多数子系统都在Kaldi工具包中实现了[13]。

4. BACK-END

4.1. Scoring

For all the systems above, the PLDA of the system was trained using embeddings of the 2-fold Train-v1 (Switchboard is ex- cluded) since the PLDA is sensitive to the domain. For the post-processing of the embeddings extracted from the embedding extractors, length normalization, centering, whiten- ing and LDA transformation for feature dimensionality reduc- tion have been applied to the embeddings in sequence, final- ly followed by the PLDA training. Furthermore, The PLDA parameters are adapted on the in-domain development data (SRE 18). All scores of subsystems were estimated using the adapted PLDA.
对于上述所有系统，由于PLDA对域敏感，所以系统的PLDA是使用2倍Train-v1（Switchboard除外）的embeddings进行训练的。对于embedding的提取后处理，对embedding进行了长度归一化、中心化、白化和特征降维LDA变换，最后进行PLDA训练。此外，PLDA参数根据域内dev数据（SRE 18）进行调整。所有子系统的得分都是用自适应PLDA得到的。

4.2. Score normalization and fusion 分数归一化与融合
It is worth noting that we have applied the AS-Norm to the scores. For a given score sij (a score for speaker model i and test segment j), the normalized score could be formulated as follow:
值得注意的是，我们对分数采用了AS-Norm。对于给定分数sij（说话人模型i和测试段j的分数），标准化分数可以表述为：
在这里插入图片描述
where the μi (Nj) and σi (Nj) are the mean and standard deviation of the scores for the speaker model i and the test segments from a subset of cohort set Nj , which is consists of the top N scores calculated between the test segment j and the segments from cohort set.
其中μi（N j）和σi（Nj）是说话人模型i和来自队列集Nj的子集的测试段的分数的平均值和标准差，由测试段j和队列集的分段之间计算的前N个分数组成。
在这里插入图片描述
In our experiments, the combination of the unlabeled and enroll part of SRE 18 dev has been used as cohort set and top 2,000 of sorted cohort scores have been used for calculating the normalization statistics.
在我们的实验中，使用SRE 18 dev的未标记部分和注册部分的组合作为队列集，并使用排序队列得分的前2000名来计算标准化统计。

5. RESULTS

We present the experimental results on the evaluation part of SRE18, since not each subsystem has obtained the results on the progress set of SRE19. And all the results were calcu- lated through the official scoring software. The results of al- l subsystems are shown in Table 3. We used AS-Norm to adjust the score distribution in order to improve the system performance. However, the results after AS-Norm only have improvement on EER, but still do not work for the primary metric act C. These results are not been listed due to the lim- ited length of the paper. We thus fused the original score and the score after AS-Norm, which further improves the prefor- mance of the subsystems. The score level fusion results of all subsystems can be seen in Table 4.
我们在SRE18的评估部分给出了实验结果，因为并不是每个子系统都得到了SRE19进度集的结果。所有的结果都是通过官方评分软件计算出来的。所有子系统的结果如表3所示。我们使用AS-Norm来调整分数分布，以提高系统性能。然而，AS-Norm后的结果仅对EER有改进，但仍不适用于主度量act C。由于论文篇幅有限，这些结果未列出。我们因此将原始得分和AS-Norm之后的得分融合，进一步提高了子系统的性能。各子系统的得分融合结果见表4。
在这里插入图片描述
From the above results, it can be clearly seen that F- TDNN x-vector performs better under the same data and the best single system is f-tdnn-v2. It could be found that the structure in f-tdnn-v3 also has great potential, and it takes ad- vantage of the deeper structure configuration of the F-TDNN network. In the feature piece, PLP has better robustness to noise and seems to be more suitable for this data augmenta- tion mode. We did a lot of engineering optimizations on the Pytorch system, which gave e-tdnn-v5 the best performance in comparison with the systems with the same extended TDNN architecture.
从以上结果可以清楚地看出，在相同的数据下，F-TDNN x矢量表现更好，最佳单系统是F-TDNN-v2。可以发现，f-tdnn-v3的结构也具有很大的潜力，它有利于f-tdnn网络更深层次的结构配置。在特征块中，PLP对噪声具有更好的鲁棒性，似乎更适合这种数据增强模式。我们对Pythorch系统进行了大量的工程优化，使e-tdnn-v5与具有相同扩展tdnn体系结构的系统相比具有最佳性能。

Finally, all the twelve subsystems are fused to generate the scores of primary system submitted to CTS challenge. The Bosaris toolkit[17] was used to perform the fusion by learn-ing the weights from scores of the development set. In our experiments, we used the eval part of SRE 18 as the develop- ment set. Results of our primary system on the progress set of NIST SRER19 and the baseline of SRE19 are shown in Table 5. System fusion allows the usage of complemental informa- tion from different subsystems so that the performance could be improved.
最后，对12个子系统进行融合，生成提交CTS挑战的主系统得分。Bosaris工具包[17]通过从dev集的分数中学习权重来执行融合。在我们的实验中，我们使用SRE 18的eval部分作为dev集。我们的主系统在NIST SRER19进度集（the progress set 这是啥？）和SRE19基线上的结果如表5所示。系统融合允许使用来自不同子系统的互补信息，从而提高性能。
在这里插入图片描述

6. CONCLUSIONS

This paper presented the description of the XMU-TS submis- sion to SRE 19 CTS challenge. In view of the large amount of training data and the domain mismatch problem, we have made various attempts in network structures, back-end s- coring and score normalization. Different acoustic features greatly enhance the diversity and complementarity of our sys- tems. These attempts have eliminated the impact of domain mismatch to some extent from different stages, allowing our final fusion system to achieve great improvement in compari- son with the baseline.
本文描述了XMU-TS向SRE 19 CTS挑战赛的提交。针对训练数据量大、域不匹配的问题，我们在网络结构、后端打分和分数规范化等方面做了各种尝试。不同的声学特性大大增强了我们系统的多样性和互补性。这些尝试在一定程度上消除了不同阶段域失配的影响，使得我们最终的融合系统与基线相比有了很大的改进。

7. ACKNOWLEDGEMENT

This work is supported by the National Natural Science Foundation of China (Grant No.61876160).

8. REFERENCES

[1] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khu- danpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in IN- TERSPEECH, 2015.
[2] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur, “Deep neural network embed- dings for text-independent speaker verification.,” in IN- TERSPEECH, 2017, pp. 999–1003.
[3] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Ro- bust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
[4] David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan McCree, Daniel Povey, and Sanjeev Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in 2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (I- CASSP). IEEE, 2019, pp. 5796–5800.
[5] Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, and Sanjeev Khu- danpur, “Semi-orthogonal low-rank matrix factorization for deep neural networks.,” in INTERSPEECH, 2018, p- p. 3743–3747.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[7] Simon JD Prince and James H Elder, “Probabilistic lin- ear discriminant analysis for inferences about identity,” in 2007 IEEE 11th International Conference on Com- puter Vision. IEEE, 2007, pp. 1–8.
[8] SandroCumani,PierDomenicoBatzu,DanieleColibro, Claudio Vair, Pietro Laface, and Vasileios Vasilakakis, “Comparison of speaker recognition approaches for real applications,” in INTERSPEECH, 2011.
[9] Arsha Nagrani, Joon Son Chung, and Andrew Zisser- man, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
[10] Joon Son Chung, Arsha Nagrani, and Andrew Zisser- man, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
[11] David Snyder, Guoguo Chen, and Daniel Povey, “Mu- san: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[12] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5220–5224.
[13] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011.
[14] Tsung-YiLin,PriyaGoyal,RossGirshick,KaimingHe, and Piotr Dolla ́r, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[15] Jesu ́s Villalba, Nanxin Chen, David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Jonas Borgstrom, Fred Richardson, Suwon Shon, Franc ̧ois Grondin, et al., “The jhu-mit system description for nist sre18,” Johns Hopkins University, Baltimore, MD, Tech. Rep, 2018.
[16] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Au- tomatic differentiation in pytorch,” 2017.
[17] Niko Bru ̈mmer and Edward De Villiers, “The bosaris toolkit: Theory, algorithms and code for surviving the new dcf,” arXiv preprint arXiv:1304.2865, 2013.