2016--Analysis of the DNN-based SRE systems in multi-language conditions

Ondrˇej Novotny ́, Pavel Mateˇjka, Ondrˇej Glembek, Oldrˇich Plchot, Frantisˇek Gre ́zl, Luka ́sˇ Burget, and Jan “Honza” Cˇernocky ́
Brno University of Technology, Speech@FIT and IT4I Center of Excellence, Brno, Czech Republic


This work was supported by the DARPA RATS Program under Contract No. HR0011-15-C-0038. The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.
这项工作得到了DARPA RATS项目的支持,合同号为HR0011-15-C-0038。所表达的观点是作者的观点,并不反映国防部或美国政府的官方政策或立场。
This work was also supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense US Army Research Laboratory contract number W911NF-12-C-0013. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.
The work was also supported by Czech Ministry of Interior project No. VI20152020025 ”DRAPAK” and European Union’s Horizon 2020 pro- gramme under grant agreement No. 645523 BISON.

This paper analyzes the behavior of our state-of-the-art Deep Neural Network/i-vector/PLDA-based speaker recognition systems in multi-language conditions. On the “Language Pack” of the PRISM set, we evaluate the systems’ performance using the NIST’s standard metrics. We show that not only the gain from using DNNs vanishes, nor using dedicated DNNs for target conditions helps, but also the DNN-based systems tend to produce de-calibrated scores under the studied conditions. This work gives suggestions for directions of future research rather than any particular solutions to these issues.


During the last decade, neural networks have experienced a renais- sance as a powerful machine learning tool. Deep Neural Networks (DNN) have been also successfully applied to the field of speech processing. After their great success in automatic speech recogni- tion (ASR) [1], DNNs were also found very useful in other fields of speech processing such as speaker [2, 3, 4] or language recog- nition [5, 6, 7]. In speech recognition, DNNs are often directly trained for the ”target” task of frame-by-frame classification of speech sounds (e.g. tied tri-phone states). Similarly, a DNN directly trained for frame-by-frame classification of languages was success- fully used for language recognition in [7]. However, this system provided competitive performance only for speech utterances of short durations.
In the field of speaker recognition, DNNs are usually used in more elaborate and indirect way: One approach is to use DNNs for extracting frame-by-frame speech features. Such features are then used in the usual way (e.g. input to i-vector based system [8]).
These features can be directly derived from the DNN output pos- terior probabilities [9] and combined with the conventional features (PLP or MFCC) [10]. More commonly, however, bottleneck (BN) DNNs are trained for a specific task, and the features are taken from a narrow hidden layer compressing the relevant information into low dimensional feature vectors [6, 5, 11]. Alternatively, standard DNN (with no bottleneck) can be used, where the high-dimensional out- puts of one of the hidden layers can be converted to features using a dimensionality reduction technique such as PCA [12].
In [13], we analyzed various DNN approaches to speaker recog- nition (and similar studies were conducted e.g. in [14, 15]). We used two different DNN’s (a mono-lingual DNN—trained on the Fisher English data corpus—and a multi-lingual DNN—trained on 11 lan- guages of the Babel data collection). The rest of the system was trained on the PRISM set, i.e. mainly on the English data. We re- ported our results only on the NIST SRE 2010 telephone condition (i.e. only on English speech) via the Equal Error Rates (EERs) and the minimum DCF NIST metrics.
在[13]中,我们分析了不同的DNN方法对说话人识别的影响(并进行了类似的研究,如在[14,15]中)。我们使用了两种不同的DNN(一种是在Fisher英语数据语料库上训练的单语DNN,另一种是在Babel数据收集的11种语言上训练的多语DNN)。系统的其余部分是在棱镜组上训练的,即主要是在英语数据上。我们仅在NIST SRE 2010电话条件下(即仅在英语语音条件下)通过等错误率(EER)和最小DCF NIST指标重新报告结果。

[13] Pavel Mateˇjka, Ondˇrej Glembek, Ondˇrej Novotny ́, Oldˇrich Pl- chot,FrantisˇekGre ́zl,Luka ́sˇBurget,andJanCˇernocky ́,“Anal- ysis of DNN approaches to speaker identification,” in Proceed- ings of the 41th IEEE International Conference on Acoustics, Speech and Signal Processing. 2016, pp. 5100–5104, IEEE Signal Processing Society.
[14] Yao Tian, Meng Cai, Liang He, and Jia Liu, “Investigation of bottleneck features and multilingual deep neural networks,” in Interspeech, 2015.
[15] Sandro Cumani, Oldˇrich Plchot, and Pietro Laface, “Compar- ison of hybrid DNN-GMM architectures for speaker recogni- tion,” in ICASSP. 2016, IEEE Signal Processing Society.

However, when tested on non-English test sets, we observed that the benefit of using the DNNs degraded dramatically. We used the “lan” Language Pack of the PRISM set (described later in the paper), and its Chinese subset—the “chn” pack in comparison with the orig- inally used NIST SRE 2010 telephone condition. Not only we saw performance degradation in terms of EER and the minimum DCFs, but more so in terms of the actual DCFs, i.e. the systems produce heavily de-calibrated scores.
然而,当在非英语测试集上测试时,我们发现使用DNNs的好处显著降低。我们使用PRISM集的“lan”语言包(本文稍后介绍),其,中文子集“chn”包与最初使用的NIST SRE 2010电话条件进行比较。我们不仅看到了EER和最小DCF方面的性能下降,更看到了实际DCF方面的性能下降,即系统产生严重的未校准分数。,即,系统产生的分数准确度较小。如果是calibrated scores,就是产生的分数准度较高。

Our hypothesis was that when we use the DNN trained for the target language, the error rates would decrease. To match the sre10, “lan”, and “chn” test conditions, we chose three DNNs, trained on: i) the Fisher English, the ii) Multilingual set, and iii) the Mandarin, re- spectively. However, it turned out that, apart from the Fisher English being optimal for the NIST SRE 2010 test, there was no clear corre- lation between the test language and the DNN training language.
我们的假设是,当我们使用为目标语言训练的DNN时,错误率会降低。(即假设是,训练集的语言和测试集的语言一致时,错误率会降低)为了符合sre10,“lan”和“chn”测试条件,我们选择了三个DNN,分别是:i)Fisher英语,ii)多语言集,和iii)普通话。(lan就是说这个情况下的测试语言是多语种的,chn就是说这个情况下的测试语言的普通话,当然还有一种情况下的测试语言是英语。所以作者选用了三种DNN,训练集分别是英语,多语言和普通话)然而,结果发现,除了Fisher英语是NIST SRE 2010测试的最佳语言外,,**测试语言和DNN训练语言之间没有明显的相关性。**结果与前面的假设并不一致,并不是-------训练集的语言和测试集的语言一致时,错误率会降低。

This paper analyzes the problems that emerged when applying the current state-of-the-art SRE systems to non-English domains, and provides directions for future research. This work is an exten- sion of our previous analysis, available as a technical report [16].


[16] Ondˇrej Novotny ́, Pavel Mateˇjka, Ondˇrej Glembek, Oldˇrich Plchot, Frantisˇek Gre ́zl, Luka ́sˇ Burget, and Jan “Honza” Cˇernocky ́, “DNN-based SRE systems in multi-language conditions,” 2016, BUT Technical Report, http://www.fit.vutbr.cz/research/pubs/report.php?id=11235, also being submitted to IEEE Signal Processing Letters.

    2.1. i-vector Systems
    The i-vectors [8] provide an elegant way of reducing large-dimensional input data to a small-dimensional feature vector while retaining most of the relevant information. The main principle is that the utterance-dependent Gaussian Mixture Model (GMM) supervector of concatenated mean vectors s is modeled as

We experimented with monolingual (English and Mandarin) and multilingual BN features. In the case of multilingual training, we adopted training scheme with block-softmax, which divides the out- put layer into parts according to individual languages. During train- ing, only the part of the output layer is activated that corresponds to the language that the given target belongs to. See [20, 21] for detailed description.

Bottleneck Neural-Network (BN-NN) refers to such topology of a NN, one of whose hidden layers has significantly lower dimension- ality than the surrounding layers. A bottleneck feature vector is gen- erally understood as a by-product of forwarding a primary input fea- ture vector through the BN-NN and reading off the vector of values at the bottleneck layer. We have used a cascade of two such NNs for our experiments. The output of the first network is stacked in time, defining context-dependent input features for the second NN, hence the term Stacked Bottleneck Features.

However, it was shown that DNNs can be used directly for posterior computa- tion [2] .

In other words, we show the utility of the trained DNNs as both feature- and posterior-extractors

SBN特征提取涉及两个NN:第一个NN的瓶颈(BN)输出是堆叠的、下采样的、可选的,并作为第二个NN的输入向量。第二个神经网络又有一个BN层,其输出作为传统高斯混合模型-隐马尔可夫模型(GMM-HMM)语音识别系统的输入特征。基频(f0)相关特征是两种声调语言和非声调语言语音识别系统中的重要特征。我们尝试将不同的f0特性作为SBN的附加输入。SBN最后输出的瓶颈特征是80维的,后续作为传统GMM/UBM i-vector 说话人识别系统的输入特征。


English SBN
Mandarin SBN
Multilang SBN
English DNN
Mandarin DNN

Baseline的DNN结构不知道是啥,也没说,Baseline应该是标准的kaldi 的i-vector吧,Baseline的特征提取设置如下:
19 MFCC coefficients + en- ergy augmented with their delta and double delta coefficients, re- sulting in 60-dimensional feature vectors. The analysis window was 20 ms long with the shift of 10 ms.

24 log Mel-scale filter bank outputs augmented with fundamental frequency features from 4 different f0 estimators (Kaldi, Snack1 , and two other according to [17] and [18]). Together, we have 13 f0 related features, see [19] for more de- tails.
80维的,后续作为传统GMM/UBM i-vector 说话人识别系统的输入特征。

English DNN 和 Mandarin DNN 的输入和输出又是啥,感觉文中没说啊。

In Tab. 3, we show the effect of a linear calibration on the En- glish SBN system. Because of the lack of an independent held- out set, we performed a cheating (gender-independent) calibration trained using the “lan” trial set, which contains both English and Chinese trials.

    In this work, we have studied the behavior of the DNN techniques in SRE i-vector/PLDA systems, currently considered to be state-of- the-art, as evaluated on the most common NIST SRE English test sets, such as the NIST SRE 2010, condition 5. We have shown that when applied to non-English test sets, these techniques stop being effective and are susceptible to de-calibration of the scores produced by the traditional i-vector/PLDA systems. We have also observed that selecting a DNN to match the test condition does not solve the issues mentioned above.
    在这项工作中,我们研究了目前被认为是最先进的SRE i-vector/PLDA系统中DNN技术的行为,并在最常见的NIST SRE英语测试集(如NIST SRE 2010,条件5)上进行了评估。我们已经证明,当应用于非英语测试集时,这些技术不再有效,并且容易对传统的i-vector/PLDA系统产生的分数进行去校准。我们还观察到,选择一个DNN来匹配测试条件并不能解决上述问题。
    This work therefore leaves more questions than answers, and suggests that we focus on the analysis of the DNN acoustic space clustering with regard to multiple languages and other types of vari- ability, and that we study the behavior of clustering with regard to the available SRE training data.
