Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

最新推荐文章于 2021-11-11 05:30:36 发布

Grace_yanyanyan

最新推荐文章于 2021-11-11 05:30:36 发布

阅读量2k

点赞数

分类专栏： papers SRE daniel povey

原文链接：http://www.danielpovey.com/files/2018_interspeech_xvector_attention.pdf

版权

papers 同时被 3 个专栏收录

17 篇文章 1 订阅

订阅专栏

SRE

5 篇文章 0 订阅

订阅专栏

daniel povey

1 篇文章 0 订阅

订阅专栏

Interspeech 2018 --Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

http://www.danielpovey.com/files/2018_interspeech_xvector_attention.pdf

Yingke Zhu1, Tom Ko2, David Snyder3, Brian Mak1, Daniel Povey3
1Department of Computer Science & Engineering
1计算机科学与工程系
The Hong Kong University of Science & Technology
香港科技大学
2Huawei Noahs Ark Research Lab, Hong Kong, China
中国香港华威诺亚方舟研究实验室
3Center for Language and Speech Processing & Human Language Technology Center of Excellence The Johns Hopkins University, USA
3语言和语音处理中心和美国约翰霍普金斯大学优秀人类语言技术中心
{yzhuav,mak}@cse.ust.hk, {tomkocse, david.ryan.snyder, dpovey}@gmail.com
{yzhuav，mak}@cse.ust.hk，{tomkocse，david.ryan.snyder，dpovey}@gmail.com

Abstract
摘要
This paper introduces a new method to extract speaker embed- dings from a deep neural network (DNN) for text-independent speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vec- tors over the frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. We relax this assumption and compute the speaker embedding as a weighted average of a speaker’s frame-level hidden vectors, and their weights are automatically determined by a self-attention mechanism. The effect of multiple attention heads are also investigated to capture different aspects of a speaker’s input speech. Finally, a PLDA classifier is used to compare pairs of embeddings. The proposed self-attentive speaker embedding system is compared with a strong DNN embedding baseline on NIST SRE 2016. We find that the self-attentive embeddings achieve superior performance. Moreover, the improvement pro- duced by the self-attentive speaker embeddings is consistent with both short and long testing utterances.
本文介绍了一种从深度神经网络（DNN）中提取说话人嵌入信息的新方法。通常，说话人嵌入是从说话人分类DNN中提取的，该DNN平均了说话人帧上的隐藏向量；假设所有帧产生的隐藏向量同等重要。我们放松这一假设，**将说话人嵌入计算为说话人帧级隐藏向量的加权平均值，其权重由自关注机制自动确定。研究了多个注意头对说话人输入语音不同方面的影响。**最后，使用PLDA分类器来比较嵌入对。在NIST SRE 2016上，将提出的自关注扬声器嵌入系统与强大的DNN嵌入基线进行了比较。我们发现，自我关注的嵌入实现了卓越的性能。此外，自我关注的说话人嵌入所带来的改进与短测试话语和长测试话语都是一致的。

Introduction

Speaker verification (SV) is the task of accepting or rejecting the identity claim of a speaker based on some given speech. There are two broad categories of SV systems: text-dependent and text-independent SV systems. Text-dependent SV systems require the content of input speech to be fixed, while text- independent SV systems do not.
说话人确认（SV）是指在给定的语音基础上，接受或拒绝说话人的身份声明。SV系统有两大类：依赖文本的SV系统和独立于文本的SV系统。与文本相关的SV系统要求输入语音的内容是固定的，而与文本无关的SV系统则不需要。
Over the years, the combination of i-vectors [1] and prob- abilistic linear discriminant analysis (PLDA) [2] has been the dominant approach for text-independent SV tasks [3, 4, 5]. Also, hybrid approaches that incorporate deep neural networks (DNNs) trained for automatic speech recognition (ASR) into the i-vector system have proved to be beneficial in some condi- tions [6, 7, 8, 9, 10]. However, the ASR DNN adds significant computational complexity to the i-vector system and also re- quires transcribed data for training. Moreover, the success of this approach has been primarily isolated to English-language datasets [11]. On the other hand, recent work demonstrates that more powerful SV systems can be built from directly training a speaker discriminative DNN [12, 13, 14, 15, 16, 17]. Heigold et al. introduced an end-to-end system for a text-dependent SV task, that was jointly trained to map frame-level features to speaker embeddings and to learn a similarity metric to compare embedding pairs [13]. The system was then adapted to the more general task of text-independent SV in [15]. The work in [16] divided the end-to-end system into two components: a DNN to produce speaker embeddings and a separately trained PLDA classifier to compare embedding pairs. Compared to the end-to- end approach, this method requires less data to be effective and has the added benefit of facilitating reuse of the methods devel- oped over the years for processing and comparing i-vectors. We continue to use this two-stage approach in this work.
多年来，i向量[1]和概率线性判别分析（PLDA）[2]的结合一直是独立于文本的SV任务的主要方法[3，4，5]。此外，将用于自动语音识别（ASR）的深神经网络（DNN）集成到i-向量系统中的，混合方法hybrid approaches，在某些情况下被证明是有益的[6，7，8，9，10]。然而，ASR DNN给I-向量系统增加了显著的计算复杂度，并且也需要转录数据以用于训练。此外，这种方法的成功主要是基于英语数据集[11],isolated to English-language datasets。另一方面，最近的研究表明，可以通过直接训练说话人辨别DNN来构建更强大的SV系统[12，13，14，15，16，17]。Heigold等人。介绍了一个用于文本相关SV任务的端到端系统，该系统通过联合训练将帧级特征映射到说话人嵌入，并学习相似性度量来比较嵌入对[13]。然后，该系统被改编成与文本无关的SV的更一般的任务。文[16]将端到端系统分为两个部分：产生说话人嵌入的DNN和比较嵌入对的单独训练的PLDA分类器。与端到端方法相比，这种方法需要的有效数据更少，并且有利于重用多年来开发的用于处理和比较i-向量的方法的附加好处。我们在这项工作中继续采用这两个阶段的方法。
Most DNN-based SV systems use a pooling mechanism to map variable-length utterances to fixed-dimensional embed- dings. In a feed-forward architecture, this is usually enabled by a pooling layer that averages some frame-level DNN fea- tures over the whole input utterance. In early systems, such as the d-vector in [12], the DNN was trained at the frame-level, and pooling is performed by averaging activation vectors of the the last hidden layer over all frames of an input utterance. The work in [15, 16, 17] proposed adding a statistics pooling layer that aggregates DNN hidden vectors over the whole utterance of a speaker, and computes its mean and standard deviation. The statistics vectors were then concatenated together to form a fixed-length representation of the input utterance at the segment level. Speaker embeddings are derived from further processing of these segment-level representations. However, in most prior work, this pooling mechanism assigns equal weight to each frame-level feature. Zhang et al, proposed using an attention model to combine the frame-level features for a text-dependent SV application [14]. The attention model takes phonetic poste- rior features and phonetic bottleneck features as extra sources, and learn the combination weights for frame-level features.
大多数基于DNN的SV系统使用池机制pooling mechanism 将可变长度的语句映射到固定维的嵌入。在前向架构中，这通常由一个池层pooling layer实现，该池层在整个输入语句中平均一些帧级DNN特征。在早期的系统中，例如[12]中的d-向量，DNN在帧级进行训练，并且通过在输入话语的所有帧上平均最后一个隐藏层的激活向量来执行池。文献[15,16,17]中的工作提出了增加一个统计池层statistics pooling layer，在说话人的整个话语中聚集DNN隐藏向量，并计算其平均值和标准差。然后将统计向量连接在一起，形成输入话语在段级的固定长度表示。说话人嵌入源于对这些段级表示的进一步处理。然而，在大多数先前的工作中，这种池机制为每个帧级特征分配相等的权重。Zhang等人提出使用注意模型来结合文本相关SV应用的帧级特征[14]。注意模型以语音后验特征和语音瓶颈特征为额外来源，学习帧级特征的组合权重。
This paper proposes an extension of the x-vector architec- ture described in [17]. In order to better utilize the speaker in- formation in the input speech, we propose using frame-level weights that are learned by a structured self-attention mecha- nism and incorporated into a weighted statistics pooling layer. In contrast to the work in [14], our task is text-independent and there’s a language mismatch between the training and testing data, so the phonetic information may not be helpful or even available. The self-attention mechanism was originally pro- posed for extracting sentence embeddings for natural language processing tasks [18]. We adapt the self-attention mechanism in [18] to text-independent SV based on the system in [17].
本文提出了文献[17]中描述的x矢量体系结构的一个推广。为了更好地利用说话人信息在输入语音中的作用，我们提出了一种利用结构化自注意机制学习的帧级权值，并将其融入加权统计池层。与文献[14]相比，我们的任务是独立于文本的，而且训练数据和测试数据之间存在语言不匹配，因此语音信息可能没有帮助，甚至不可用。自我注意机制最初是为提取自然语言处理任务的句子嵌入而提出的[18]。基于文[17]中的系统，我们将文[18]中的自我注意机制应用于文本无关的SV。

Speaker verification systems

We compare the proposed methods with two x-vector-based SV baseline systems. All systems are built using the Kaldi speech recognition toolkit [19].
我们将所提出的方法与两个基于x矢量的SV基线系统进行了比较。所有系统都是使用Kaldi语音识别工具包构建的[19]。

2.1. Thex-vector baseline system

The x-vector baselines are based on the systems described in [17]. A speaker discriminative DNN is trained to produce speaker embeddings called x-vectors, and a PLDA backend is used to compare pairs of speaker embeddings.
x矢量基线基于[17]中描述的系统。通过训练说话人识别DNN生成称为x矢量的说话人嵌入，并利用PLDA后端比较说话人嵌入对。
The input acoustic features are 23-dimensional MFCCs with a frame-length of 25ms that are mean-normalized over a sliding window of up to 3 seconds. An energy-based VAD is employed to filter out non-speech frames from the utterances.
输入声学特征是23维mfcc，帧长为25ms，在3秒的滑动窗口上平均归一化。采用基于能量的VAD从语音中滤除非语音帧。
The DNN used in the x-vector baseline system is depicted in Figure 1. The first five layers l1 to l5 are constructed with a time-delay architecture and they work at the frame level. Sup- pose t is the current time step. Frames from (t − 2) to (t + 2) are spliced together at the input layer. The next two layers splice the output of the previous layer at time steps {t−2, t, t+2} and {t − 3, t, t + 3}, respectively. No temporal contexts are added to the fourth and fifth layers. Thus, the total temporal context after the third layer is 15 frames.
x矢量基线系统中使用的DNN如图1所示。前五层l1到l5是用延时架构构建的，它们在帧级工作。假设 t是当前时间步。从（t-2）到（t+2）的帧在输入层拼接在一起。接下来的两个层分别在时间步骤{t-2，t，t+2}和{t-3，t，t+3}拼接前一层的输出。第四层和第五层没有添加时间上下文。因此，第三层之后的总时间上下文是15帧。
The statistics pooling layer aggregates over frame-level out- put vectors of the DNN, and computes their mean and standard deviation. This pooling mechanism enables the DNN to pro- duce fixed-length representation from variable-length speech segments. The mean and standard deviation are concatenated together and forwarded to two additional hidden layers l6 and l7, and finally a softmax output layer. The DNN is trained to classify speakers in the training set. After training, the softmax output layer and the last hidden layer are discarded, and speaker embeddings are extracted from the affine component of l6. The system uses PLDA backend for scoring, which is described in section 2.3. All neural units are rectified linear units (ReLUs).
统计池层聚合DNN的帧级输出向量，并计算它们的平均值和标准差。这种池机制使DNN能够从可变长度的语音段中产生固定长度的表示。平均值和标准差被连接在一起并转发到另外两个隐藏层l6和l7，最后是softmax输出层。DNN被训练为在训练集中对说话人进行分类。训练结束后，丢弃softmax输出层和最后一个隐藏层，从l6的仿射分量中提取说话人嵌入。系统使用PLDA后端进行评分，如第2.3节所述。所有的神经单位都是校正线性单位（ReLUs）。
在这里插入图片描述
2.2. Self-attentive speaker embeddings

Self-attention mechanism can be effectively used to encode a variable-length sequence into some fixed-length embeddings. Inspired by the structured self-attention mechanism proposed in [18] for sentence embedding, we adapt it to improve speaker embeddings in the x-vector baseline system shown in Fig. 1.
自关注机制可以有效地将可变长度序列编码成固定长度的嵌入。在文[18]提出的句子嵌入的结构化自关注机制的启发下，我们对其进行了改进，以改进图1所示的x矢量基线系统中的说话人嵌入。
In the current x-vector system, the statistics pooling layer treats all the frame-level outputs from its previous hidden layer equally. However, not all frames provide ‘equal’ speaker- discriminative information to the upper layers. For instance, non-speech frames that unfortunately pass the VAD and short pauses are not useful, and some phonetic contents can be more speaker-discriminative. In this paper, the statistics pooling layer is replaced by a self-attention layer as shown in Figure 2 to de- rive a weighted mean and a standard deviation vector from the outputs of the previous hidden layer over each speech segment. The weights are learned with the self-attention mechanism to maximize speaker classification performance for the whole sys- tem.
在当前的x矢量系统中，统计池层对其前一个隐藏层的所有帧级输出进行同等处理。然而，并不是所有的帧都向上层提供“平等”的说话人辨别信息。例如，不幸地通过VAD和短停顿的非语音帧是无用的，并且一些语音内容可以更具说话人的辨别力。在本文中，统计池层被图2所示的自关注层所代替，以在每个语音段上从先前隐藏层的输出中提取加权平均值和标准偏差向量。利用自关注机制学习权重，使整个系统的说话人分类性能最大化。
在这里插入图片描述

（下面部分翻译错误太多，结合上面图片原文，自行纠错。）
Suppose a speech segment of duration T produces a se- quence of T output vectors H = {h1,h2,··· ,hT }, where ht is the hidden representation of input frame xt captured by the hidden layer below the self-attention layer. Let the dimension of ht be dh. Thus, the size of H is dh × T. The self-attention mechanism takes the whole hidden representation H as input, and outputs an annotation matrix A as follows:
假设持续时间T的语音段产生T输出向量H={h1，h2，····，hT}的序列，其中hT是自关注层下面的隐藏层捕获的输入帧xt的隐藏表示。设ht的维数为dh。因此，H的大小为dh×T，自关注机制以整个隐藏表示H为输入，输出注释矩阵A如下：
A = softmax(g(HT W1)W2) (1)
A=软最大值（g（HT W1）W2）（1）
whereW1 isamatrixofsizedh ×da;W2 isamatrixofsize da × dr , and dr is a hyperparameter that represents the number of attention heads; g(·) is some activation function and ReLU is chosen here. The sof tmax(·) is performed column-wise.
其中w1为矩阵大小，dH×da；W2为矩阵大小，dr为表示注意头数目的超参数；g（·）为某个激活函数，此处选择ReLU。sof tmax（·）按列执行。
Each column vector of A is an annotation vector that repre- sents the weights for different ht. Finally the weighted means E is obtained by
A的每个列向量都是一个注释向量，表示不同ht的权重。最后，通过
E = HA. (2)
E=公顷。（二）
When the number of attention heads dr = 1, E is simply a weighted mean vector computed from H, and it is expected to reflect an aspect of discriminative speaker characteristics in the given speech segment. Apparently, speakers can be discrimi- nated along multiple aspects, especially when a speech segment is long. By increasing dr, we can easily have multiple atten- tion heads to learn different aspects from a speaker’s speech. To encourage diversity in the annotation vectors so that each at- tention head can extract dissimilar information from the same speech segment, a penalty term P is introduced when dr > 1:
当注意头的数目dr=1时，E只是从H计算出的加权平均向量，并且期望它反映给定语音段中区分说话人特征的一个方面。显然，说话人可以从多个方面来判断，特别是当一段话很长的时候。通过增加dr，我们可以很容易地让多个注意力集中的人从说话人的演讲中学习不同的方面。为了鼓励注释向量的多样性，以便每个提示头可以从同一语音段中提取不同的信息，当dr>1时引入惩罚项P：
P =∥(ATA−I)∥2F (3)
P=∥（ATA−I）∥2F（3）
where I is the identity matrix and ∥·∥F represents the Frobe- nius norm of a matrix. P is similar to L2 regularization and is minimized together with the original cost of the whole system.
其中I是恒等式矩阵，∥·∥F表示矩阵的Frobe-nius范数。P类似于L2正则化，并且与整个系统的原始成本一起最小化。

2.3. PLDA backend

We use the same type of PLDA backend as [16, 17] for com- paring pairs of embeddings. The embeddings are centered, and projected using LDA, which reduces the dimension from 512 to 150. After dimensionality reduction, the representations are length-normalized and modeled by PLDA. The scores are nor- malized using adaptive s-norm [20].
我们使用与[16，17]相同类型的PLDA后端对嵌入进行比较。嵌入被居中centered（是一种归一化方式），并使用LDA投影，从而将尺寸从512减小到150。在降维后，用PLDA对嵌入进行长度归一化和建模。使用自适应s-范数[20]来对分数进行归一化。
[20] D. E. Sturim and D. A. Reynolds, “Speaker adaptive cohort se- lection for Tnorm in text-independent speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 2005, pp. I–741.

Experimental setup

3.1. Model configuration
在这里插入图片描述
In the x-vector baseline system, the inputs size is 115 including context, and there are 512 nodes in each of the first four frame- level hidden layers l1 to l4, while the last frame-level layer l5 has dh = 1500 hidden nodes. Each of the two segment-level layers l6 and l7 has 512 nodes. For the self-attention layer, da is set to 500.
在x矢量基线系统中，输入大小为115（包括上下文），在前四个帧级隐藏层l1到l4中每个帧级隐藏层有512个节点，而最后一个帧级层l5具有dh＝1500个隐藏节点。两个段级层l6和l7都有512个节点。对于自我注意层，da设置为500。

3.2. Training data

The training data consists primarily of English telephone speech (with a smaller amount of non-English and microphone speech), taken from Switchboard datasets, past NIST speaker recogni- tion evaluations (SRE) and Mixer 6. The Switchboard portion consists of Switchboard 2 Phase 1, 2, 3 and Switchboard Cel- lular, and it contains about 28k recordings from 2.6k speakers. The SRE portion consists of NIST SREs data from 2004 to 2010 along with Mixer 6 for a total of about 63k recordings from 4.4k speakers. The four data augmentation techniques described in [17], namely, babble, music, noise, and reverb[21] are applied to increase the amount of training data and to improve the ro- bustness of the system. The clean data, together with the aug- mented data are used to train the speaker embedding DNN sys- tem, and only the clean and augmented SRE subset is used to train the PLDA classifier.
训练数据主要包括英语电话语音（非英语和麦克风语音的数量较少），取自Switchboard datasets、过去的NIST说话人识别评估（SRE）和Mixer 6。Switchboard部分由Switchboard2 Phase1、2、3和Switchboard Cellular组成，其中包含来自2.6k个说话人的约28k个录音。SRE部分包括2004年至2010年的NIST SREs数据以及Mixer 6，共有4.4k说话人的约63k录音。使用文献[17]中描述的四种数据增强技术，即babble、music、noise和reverb[21]，以增加训练数据量并提高系统的鲁棒性。干净数据和增强数据一起用于训练说话人嵌入DNN系统，只有干净和增强的SRE子集用于训练PLDA分类器（即没有用Switchboard和Mixer 6的数据来训练PLDA model）。

3.3. Evaluation
在这里插入图片描述
System performance is assessed on NIST 2016 speaker recog- nition evaluations (SRE16) [22]. SRE16 consists of Cantonese and Tagalog telephone speech. The length of enrollment seg- ments is about 60 seconds, and the length of test segments varies from 10 to 60 seconds. The performance is reported in terms of equal error rate (EER) as well as the official evaluation met- ric DCF16 for SRE16 [22], which is computed from a normalized detection cost function (DCF) averaged from two operation points with Ptarget = 0.01 and Ptarget = 0.005 respectively.
系统性能根据NIST 2016说话人识别评估（SRE16）[22]进行评估。SRE16由粤语和塔加洛语组成。注册段的长度约为60秒，测试段的长度从10秒到60秒不等。报告了SRE16的等错误率（EER）和官方评估metric DCF16[22]，这是根据normalized检测成本函数（DCF）平均计算的，DCF平均值来自两个操作点，目标值分别为0.01和0.005。

Results
In the following results, ‘baseline’ refers to the x-vector base- line described in Section 2.1. The label ‘attn-k’ denotes the self-attentive embedding systems described in Section 2.2 with k attention heads.
在下面的结果中，“基线”是指第2.1节中描述的x矢量基线。标签“attn-k”表示第2.2节中描述的具有k个注意头的自关注嵌入系统。

4.1. Overall results

The SRE16 results are summarized in Table 1. In producing the mean-only results, the various systems only utilize 1st-order statistics to generate speaker embeddings. That is, ‘baseline’ computes simple unweighted mean from all the frames of an input utterance, whereas ‘attn-k’ computes the weighted mean using a self-attention layer. On the other hand, both the 1st- and 2nd-order statistics are used in the mean+stddev results.
表1总结了SRE16的结果。在产生均值时，各系统仅利用一阶统计量来产生说话人嵌入。也就是说，“基线”从输入话语的所有帧计算简单的未加权平均值，而“attn-k”使用自我注意层计算加权平均值。另一方面，在mean+stddev结果中同时使用一阶和二阶统计量。
在这里插入图片描述
In general, self-attention systems outperform the baseline systems that derive speaker embeddings from simple averag- ing pooling layer, and more attention heads achieve greater im- provement. For example, when only mean vectors are used, the single-head attention system is 16% better in EER and 3% better in DCF16 on Cantonese, and 12% better in EER and 4% better in DCF16 on Tagalog. On the other hand, the 5-head system outperform the baseline by 21% in EER and 13% in DCF16 on Cantonese, and 15% better in EER and 3% better in DCF16 on Tagalog. When the performances are pooled across the two lan- guages, the best self-attentive system outperforms the baseline by 16% in EER and 6% in DCF16. Figure 3 shows the detec-tion error tradeoff (DET) curves for mean only systems when the performance is pooled across Cantonese and Tagalog.
一般来说，自我注意系统的性能优于从简单平均池层导出说话人嵌入的基线系统，并且更多的注意头获得更大的改进。例如，当只使用mean向量时，单头注意系统在粤语中的EER和DCF16分别提高了16%和3%，在Tagalog中的EER和DCF16分别提高了12%和4%。另一方面，5头系统的EER和DCF16在广东话上的表现分别比基线好21%和13%，EER和DCF16在塔加洛语上的表现分别比基线好15%和3%。当性能集中在两个局域网上时，最佳的自我关注系统在EER和DCF16上的性能分别比基线高16%和6%。图3显示了性能跨广东话和Tagalog汇集时仅限平均值系统的检测误差折衷（DET）曲线。
在这里插入图片描述
With the incorporation of standard deviation information, the performance of the self-attentive embedding systems is more stable and they continue to outperform the respective baselines. We see that the single-head attention system achieves 5% improvement in EER on both Cantonese and Tagalog. The best multi-head system with 5 heads is 14% better in EER and 10% better in DCF16 on Cantonese, and 7% better in EER and 5% better in DCF16 on Tagalog. Figure 4 reports the DET curve for systems using both mean and standard deviation.
随着标准差信息的加入，自关注嵌入系统的性能更加稳定，并且它们继续优于各自的基线。我们发现单头注意系统在粤语和塔加罗语的EER上都提高了5%。最佳的五头多头系统，粤语为14%，粤语为10%，粤语为7%，塔加乐为5%。图4报告了使用平均值和标准偏差的系统的DET曲线。
在这里插入图片描述
We also compare the performance of the self-attentive speaker embedding systems with a traditional i-vector system reported in Snyder et al. [16]. Pooled across languages, the best 5-head self-attentive embedding system performs better than the i-vector system 25% in EER and 22% in DCF16.
我们还比较了自注意说话人嵌入系统与Snyder等人报道的传统i矢量系统的性能。[16] 。最佳的5头自注意说话人嵌入系统在EER和DCF16中的性能分别比i-vector系统提高了25%和22%。

4.2. Results on test utterances of different durations

We also investigate the interplay between performance and ut- terance duration. Test utterances are divided into 3 groups ac- cording to their speech durations. Table 2 and Table 3 report the mean+stddev performance of various systems on the three different duration groups. We can see that with few exceptions, (a) self-attentive embeddings bring improvement across all the different duration groups; (b) as expected the SV performance is better with longer utterances; © in general, the self-attentive systems perform better with more heads. For instance, on Can- tonese, the single-head system achieves 2% improvement in EER for utterances in the first two groups and 9% in the last group; the improvement in DCF16 is fairly consistent among all duration groups and it is about 10% for the single-head sys- tem. Larger gains are made by multi-head systems. The 5-head self-attentive system achieves 13-16% improvement in EER and about 11% in DCF16 for utterances in all the duration groups. On Tagalog, the largest improvement is obtained by the single- head system: it is around 5% better in EER and 2-6% better in DCF16 for all duration groups.
我们还研究了性能和语音持续时间之间的相互作用。测试话语根据其持续时间分为三组。表2和表3报告了三个不同持续时间组上不同系统的mean+stddev性能。我们可以看到，除了少数例外，（a）自我注意嵌入在所有不同的持续时间组都带来改善；（b）正如预期的那样，SV表现更好，如果话语更长；（c）一般来说，自我注意系统表现更好，如果头部更多。例如，在粤语上，单头系统在前两组和最后一组的话语中的EER分别提高了2%和9%；DCF16在所有持续时间组中的改善相当一致，在单头系统中的改善约为10%。多头系统可以获得更大的增益。在所有的持续时间组中，5头自我注意系统的EER提高了13-16%，DCF16提高了11%。在Tagalog上，最大的改进是通过单头系统获得的：对于所有持续时间组，EER和DCF16的改善分别为5%和2-6%。
Notice that in our current experiments, in order to provide enough training examples per speaker and to increase diversity in the training examples, we have chunked the training utter- ances into segments of 200–400 frames. After DNN training, the speaker embeddings are extracted from the entire recoding. Therefore, there may be a mismatch between training and test-ing duration.if more data are available so that we do not need to chunk the data,even better performance maybe achieved.
注意，在我们目前的实验中，为了给每个说话人提供足够的训练示例，并增加训练示例的多样性，我们将训练语句分为200-400帧。经过DNN训练后，从整个记录中提取说话人嵌入。因此，训练时间和测试时间可能不匹配。如果有更多的数据可用，这样我们就不需要对数据进行分块，甚至可以获得更好的性能。

在这里插入图片描述
5. Conclusion

We propose a new method to extract speaker embeddings for text-independent speaker verification by introducing a self- attention mechanism into DNN embeddings. The new speaker embeddings are evaluated on SRE16, which is a challenging task since there is a language mismatch between the predomi- nately English training data and the Cantonese or Tagalog eval- uation data. We find that the proposed self-attentive speaker em- bedding outperforms a traditional i-vector system and a strong DNN embedding baseline when tested on utterances of differ- ent lengths. By increasing the number of attention heads, con- sistent improvement is further obtained. We believe the training strategy with chunks of speech segments may not be optimal for self-attention mechanism. In the future work, we will mod- ify the training strategy and try on larger training corpus. We will also investigate different penalty terms for multi-head at- tention.
我们提出了一种新的提取说话人嵌入的方法，通过在DNN嵌入中引入一种自关注机制来进行文本无关说话人验证。新的说话人嵌入是在SRE16上进行评估的，这是一个具有挑战性的任务，因为在自然英语之前的训练数据和广东话或塔加洛语评估数据之间存在语言不匹配。我们发现，当在不同长度的话语上测试时，所提出的自关注说话人em-bedding系统优于传统的i-向量系统和强DNN嵌入基线。通过增加注意头的数目，进一步提高了一致性。我们认为，对于自我注意机制而言，包含语段块的训练策略可能不是最佳的。在今后的工作中，我们将对训练策略进行改进，尝试建立更大的训练语料库。我们还将研究多头注意的不同惩罚情况。

Acknowledgements

The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. HKUST16215816).
本文所描述的工作部分得到了中国香港特别行政区研究资助委员会（HKUST16215816）的资助。

References
[1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans- actions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
[2] S. Ioffe, “Probabilistic linear discriminant analysis,” in European Conference on Computer Vision. Springer, 2006, pp. 531–542.
[3] P. Kenny, “Bayesian speaker verification with heavy-tailed pri- ors,” in Proceedings of the Odyssey, 2010, p. 14.
[4] N. Bru ̈mmer and E. De Villiers, “The speaker partitioning prob- lem,” in Proceedings of the Odyssey, 2010, p. 34.
[5] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Proceed- ings of Interspeech, 2011.
[6] P.Kenny,V.Gupta,T.Stafylakis,P.Ouellet,andJ.Alam,“Deep neural networks for extracting baum-welch statistics for speaker recognition,” in Proceedings of the Odyssey, 2014, pp. 293–298.
[7] Y.Lei,N.Scheffer,L.Ferrer,andM.McLaren,“Anovelscheme for speaker recognition using a phonetically-aware deep neural network,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2014, pp. 1695– 1699.
[8] D. Garcia-Romero, X. Zhang, A. McCree, and D. Povey, “Im- proving speaker recognition performance in the domain adapta- tion challenge using deep neural networks,” in Proceedings of the IEEE Workshop on Spoken Language Technology, 2014, pp. 378– 383.
[9] D. Snyder, D. Garcia-Romero, and D. Povey, “Time delay deep neural network-based universal background models for speaker recognition,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2015, pp. 92–97.
[10] F.Richardson,D.Reynolds,andN.Dehak,“Deepneuralnetwork approaches to speaker and language recognition,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1671–1675, 2015.
[11] O.Novotny ́,P.Mateˇjka,O.Glembeck,O.Plchot,F.Gre ́zl,L.Bur- get, and J. Cˇernocky ́, “Analysis of the dnn-based sre systems in multi-language conditions,” in Spoken Language Technology Workshop (SLT). IEEE, 2016.
[12] E. Variani, X. Lei, E. McDermott, I. Moreno, and J. Gonzalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker verification,” in 2014 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4052–4056.
[13] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing, 2016, pp. 5115–5119.
[14] S. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-end at- tention based text-dependent speaker verification,” in Spoken Lan- guage Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 171–178.
[15] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in Pro- ceedings of the IEEE Workshop on Spoken Language Technology, 2016, pp. 165–170.
[16] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” Proceedings of Interspeech, pp. 999–1003, 2017.
[17] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu- danpur, “X-vectors: Robust DNN embeddings for speaker recog- nition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2018.
[18] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” Proceedings of the International Conference on Learning Repre- sentations, 2017.
[19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Work- shop, 2011.
[20] D. E. Sturim and D. A. Reynolds, “Speaker adaptive cohort se- lection for Tnorm in text-independent speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 2005, pp. I–741.
[21] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2017, pp. 5220–5224.
[22] “NIST speaker recognition evaluation 2016,” https://www.nist. gov/itl/iad/mig/speaker-recognition-evaluation-2016, 2006.