ICASSP 2019----Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker

最新推荐文章于 2022-05-25 00:22:44 发布

Grace_yanyanyan

最新推荐文章于 2022-05-25 00:22:44 发布

阅读量1k

点赞数

分类专栏： ICASSP 2019

原文链接：https://arxiv.xilesou.top/pdf/1902.07821.pdf

版权

ICASSP 2019 专栏收录该内容

18 篇文章 6 订阅

订阅专栏

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

https://ieeexplore.ieee.org/document/8682712
https://arxiv.xilesou.top/pdf/1902.07821.pdf

Yun Tang; JD AI Research
Guohong Ding; JD AI Research
Jing Huang; JD AI Research
Xiaodong He; JD AI Research
Bowen Zhou; JD AI Research

Abstract:
文摘:
This paper aims to improve the widely used deep speaker embedding x-vector model.
本文旨在对目前广泛使用的说话人嵌入x-vector模型进行改进。
We propose the following improvements: (1) a hybrid neural network structure using both time delay neural network (TDNN) and long short-term memory neural networks (LSTM) to generate complementary speaker information at different levels;
提出了以下改进:(1)同时使用时间延迟神经网络(TDNN)和长短时记忆神经网络(LSTM)生成不同层次的互补说话人信息的混合神经网络结构;
(2) a multi-level pooling strategy to collect speaker information from both TDNN and LSTM layers;
(2)采用多级池化策略，从TDNN层和LSTM层收集说话人信息;
(3) a regularization scheme on the speaker embedding extraction layer to make the extracted embeddings suitable for the following fusion step.
(3)对说话人嵌入提取层进行正则化处理，使提取的嵌入适合于后续融合步骤。
The synergy of these improvements are shown on the NIST SRE 2016 eval test (with a 19% EER reduction) and SRE 2018 dev test (with a 9% EER reduction), as well as more than 10% DCF scores reduction on these two test sets over the x-vector baseline.
这些改进的协同作用体现在NIST SRE 2016 eval测试(能率降低19%)和SRE 2018 dev测试(能率降低9%)上，以及这两个测试集在x-vector基线上的DCF分数降低了10%以上。

SECTION 1.INTRODUCTION
1.节介绍
Speaker verification (SV) [1] is one of the key components for human machine interface and initially was widely used in person identification for security purposes.
说话人验证(SV)[1]是人机界面的关键部件之一，最初广泛应用于安全目的的身份识别中。
Nowadays with intelligent speech assistants such as Alexa, Google Home, Siri and Cortana being used in home environments as well as on smartphones, the demand for SV technology is rising.
如今，随着Alexa、谷歌Home、Siri和Cortana等智能语音助手在家庭环境和智能手机上的应用，人们对SV技术的需求越来越大。
This is especially true for robust speaker verification in challenging acoustic conditions and different population of speakers.
这对于在具有挑战性的声学条件和不同的说话人群体下的鲁棒说话人验证尤其如此。
The speaker verification problem usually falls into two categories: text-dependent (TD) SV and text-independent (TI) SV.
说话人验证问题通常分为两类:文本相关(TD) SV和文本无关(TI) SV。
In the TD SV system, the transcriptions for the test utterances and enrollment utterances are the same and usually limited to a small set. In the TI SV system, there is no constraint on transcriptions for both test and enrollment utterances.
在TD SV文本相关系统中，测试话语和注册话语的是相同的，并且通常局限于一个小的集合。在TI SV文本无关系统中，测试和注册话语都没有限制。
Hence the TI case is more difficult than the TD case due to larger variations introduced by different utterance transcriptions and duration.
因此，由于不同的语音转录和持续时间所带来的差异较大，TI文本无关的情况比TD文本相关的情况更困难。
In this study, we will focus on the more challenging TI speaker verification system.
在本研究中，我们将重点研究更具挑战性的TI文本无关说话人验证系统。
Recently, more attention has been drawn to the use of deep neural networks (DNN) to generate speaker embedding representations [2] [3] [4].
近年来，利用深度神经网络(DNN)生成说话人嵌入表示[2][3][4]的方法得到了越来越多的关注。
These deep speaker embedding systems are shown to have large improvements over the i-vector [5] based methods, especially the recently proposed x-vector system with data augmentation [4].
与基于i-vector[5]的方法相比，这些深度说话人嵌入系统有较大的改进，尤其是最近提出的带数据增强[4]的x-vector系统。
The DNN based speaker embedding extraction system usually consists of three components: frame level feature processing, utterance (speaker) level feature processing and training loss.
基于DNN的说话人嵌入提取系统通常由帧级特征处理、说话人级特征处理和训练损失函数三个部分组成。
Frame level processing deals with local short span of acoustic features.
帧级处理处理局部短跨度的声学特征。
It can be done via recurrent neural networks [2] or convolutional neural networks [3] [6].
它可以通过递归神经网络[2]或卷积神经网络[3][6]来实现。
Utterance level processing forms speaker representation based on the frame level output.
话语级处理是，基于帧级输出，形成说话人表示的过程。
A pooling layer is used to gather frame level information to form utterance level representation.
池化层用于收集帧级信息，形成话语级表示。
Methods such as statistical pooling [3], max pooling [7], attentive statistical pooling [8], multi-headed attentive statistical pooling [9] are popular choices.
常用的方法有统计池化[3]、最大池化[7]、attentive池化[8]、多头attentive池化[9]等。
Cross entropy and triplet loss are two widely used training losses.
交叉熵损失和 triplet loss是两种应用广泛的训练损失函数。
Cross entropy based methods are focused on reducing the confusion for all speakers in the training data [3], while triplet loss based methods [10] [11] [12] [13] are focused on increasing the margin between similar speakers.
基于交叉熵的方法主要是为了减少训练数据[3]中所有说话人的混淆，而基于triplet loss的方法[10][11][12][13]主要是为了增加相似说话人之间的距离。
In this paper, we propose a novel deep speaker embedding learning framework to improve upon the x-vector system.
本文提出了一种新的深度说话人嵌入学习框架，对 x-vector系统进行了改进。
We first add LSTM layers to the x-vector’s TDNN structure, because sequential modeling of an utterance would generate different speaker information from that of TDNN.
我们首先将LSTM层添加到x-vector的TDNN结构中，因为语音的顺序建模将生成与TDNN不同的说话者信息。
In order to aggregate these different information, we propose a multi-level pooling strategy to fuse different information from the different frame level models, one from TDNN and one from LSTM.
为了聚合这些不同的信息，我们提出了一个多级池化策略来融合来自不同帧级模型的不同信息，一个来自TDNN，一个来自LSTM。
We also add a regularization term on the speaker embedding extraction layer in order to make the system output more suitable for the back-end processing.
为了使系统输出更适合后端处理，我们还在说话人嵌入提取层上增加了正则化项。
Overall, the proposed new speaker embedding system improves the equal error rate (EER) by 19% and the detection cost function(DCF) [1] score by 12%, compared to the previous x-vector baseline on the NIST SRE 2016 eval test.
总体而言，与之前NIST SRE 2016 eval测试的x-vector基线相比，新提出的说话人嵌入系统平均错误率(EER)提高19%，检测成本函数(DCF)[1]评分提高12%。
This is to our knowledge the lowest EERs (9.2% on Tagalog and 3.1% on Cantonese) on SRE 2016 eval test in publications so far.
据我们所知，到目前为止，在SRE 2016年eval测试中，发表的文献中eer最低(他加禄语为9.2%，粤语为3.1%)。
The rest of this paper is organized as follows.
本文的其余部分组织如下。
Section 2 briefly describes prior work, including the baseline x-vector system and related work.
第2节简要介绍了以前的工作，包括基线x向量系统和相关工作。
The proposed new system is introduced in Section 3 on gathering speaker information from different modeling levels.
第三节从不同的建模层次收集说话人的信息，介绍了所提出的新系统。
The experimental set up, results and analysis are described in Section 4.
实验装置、结果和分析见第4节。
Finally, the conclusion is given at Section 5.
最后，第五部分给出结论。

SECTION 2.
第二节。
PRIOR WORK
之前的工作
2.1.
2.1。
Baseline x-vector System
基线向量x系统
Figure 1 depicts the deep neural network configuration for x-vector system [3] [4].
图1描述了x-vector system[3][4]的深度神经网络配置。
The blue, yellow and green blocks represent the frame level, utterance level and training loss used in the system training.
蓝色、黄色和绿色的方块表示系统训练中使用的帧级、话语级和训练损失。
Black arrows indicate a ReLU activation function.
黑色箭头表示ReLU激活函数。
Batch normalization layers are added in two adjacent layers, while the red arrows indicate there is no nonlinear mapping added between layers.
在相邻的两个层中添加批处理规范化层，而红色箭头表示层之间没有添加非线性映射。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tP78PD0r-1581785092888)(https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/8671773/8682151/8682712/tang1-p5-tang-large.gif)]
The frame level model consists of three TDNN layers to extract speaker information at the frame level.
帧级模型由三个TDNN层组成，用于提取帧级的说话人信息。
X-Y-Z in the block represents the number of filters, filter window size and dilation number at this layer.
块中的X-Y-Z表示该层的过滤器数量、过滤器窗口大小和膨胀数。
At the utterance level, high order stats pooling methods such statistical pooling or attentive statistical pooling can be used.
在话语级层面，可以使用高阶统计池化方法，如统计池化或注意力统计池化。
The output (concatenation of the mean and standard deviation vectors) is fed into another three fully connected layers.
输出(均值和标准差向量的串联)被输入到另外三个完全连接的层中。
Speaker embeddings are then extracted from the output of the first linear projection layer.
然后从第一个线性投影层的输出中提取说话人嵌入。
Cross entropy loss is used to train the system and reduce the confusion between all speakers in the training set. Once speaker embeddings are extracted, LDA and PLDA [14] are used as back-end scoring, as in the standard Kaldi x-vector recipe.
训练时使用交叉熵损失函数，减少训练集中所有说话者之间的混淆。提取说话者嵌入后，使用LDA和PLDA[14]作为后端评分，与标准Kaldi x-vector方法中一样。
2.2.
2.2。
Related work
相关工作
Combining CNN or TDNN with LSTM is proved effective in automatic speech recognition (ASR) tasks.
在自动语音识别(ASR)任务中，将CNN或TDNN与LSTM相结合是有效的。
Sainath et al. [15] proposed to stack CNN, LSTM and DNN sequentially for speech recognition and it shows superior results than using CNN, LSTM or DNN alone.
Sainath等人[15]提出将CNN、LSTM和DNN按顺序叠加进行语音识别，其结果优于单独使用CNN、LSTM或DNN。
Peddinti et al. [16] conducted experiments to compare the stack order of TDNN and LSTM and found interleaving of TDNN layers with LSTM layers could be more effective than simple stacking strategy.
Peddinti等人[16]通过实验比较了TDNN和LSTM的叠加顺序，发现TDNN层与LSTM层的交错比简单的叠加策略更有效。
Pooling module is the key component to bridge frame level features to speaker representation.
池化模块是将帧级特征与说话人表示连接起来的关键部件。
High order statistical pooling [3] shows better performance than simple pooling method like average or maximum pooling.
高阶统计池化[3]的性能优于平均池化或最大池化等简单池化方法。
Okabe et al. [8] proposed parametric based attentive pooling to give different importance for each frame feature.
Okabe等人提出了一种基于参数的注意力池化方法，对每个帧特征赋予不同的重要性。
Zhu et al. [9] further extended it to multiple heads to generate utterance representation from different views.
Zhu等人将其进一步扩展到多个头部，从不同的角度产生话语表征。
The multiple-level pooling proposed in this work is focused on the way to gather pooling features instead of pooling module itself.
本文提出的多级池化重点在于，收集池化后的特征，而不是池化模块本身。
SECTION 3.
第三节。
DEEP SPEAKER EMBEDDING WITH MULTI-LEVEL POOLING
深度说话人嵌入多层池化
Figure 2 shows the proposed new deep speaker embedding training used in this study.
图2显示了本研究中提出的新的深度说话人嵌入训练。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-H952qsPp-1581785092889)(https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/8671773/8682151/8682712/tang2-p5-tang-large.gif)]
In order to generate complimentary information, a hybrid structure with both TDNN and LSTM layers is proposed in this framework.
为了生成互补信息，在该框架中提出了一种同时具有TDNN层和LSTM层的混合结构。
The TDNN layer would focus on the local feature representation, while the LSTM layer would consider global and sequential information from the whole utterance.
TDNN层将重点放在局部特征表示上，而LSTM层将考虑来自整个话语的全局和顺序信息。
The multi-level pooling from the TDNN and LSTM layers generate different representations on the utterance level.
来自TDNN层和LSTM层的多级池化在话语层上产生不同的表示。
This would help to gather speaker information from different spaces and generate a comprehensive speaker representation.
这将有助于从不同的方面收集说话人的信息，并产生一个全面的说话人代表。
This would also help to pass error signals back to earlier layers which, in turn, helps alleviate the vanishing gradient problem.
这也有助于将错误信号传递回更早的层，从而帮助缓解梯度消失的问题。
The outputs of different pooling layers are concatenated and fused into the following fully connected layers.
不同池化层的输出，被连接并融合到下面的全连接层中。
Even though residual networks with very deep neural network structures [17][6] are popular for computer vision tasks, we haven’t seen good improvements from very deep structures, as shown in [7].
虽然具有非常深的神经网络结构的残差网络[17][6]在计算机视觉任务中很受欢迎，但我们还没有看到来自非常深结构的良好改进，如[7]所示。
Therefore we use the same number of TDNN layers as in x-vector model.
因此，我们使用与x-vector模型相同数量的TDNN层。
The x-vector system is not an end-to-end SV system.
x-vector向量系统并不是端到端的SV系统。
The extracted speaker embeddings are passed to the back-end classifiers of LDA and PLDA, which are both linear transforms trained with different objective functions.
提取的说话人嵌入，被传递到LDA和PLDA的后端分类器，这两个分类器都是经过不同目标函数训练过的线性变换。
Thus there is a mismatch of training loss and LDA/PLDA training objectives, as noticed in [18].
因此，正如[18]中所注意到的，训练损失与LDA/PLDA训练目标不匹配。
In order to make the embedding output suitable for the backend, a regularization term is introduced by applying a constraint on the norm of the embedding layer output.
为了使嵌入的输出，适合后端，通过对嵌入层输出的范数施加约束，引入正则化项。

Grace_yanyanyan

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
ICASSP 2019----Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verificationhttps://ieeexplore.ieee.org/document/8682712Yun Tang; JD AI ResearchGuohong Ding; JD AI ResearchJi...
复制链接

扫一扫