Far-Field End-to-End Text-Dependent Speaker Verification based on Mixed Training Data with Transfer

Far-Field End-to-End Text-Dependent Speaker Verification based on Mixed Training Data with Transfer Learning and Enrollment Data Augmentation
Xiaoyi Qin1,2
, Danwei Cai1
, Ming Li1
1Data Science Research Center, Duke Kunshan University, Kunshan, China
2School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China

远场端到端文本相关说话人验证–基于迁移学习
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1542.pdf

Abstract

In this paper, we focus on the far-field end-to-end text dependent speaker verification task with a small-scale far-field text dependent dataset and a large scale close-talking text independent database for training. First, we show that simulating far-field text independent data from the existing large-scale clean database for data augmentation can reduce the mismatch. Second, using a small far-field text dependent data set to finetune the deep speaker embedding model pre-trained from the simulated far-field as well as original clean text independent data can significantly improve the system performance. Third, in special applications when using the close-talking clean utterances for enrollment and employing the real far-field noisy utterances for testing, adding reverberant noises on the clean enrollment data can further enhance the system performance. We evaluate our methods on AISHELL ASR0009 and AISHELL 2019B-eval databases and achieve an equal error rate (EER) of 5.75% for far-field text-dependent speaker verification under noisy environments.
本文以一个小规模的远场文本相关数据集和一个大规模的近场文本无关数据库为训练对象,研究了远场端到端文本相关说话人验证问题。首先,我们表明,模拟远场文本无关的数据从现有的大规模清洁数据库的数据扩增可以减少不匹配。其次,利用一个小的远场文本相关数据集对由模拟的远场数据和原始的纯文本无关数据预训练的深度说话人嵌入模型进行微调,可以显著提高系统性能。第三,在特殊应用中,当使用近距离交谈的干净话语进行注册和使用真实的远场噪声话语进行测试时,在干净的注册数据上添加混响噪声可以进一步提高系统性能。我们在AISHELL ASR0009和AISHELL 2019B eval数据库上评估了我们的方法,并在噪声环境下实现了5.75%的等错误率(EER)。

  1. Introduction

In the past decade, the performance of automatic speaker verification (ASV) has improved dramatically. The i-vector based method [1] and the deep neural network (DNN) based methods [2, 3] have been widely used in telephone channel and closetalking scenarios. Recently, smartphones and virtual assistants become very popular. People use pre-defined words to wake up the system. To enhance the security level and be able to provide preconized service, the wake-up words based text-dependent speaker verification is adapted to determine whether the wake-up speech is indeed uttered by the claimed speaker [4, 5, 6]. However, in many Internet of Things (IoT) applications, e.g., smart speakers and smart home devices, text-dependent speaker verification under far-field and complex environmental settings are still challenging due to the effects of room reverberation and various kinds of noises and distortions. To reduce the effects of room reverberation and environmental noise, various approaches with single channel microphone or multi-channel microphone array, have been proposed at different levels of the text independent ASV system. At the signal level, linear prediction inverse modulation transfer function [7] and weighted prediction error (WPE) [8, 9] methods have been used for dereverberation. DNN based denoising methods for single-channel speech enhancement [10, 11, 12, 13] and beamforming for multi-channel speech enhancement [8, 14, 15] have also been explored for ASV system under complex environments. At the feature level, sub-band Hilbert envelopes based features [16, 17, 18], warped minimum variance distortionless response (MVDR) cepstral coefficients [19], blind spectral weighting (BSW) based features [17], power-normalized cepstral coefficients (PNCC) [20] and DNN bottleneck features [21] have been applied to ASV system to suppress the adverse impacts of reverberation and noise. At the model level, reverberation matching with multi-condition training models have also been successfully employed within the universal background model (UBM) or ivector based frontend systems [22, 23]. Multi-channel i-vector combination for far-field speaker recognition is also explored in [24]. In backend modeling, multi-condition training of probabilistic linear discriminant analysis (PLDA) models was employed in i-vector system [25]. The robustness of deep speaker embeddings for far-field text-independent speech has also been investigated in [26, 27]. Finally, at the score level, score normalization [22] and multi-channel score fusion [28] have been applied in farfield ASV system to improve the robustness. In this work, we focus on the far-field end-to-end textdependent speaker verification task at the model level. Previous studies [4, 5, 6] on end-to-end deep neural network based text-dependent speaker verification directly use large-scale text dependent database to train the systems. However, in realworld applications, people may want to use customized wakeup words for speaker verification, and different smart home devices may have different wake-up words even for products from the same company. Hence collecting a large-scale far-field text-dependent speech database for each new or customized wake-up words may not be possible. This motivates us to explore the transfer learning concept and use a small far-field text-dependent speech dataset to fine-tune the existing deep speaker embedding network trained from large-scale text independent speech databases, like NIST SRE databases or voxceleb [29, 30]. Furthermore, we propose a new research topic on far-field text-dependent speaker verification, which is to use the closetalking clean data for enrollment and employ the real far-field noisy utterances for testing. This scenario corresponds to the case that only one clean utterance recorded by cell phone is used to enroll the speaker for the smart home devices. In this work, we investigate an enrollment data augmentation scheme to reduce the mismatch and improve the ASV performance
近十年来,自动说话人验证(ASV)的性能有了很大的提高。基于i-向量的方法[1]和基于深度神经网络(DNN)的方法[2,3]已经被广泛应用于电话信道和闭话场景中。最近,智能手机和虚拟助理变得非常流行。人们用预先定义好的词来唤醒系统。为了提高安全级别并能够提供预定的服务,基于唤醒词的文本相关说话人验证适合于确定所述唤醒语音是否确实由所述说话人发出[4、5、6]。然而,在许多物联网(IoT)应用中,如智能扬声器和智能家居设备中,由于室内混响和各种噪声和失真的影响,在远场和复杂环境下的文本相关扬声器验证仍然具有挑战性。为了降低室内混响和环境噪声的影响,在文本无关ASV系统的不同层次上,提出了使用单通道麦克风或多通道麦克风阵列的各种方法。在信号级,线性预测逆调制传递函数[7]和加权预测误差(WPE)[8,9]方法被用于去冗余。针对复杂环境下的ASV系统,研究了基于DNN的单通道语音增强去噪方法[10,11,12,13]和用于多通道语音增强的波束形成方法[8,14,15]。**在特征层,**基于子带希尔伯特包络的特征【16,17,18】,翘曲最小方差无失真响应(MVDR)倒谱系数【19】,基于盲谱加权(BSW)的特征【17】,功率归一化倒谱系数(PNCC)[20]和DNN瓶颈特征[21]被应用于ASV系统,以抑制混响和噪声的不利影响。在模型级,混响匹配和多条件训练模型也被成功地应用于通用背景模型(UBM)或基于ivector的前端系统中[22,23]。在[24]中还探讨了用于远场说话人识别的多通道i矢量组合。在后端建模中,概率线性判别分析(PLDA)模型的多条件训练被用于i-向量系统[25]。文[26,27]还研究了深度说话人嵌入对远场文本无关语音的鲁棒性。最后,在分数层次上,将分数归一化[22]和多通道分数融合[28]应用于farfield-ASV系统,提高了系统的鲁棒性。

在这项工作中,我们将重点放在模型级的远场端到端文本相关说话人验证任务上。已有的基于端到端深度神经网络的文本相关说话人验证研究[4,5,6]直接使用大规模文本相关数据库对系统进行训练。然而,在现实世界的应用中,人们可能希望使用自定义的唤醒词进行扬声器验证,而不同的智能家居设备可能有不同的唤醒词,即使是来自同一公司的产品。因此,**可能无法为每个新的或自定义的唤醒词收集大规模的远场文本相关语音数据库。这促使我们探索转移学习的概念,并使用一个小的远场文本相关语音数据集,以调整现有的深度扬声器嵌入网络训练从大型文本无关的语音数据库,如NIST SRE数据库或Vox SeleB[[ 29, 30 ] ]。**此外,我们还提出了一个新的研究课题,即使用闭嘴干净数据进行注册,并使用真实的远场噪声话语进行测试。这种情况对应于这样的情况,即只有一个由手机记录的干净的话语被用于注册智能家庭设备的扬声器。在这项工作中,我们研究了一个登记资料扩充计划,以减少不匹配,并改善ASV的效能

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值