ICASSP 2019----Analysis and Mitigation of Vocal Effort Variations in Speaker Recognition

Mahesh Kumar Nandwana1
, Mitchell McLaren1
, Luciana Ferrer2
, Diego Castan1
, Aaron Lawson1

1,Speech Technology and Research Laboratory, SRI International, Menlo Park, California, USA
美国加利福尼亚门罗公园SRI国际SPEECH技术和研究实验室
2,Instituto de Investigacon en Ciencias de la Computaci ´ on, UBA-CONICET, Argentina
阿根廷联邦大学计算机研究所

http://150.162.46.34:8080/icassp2019/ICASSP2019/pdfs/0006001.pdf

Abstract:
摘要:
In this work, we assess the impact of vocal effort on discrimination and calibration performance of a state-of-the-art speaker recognition system.
在这项工作中,我们评估了vocal effort对最先进的说话人识别系统的辨别和校准性能的影响。
We analyze three levels of vocal effort (low, normal, and high) from the SRI-FRTIV corpus.
我们分析了来自SRI-FRTIV语料库的三种vocal effort (低、正常和高)。
We use a deep neural network (DNN) speaker embeddings system with probabilistic linear discriminant analysis (PLDA) and find that vocal effort variation significantly degrades system performance.
我们利用深度神经网络(DNN)说话人嵌入系统与概率线性判别分析(PLDA),发现vocal effort的变化会明显降低系统的性能。
We apply both mixture PLDA (mix-PLDA) and trial-based calibration(校准) with condition PLDA similarity (TBC-CPLDA) to improve system robustness.
为了提高系统的鲁棒性,我们采用了混合PLDA (mix-PLDA)和基于条件PLDA相似性(TBC-CPLDA)的实验标定方法。
Our proposed approaches resulted in 18% and 33% relative improvement in discrimination and calibration performance respectively on the SRI-FRTIV corpus.
我们提出的方法在SRI-FRTIV语料库上的识别和校准性能分别提高了18%和33%。

From Wikipedia:
Vocal effort is a quantity varied by speakers when adjusting to an increase or decrease in the communication distance.
Vocal effort是这样一个变量,当交谈距离变化的时候,它也会随说话人的不同而变化
The communication distance is the distance between the speaker and the listener.
交谈距离是指说话者和听者之间的距离。
Vocal effort is a subjective physiological quantity, and is mainly dependent on subglottal pressure, vocal fold tension and jaw opening.
Vocal effort 是一个主观的生理变量,主要取决于声门下压力、声带张力和下颌张开度。
Vocal effort is different from sound pressure.
Vocal effort不是声压。
To measure vocal effort, listeners are asked to rate the distance between speaker and addressee.
为了衡量说话人的vocal effort,听众被要求对说话人和听众之间的距离打分。

SECTION 1.INTRODUCTION
1.节介绍
Variability in the acoustic signal is a persistent challenge for speaker recognition systems operating under real-world conditions.
声音信号的可变性是说话人识别系统在真实环境下工作所面临的一个长期挑战。
Such variability is caused by either intrinsic or extrinsic factors.
这种变异性是由内在因素或外在因素造成的。
Intrinsic factors are associated with the speaker rather than the recording environment.
内在因素与说话者有关,而与录音环境无关。
These factors include changes in vocal effort, speaking style [1], non-speech sounds [2], [3], [4], emotions, language [5], aging, etc. across recordings of the same speaker.
内在因素包括在vocal effort方面的变化,说话风格[1],非语言的声音[2],[3],[4],情绪,语言[5],年龄等。
Extrinsic factors are associated with the differences in the recording environments between recordings.
外部因素与录音环境的差异有关。
These factors include changes in background noise, microphone, room acoustics, distance from the microphone [6], transmission channel, codec [7], etc.
外部因素包括背景噪声的变化、麦克风、房间音响效果、与麦克风[6]的距离、传输通道、编解码器[7]等。
Intrinsic factors are also known as speaker-dependent factors, whereas extrinsic factors are called speaker-independent factors [8].
内在因素也称为说话者相关因素,而外在因素则称为说话者无关因素[8]。
During recent decades, US government evaluations and programs (such as the NIST Speaker Recognition Evaluations (SRE), the IARPA BEST program, and the DARPA RATS program) have motivated particular research directions in the speaker recognition community.
近几十年来,美国政府的评估和项目(如NIST Speaker Recognition assessment (SRE)、IARPA BEST program和DARPA RATS program)推动了说话人识别领域的特定研究方向。
Those research programs have primarily focused on the problem of extrinsic variability, including channel effects, transmission noise, and environmental noise.
这些研究项目主要集中于外部变异性的问题,包括通道效应、传输噪声和环境噪声。
Intrinsic variability, in contrast, has received sparse research exposure.
相反,内在的可变性却很少得到研究的关注。
Yet, intrinsic variability is a key factor for unconstrained applications, such as forensic speaker recognition.
然而,内在的可变性是无约束应用的一个关键因素,例如法医说话人识别。
This work is focused specifically on vocal effort variations, which is one class of intrinsic variability.
我们的工作特别关注 vocal effort方面的变化,这是一种内在的变异性。
Vocal effort has been shown to impact the performance of speaker recognition systems [9].
vocal effort已经被证明,会影响说话人识别系统[9]的性能。
In the past, a number of studies focused on different levels of vocal effort, such as whisper [10], shouts [11], and screams [4].
在过去,许多研究集中于不同程度的 vocal effort,如耳语[10]、大喊[11]和尖叫[4]。
The impact of Lombard speech on the performance of speaker verification system was considered in [12], [13].
[12]、[13]中考虑了朗巴德语对说话人验证系统性能的影响。
The main contributions of this work are as follows.
这项工作的主要贡献如下。
First, we use a state-of-the-art DNN speaker embeddings based speaker recognition system over classical GMM-UBM or i-vector based systems.
首先,不同于经典的基于GMM-UBM或i-vector的系统,我们使用了一种最先进的基于DNN说话人嵌入式的说话人识别系统。
Second, rather than focusing on just one type of vocal effort level such as whisper or shouts, we develop our mitigation approaches for a range of vocal efforts from low to high.
其次,我们不是只专注于一种类型的 vocal effort,如耳语或呼喊,而是为一系列从低到高的 vocal effort开发我们的缓解方法。
Third, we use a relatively large number of speakers with sufficient audio data per speaker to get significant results.
第三,我们使用相对较多的说话人,每个说话人有足够的音频数据,以获得显著的结果。
Also, to the best of our knowledge, this study is the first to consider calibration of speaker recognition system for a range of vocal efforts.
此外,据我们所知,本研究是第一个考虑校准说话人识别系统的一系列 vocal effort。
In this study, we first assess the impact of vocal effort on discrimination and calibration performance of a DNN speaker embeddings speaker recognition system.
在这项研究中,我们首先评估了 vocal effort对DNN嵌入式说话人识别系统的识别和校准性能的影响。
We then apply mixture PLDA (mix-PLDA) using meta information and the recently proposed trial-based calibration with condition PLDA similarity (TBC-CPLDA) to mitigate the impact of vocal effort.
然后,我们使用元信息和最近提出的基于条件PLDA相似性的实验校准(TBC-CPLDA),混合PLDA (mix-PLDA)来减轻 vocal effort的影响。
We used SRI-FRTIV corpora for all the experiments.
所有实验均采用SRI-FRTIV语料库。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值