2019---Introduction to the special issue “Speaker and language characterization and recog

Introduction to the special issue “Speaker and language characterization and recognition: Voice modeling, conversion, synthesis and ethical aspects”
“说话人和语言的特征和识别:声音建模、转换、合成和伦理方面”专题介绍

Welcome to this special issue on Speaker and Language Characterization which features, among other contributions, some of the most remarkable ideas presented and discussed at Odyssey 2018: the Speaker and Language Recognition Workshop, held in Les Sables d’Olonne, France, in June 2018. This issue perpetuates the series proposed by ISCA Speaker and language Characteriza- tion Special Interest Group (SIG) in coordination with ISCA Speaker Odyssey workshops (Campbell, 2000; Berkling et al., 2007; Lleida and Rodriguez-Fuentes, 2018).
欢迎来到本期关于说话人和语言特征的专刊,其中包括2018年6月在法国Les Sables d’Olone举办的《奥德赛2018:说话人和语言识别研讨会》上提出和讨论的一些最杰出的想法。本期节目延续了ISCA演讲者和语言特征特别兴趣小组(SIG)与ISCA演讲者奥德赛研讨会(Campbell,2000;Berkling等人,2007;Lleida和Rodriguez Fuentes,2018)合作提出的系列节目。
Voice is one of the most casual modalities for natural and intuitive interactions between humans as well as between humans and machines. Voice is also a central part of our identity. Voice-based solutions are currently deployed in a growing variety of applications, including person authentication through automatic speaker verification (ASV).
声音是人与人之间以及人与机器之间自然和直观互动的最随意的方式之一。声音也是我们身份的核心部分。基于语音的解决方案目前部署在越来越多的应用中,包括通过自动说话人验证(ASV)进行个人身份验证。
A related technology concerns digital cloning of personal voice characteristics for text-to-speech (TTS) and voice conversion (VC). In the last years, the impressive advancements of the VC/TTS field opened the way for numerous new consumer applica- tions. Especially, VC is offering new solutions for privacy protection. However, VC/TTS also brings the possibility of misuse of the technology in order to spoof ASV systems (for example presentation attacks implemented using voice conversion). As a direct consequence, spoofing countermeasures raised a growing interest during the past years.
一项相关技术涉及到文本到语音(TTS)和语音转换(VC)的个人语音特性的数字克隆。在过去的几年里,VC/TTS领域令人印象深刻的进步为许多新的消费者应用开辟了道路。特别是,风投为隐私保护提供了新的解决方案。然而,VC/TTS也带来了滥用该技术以欺骗ASV系统的可能性(例如使用语音转换实现的表示攻击)。作为一个直接的结果,在过去几年中,欺骗反措施引起了越来越多的关注。
Moreover, voice is a central part of our identity and is also bringing other characteristics on the persons than their identity, which could be extracted with or without the consent of the speaker. This brings up the need to tackle in ASV and VC/TTS not only the technical challenges, but specific ethical considerations, as shown, for example, by the recent EU General Data Protection Regulation (GDPR).
此外,声音是我们身份的一个核心部分,也给人带来了身份以外的其他特征,这些特征可以在说话人同意或不同意的情况下提取出来。这就提出了在ASV和VC/TTS中不仅要解决技术挑战,而且要解决具体的伦理问题,例如最近的欧盟通用数据保护条例(GDPR)就表明了这一点。
Time has passed since the previous Computer Speech and Language (CSL) special issue that focused on speaker and language recognition and summarized contributions originating from the 2016 edition of the Odyssey workshop Lleida and Rodriguez- Fuentes (2018). This special issue presents the latest progress in speaker and language characterization. But it also extends the topic to voice modeling, conversion, synthesis and ethical aspects, in order to reflect the relations of these themes with speaker and language characterization. As dedicated Editors of this special issue, we wished to propose high quality but also timely infor- mation. To achieve this objective, we accepted a loss in term of coverage and we selected only 8 high quality and (quite) ready for press articles, presented below.
自上一期《计算机语音和语言》(CSL)专刊以来,时间已经过去,该专刊关注的是说话人和语言识别,并总结了2016年版《奥德赛》研讨会Lleida和Rodriguez-Fuentes(2018)的贡献。本期特刊介绍了说话人和语言表征的最新进展。但它也将话题扩展到声音建模、转换、合成和伦理方面,以反映这些主题与说话人和语言表征的关系。作为本期特刊的专职编辑,我们希望能提出高质量但又及时的信息。为了实现这一目标,我们接受了在覆盖范围方面的损失,我们只选择了8篇高质量和(相当)可供出版的文章,如下所示。
The article entitled Vocoder-Free Text-to-Speech Synthesis Incorporating Generative Adversarial Networks Using Low- / Multi- Frequency STFT Amplitude Spectra by Saito et al. addresses quality degradation in text-to-speech (TTS) synthesis. The authors devise a vocoder-free approach that predicts high-dimensional amplitude spectrum from linguistic features. As the high- dimensional amplitude spectra has a complicated distribution, the resulting speech has degraded quality. This lead the authors to propose a novel training loss to combine prediction error (square error between the target and predicted amplitude spectra) with an adversarial loss term. The latter enforces the distribution of broad amplitude spectrum of generated speech to match closely the distribution of natural speech. The study represents an interesting example of knowledge transfer from ASV spoofing research to TTS: besides conventional mel-frequency scale (with a low-frequency focus), the authors incorporated inverse mel- scale (with high-frequency focus), originally used in detecting TTS and voice conversion attacks. Further, the authors used a spe- cific metric, spoofing rate, defined as the percentage of generated spectra classified by the discriminator network mistakenly as human speech, as part of their objective evaluation.
Saito等人发表了一篇题为“使用低/多频STFT幅度谱结合生成性对抗网络的无声码器文本到语音合成”的文章。解决文本到语音(TTS)合成中质量下降的问题。作者设计了一种无声码器的方法,从语言特征预测高维幅度谱。由于高维振幅谱分布复杂,导致语音质量下降。因此,作者提出了一种新的训练损失,将预测误差(目标和预测幅度谱之间的平方误差)与对抗损失项相结合。后者强制生成语音的宽振幅谱分布与自然语音的分布紧密匹配。这项研究代表了从ASV欺骗研究到TTS的知识转移的一个有趣的例子:除了传统的mel频率尺度(低频聚焦)外,作者还将原来用于检测TTS和语音转换攻击的逆mel尺度(高频聚焦)结合起来。此外,作者还使用了一种特殊的度量标准,即欺骗率,定义为被鉴别器网络误分类为人类语音的生成光谱的百分比,作为他们客观评价的一部分。
The article entitled Voice Mimicry Attacks Assisted by Automatic Speaker Verification by Vestman et al. addresses a potential security threat caused by the malicious use of automatic speaker verification (ASV) technology. The authors used one ASV system (i-vector) to assist mimicry attack of naive speakers against another ASV system (x-vector). They selected closeby target speakers for naive attackers from a large publicly available database for each of the mimickers. The authors additionally included percep- tual experiments and studies on changes in prosody. Although the results from this simulated attack scenario reveal that the malicious use of the technology improves the chances of breaking a speaker verification system, it was not generally not enough to break the attacked system.
这篇文章题为语音模拟攻击辅助自动说话人验证维斯特曼等人。解决恶意使用自动说话人验证(ASV)技术所造成的潜在安全威胁。作者使用一个ASV系统(i-vector)来辅助天真的说话人对另一个ASV系统(x-vector)的模仿攻击。他们从每个模仿者的大型公共数据库中为天真的攻击者选择了近距离目标扬声器。作者还包括知觉实验和韵律变化的研究。虽然这个模拟攻击场景的结果显示恶意使用该技术提高了破坏说话人验证系统的机会,但通常不足以破坏被攻击的系统。
The article entitled Deep Domain Adaptation for Anti-Spoofing in Speaker Verification Systems by Himawan et al. describes domain adaptation techniques of anti-spoofing countermeasure models for ASV. Countermeasure models trained on a database should be usable even on a different database ideally, but, in practice, this is not true due to mismatched acoustic conditions between the two databases. To address this issue, they proposed several interesting networks for supervised and unsupervised settings. For supervised setting, they used a Siamese like network that has two outputs, spoofing/genuine classification and adversarial domain classification. This transforms input spectrum into a new domain-invariant feature that can still be used for spoofing/genuine classification. For unsupervised setting, they further constrained the network so that weights for handing two databases are linearly correlated. The CORAL loss was also added. They show very detailed analysis results of the proposed domain adaptation techniques using the ASVspoof 2015 and AVspoof databases.
Himawan等人在《说话人验证系统中抗欺骗的深域自适应》一文中提出。介绍了ASV反欺骗对抗模型的域自适应技术。在一个数据库上训练的对抗模型即使在另一个理想的数据库上也应该是可用的,但实际上,由于两个数据库之间的声学条件不匹配,这是不正确的。为了解决这个问题,他们提出了几个有趣的网络用于有监督和无监督的设置。对于监督设置,他们使用了一个类似连体的网络,它有两个输出,欺骗/真实分类和对抗域分类。这将输入频谱转换为一个新的域不变特征,仍然可以用于欺骗/真正的分类。对于无监督设置,它们进一步约束网络,使处理两个数据库的权重线性相关。珊瑚损失也增加了。它们展示了使用ASVspoof 2015和AVspoof数据库对提议的域适应技术的非常详细的分析结果。
The survey article Preserving Privacy in Speaker and Speech Characterisation by Nautsch et al. is a recommended reading for speech, biometrics and applied cryptography researchers and relevant readers who works on speaker and speech privacy since privacy preservation is mandated by recent European and international data protection regulations. It covers a legal perspective on privacy preservation of speech data, the requirements for effective privacy preservation, and cryptography-based solutions that are applicable to speaker characterization and speech characterization, respectively.
Nautsch等人的《保护说话人隐私和语言特征》调查文章。是语音、生物测定学和应用密码学研究人员以及从事说话人和语音隐私工作的相关读者的推荐阅读材料,因为隐私保护是由最近的欧洲和国际数据保护条例规定的。它涵盖了语音数据隐私保护的法律观点、有效隐私保护的要求,以及分别适用于说话人特征和语音特征的基于密码的解决方案。
In their article End-to-end DNN Based Text-Independent Speaker Recognition for Long and Short Utterances, Rohdin et al. proposed to mimic an i-vector/PLDA system using an end-to-end neural network to address over-fitting problems in usual end- to-end systems. Each part of a classical i-vector/PLDA system, including sufficient statistics computation, i-vector extraction and PLDA scoring is replaced by a neural network. Training this system in an end-to-end manner makes the tasks in training and eval- uation the same which is beneficial compared to standard x-vector systems. The article additionally describes the entire training process and details the optimization process required to limit the memory requirement. The proposed solution performs simi- larly to an x-vector system but without requiring data augmentation.
Rohdin等人在他们的文章中提出了一种基于端到端DNN的文本无关说话人识别方法。提出用端到端神经网络来模拟i-vector/PLDA系统,以解决通常端到端系统中的过拟合问题。用神经网络代替经典的i-vector/PLDA系统的各个部分,包括充分的统计计算、i-vector提取和PLDA评分。以端到端的方式对系统进行训练,使得训练和评估的任务与标准的x向量系统相同,这是有益的。本文还描述了整个培训过程,并详细说明了限制内存需求所需的优化过程。所提出的解决方案与x矢量系统的性能相似,但不需要增加数据量。
An adversarial approach is proposed by Chien and Peng in their article Neural Adversarial Learning for Speaker Recognition. Their method can be used in two tasks. First, adversarial training can be used to construct a manifold PLDA that preserves neigh- bor embedding of i-vectors in a low-dimensional space to benefit speaker recognition. Second, the generative network can be used to tackle the problem of imbalanced and insufficient data in PLDA speaker recognition by generating artificial examples. To train the couple of networks, they propose to perform multi-objective learning for minimax optimization and introduce a regu- larization of Gaussianity and cosine similarity.
Chien和Peng在他们的文章《神经对抗学习用于说话人识别》中提出了一种对抗方法。他们的方法可以用于两个任务。首先,对抗训练可以用来构造流形PLDA,在低维空间中保持i矢量的邻域嵌入,以利于说话人识别。其次,生成网络可以通过生成人工例子来解决PLDA说话人识别中数据不平衡和不足的问题。为了训练这对网络,他们提出了对minimax优化进行多目标学习,并引入高斯和余弦相似性的正则化。
In their article Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition, Novotny et al. report the results of a detailed analysis of speaker verification noise robustness. They study the use of deep neural networks for audio enhancement and analyze the performance of standard i-vector system as well as x-vectors, considered as the current state-of-the art in speaker verification. Their experiments cover a large number of standard corpora and datasets derived from standard corpora to cover multiple acoustic domains. This work demonstrates the effectiveness of some methods while alerting on the degradation that denoising may induce in clean conditions. The methods compared include denoising as well as data augmentation, robust features and multi-condition training.
Novotny等人在他们的文章中分析了用于鲁棒说话人识别的DNN语音信号增强。报告对说话人验证噪声稳健性的详细分析结果。他们研究了深度神经网络在音频增强中的应用,并分析了标准i-向量系统和x-向量系统的性能,认为这是目前说话人验证的最新技术。他们的实验覆盖了大量的标准语料库和从标准语料库中提取的数据集,覆盖了多个声学域。这项工作证明了一些方法的有效性,同时提醒在清洁条件下去噪可能导致的退化。比较的方法包括去噪和数据增强、鲁棒特征和多条件训练。
Finally, Monteiro et al. proposed in their article entitled Residual Convolutional Neural Network with Attentive Feature Pooling for End-to-End Language Identification from Short-Duration Speech a solution for end-to-end language identification. They propose to use residual convolutional neural networks due to their property to take into account large contextual segments. They associ- ate this architecture with different attention mechanisms. They demonstrate that their approach improves the average cost of classical methods by 30% to 40% on standard benchmarks.
最后,Monteiro等人。在他们的文章中提出了一种基于注意特征池的残差卷积神经网络的端到端语言识别方法。由于残差卷积神经网络的特性,他们建议使用残差卷积神经网络来考虑大的上下文片段。他们将这种结构与不同的注意机制联系起来。他们证明他们的方法在标准基准上将经典方法的平均成本提高了30%到40%。
We express our gratitude to the authors of the submissions to this special issue and present our apologizes to the authors of the submissions postponed due to our strict selection rules. We thank the reviewers for the huge time they invest in reviewing the submissions and particularly for their fruitful comments in order to improve the papers. We wish also to thank Prof. Roger K. Moore, the Editor-in-Chief, who supported our project, including our “rush” planning and the associated selection rules, and advise us during all the (long and complex) editing process.
我们对本期特刊投稿的作者表示感谢,并对因我们严格的评选规则而推迟投稿的作者表示歉意。我们感谢审稿人花了大量时间审查提交的材料,特别是他们富有成果的评论,以便改进论文。我们还要感谢总编辑罗杰K摩尔教授,他支持我们的项目,包括我们的“冲刺”计划和相关的选择规则,并在所有(长期和复杂的)编辑过程中为我们提供建议。

Jean-Franc ̧ ois Bonastre* LIA, Avignon University, Avignon, France
Jean François Bonastre*LIA,阿维尼翁大学,阿维尼翁,法国

Tomi Kinnunen
托米·金努南
University of Eastern Finland, Joensuu, Finland
芬兰东部约恩苏大学

Anthony Larcher
安东尼·拉彻
LIUM, Le Mans Universit e, Le Mans, France
柳姆,勒芒大学,勒芒,法国

Yamagishi 山崎骏一
National Institute of Informatics, Tokyo, Japan
日本东京国立信息学研究所
*Corresponding author. E-mail address: jean-francois.bonastre@univ-avignon.fr (J.-F. Bonastre).
*通讯作者。电子邮件地址:jean-francois.bonastre@univ-avignon.fr(J.-F.bonastre)。

References
J.-F. Bonastre et al. / Computer Speech & Language 60 (2020) 101021 3
Berkling, K., Bonastre, J., Campbell, J.P., 2007. Introduction to the special section on speaker and language recognition. IEEE Trans. Audio Speech Lang. Process. 15 (7), 1949–1950. doi: 10.1109/TASL.2007.905038.
Campbell, J., 2000. Introduction to the issue. Digit. Signal Process. 10 (1), xi–xv. doi: 10.1006/dspr.2000.0372.
Lleida, E., Rodriguez-Fuentes, L.J., 2018. Speaker and language recognition and characterization: introduction to the CSL special issue. Comput. Speech Lang. 49,
107–120. doi: 10.1016/j.csl.2017.12.001.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值