ICASSP 2019----Phoneme Specific Modelling and Scoring Techniques for Anti Spoofing System

https://ieeexplore.ieee.org/document/8682411

Gajan Suthokumar; The University of New South Wales
Kaavya Sriskandaraja; The University of New South Wales
Vidhyasaharan Sethu; The University of New South Wales
Chamith Wijenayake; The University of New South Wales
Eliathamby Ambikairajah; The University of New South Wales

Phoneme Specific Modelling and Scoring Techniques for Anti Spoofing System
针对音素建模和评分技术的反欺骗系统
Abstract:
文摘:
Replay attack refers to the use of recorded speech in an attempt to spoof an automatic speaker verification system and the development of countermeasures that can detect these attacks is an active area of research.
重播攻击,是指利用录制的语音试图对,自动说话人认证系统,进行欺骗。检测这些攻击的方法,是一个活跃的研究领域。
This paper investigates the effect of phoneme specific information on replay attack detection.
本文研究了,特定音素信息,对重播攻击检测的影响。
It then develops a replay detection system that employs phoneme specific genuine and spoof models and compares novel scoring methods that take into account phonetic information obtained from a suitable phoneme recogniser.
然后,本文开发了一种利用,音素特异性真假模型,的重播检测系统,并将它与一种新的评分方法做了比较,这种新的评分方法利用了从一个音素识别器中得到的音素信息。
Experiment result on the ASVSpoof 2017 V2.0 corpus indicated that replayed speech may be easier to detect from speech corresponding to some phonemes compared to others and consequently judicious use of phoneme specific models can improve replay detection systems.
在ASVSpoof 2017 V2.0语料库上的实验结果表明,重放语音更容易从一些音素对应的语音中检测出来,因此明智地使用音素特定的模型可以改善重放检测系统。

SECTION 1.INTRODUCTION
1.节介绍
Replay attacks are simple yet effective means by which automatic speaker verification (ASV) system can be spoofed using simple audio record and playback devices [1].
重放攻击是一种简单而有效的手段,通过使用简单的音频记录和重放设备[1]可以欺骗自动说话人验证(ASV)系统。
Most current approaches to replay detection rely on the observation that the speech signal involved in replay attacks must pass through both recording and playback channels, which in turn may result in some spectral distortion.
目前大多数重放检测方法都依赖于这样一种观察,即涉及重放攻击的语音信号必须同时通过录制和重放两个通道,而这可能导致某些频谱失真。
Replay detection may then be cast as a problem of detection of this channel distortion, while taking into consideration that there is a myriad of recording and playback channels and these cannot be known a priori.
回放检测可以作为检测这种信道失真的问题进行转换,同时考虑到有无数的录制和回放通道,而这些通道不能预先知道。
Generally, spoofing detection includes front-end as well as back-end and most of the anti-spoofing research for replay attack has been focused on feature engineering while the classification blocks are often built on the traditional classification techniques such as Gaussian mixture model (GMM), support vector machine (SVM) [2].
欺骗检测一般分为前端和后端两部分,针对重放攻击的反欺骗研究大多集中在特征工程方面,而分类块往往建立在高斯混合模型(GMM)、支持向量机(SVM)[2]等传统分类技术的基础上。
Front ends based on variants of spectral features, long-term spectral statistics [3], voice source [4], phase based features [5] and different variants of deep neural network based systems [5][6][7][8] [9] have been investigated, extensively.
基于频谱特征变量的前端,长期频谱统计[3]、语音源[4]、基于相位特征[5]以及基于深度神经网络系统[5][6][7][8][9]的不同变量都得到了广泛的研究。
The features indicative of spectral cues, include spectral centroid magnitude coefficient (SCMC) [10], constant-Q cepstral coefficient (CQCC) [11], rectangular filter cepstral coefficients (RFCC) [10], scattering coefficients [12], spectral energy slope [13], spectro-temporal modulation feature (STMF) [14], [15], often use spectrogram to extract the information.
表征频谱的特征包括(SCMC)[10]、(CQCC)[11]、矩形滤波倒谱系数(RFCC)[10]、散射系数[12]、频谱能量斜率[13]、频谱-时间调制特征(STMF)[14]、[15],常用频谱图提取信息。
It has been suggested that replayed signals would include noise and reverberation, leading to a flatter and altered spectrogram [14].
有人认为,重放信号会包括噪声和混响,导致频谱图[14]变平和改变。
Each region of spectrogram tends to be affected differently which in turn could mean different phonemes are affected differently by the replay channel.
频谱图的每个区域受到的影响往往是不同的,这反过来又可能意味着不同的音素受到重放通道的影响是不同的。
Furthermore, it has also been suggested that different phonemes vary in their robustness to reverberation in the context of automatic speech recognition [16].
此外,在自动语音识别[16]环境下,不同音素对混响的鲁棒性也不同。
Motivated by above findings, we aim to investigate how phoneme related information can be incorporated in to replay attack detection systems.
基于以上研究结果,我们旨在研究如何将音素相关信息整合到重放攻击检测系统中。
In addition, features employed in any spoofing detection system will be incorporate variability due to a number of factors such as channel effects, differences between speakers, and phonetic variability arising from the linguistic content.
此外,在任何欺骗检测系统中使用的特性都将包含多种因素的可变性,如通道效应、说话人之间的差异以及由语言内容引起的语音可变性。
Previous work has shown that replay detection can be improved by making use of speaker specific models and in turn implicitly compensating for speaker variability [15].
以前的工作已经表明,回放检测可以通过使用,特定于说话人的模型,来改进,从而隐式地补偿说话人的变异性[15]。
Phonetic variability is generally not explicitly taken into consideration.
通常不明确地考虑语音变异性。
Instead, most back-ends model the statistical distribution of the features for replayed and genuine speech and rely on the back-ends capturing the differences across all phonemes.
相反,大多数后端,对回放和真实语音的特征进行统计分布建模,并依赖于后端捕捉所有音素之间的差异。
However, in other areas of speech processing, such as emotion recognition [17] and speaker verification [18], explicit modelling of phonetic information has been shown to be beneficial.
然而,在语音处理的其他领域,如情绪识别[17]和说话人验证[18],明确的语音信息建模已被证明是有益的。
This paper makes three key contributions;
本文有三个主要贡献;
firstly we investigate if some phonemes are more conducive replay detection than others;
首先,我们研究了某些音素是否比其他音素更有利于重放检测;
secondly, we proposed a novel framework to incorporate phoneme specific models into a replay detection system;
其次,提出了一种新的框架,将音素特定模型集成到重放检测系统中;
and finally we compare four scoring methods developed to incorporate phonetic information.
最后,我们比较了四种结合语音信息的评分方法。
To the best of the authors’ knowledge this is the first study on the effect of phonetic variation in replay detection.
据作者所知,这是第一次研究不同的音素在回放检测中的作用。

SECTION 2.DATABASE
2.节数据库
The original ASVSpoof 2017 challenge corpus [19], comprising of genuine recordings and their replayed versions, are used in all the experiments outlined in this paper.
本文所述的所有实验均使用了ASVSpoof 2017 challenge corpus[19]原装录音及其重播版本。
The RedDots text dependent corpus is used directly for the genuine utterances.
红点文本相关语料库直接用于真实话语。
Replayed speech utterances are created through recording the playback of the genuine speech through the different playback and recording devices in various acoustic environments.
重放语音是通过在各种声学环境中,通过不同的重放和录音设备,对真实语音进行录音回放而产生的。
Three non-overlapping subsets as train, development and evaluation are provided.
提供了训练、开发和评价三个非重叠子集。
As this is a text dependent corpus, 10 phrases have been used in all subsets.
由于这是一个文本相关的语料库,所以所有子集都使用了10个短语。
Anomalies identified in the original ASVSpoof 2017 corpus prompted the organisers to release an updated version referred to as the ASVSpoof 2017 Version 2.0 (V2.0) corpus and a new enhanced CQCC baseline in 2018 [11] and all our experiments results reported here are on the V2.0 corpus.
在ASVSpoof 2017原版语料库中发现的异常,促使主办方发布了一个更新版本,称为ASVSpoof 2017 version 2.0 (V2.0)语料库,以及一个新的增强的CQCC基线,即2018年的[11],我们在这里报告的所有实验结果都是在V2.0语料库上。
It should be noted that results reported using the original ASVSpoof 2017 (V1.0) are not directly comparable with V2.0 results.
需要注意的是,使用原始ASVSpoof 2017 (V1.0)报告的结果与V2.0结果没有直接可比性。

SECTION 3.
第三节。
PHONETIC VARIABILITY ANALYSIS
音素变化分析
As previously mentioned, replay detection may be cast as a problem of detecting an a priori unknown channel, where the channel comprises of the recording and playback devices in addition to the acoustic environment.
如前所述,重放检测可以转换为检测一个先验未知信道的问题,该信道除声学环境外还包括录音和重放设备。
This in turn is typically implemented as the detection of the spectral distortion introduced by the channel.
这通常被实现为检测信道引入的频谱失真。
Consequently, since different phonemes have different spectral characteristics (for instance, fricatives have more of their energy contents in the high frequency regions while vowels contain more of their energy in the lower frequency regions), the ease of detecting the spectral characteristics of an unknown channel may vary across different phonemes.
因为不同的音素有不同的频谱特征(例如,摩擦音能量更多的内容在高频区域,元音含有更多的能量较低频率区域),易于检测的频谱特征在不同的音素未知信道可能不同。
Our main aim of this work is to determine whether some phonemes allow easier detection of spoofed speech compared to others.
我们这项工作的主要目的是确定一些音素是否比其他音素更容易检测到欺骗语音。
Specifically, we investigate whether every phoneme affected differently, during the process of replay.
具体来说,我们研究了每个音素在重放过程中的影响是否不同。
To analyse which phoneme has more discriminative ability each phoneme has examined separately.
为了分析哪个音素具有更强的辨别能力,每个音素分别进行了考查。
First, the corresponding phoneme presents in each frame is detected and then discriminative power between genuine and spoof class of different phonemes was estimated using two approaches: model-level and classification-level comparison.
首先检测每个帧中出现的对应音素,然后利用模型级和分类级比较两种方法估计不同音素的真假类鉴别能力。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值