【audio evaluation】objective evaluation

MCD

mel cepstral distortion (MCD) :
M C D K = 1 T ∑ t = 0 T − 1 ∑ k = 1 K ( c t , k − c t , k ′ ) 2 M C D_{K}=\frac{1}{T} \sum_{t=0}^{T-1} \sqrt{\sum_{k=1}^{K}\left(c_{t, k}-c_{t, k}^{\prime}\right)^{2}} MCDK=T1t=0T1k=1K(ct,kct,k)2
where c t , k c_{t, k} ct,k and c t , k ′ c_{t, k}^{\prime} ct,k are the k-th mel frequency cepstral coefficient (MFCC) of the t-th frame from the ground-truth and converted utterance, respectively. In addition, T T T and K K K are the frame length and number of coefficients, respectively.

F0RMSE

F0 root mean squared error (RMSE)
F 0 R M S E = 1 T ∑ i = 1 T ( f i − f i ′ ) 2 F_{0} R M S E=\sqrt{\frac{1}{T} \sum_{i=1}^{T}\left(f_{i}-f_{i}^{\prime}\right)^{2}} F0RMSE=T1i=1T(fifi)2
where f i f_{i} fi and f i ′ f_{i}^{\prime} fi are the fundamental frequencies of the i-th frame from the ground-truth and converted utterance, respectively.
A lower MCD and F 0 F_{0} F0 RMSE refer to the smaller distortion between the two utterances.

Inference Speed

Testing the inference that a certain number of samples takes time.

speaker classification accuracy

speaker classification accuracy, a network trained to predict the speakers.

Word Error Rate (WER)

automatic speech recognition (ASR) test to calculate the word error rate (WER).can use the Google Speech-to-Text API for the ASR model

character error rate (CER)

evaluate linguistic consistency
automatic speaker verification (ASV)
automatic speaker verifification (ASV) to compute the equal error rate from (EER) the sample pairs from the target and converted speech.

PCC (Pearson correlation coeffificient)

PCC (Pearson correlation coeffificient) was used to calculate prosody consistency

Speaker Similarity score

use a speaker verification system based on I-vectors, which formulate the speaker verification problem in the total variability space
After length normalization, cosine distance scoring is used for i-vector modeling.

Cosine distance

speaker verification (SV) system to evaluate the similarity between the converted speech and the target speakers’ speech.

Reference:
Duration Controllable Voice Conversion via Phoneme-Based Information Bottleneck
FREEVC: TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION
Overview of Voice Conversion Methods Based on Deep Learning

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值