【audio evaluation】objective evaluation

最新推荐文章于 2024-08-16 22:16:38 发布

eggplant323

最新推荐文章于 2024-08-16 22:16:38 发布

阅读量132

点赞数

分类专栏： TTS accent conversion 文章标签：语音识别自然语言处理学习

本文链接：https://blog.csdn.net/eggplant323/article/details/132655052

版权

TTS 同时被 2 个专栏收录

9 篇文章 1 订阅

订阅专栏

accent conversion

8 篇文章 0 订阅

订阅专栏

objective evaluation

MCD
F0RMSE
Inference Speed
speaker classification accuracy
Word Error Rate (WER)
character error rate (CER)
PCC (Pearson correlation coeffificient)
Speaker Similarity score
Cosine distance

MCD

mel cepstral distortion (MCD) ：
$D_{K}=\frac{1}{T} \sum_{t=0}^{T-1} \sqrt{\sum_{k=1}^{K}\left(c_{t, k}-c_{t, k}^{\prime}\right)^{2}}$
where $c_{t, k}$ and $c_{t, k}^{\prime}$ are the k-th mel frequency cepstral coefficient (MFCC) of the t-th frame from the ground-truth and converted utterance, respectively. In addition, $T$ and $K$ are the frame length and number of coefficients, respectively.

F0RMSE

F0 root mean squared error (RMSE)
$F_{0} R M S E=\sqrt{\frac{1}{T} \sum_{i=1}^{T}\left(f_{i}-f_{i}^{\prime}\right)^{2}}$
where $f_{i}$ and $f_{i}^{\prime}$ are the fundamental frequencies of the i-th frame from the ground-truth and converted utterance, respectively.
A lower MCD and $F_{0}$ RMSE refer to the smaller distortion between the two utterances.

Inference Speed

Testing the inference that a certain number of samples takes time.

speaker classification accuracy

speaker classification accuracy, a network trained to predict the speakers.

Word Error Rate (WER)

automatic speech recognition (ASR) test to calculate the word error rate (WER).can use the Google Speech-to-Text API for the ASR model

character error rate (CER)

evaluate linguistic consistency
automatic speaker verification (ASV)
automatic speaker verifification (ASV) to compute the equal error rate from (EER) the sample pairs from the target and converted speech.

PCC (Pearson correlation coeffificient)

PCC (Pearson correlation coeffificient) was used to calculate prosody consistency

Speaker Similarity score

use a speaker verification system based on I-vectors, which formulate the speaker verification problem in the total variability space
After length normalization, cosine distance scoring is used for i-vector modeling.

Cosine distance

speaker verification (SV) system to evaluate the similarity between the converted speech and the target speakers’ speech.

Reference：
Duration Controllable Voice Conversion via Phoneme-Based Information Bottleneck
FREEVC: TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION
Overview of Voice Conversion Methods Based on Deep Learning