2021-08-13 【翻译】Mel Frequency Cepstral Coefficient (MFCC) tutorial

weixin_45965693

于 2021-08-13 18:34:28 发布

阅读量223

点赞数

本文链接：https://blog.csdn.net/weixin_45965693/article/details/119679640

版权

原文http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
Mel Frequency Cepstral Coefficient (MFCC) tutorial
梅尔倒谱系数教程

P1.The first step in any automatic speech recognition system is to extract features i.e. identify the components of the audio signal that are good for identifying the linguistic content and discarding all the other stuff which carries information like background noise, emotion etc.
任何自动语音识别系统的第一步是提取特征，比如，即确定出音频信号中有利于识别语音内容的组分，并丢弃携带背景噪声、情感等信息的所有其他内容。

P2.The main point to understand about speech is that the sounds generated by a human are filtered by the shape of the vocal tract including tongue, teeth etc. This shape determines what sound comes out. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope. This page will provide a short tutorial on MFCCs.
了解发音的一个要点是：人类产生的声音是通过声道包括舌头、牙齿等“滤波器”形成的。声道的形状决定了发出什么样的声音。如果我们能准确地确定形状，就能让我们准确地表示正在发出的音素。声道的形状表现在短时间功率谱的包络中，而MFCCs的工作就是准确地表示这个包络。此页面将提供关于 MFCCs 的简短教程。

P3.Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. They were introduced by Davis and Mermelstein in the 1980’s, and have been state-of-the-art ever since. Prior to the introduction of MFCCs, Linear Prediction Coefficients (LPCs) and Linear Prediction Cepstral Coefficients (LPCCs) (click here for a tutorial on cepstrum and LPCCs) and were the main feature type for automatic speech recognition (ASR), especially with HMM classifiers. This page will go over the main aspects of MFCCs, why they make a good feature for ASR, and how to implement them.
梅尔频率倒谱系数(Mel Frequency Cepstral coefficient, MFCCs)是一种广泛应用于自动语音识别和说话人识别的特征。它们是在20世纪80年代由Davis和Mermelstein引入的，从那时起就一直是最先进的。在引入MFCCs之前，线性预测系数(LPCs)和线性预测倒谱系数(LPCCs)是自动语音识别(ASR)的主要特征类型，特别是在使用HMM分类器的情况下。本页面将介绍MFCCs的主要方面，为什么它们是ASR的好特征，以及如何实现它们。

Steps at a Glance 步骤一览

P4.We will give a high level intro to the implementation steps, then go in depth why we do the things we do. Towards the end we will go into a more detailed description of how to calculate MFCCs.
我们将对实施步骤做一个高度概况的介绍，然后深入了解我们为什么要做这些事情。最后，我们将更详细地描述如何计算MFCCs。

Frame the signal into short frames.把信号转换成短帧组。
For each frame calculate the periodogram estimate of the power spectrum.对每一帧计算周期谱估计。
Apply the mel filterbank to the power spectra, sum the energy in each filter.将mel滤波器组应用于功率谱，对每个滤波器的能量进行求和。
Take the logarithm of all filterbank energies.取所有滤波器组能量的对数。
Take the DCT of the log filterbank energies.取对数滤波器组能量的DCT。
Keep DCT coefficients 2-13, discard the rest.保留DCT系数2-13，其他的舍去。

P5.There are a few more things commonly done, sometimes the frame energy is appended to each feature vector. Delta and Delta-Delta features are usually also appended. Liftering is also commonly applied to the final features.
还有一些通常做的事情，有时帧能量被附加到每个特征向量上。 Delta和Delta-Delta特征通常也附加在后面。提升器也通常应用于最后的特征。

Why do we do these things? 为什么我们做这些？

P6.We will now go a little more slowly through the steps and explain why each of the steps is necessary.
现在，我们将稍微慢一些地完成这些步骤，并解释为什么每个步骤都是必要的。

P7.An audio signal is constantly changing, so to simplify things we assume that on short time scales the audio signal doesn’t change much (when we say it doesn’t change, we mean statistically i.e. statistically stationary, obviously the samples are constantly changing on even short time scales). This is why we frame the signal into 20-40ms frames. If the frame is much shorter we don’t have enough samples to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame.
音频信号是不断变化的，所以为了简化问题，我们假设音频信号在短时间尺度上变化不大(当我们说它没有变化时，我们的意思是统计上的，也就是统计上的平稳，显然样本在短时间尺度上也在不断变化)。这就是为什么我们把信号分割成20-40毫秒的帧。如果帧很短，我们就没有足够的采样样本来得到可靠的谱估计，如果帧很长，则信号在整个帧中变化太多。

P8.The next step is to calculate the power spectrum of each frame. This is motivated by the human cochlea (an organ in the ear) which vibrates at different spots depending on the frequency of the incoming sounds. Depending on the location in the cochlea that vibrates (which wobbles small hairs), different nerves fire informing the brain that certain frequencies are present. Our periodogram estimate performs a similar job for us, identifying which frequencies are present in the frame.
下一步是计算每帧的功率谱。这是由人类耳蜗(耳朵里的一个器官)驱动的，耳蜗会根据传入声音的频率在不同的地方振动。根据震动的位置(震动的小毛发)，不同的神经会发出信号，告诉大脑某些频率的存在。我们的周期谱估计为我们完成了类似的工作，识别在帧中出现的频率。

P9.The periodogram spectral estimate still contains a lot of information not required for Automatic Speech Recognition (ASR). In particular the cochlea can not discern the difference between two closely spaced frequencies. This effect becomes more pronounced as the frequencies increase. For this reason we take clumps of periodogram bins and sum them up to get an idea of how much energy exists in various frequency regions. This is performed by our Mel filterbank: the first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher our filters get wider as we become less concerned about variations. We are only interested in roughly how much energy occurs at each spot. The Mel scale tells us exactly how to space our filterbanks and how wide to make them. See below for how to calculate the spacing.
周期谱估计值中仍然包含许多自动语音识别不需要的信息。特别是耳蜗不能分辨两个间隔很近的频率之间的差异。随着频率的增加，这种效应变得更加明显。由于这个原因，我们取一簇周期谱集，并把它们总结起来以了解在不同的频率区域中存在多少能量。这是由我们的梅尔滤波器组完成的:第一个滤波器非常窄，给出了在0赫兹附近存在多少能量的指示。当频率变得更高时，我们的滤波器就会变得更宽，因为我们变得不那么关心变化。我们只关心每个点有多少能量。梅尔尺度告诉我们如何划分滤波器的间距以及滤波器的宽度。请参阅下面如何计算间距。

P10.Once we have the filterbank energies, we take the logarithm of them. This is also motivated by human hearing: we don’t hear loudness on a linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with. This compression operation makes our features match more closely what humans actually hear. Why the logarithm and not a cube root? The logarithm allows us to use cepstral mean subtraction, which is a channel normalisation technique.
有了滤波器组的能量后，我们对它们取对数。这也是由人类听觉引起的:我们的听觉不遵循线性响度。一般来说，要将感知到的音量增加一倍，我们需要投入8倍的能量。这意味着，如果一开始声音很大，能量的巨大变化听起来可能并没有那么不同。这种压缩操作使我们的特征与人类实际听到的更接近。为什么是对数而不是立方根?对数允许我们使用倒谱平均减法，这是一种通道标准化技术。

P11.The final step is to compute the DCT of the log filterbank energies. There are 2 main reasons this is performed. Because our filterbanks are all overlapping, the filterbank energies are quite correlated with each other. The DCT decorrelates the energies which means diagonal covariance matrices can be used to model the features in e.g. a HMM classifier. But notice that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients represent fast changes in the filterbank energies and it turns out that these fast changes actually degrade ASR performance, so we get a small improvement by dropping them.
最后一步是计算对数滤波器组能量的DCT。这样做有两个主要原因。由于我们的滤波器组都是重叠的，所以滤波器组的能量彼此之间是相当相关的。DCT去关联能量，这意味着对角协方差矩阵可以用于例如HMM分类器的建模特征。但是请注意，26个DCT系数中只有12个保留了下来。这是因为较高的DCT系数代表了滤波器组能量的快速变化，而事实证明，这些快速变化实际上降低了ASR的性能，所以我们通过舍弃它们得到了一点改进。

What is the Mel scale? 什么是梅尔尺度？

P12.The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency. Humans are much better at discerning small changes in pitch at low frequencies than they are at high frequencies. Incorporating this scale makes our features match more closely what humans hear.
梅尔音阶将感知到的纯音频率或音高与实际测量到的频率联系起来。人类在识别低频音高的细微变化方面要比在高频音高上强得多。结合这种尺度可以使我们的特征与人类听到的更接近。

The formula for converting from frequency to Mel scale is:
频率转换为梅尔尺度的公式为:
$M (f) = 1125 l n (1 + f / 700)$
To go from Mels back to frequency:
从梅尔尺度回到频率:
$M^{-1}=700(exp(m/1125)-1)$

Implementation steps 实施步骤

We start with a speech signal, we’ll assume sampled at 16kHz.
我们从一个语音信号开始，假设采样频率是16kHz。

1 Frame the signal into 20-40 ms frames. 25ms is standard. This means the frame length for a 16kHz signal is 0.02516000 = 400 samples. Frame step is usually something like 10ms (160 samples), which allows some overlap to the frames. The first 400 sample frame starts at sample 0, the next 400 sample frame starts at sample 160 etc. until the end of the speech file is reached. If the speech file does not divide into an even number of frames, pad it with zeros so that it does.
将信号分成20-40毫秒的帧，25是标准。这意味着一个16kHz信号的帧长度是0.02516000 = 400个样本。帧步长通常是10ms(160个样本)，这允许一些帧重叠。前400个样本帧从样本0开始，下400个样本帧从样本160开始，以此类推，直到语音文件结束。如果语音文件没有被分割成偶数帧，用0填充它，这样就可以了。

2 To take the Discrete Fourier Transform of the frame, perform the following:

Computing the Mel filterbank

Deltas and Delta-Deltas

Implementations

References

未完待续…
翻译by有道tty

weixin_45965693

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2021-08-13 【翻译】Mel Frequency Cepstral Coefficient (MFCC) tutorial

原文http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/Mel Frequency Cepstral Coefficient (MFCC) tutorial梅尔倒谱系数教程P1.The first step in any automatic speech recognition system is to extract featur
复制链接

扫一扫