Mel频率倒谱系数

最新推荐文章于 2023-11-20 20:12:46 发布

科学辰辰

最新推荐文章于 2023-11-20 20:12:46 发布

阅读量5.5k

点赞数 2

一、概述

MFCC：Mel频率倒谱系数（Mel Frequency Cepstrum Coefficient，MFCC）的缩写。Mel（美尔）是主观音高的单位，而Hz（赫兹）则是客观音高的单位。Mel频率是基于人耳听觉特性提出来的，它与Hz频率成非线性对应关系。Mel频率倒谱系数(MFCC)则是利用它们之间的这种关系，计算得到的Hz频谱特征。

二、应用

MFCC已经广泛地应用在语音识别领域。由于Mel频率与Hz频率之间非线性的对应关系，使得MFCC随着频率的提高，其计算精度随之下降。因此，在应用中常常只使用低频MFCC，而丢弃中高频MFCC。

三、提取流程

MFCC参数的提取包括以下几个步骤：
 预滤波：CODEC前端带宽为300-3400Hz的抗混叠滤波器。
 A/D变换：8kHz的采样频率，12bit的线性量化精度。
 预加重：通过一个一阶有限激励响应高通滤波器，使信号的频谱变得平坦，不易受到有限字长效应的影响。
 分帧：根据语音的短时平稳特性，语音可以以帧为单位进行处理，实验中选取的语音帧长为32ms，帧叠为16ms。
 加窗：采用哈明窗对一帧语音加窗，以减小吉布斯效应的影响。
 快速傅立叶变换（Fast Fourier Transformation, FFT）：将时域信号变换成为信号的功率谱。
 三角窗滤波：用一组Mel频标上线性分布的三角窗滤波器（共24个三角窗滤波器），对信号的功率谱滤波，每一个三角窗滤波器覆盖的范围都近似于人耳的一个临界带宽，以此来模拟人耳的掩蔽效应。
 求对数：三角窗滤波器组的输出求取对数，可以得到近似于同态变换的结果。
 离散余弦变换（Discrete Cosine Transformation, DCT）：去除各维信号之间的相关性，将信号映射到低维空间。
 谱加权：由于倒谱的低阶参数易受说话人特性、信道特性等的影响，而高阶参数的分辨能力比较低，所以需要进行谱加权，抑制其低阶和高阶参数。
 倒谱均值减（Cepstrum Mean Subtraction, CMS）：CMS可以有效地减小语音输入信道对特征参数的影响。
 差分参数：大量实验表明，在语音特征中加入表征语音动态特性的差分参数，能够提高系统的识别性能。在本系统中，我们也用到了MFCC参数的一阶差分参数和二阶差分参数。
 短时能量：语音的短时能量也是重要的特征参数，本系统中我们采用了语音的短时归一化对数能量及其一阶差分、二阶差分参数。

=================

梅尔频率倒频谱

梅尔频率倒频谱是倒频谱的一种应用，梅尔频率倒频谱常应用在声音讯号处理，对于声音讯号处理比倒频谱更接近人耳对声音的分析特性，而梅尔频率倒频谱与倒频谱的差别在于:

梅尔频率倒频谱的频带分析是根据人耳听觉特性所设计，人耳对于频率的分辨能力，是由频率的"比值"决定，也就是说，人耳对200赫兹和300赫兹之间的差别与2000赫兹和3000赫兹之间的差别是相同的。
梅尔频率倒频谱是针对讯号的能量取对数，而倒频谱是针对讯号原始在频谱上的值取对数。
梅尔频率倒频谱是使用离散余弦转换，倒频谱是用离散傅立叶变换。
梅尔频率倒频谱系数足够描述语音的特征。

梅尔频率倒频谱系数(MFCCs)的推导步骤：

将信号做傅立叶变换
频谱上的值取绝对值在平方成为能量，在乘上频谱上对应的梅尔频率倒频谱三角重叠窗(window)的系数。
对每个梅尔频率取对数。
作离散余弦转换。
求得梅尔频率倒频谱系数。
梅尔频率倒频谱应用

梅尔频率倒频谱系数常利用在辨认语音技术上，例如辨认电话中说话的人的身份。
利用每种乐风、或乐器在梅尔频域上有不同特性来分析音乐的种类与类型，并且可以加以分类。
噪声敏感性

梅尔频率倒频谱系数很容易被外来的噪声所破坏，因此有些研究结果指出，在求梅尔频率倒频谱系数时，在作离散余弦转换前，提升适当的能量(大约2或3倍)，以减少噪声在低能量成份的影响。

===============

http://stackoverflow.com/questions/5835568/how-to-get-mfcc-from-an-fft-on-a-signal#5846459

First, you have to split the signal in small frames with 10 to 30ms, apply a windowing function (humming is recommended for sound applications), and compute the fourier transform of the signal. With DFT, to compute Mel Frequecy Cepstral Coefficients you have to follow these steps:

Get power spectrum: |DFT|^2
Compute a triangular bank filter to transform hz scale into mel scale
Get log spectrum
Apply discrete cossine transform
A python code example:

import numpy
from scipy.fftpack import dct
from scipy.io import wavfile

sampleRate, signal = wavfile.read("file.wav")
numCoefficients = 13 # choose the sive of mfcc array
minHz = 0
maxHz = 22.000

complexSpectrum = numpy.fft(signal)
powerSpectrum = abs(complexSpectrum) ** 2
filteredSpectrum = numpy.dot(powerSpectrum, melFilterBank())
logSpectrum = numpy.log(filteredSpectrum)
dctSpectrum = dct(logSpectrum, type=2) # MFCC :)

def melFilterBank(blockSize):
    numBands = int(numCoefficients)
    maxMel = int(freqToMel(maxHz))
    minMel = int(freqToMel(minHz))

    # Create a matrix for triangular filters, one row per filter
    filterMatrix = numpy.zeros((numBands, blockSize))

    melRange = numpy.array(xrange(numBands + 2))

    melCenterFilters = melRange * (maxMel - minMel) / (numBands + 1) + minMel

    # each array index represent the center of each triangular filter
    aux = numpy.log(1 + 1000.0 / 700.0) / 1000.0
    aux = (numpy.exp(melCenterFilters * aux) - 1) / 22050
    aux = 0.5 + 700 * blockSize * aux
    aux = numpy.floor(aux) # Arredonda pra baixo
    centerIndex = numpy.array(aux, int) # Get int values

    for i in xrange(numBands):
        start, centre, end = centerIndex[i:i + 3]
        k1 = numpy.float32(centre - start)
        k2 = numpy.float32(end - centre)
        up = (numpy.array(xrange(start, centre)) - start) / k1
        down = (end - numpy.array(xrange(centre, end))) / k2

        filterMatrix[i][start:centre] = up
        filterMatrix[i][centre:end] = down

    return filterMatrix.transpose()

def freqToMel(freq):
    return 1127.01048 * math.log(1 + freq / 700.0)

def melToFreq(mel):
    return 700 * (math.exp(freq / 1127.01048 - 1))
This code is based on MFCC Vamp example. I hope this help you!

share|improve this answer
edited Oct 4 '13 at 20:43

answered May 1 '11 at 4:35

alf.
1064

Hi, Do you mean for "file.wav" to be a frame (10ms to 30ms)? If not, you need to split signal into small frames and then apply the operations you did to each frame. For each frame, you should get out 13 coefficients. – engineerchuan May 3 '11 at 3:36

... I was confused with that too. I assumed he was talking about the size of the window. It's Where we grab the values and then compute the FFT on it. Please confirm – Pavan May 3 '11 at 7:06

but what happens once i have the coefficients? what do it do with them? im assuming i get the coefficients of sound one and then the coefficients of sound 2... then what – Pavan May 3 '11 at 23:34
1
Sorry! I extracted it from my research code, then i forgot do made it clear. Consider "file.wav" as a sound frame with 10ms to 30ms! I think just have the coefficients isn't enough. You need to pass MFCC for an algorithm to classify it. I'm using a back-propagation neural network here to classify percussive sounds. An interesting project that uses MFCC to classify drum and other percursive sounds is william brandt's timbreID – alf. May 5 '11 at 4:52