DCT和MFCC

最新推荐文章于 2023-12-12 20:25:17 发布

wodownload2

最新推荐文章于 2023-12-12 20:25:17 发布

阅读量1.5k

点赞数

本文链接：https://blog.csdn.net/wodownload2/article/details/112241382

版权

https://blog.csdn.net/joycesunny/article/details/85218271
https://zhuanlan.zhihu.com/p/85299446
https://penghuailiang.gitee.io/blog/2020/mouth/
https://www.yisu.com/zixun/165691.html
https://www.codenong.com/cs106864196/
http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
https://zhuanlan.zhihu.com/p/43102193
https://www.cnblogs.com/xingshansi/p/6815217.html
https://www.cnblogs.com/xingshansi/p/6621914.html
https://blog.csdn.net/xmdxcsj/article/details/51228791
http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
https://zhuanlan.zhihu.com/p/110766305
http://fourier.eng.hmc.edu/e161/lectures/dct/node1.html
https://blog.csdn.net/jojozhangju/article/details/22730663 这个比较详细
https://github.com/weedwind/MFCC.git

DFT的变换公式
在这里插入图片描述

上式拆开：
在这里插入图片描述
实数部分：

虚数部分：
在这里插入图片描述
假设

则实数部分变为：

虚数系数部分：

在这里插入图片描述

cos是一个偶数，sin是一个奇数，因此有：
在这里插入图片描述

认知梯度：
当x[n]是一个实数函数时，频域是实部，相位是虚部。

假设，原信号x[n]是一个全是实数的偶函数信号，那么x[n]sin(kt)变为奇函数，既然是奇函数那么自然：
在这里插入图片描述
所以，当原时域信号是一个实偶信号时，我们就可以把DFT写成：

以上就是DCT变换的核心思想了。

我们常用的DCT变换公式：
在这里插入图片描述

当u=0时，
在这里插入图片描述
否则

DCT变换是DFT变换的一种特殊形式，这点没错，而其特殊点就在于其原始信号是一个实偶函数，但是但是实际应用中哪有那么多刚刚好的实偶函数信号给我们，因此为了适用面更广，既然自然界没有那么多实偶信号，我们就用实信号造一个。

设一长度为N的实数离散信号{x[0], x[1], … x[n-1]}，首先，我们将这个信号长度扩大成原来的两倍，并变成2N，定义新信号x’[m]为：
在这里插入图片描述

当m=-2，时，采用第二个公式，x[-m-1]=x[2-1]=x[1]
图为：
在这里插入图片描述

其中，蓝色为原始信号，红色为扩宽之后的信号，我们就将一个实信变成了一个实偶信号。
那么，对这个拓宽之后的信号进行DFT变换，信号的区间由[0,N-1]变成了[-N,N-1]，因此DFT变换公式也变成了：

梅尔Mel频率分析
人类听觉的感知只聚集在某些特定的区域，而不是整个频谱，
人的耳朵像一个滤波器一样，只关注特定的频率分量。这些滤波器在频率坐标轴上不是统一分布的，在低频区域有很多的滤波器，
他们分布比较密集，在高频区域，滤波器的数目较少，分布很稀疏。

MFCC是Mel-Frequency Cepstral Coefficients的缩写，MFCC特征提取包括两个关键的步骤：转化到梅尔频率，然后进行倒谱分析。

梅尔刻度和频率直接的关系：
在这里插入图片描述
所以当在梅尔刻度上面是均匀的话，频率直接的间距就会越来越大。

在这里插入图片描述

倒谱分析
对于一个语音的频谱图，峰值表示语音的主要频率成分，也称为共振峰。
共振峰携带了声音的辨识属性。
在语音识别中，我们需要把共振峰的位置和它们转变的过程提取出来。
这个变化的过程是一条连接这些共振峰点的平滑曲线，称为频谱的包络(Spectral Envelope)。

在这里插入图片描述

原始频谱可以看成由两个部分组成：包络和频谱细节，如果我们把这两个部分分离开，我们就可以得到包络。
也就是，在log X[k]的基础上，求得log H[k]和log E[k]，以满足：
log X[k] = log H[k] + log E[k]
在这里插入图片描述

这就用到了倒谱分析。
倒谱是一种信号的傅里叶变换经过对数运算后再进行傅里叶反变换得到的谱。
倒谱分析可以用于将信号分解，两个信号的卷积转换为两个信号的相加。

在这里插入图片描述

https://www.cnblogs.com/helloforworld/p/5283641.html
DCT变换对图像进行压缩的原理是减少图像中的高频分量，高频主要是对应图像中的细节信息，而我们人眼对细节信息并不是很敏感，因此可以去除高频的信息量。

http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

mel frequency cepstral coefficient (MFCC) tutorial
the first step in any automatic speech recogintion system is to extract features i.e. identify the components of the audio signal that are good for identifying the linguistic content and discarding all the other
stuff which carries information like background noise, emotion etc.

the main point to understand about speech is that the sounds generated by a human are filtered by the shape of the voal tract 声道 including tongue, teeth etc.
this shape determines what sound comes out.
if we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced.
the shape of the vocal tract manifests itself in the envelope of the short time power spectrum,
and the job of MFCCs is to accurately represent this envelope.
this page will provide a short tutorial on MFCCs.

Why do we do these things?
We will now go a little more slowly through the steps and explain why each of the steps is necessary.

An audio signal is constantly changing, so to simplify things we assume that on short time scales the audio signal doesn’t change much (when we say it doesn’t change, we mean statistically i.e. statistically stationary, obviously the samples are constantly changing on even short time scales). This is why we frame the signal into 20-40ms frames. If the frame is much shorter we don’t have enough samples to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame.
一帧不能太短，也不能太长。太短没有足够的采样数据，太长一帧里的信号变化太多。

The next step is to calculate the power spectrum of each frame. This is motivated by the human cochlea 耳蜗 (an organ in the ear) which vibrates at different spots depending on the frequency of the incoming sounds. Depending on the location in the cochlea that vibrates (which wobbles small hairs), different nerves fire informing the brain that certain frequencies are present. Our periodogram estimate performs a similar job for us, identifying which frequencies are present in the frame.

The periodogram spectral estimate still contains a lot of information not required for Automatic Speech Recognition (ASR). In particular the cochlea can not discern 识别 the difference between two closely spaced frequencies. This effect becomes more pronounced 显著的 as the frequencies increase. For this reason we take clumps of periodogram bins and sum them up to get an idea of how much energy exists in various frequency regions. This is performed by our Mel filterbank: the first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher our filters get wider as we become less concerned about variations. We are only interested in roughly how much energy occurs at each spot. The Mel scale tells us exactly how to space our filterbanks and how wide to make them. See below for how to calculate the spacing.

Once we have the filterbank energies, we take the logarithm of them. This is also motivated by human hearing: we don’t hear loudness on a linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with. This compression operation makes our features match more closely what humans actually hear. Why the logarithm and not a cube root? The logarithm allows us to use cepstral mean subtraction, which is a channel normalisation technique.

The final step is to compute the DCT of the log filterbank energies. There are 2 main reasons this is performed. Because our filterbanks are all overlapping, the filterbank energies are quite correlated with each other. ==The DCT decorrelates the energies which means diagonal covariance matrices 协方差矩阵去除相关性 can be used to model the features in e.g. a HMM classifier. But notice that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients represent fast changes in the filterbank energies and it turns out that these fast changes actually degrade ASR performance, so we get a small improvement by dropping them.

What is the Mel scale? 梅尔刻度
The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency. Humans are much better at discerning small changes in pitch at low frequencies than they are at high frequencies. Incorporating this scale makes our features match more closely what humans hear.

The formula for converting from frequency to Mel scale is:

在这里插入图片描述
To go from Mels back to frequency:

once this is performed we are left with 26 numbers that gives us an indication of how much energy was in each filterbank.

在这里插入图片描述
Plot of Mel Filterbank and windowed power spectrum
4. Take the log of each of the 26 energies from step 3. This leaves us with 26 log filterbank energies.

Take the Discrete Cosine Transform (DCT) of the 26 log filterbank energies to give 26 cepstral coefficents. For ASR, only the lower 12-13 of the 26 coefficients are kept.

The resulting features (12 numbers for each frame) are called Mel Frequency Cepstral Coefficients.

Computing the Mel filterbank
In this section the example will use 10 filterbanks because it is easier to display, in reality you would use 26-40 filterbanks.

To get the filterbanks shown in figure 1(a) we first have to choose a lower and upper frequency. Good values are 300Hz for the lower and 8000Hz for the upper frequency. Of course if the speech is sampled at 8000Hz our upper frequency is limited to 4000Hz. Then follow these steps:

在这里插入图片描述

https://github.com/jameslyons/python_speech_features.git


// This program generates MFCC features, not including delta terms
// Users can specify frame length and space, FFT points, number of filters, number of cepstrums, low and high frequency limits
// Users can also specify wether to have log energy term in the output
// Sampling rate is obtained from the wave file

// INPUT: a windows wave file
// OUTPUT: cepstrum.txt: stores output cepstrums, each row is a feature vector
//         weights.txt: stores filterbank weights, each row is a channel

// Created by Xiaoyu Liu

#include<iostream>
#include<fstream>
#include<cmath>
#include<vector>
#include <complex> 
#include <bitset> 



using namespace std;

typedef struct     // WAV stores wave file header
{
        int rId;    
        int rLen;   
        int wId;    
        int fId;    

        int fLen;   

        short wFormatTag;       
        short nChannels;        
        int nSamplesPerSec;   // Sampling rate. This returns to FS variable
        int nAvgBytesPerSec;  // All other parameters are not used in processing
        short nBlockAlign;      
        short wBitsPerSample;   
        int dId;              
        int wSampleLength;    
}WAV;


// Global variables

int FS=8;                    // default sampling rate in KHz, actual value will be obtained from wave file
int HIGH=4;                   // default high frequency limit in KHz
int LOW=0;                    // default low frequency limit in KHz
int FrmLen=25;             // frame length in ms
int FrmSpace=10;           // frame space in ms
const unsigned long FFTLen=512;           // FFT points
const double PI=3.1415926536;
const int FiltNum=26;              // number of filters
const int PCEP=12;                 // number of cepstrum
vector <double> Hamming;            // Hamming window vector
float FiltWeight[FiltNum][FFTLen/2+1]; //This is the Mel filterbank weights
const int LOGENERGY=1;             // whether to include log energy in the output
vector<float> Coeff; // This stores cepstrum and log energy



// function declarations
void InitHamming();
void HammingWindow(short* buf,float* data);
float FrmEnergy(short* data);
void zero_fft(float *buffer,vector<complex<float> >& vec);
void FFT(const unsigned long & fftlen, vector<complex<float> >& vec);
void InitFilt(float (*w)[FFTLen/2+1], int num_filt); 
void CreateFilt(float (*w)[FFTLen/2+1], int num_filt, int Fs, int high, int low);
void mag_square(vector<complex<float> > & vec, vector<float> & vec_mag);
void Mel_EN(float (*w)[FFTLen/2+1],int num_filt, vector<float>& vec_mag, float * M_energy);
void Cepstrum(float *M_energy);





int main()
{
	WAV header;    // This struct stores wave file header
	FILE *sourcefile;
	ofstream outfile1("cepstrum.txt");     // This file stores output cepstrum
	ofstream outfile2("weights.txt");  // This file stores filter weights
	sourcefile=fopen("1.wav","rb");  // open the wave file as a binary file
	fread(&header,sizeof(WAV),1,sourcefile);   // read in the header
	FS=header.nSamplesPerSec/1000;  // Obtain sampling frequency
	if (HIGH>(int) (FS/2))                      // Check pre-defined high frequency
	   HIGH=(int) (FS/2);
	if (LOW>HIGH)                               // Check pre-defined low frequency
	   LOW=(int)(HIGH/2);
	FrmLen=FrmLen*FS;                          // Obtain frame length in samples
	FrmSpace=FrmSpace*FS;                      // Obtain frame space in samples
	
	short buffer[FrmLen];    // buffer stores a frame of data, each 2 byte
	float data[FrmLen];
	float energy=0.0;
	float mel_energy[FiltNum]; // This stores the channel output energy for a frame

	vector<complex<float> > zero_padded;   // zero_padded is a vector which stores the zero padded data and FFT
	vector <float> fft_mag;                // This is the magnitude squared FFT
		
	InitHamming();      //Create a Hamming window of length FrmLen
	InitFilt(FiltWeight, FiltNum); // Initialize filter weights to all zero
	CreateFilt(FiltWeight, FiltNum, FS*1000, HIGH*1000, LOW*1000);    // Compute filter weights
	for (int i=0;i<FiltNum; i++)          // Output filter weights to a file
	  { for (int j=0;j<FFTLen/2+1;j++)
	       outfile2<<FiltWeight[i][j]<<' ';
	   outfile2<<endl;
	  } 
	
    // While loop reads in each frame, and compute cepstrum features
	while(fread(buffer,sizeof(short),FrmLen,sourcefile)==FrmLen)  //  continue to read in a frame of data
	{

		HammingWindow(buffer,data);  // multiply Hamming window to speech, return to data 
		energy=FrmEnergy(buffer);//Get frame energy without windowing
		zero_fft(data,zero_padded); // This step first zero pad data, and do FFT
		mag_square(zero_padded, fft_mag);    // This step does magnitude square for the first half of FFT
        Mel_EN(FiltWeight,FiltNum, fft_mag, mel_energy); // This step computes output log energy of each channel
		Cepstrum(mel_energy);
		if (LOGENERGY)   // whether to include log energy term or not
		   Coeff.push_back(energy);
		   
		zero_padded.clear(); // clear up fft vector
		fft_mag.clear();    // clear up fft magnitude 
		//index++;
		fseek(sourcefile, -(FrmLen-FrmSpace), SEEK_CUR); // move to the next frame
	}

	int length=Coeff.size();  // Output cepstrum and log energy to a file. Each row is a feature vector
	for(int i=0;i<length;++i)
	{
		outfile1<<Coeff[i]<<' ';
		if((i+1)%(PCEP+LOGENERGY)==0)
			outfile1<<endl;
	}

    fclose(sourcefile);
	return 0;

}

// This function create a hamming window
void InitHamming()
{
	float two_pi=8.0F*atan(1.0F);   // This is just 2*pi;
	float temp;
	int i;
	for( i=0;i<FrmLen;i++)
	{
		temp=(float)(0.54-0.46*cos((float)i*two_pi/(float)(FrmLen-1)));  // create a Hamming window of length FrmLen
		Hamming.push_back(temp);
    }
}

void HammingWindow(short* buf,float* data)  // This function multiply a Hamming window to a frame
{
	int i;
	for(i=0;i<FrmLen;i++)
	{
		data[i]=buf[i]*Hamming[i];
	}
}

float FrmEnergy(short* data)        // This function computes frame energy
{
	int i;
	float frm_en=0.0;
	for(i=0;i<FrmLen;i++)
	{
		frm_en=frm_en+data[i]*data[i];
	}
	return frm_en;
}


void zero_fft(float *data,vector<complex<float> >& vec) // This function does zero padding and FFT
{	
	for(int i=0;i<FFTLen;i++)     // This step does zero padding
	{
		if(i<FrmLen)
		{
			vec.push_back(complex<float>(data[i]));
		}
		else
		{
			vec.push_back(complex<float>(0));
		}
	}
	FFT(FFTLen, vec);    // Compute FFT
}




void FFT(const unsigned long & fftlen, vector<complex<float> >& vec) 
{ 		 
	unsigned long ulPower = 0;  
	unsigned long fftlen1 = fftlen - 1; 
	while(fftlen1 > 0) 
	{ 
		ulPower++; 
		fftlen1=fftlen1/2; 
	} 


	bitset<sizeof(unsigned long) * 8> bsIndex;
	unsigned long ulIndex; 
	unsigned long ulK; 
	for(unsigned long p = 0; p < fftlen; p++) 
	{ 
		ulIndex = 0; 
		ulK = 1; 
		bsIndex = bitset<sizeof(unsigned long) * 8>(p); 
		for(unsigned long j = 0; j < ulPower; j++) 
			{ 
				ulIndex += bsIndex.test(ulPower - j - 1) ? ulK : 0; 
				ulK *= 2; 
			} 

		if(ulIndex > p) 
			{ 
				complex<float> c = vec[p]; 
				vec[p] = vec[ulIndex]; 
				vec[ulIndex] = c; 
			} 
	} 


	vector<complex<float> > vecW; 
	for(unsigned long i = 0; i < fftlen / 2; i++) 
		{ 
			vecW.push_back(complex<float>(cos(2 * i * PI / fftlen) , -1 * sin(2 * i * PI / fftlen))); 
		} 



	unsigned long ulGroupLength = 1; 
	unsigned long ulHalfLength = 0;  
	unsigned long ulGroupCount = 0;  
	complex<float> cw; 
	complex<float> c1;  
	complex<float> c2;  
	for(unsigned long b = 0; b < ulPower; b++) 
		{ 
			ulHalfLength = ulGroupLength; 
			ulGroupLength *= 2; 
			for(unsigned long j = 0; j < fftlen; j += ulGroupLength) 
				{ 
					for(unsigned long k = 0; k < ulHalfLength; k++) 
						{ 
							cw = vecW[k * fftlen / ulGroupLength] * vec[j + k + ulHalfLength]; 
							c1 = vec[j + k] + cw; 
							c2 = vec[j + k] - cw; 
							vec[j + k] = c1; 
							vec[j + k + ulHalfLength] = c2; 
						} 
				} 
		} 
} 


// This function initialize filter weights to 0

void InitFilt(float (*w)[FFTLen/2+1], int num_filt)
{
  int i,j;
  for (i=0;i<num_filt;i++)
      for (j=0;j<FFTLen/2+1;j++)
	     *(*(w+i)+j)=0.0;
}

// This function creates a Mel weight matrix

void CreateFilt(float (*w)[FFTLen/2+1], int num_filt, int Fs, int high, int low)
{
   float df=(float) Fs/(float) FFTLen;    // FFT interval
   int indexlow=round((float) FFTLen*(float) low/(float) Fs); // FFT index of low freq limit
   int indexhigh=round((float) FFTLen*(float) high/(float) Fs); // FFT index of high freq limit

   float melmax=2595.0*log10(1.0+(float) high/700.0); // mel high frequency
   float melmin=2595.0*log10(1.0+(float) low/700.0);  // mel low frequency
   float melinc=(melmax-melmin)/(float) (num_filt+1); //mel half bandwidth
   float melcenters[num_filt];        // mel center frequencies
   float fcenters[num_filt];          // Hertz center frequencies
   int indexcenter[num_filt];         // FFT index for Hertz centers
   int indexstart[num_filt];   //FFT index for the first sample of each filter
   int indexstop[num_filt];    //FFT index for the last sample of each filter
   float increment,decrement; // increment and decrement of the left and right ramp
   float sum=0.0;
   int i,j;
   for (i=1;i<=num_filt;i++)
   {
	     melcenters[i-1]=(float) i*melinc+melmin;   // compute mel center frequencies
		 fcenters[i-1]=700.0*(pow(10.0,melcenters[i-1]/2595.0)-1.0); // compute Hertz center frequencies
		 indexcenter[i-1]=round(fcenters[i-1]/df); // compute fft index for Hertz centers		 
   }
   for (i=1;i<=num_filt-1;i++)  // Compute the start and end FFT index of each channel
      {
	    indexstart[i]=indexcenter[i-1];
		indexstop[i-1]=indexcenter[i];		
	  }
   indexstart[0]=indexlow;
   indexstop[num_filt-1]=indexhigh;
   for (i=1;i<=num_filt;i++)
   {
      increment=1.0/((float) indexcenter[i-1]-(float) indexstart[i-1]); // left ramp
	  for (j=indexstart[i-1];j<=indexcenter[i-1];j++)
	     w[i-1][j]=((float)j-(float)indexstart[i-1])*increment;
	  decrement=1.0/((float) indexstop[i-1]-(float) indexcenter[i-1]);    // right ramp
	  for (j=indexcenter[i-1];j<=indexstop[i-1];j++)
	     w[i-1][j]=1.0-((float)j-(float)indexcenter[i-1])*decrement;		 
   }

   for (i=1;i<=num_filt;i++)     // Normalize filter weights by sum
   {
       for (j=1;j<=FFTLen/2+1;j++)
	      sum=sum+w[i-1][j-1];
	   for (j=1;j<=FFTLen/2+1;j++)
	      w[i-1][j-1]=w[i-1][j-1]/sum;
	   sum=0.0;
   }
   

      
    
   
}

void mag_square(vector<complex<float> > &vec, vector<float> &vec_mag) // This function computes magnitude squared FFT
{
  int i;
  float temp;
  for (i=1;i<=FFTLen/2+1;i++)
     {
	   temp = vec[i-1].real()*vec[i-1].real()+vec[i-1].imag()*vec[i-1].imag();
	   vec_mag.push_back(temp);
	 }
       	   	   
}

void Mel_EN(float (*w)[FFTLen/2+1],int num_filt, vector<float>& vec_mag, float * M_energy) // computes log energy of each channel
{
   int i,j;
   for (i=1;i<=num_filt;i++)    // set initial energy value to 0
     M_energy[i-1]=0.0F;
   
   for (i=1;i<=num_filt;i++)
   {
     for (j=1;j<=FFTLen/2+1;j++)
         M_energy[i-1]=M_energy[i-1]+w[i-1][j-1]*vec_mag[j-1];
     M_energy[i-1]=(float)(log(M_energy[i-1]));			 
   }

}



// Compute Mel cepstrum

void Cepstrum(float *M_energy)
{
	int i,j;
	float Cep[PCEP];
    for (i=1;i<=PCEP;i++)
	{ Cep[i-1]=0.0F;    // initialize to 0
	  for (j=1;j<=FiltNum;j++)
          Cep[i-1]=Cep[i-1]+M_energy[j-1]*cos(PI*((float) i)/((float) FiltNum)*((float) j-0.5F)); // DCT transform
      Cep[i-1]=sqrt(2.0/float (FiltNum))*Cep[i-1];
	  Coeff.push_back(Cep[i-1]);   // store cepstrum in this vector
    }	
	  
}

https://github.com/weedwind/MFCC.git

https://www.jianshu.com/p/4fed25394c73
https://blog.csdn.net/weiqiwu1986/article/details/46127747
https://www.cnblogs.com/LXP-Never/p/11602510.html

https://blog.csdn.net/qq_44945010/article/details/89416485
离散傅立叶变换
在得到加窗的每一帧信号后，需要知道此帧信号在不同频段的能量分布。
从一个离散信号（采样信号）中提取离散频段频谱信息的工具就是离散傅立叶变换（DFT）。
DFT的输入是一帧帧加窗后的信号x[n]…x[m]，输出则是包含N个频带的复数X[k]，表示原始信号中某一频率成分的幅度和相位。
DFT的定义如下：
在这里插入图片描述
计算DFT常用的一个算法是快速傅立叶变换FFT，它非常高效，但是一般要求N是2的幂。

Mel滤波器组
FFT的结果包含此帧信号在每一频带的能量信息。
但是，人耳听觉对不同频带的敏感度是不同的，人耳对高频不如低频敏感，这一分界线大约是1000Hz，在提取声音特征时模拟人耳听觉这一性质可以提高识别性能。
在MFCC中的做法是将DFT输出的频率对应到mel刻度上。
一mel是一个音高单位，在音高上感知等距的声音可以被相同数量的mel数分离。

频率和mel刻度之间的计算公式：

在这里插入图片描述
在计算MFCC时，将FFT频谱通过一组mel滤波器组就可以转换为mel频谱。mel滤波器组一般是一组mel刻度的三角形滤波器组。