FBK和MFCC特征

最新推荐文章于 2023-07-02 19:24:13 发布

huang_yx005

最新推荐文章于 2023-07-02 19:24:13 发布

阅读量2.2k

点赞数

分类专栏：学习笔记文章标签：语音识别人工智能

本文链接：https://blog.csdn.net/huang_yx005/article/details/122474081

版权

学习笔记专栏收录该内容

8 篇文章 0 订阅

订阅专栏

FBK: filter bank

MFCC: Mel-frequency ceptral coefficients 梅尔频率倒谱系数

MFCC特征的计算过程：

输入是一段语音波形，shape=(14520,)

1.预加重pre emphasis:

为什么要预加重：语音在空气中传播，高频部分衰减的更厉害些，应该将衰减的这部分恢复出来。

怎样预加重：

pre_emphasis_coeff = 0.95
x(n) = x(n) - pre_emphasis_coeff * x(n-1)

预加重后shape=(14520,)

2.分帧：

frame_len = 25 # each frame length (ms)
frame_shift = 10 # frame shift length (ms)
frame_len_samples = frame_len*fs//1000 # each frame length (samples) =200
frame_shift_samples = frame_shift*fs//1000 # frame shifte length (samples) =80

分帧后shape=(180,200) 即这段语音对应180个帧，每帧200个采样点。

3.预加窗

为什么要预加窗：窗函数很平滑，可以让每帧两端的采样点平滑地衰减到零，这样可以傅里叶变换后旁瓣的强度，取得更高质量的频谱。

汉明窗：

窗的长度也是200，与每帧的采样点数相同。中间最大值为1，两边逐渐衰减到零。

加窗前某帧的图像：

加窗后的图像：

可以看到，加窗后该帧数据的两端被逐渐衰减到零了。

加窗后的shape=(180,200)

3. (离散)傅里叶变换

物理含义：将信号从时域转换到频域，得到各个频率上（或者说各个频率点上）的幅度和相位。

K=512 # length of DFT

傅里叶变换后的shape=(180,257)

第2轴的维度是257，表示我们得到了257个频率点上的幅度和相位。傅里叶变换后是复数，复数的实部表示该频率点上正弦信号的幅度，复数的虚部表示该频率点上正弦信号的相位。

4.计算能量谱

得到各个频率上（各个频率点上）的能量。

计算方法：复数的实部和虚部的平方和

power_spec = np.absolute(freq_domain_data) ** 2 * (1/K) # power spectrum

计算能量谱的shape=(180,257)

5.梅尔滤波Mel-filter

WHY: 人耳对语音的低频部分和高频部分的敏感度不一样，对低频部分更敏感，对高频部分不敏感。

梅尔频率与普通的HZ频率是一一对应的，计算方式如下：

""" 3. Apply the mel filterbank to the power spectrum, sum the energy in each filter.
    The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency. 
    Humans are much better at discerning small changes in pitch at low frequencies than they are at high frequencies. 
    Incorporating this scale makes our features match more closely what humans hear.
    The formula for converting from frequency to Mel scale is:
        M(f) = 2595*log10(1+f/700)
    And formula for converting from Mel scale to frequency is:
        F(m) = 700*(10**(m/2595)-1)
"""
low_frequency = 20 # We don't use start from 0 Hz because human ear is not able to perceive low frequency signal.
high_frequency = fs//2 # if the speech is sampled at f Hz then our upper frequency is limited to 2/f Hz. =4000
low_frequency_mel = 2595 * np.log10(1 + low_frequency / 700) # =31.74
high_frequency_mel = 2595 * np.log10(1 + high_frequency / 700) # = 2146.06

将频率范围[low_frequency, high_frequency]分成40个频率区间，我们要计算每个频率区间内的能量。

计算方法：构造一组滤波器，然后将滤波器与能量谱相乘

梅尔滤波后的shape=(180, 40)

第2维的40就对应40个频率区间。

6.取log

WHY：纵轴的放缩，可以放大低能量处的能量差异。想想log的图像

取log后的shape=（180,40）

至此，我们得到了FBK特征。

7.离散余弦变换DCT

一般只保留离散余弦变换后的前12~20个点

num_ceps = 12 # MFCC feature dims, usually between 2-13.
# feature from other dims are dropped beacuse they represent rapid changes in filter bank coefficients and they are not helpful for speech models.
mfcc = dct(log_fbank, type=2, axis=1, norm="ortho")[:, 1 : (num_ceps + 1)]

离散余弦变换后的shape=（180, 12）

至此，我们得到了MFCC特征。

一些总结：

1.MFCC特征是在FBK特征的基础上计算得到；

2.MFCC特征比FBK特征维度更低。一般FBK特征是40维，MFCC特征是13维。

huang_yx005

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
FBK和MFCC特征

FBK: filter bankMFCC: Mel-frequency ceptral coefficients 梅尔频率倒谱系数MFCC特征的计算过程：输入是一段语音波形1.一些总结：1.MFCC特征是在FBK特征的基础上计算得到；2.MFCC特征比FBK特征维度更低。一般FBK特征是40维，MFCC特征是13维...
复制链接

扫一扫