为什么会产生出Mel 这种尺度的机制呢?
人耳朵具有特殊的功能,可以使得人耳朵在嘈杂的环境中,以及各种变异情况下仍能正常的分辨出各种语音;
其中,耳蜗有关键作用;
耳蜗实质上的作用相当于一个滤波器组,耳蜗的滤波作用是在对数频率尺度上进行的,在1000HZ以下为线性尺度,1K HZ以上为对数尺度,使得人耳对低频信号敏感,高频信号不敏感;
也就是说,当初产生这种机制主要是为了模拟,人耳朵的听觉机制;
根据这一原则,从而研制出来了Mel频率滤波器组,
所以,Mel滤波器组的在靠近低频出越密集,越靠近高频出,滤波器越稀疏;
1. 简介 梅尔刻度
- 为什么需要 Mel 刻度:
MEL刻度模拟人耳对不同频率语音的感知:
研究表明,人类对频率的感知并不是线性的,并且对低频信号的感知要比高频信号敏感。对1kHz以下,与频率成线性关系,对1kHz以上,与频率成对数关系。频率越高,感知能力就越差。
例如,人们可以比较容易地发现500和1000Hz的区别,但很难发现7500和8000Hz的区别。
- Mel (melody)刻度定义:
这时,梅尔标度(the Mel Scale)被提出,它是Hz的非线性变换,对于以mel scale为单位的信号,可以做到人们对于相同频率差别的信号的感知能力几乎相同。
M e l ( f ) = 2595 ∗ l o g 10 ( 1 + f 700 ) , Mel(f) = 2595* log_{10}(1+ \frac{f}{700} ), Mel(f)=2595∗log10(1+700f),
f = 700 ( 1 0 m 2595 − 1 ) . f=700(10^\frac{m}{2595} −1). f=700(102595m−1).
M 被称作 Mel 频率;
观察上图:
从Hz到mel的映射图,由于它们是log的关系,
当频率较小时,mel随Hz变化较快;
当频率很大时,mel的上升很缓慢,曲线的斜率很小。
这说明了人耳对低频音调的感知较灵敏,在高频时人耳是很迟钝的,梅尔标度滤波器组启发于此, 从而 梅尔滤波器 的 分布情况 受启发于此。
如上图所示,40个三角滤波器组成滤波器组,低频处滤波器密集,门限值大,高频处滤波器稀疏,门限值低。恰好对应了频率越高人耳越迟钝这一客观规律。上图所示的滤波器形式叫做等面积梅尔滤波器(Mel-filter bank with same bank area),在人声领域(语音识别,说话人辨认)等领域应用广泛,但是如果用到非人声领域,就会丢掉很多高频信息。
如上图所示,如果用到非人声领域,这时常用的是等高梅尔滤波器(Mel-filter bank with same bank height):
2. Mel频率与普通频率
2.1 Mel 与f 两者的转换关系
梅尔频率与实际频率的关系如下:
F m e l ( f ) = 2595 ∗ l o g 10 ( 1 + f 700 ) , F_{mel}(f) = 2595* log_{10}(1+ \frac{f}{700} ), Fmel(f)=2595∗log10(1+700f),
f
=
700
(
1
0
m
2595
−
1
)
.
f=700(10^\frac{m}{2595} −1).
f=700(102595m−1).
F
m
e
l
(
f
)
F_{mel}(f)
Fmel(f) j简写成m, 是以Mel为单位的感知频率,
f
f
f是以Hz为单位的实际频率;
注意, 如果对数以e为底, 则系数的取值为1125
Mel滤波器组就是一系列的三角形滤波器,通常有40个或80个,在中心频率点响应值为1,在两边的滤波器中心点衰减到0,如下图:
具体公式可以写为:
2.2 生成Mel 语谱图
注意,log mel-filter bank outputs
和“FBANK features
说的是同一个东西,
他们本质上都指的是经过 Mel filter
, Mel 滤波器的输出 ;
最后在能量谱上应用Mel滤波器组,其公式为:
Y t ( m ) = ∑ k = 1 N H m ( k ) ∣ X t ( k ) ∣ 2 Y_t(m) = \sum^{N}_{k=1} H_{m}(k)|X_{t}(k)|^2 Yt(m)=k=1∑NHm(k)∣Xt(k)∣2
其中,k表示FFT变换后的编号,m表示mel滤波器的编号。
low_freq_mel = 0
high_freq_mel = 2595 * np.log10(1 + (sample_rate / 2) / 700)
print(low_freq_mel, high_freq_mel)
0 2146.06452750619
nfilt = 40
mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # 所有的mel中心点,为了方便后面计算mel滤波器组,左右两边各补一个中心点
hz_points = 700 * (10 ** (mel_points / 2595) - 1)
fbank = np.zeros((nfilt, int(NFFT / 2 + 1))) # 各个mel滤波器在能量谱对应点的取值
bin = (hz_points / (sample_rate / 2)) * (NFFT / 2) # 各个mel滤波器中心点对应FFT的区域编码,找到有值的位置
for i in range(1, nfilt + 1):
left = int(bin[i-1])
center = int(bin[i])
right = int(bin[i+1])
for j in range(left, center):
fbank[i-1, j+1] = (j + 1 - bin[i-1]) / (bin[i] - bin[i-1])
for j in range(center, right):
fbank[i-1, j+1] = (bin[i+1] - (j + 1)) / (bin[i+1] - bin[i])
print(fbank)
[[0. 0.46952675 0.93905351 … 0. 0. 0. ]
[0. 0. 0. … 0. 0. 0. ]
[0. 0. 0. … 0. 0. 0. ]
…
[0. 0. 0. … 0. 0. 0. ]
[0. 0. 0. … 0. 0. 0. ]
[0. 0. 0. … 0.14650797 0.07325398 0. ]]
filter_banks = np.dot(pow_frames, fbank.T)
filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks)
filter_banks = 20 * np.log10(filter_banks) # dB
print(filter_banks.shape)
(349, 40)
plot_spectrogram(filter_banks.T, 'Filter Banks')
3. Mel滤波器组设计的总体思想
3.1 中心频率的选取
在语音的频谱范围内设置若干带通滤波器 H m ( k ) H_m(k) Hm(k), 1 ≤ m ≤ M 1≤m≤M 1≤m≤M,M为滤波器的个数(M一般取24)。
每个滤波器具有三角滤波特性,其中心频率为 f ( m ) f(m) f(m), 1 ≤ m ≤ M 1≤m≤M 1≤m≤M。
在Mel频率范围内,这些滤波器是等带宽的。
f l f_{l} fl为滤波器频率范围内的最低频率,
f
h
f_{h}
fh为滤波器频率范围内的最高频率,
N为DFT(或FFT)时的长度,
f
s
f_{s}
fs为采样频率,
F
m
e
l
F_{mel}
Fmel的逆函数:
F
m
e
l
−
1
(
m
)
=
700
(
1
0
m
2595
−
1
)
.
F^{-1}_{mel}(m) =700(10^\frac{m}{2595} −1).
Fmel−1(m)=700(102595m−1).
3.2 分析与设计
以M=6(6个滤波器)为例, f l f_{l} fl, f h f_{h} fh, f s f_{s} fs、 N N N、 M M M的值一旦确定就固定不变。
将Mel 频率尺度上的
F
m
e
l
(
f
l
)
F_{mel}(f_l)
Fmel(fl)至
F
m
e
l
(
f
h
)
F_{mel}(f_h)
Fmel(fh)
的Mel频率范围均分为M+1=7段,
产生M+2=8个Mel频率值。
分别求出这M+2=8个Mel频率对应的实际频率值,即:
F m e l − 1 ( F m e l ( f l ) + i F m e l ( f h ) − F m e l ( f l ) M + 1 ) F^{-1}_{mel}(F_{mel}(f_l)+i\cfrac{F_{mel}(f_h)-F_{mel}(f_l)}{M+1}) Fmel−1(Fmel(fl)+iM+1Fmel(fh)−Fmel(fl))
0 ≤ i ≤ 8 0≤i≤8 0≤i≤8
分别求出这M+2=8个实际频率值对应的FFT点数
f(1)、f(2)、f(3)、f(4)、f(5)、f(6)分别为第1,2,3,4,5,6个滤波器的中心频率对应的FFT点数(向上取整),设计M=6个滤波器。
- m = 1(第1个滤波器)
当 f(m-1)<k<f(m) 即 f(0)<k<f(1) 时,画红色三角形的左半部分;
当 f(m)<k<f(m+1) 即 f(1)<k<f(2) 时,画红色三角形的右半部分;
- m = 2(第2个滤波器)
当 f(m-1)<k<f(m) 即 f(1)<k<f(2) 时,画黄色三角形的左半部分
当 f(m)<k<f(m+1) 即 f(2)<k<f(3) 时,画黄色三角形的右半部分
- m = 3(第3个滤波器)
当 f(m-1)<k<f(m) 即 f(2)<k<f(3) 时,画绿色三角形的左半部分
当 f(m)<k<f(m+1) 即 f(3)<k<f(4) 时,画绿色三角形的右半部分
- m = 4(第4个滤波器)
当 f(m-1)<k<f(m) 即 f(3)<k<f(4) 时,画蓝色三角形的左半部分
当 f(m)<k<f(m+1) 即 f(4)<k<f(5) 时,画蓝色三角形的右半部分
- m = 5(第5个滤波器)
当 f(m-1)<k<f(m) 即 f(4)<k<f(5) 时,画紫色三角形的左半部分
当 f(m) <k<f(m+1) 即 f(5)<k<f(6) 时,画紫色三角形的右半部分
- m = 6(第6个滤波器)
当 f(m-1)<k<f(m) 即 f(5)<k<f(6) 时,画黑色三角形的左半部分
当 f(m)<k<f(m+1) 即 f(6)<k<f(7) 时,画黑色三角形的右半部分
4. Mel 滤波器生成的具体步骤
原始的Mel 滤波器,实现过程如下:
主要有分为三步:
- 将信号的最高频率和最低频率 映射到Mel 频率上, 根据Mel 滤波器的个数K, 在Mel低频率和Mel 高频率之间 线性间隔出 K 个附加 点, 共 (K + 2) 个 Mel频率点 m ( i ) m(i) m(i)。
映射到Mel 尺度上的原因,该Mel尺度的频率是拟合了人耳的线性变换,
-
将上述步骤一中, (K + 2) 个Mel 频率点 映射到(K + 2) 普通频率HZ上 h ( i ) h(i) h(i)
-
将步骤二中, (K + 2) 个普通频率取整到最接近的 frequency bin 频率上 f ( i ) f(i) f(i); 此时得到的 f ( i ) f(i) f(i) 即是各个滤波器的中心频率, 生成三角滤波器;
采用python中librosa
库,发现从Hz
频率到Mel
频率有两种转换方式,而默认的方式并不是按照我们熟知的公式进行转换的,
因此详细研究了一下python中librosa
库中与mel
频谱有关的源代码。
这里我们主要关注数学表达,数学表达搞清后写代码自然不是难事。
通常情况下,hz
频率f 与mel尺度 通过如下公式转换
m = 2595 l g ( 1 + f / 700 ) , m=2595lg(1+ f/700), m=2595lg(1+f/700),
f = 700 ( 1 0 2595 m − 1 ) . f=700(10^{\frac{2595}{m}} −1). f=700(10m2595−1).
4.1 Mel 滤波器中等间隔的Mel 频率:
求出 Mel 滤波器中,对应的等间隔的Mel频率:
假如有10个Mel滤波器
(在实际应用中通常一组Mel滤波器组有26个滤波器。)
首先要选择一个最高频率和最低频率,通常最高频率为8000Hz,最低频率为300Hz。
使用从频率转换为Mel频率的公式将300Hz
转换为401.25Mels
,8000Hz转换为2834.99Mels,
由于有10个滤波器,每个滤波器针对两个频率的样点,样点之间会进行重叠处理,因此需要12个点,意味着需要在401.25和2834.99之间再线性间隔出10个附加点,如:
m(i)=401.25,622.50,843.75,1065.00,1286.25,1507.50,1728.74,1949.99,2171.24,2392.49,2613.74,2834.99
4.2 Mel 频率转换为 HZ:
现在使用从Mel频率转换为频率的公式将它们转换回赫兹:
def hz2mel(hz):
'''把频率hz转化为梅尔频率'''
return 2595 * numpy.log10(1 + hz / 700.0)
def mel2hz(mel):
'''把梅尔频率转化为hz'''
return 700 * (10 ** (mel / 2595.0) - 1)
h(i)=300,517.33,781.90,1103.97,1496.04,1973.32,2554.33,3261.62,4122.63,5170.76,6446.70,8000
4.3 Mel 滤波器中心频率
举例,
将频率映射到最接近的DFT频率, 其中 257 = N , N取值为 DFT 长度中的有效保留值,
此时得到的 f(i), 便是 Mel 滤波器组中 各个滤波器的中心频率;
图像中提供的公式和图表与计算梅尔滤波器组中梅尔滤波器的中心频率有关,该滤波器用于信号处理,特别是在语音识别和音频分析中。
Formula Explanation
The formula is:公式为:
f ( i ) = ⌊ ( 257 + 1 ) × h ( i ) 8000 ⌋ f(i) = \left\lfloor \frac{(257 + 1) \times h(i)}{8000} \right\rfloor f(i)=⌊8000(257+1)×h(i)⌋
Where:
- f ( i ) f(i) f(i): The frequency in Hertz (Hz) corresponding to the center of the i i i-th Mel filter. f ( i ) f(i) f(i) :对应于中心的频率(以赫兹 (Hz) 为单位) i i i -th 梅尔过滤器。
- h ( i ) h(i) h(i): The Mel frequency of the i i i-th filter. h ( i ) h(i) h(i) :梅尔频率 i i i -th 过滤器。
- 257: The length of the Discrete Fourier Transform (DFT) minus one, related to the effective number of frequency bins.257:离散傅里叶变换(DFT)的长度减一,与频率仓的有效数量有关。
- 8000: Typically the Nyquist frequency, which is half of the sampling rate (usually 16,000 Hz in speech processing).8000:通常是奈奎斯特频率,它是采样率的一半(在语音处理中通常为 16,000 Hz)。
- ⌊ ⋅ ⌋ \left\lfloor \cdot \right\rfloor ⌊⋅⌋: The floor function, which returns the greatest integer less than or equal to the given value. ⌊ ⋅ ⌋ \left\lfloor \cdot \right\rfloor ⌊⋅⌋ :floor 函数,返回小于或等于给定值的最大整数。
Principles
-
DFT and Frequency Bins:DFT 和频率箱:
- In signal processing, the Discrete Fourier Transform (DFT) converts a signal from the time domain to the frequency domain. The DFT length (N) determines the resolution in the frequency domain, with each bin representing a specific frequency.在信号处理中,离散傅里叶变换(DFT)将信号从时域转换到频域。 DFT 长度 (N) 决定频域中的分辨率,每个 bin 代表一个特定频率。
- The value 257 corresponds to half the DFT length plus one, representing the effective number of bins used in the frequency analysis.值 257 对应于 DFT 长度的一半加一,表示频率分析中使用的有效箱数。
-
Mel Scale:
- The Mel scale is a perceptual scale of pitches judged by listeners to be equidistant. It is designed to approximate the human ear’s response to different frequencies, with more resolution at lower frequencies and less at higher frequencies.梅尔音阶是听众判断为等距的音高感知音阶。它旨在模拟人耳对不同频率的响应,在较低频率下分辨率较高,在较高频率下分辨率较低。
- h ( i ) h(i) h(i) represents the Mel frequency of the i i i-th filter. h ( i ) h(i) h(i) 代表梅尔频率 i i i -th 过滤器。
-
Mapping Mel to Linear Frequency:将梅尔映射到线性频率:
- The formula converts Mel frequencies h ( i ) h(i) h(i) to their corresponding linear frequencies f ( i ) f(i) f(i) on the DFT scale. This is crucial for applying the Mel filterbank to the frequency spectrum of an audio signal.公式转换梅尔频率 h ( i ) h(i) h(i) 到它们相应的线性频率 f ( i ) f(i) f(i) 在 DFT 尺度上。这对于将梅尔滤波器组应用于音频信号的频谱至关重要。
Function功能
-
Frequency Conversion:变频:
- The formula maps the Mel filter center frequencies to their nearest corresponding DFT bins. This ensures that each filter in the Mel filterbank accurately targets the intended frequency region in the DFT output.
该公式将 Mel 滤波器中心频率映射到其最近的相应 DFT 箱。这确保了 Mel 滤波器组中的每个滤波器准确地瞄准 DFT 输出中的预期频率区域。
- The formula maps the Mel filter center frequencies to their nearest corresponding DFT bins. This ensures that each filter in the Mel filterbank accurately targets the intended frequency region in the DFT output.
-
Application in Mel Filterbank:
- The result f ( i ) f(i) f(i) is used to construct the Mel filterbank, where each filter is a triangular filter centered at f ( i ) f(i) f(i), and its shape is determined by the adjacent center frequencies f ( i − 1 ) f(i-1) f(i−1) and f ( i + 1 ) f(i+1) f(i+1).结果 f ( i ) f(i) f(i) 用于构建梅尔滤波器组,其中每个滤波器都是一个以 f ( i ) f(i) f(i) ,其形状由相邻中心频率决定 f ( i − 1 ) f(i-1) f(i−1) 和 f ( i + 1 ) f(i+1) f(i+1) 。
- The diagram below the formula illustrates how the filters H 1 ( k ) H_1(k) H1(k), H 2 ( k ) H_2(k) H2(k), etc., are placed across the frequency spectrum, with each filter focusing on a specific range of frequencies.下图的公式说明了过滤器如何 H 1 ( k ) H_1(k) H1(k) , H 2 ( k ) H_2(k) H2(k) 等,放置在频谱上,每个滤波器专注于特定的频率范围。
Diagram Explanation
-
The diagram shows a series of triangular filters H 1 ( k ) H_1(k) H1(k), H 2 ( k ) H_2(k) H2(k), H 3 ( k ) H_3(k) H3(k), etc., across the frequency axis.
该图显示了一系列三角形滤波器 H 1 ( k ) H_1(k) H1(k) , H 2 ( k ) H_2(k) H2(k) , H 3 ( k ) H_3(k) H3(k) 等,跨越频率轴。 -
f ( 1 ) f(1) f(1), f ( 2 ) f(2) f(2), f ( 3 ) f(3) f(3), … represent the center frequencies of these filters in the linear frequency domain, corresponding to specific DFT bins.
f ( 1 ) f(1) f(1) , f ( 2 ) f(2) f(2) , f ( 3 ) f(3) f(3) , … 表示这些滤波器在线性频域中的中心频率,对应于特定的 DFT bin。 -
The height of the triangles corresponds to the filter’s response, with the peak at the center frequency and tapering off at the neighboring frequencies.三角形的高度对应于滤波器的响应,峰值位于中心频率处,并在邻近频率处逐渐减小。
Overall Summary
The formula provided calculates the center frequencies of the Mel filters by converting from the Mel scale to the linear frequency scale and mapping them to the nearest DFT bins. This is essential for applying the Mel filterbank in the frequency analysis of signals, particularly in audio and speech processing. The diagram illustrates how these filters are positioned across the frequency spectrum, each focusing on a specific range to capture perceptually relevant features.提供的公式通过从梅尔标度转换为线性频率标度并将它们映射到最近的 DFT 箱来计算梅尔滤波器的中心频率。
这对于在信号频率分析中应用梅尔滤波器组至关重要,特别是在音频和语音处理中。
该图说明了这些滤波器如何在频谱上定位,每个滤波器都专注于特定范围以捕获感知相关的特征。
- Mel 滤波器的频率响应
在得到各个滤波器的中心频率后, 便可知 各个滤波器对应的频率响应;
将功率谱形成的语谱图通过 Mel 滤波器组,便得到 Mel Spectrogram;
5. librosa 库的实现原理是:
如下图,函数输入为(frequencies,htk)frequencies为待转化的频率,htk为是否用HTK formula进行转化,默认htk=False。
官方文档是这么解释的:htk,bool,
-
If True, use HTK formula to convert Hz to mel.
-
Otherwise (False), use Slaney’s Auditory Toolbox.
那么这里的Slaney’s Auditory Toolbox是怎么计算mel频率的呢?
根据代码来看是这样:
-
对于小于1000Hz的频率,进行线性转化, m = 3 f 200 m=\frac{3f}{200} m=2003f
-
对于大于(等于)1000Hz的频率,对数转化, m = 15 + l n f 1000 l n 6.4 27 m=15+\frac{ln\frac{f}{1000}}{\frac{ln6.4}{27}} m=15+27ln6.4ln1000f,
其中, 15是因为15 = 3 ∗ 1000 200 \frac{3*1000}{200} 2003∗1000,
至于6.4和27怎么来的我也不太清楚……
// An highlighted block
def hz_to_mel(frequencies, htk=False):
"""Convert Hz to Mels
Examples
--------
>>> librosa.hz_to_mel(60)
0.9
>>> librosa.hz_to_mel([110, 220, 440])
array([ 1.65, 3.3 , 6.6 ])
Parameters
----------
frequencies : number or np.ndarray [shape=(n,)] , float
scalar or array of frequencies
htk : bool
use HTK formula instead of Slaney
Returns
-------
mels : number or np.ndarray [shape=(n,)]
input frequencies in Mels
See Also
--------
mel_to_hz
"""
frequencies = np.asanyarray(frequencies)
if htk:
return 2595.0 * np.log10(1.0 + frequencies / 700.0)
# Fill in the linear part
f_min = 0.0
f_sp = 200.0 / 3
mels = (frequencies - f_min) / f_sp
# Fill in the log-scale part
min_log_hz = 1000.0 # beginning of log region (Hz)
min_log_mel = (min_log_hz - f_min) / f_sp # same (Mels)
logstep = np.log(6.4) / 27.0 # step size for log region
if frequencies.ndim:
# If we have array data, vectorize
log_t = frequencies >= min_log_hz
mels[log_t] = min_log_mel + np.log(frequencies[log_t] / min_log_hz) / logstep
elif frequencies >= min_log_hz:
# If we have scalar data, heck directly
mels = min_log_mel + np.log(frequencies / min_log_hz) / logstep
return mels
6. 补充说明
question:
- In the process of generating mel spectrogram, assuming that the FFT length is 256, the highest frequency of the mel filter is fs/2, and the maximum number of converted FFT points is 128, that is, the maximum k is 128, what to do with the remaining 128 FFT points Woolen cloth?
As::
- In the context of generating mel spectrograms, let’s break down the details and implications of questionr setup:
- FFT Length : 256
- Sampling Rate : fsf_sfs (so the highest frequency in the mel filter is fs/2f_s / 2fs/2)
- Maximum Number of Converted FFT Points : 128
- ** value** : 128
When question perform a Fast Fourier Transform (FFT) with a length of 256, question get 256 frequency bins. These bins represent frequencies from 0 Hz to the Nyquist frequency (which is fs/2f_s / 2fs/2).
Given that the maximum number of converted FFT points is 128, this means that question are only interested in the first 128 FFT bins, which correspond to frequencies from 0 Hz to fs/2f_s / 2fs/2.
What to Do with the Remaining 128 FFT Points
The remaining 128 FFT points (from index 128 to 255) correspond to frequencies above fs/2f_s / 2fs/2 and are redundant for questionr mel spectrogram computation because the mel filter bank only spans up to fs/2f_s / 2fs/2. In practice, here’s what typically happens:
- Discarding the Higher Frequency Bins : Since the mel spectrogram computation only requires frequencies up to fs/2f_s / 2fs/2, the remaining 128 points are not needed and can simply be discarded. This is a common practice because those bins do not contain useful information for the mel filter bank.
- Symmetry in FFT : Note that for real-valued signals, the FFT output is symmetric. The first half (0 to 127) contains the positive frequencies, and the second half (128 to 255) contains the negative frequencies, which are complex conjugates of the positive frequencies. Since we are dealing with real signals and are only interested in magnitude (or power), the second half is redundant.
Implementation Steps
- Compute the FFT : Compute the FFT of questionr signal using a length of 256. This will give question 256 complex values.
- Extract the Relevant Bins : Take the first 128 bins of the FFT result.
- Apply the Mel Filter Bank : Apply questionr mel filter bank to these 128 bins to generate the mel spectrogram.
Example Code (Python)
Here’s a Python snippet using numpy
and librosa
to illustrate these steps:
import numpy as np
import librosa
import matplotlib.pyplot as plt
# Example signal (replace with questionr actual signal)
signal = np.random.random(1024) # Example signal of length 1024
fs = 16000 # Example sampling rate
n_fft = 256
# Compute FFT
fft_result = np.fft.fft(signal, n=n_fft)
# Extract the first 128 bins
fft_bins = fft_result[:128]
# Apply mel filter bank (using librosa for convenience)
mel_filter_bank = librosa.filters.mel(sr=fs, n_fft=n_fft, n_mels=40)
mel_spectrogram = np.dot(mel_filter_bank, np.abs(fft_bins))
# Plot the mel spectrogram
plt.figure(figsize=(10, 4))
librosa.display.specshow(librosa.power_to_db(mel_spectrogram, ref=np.max),
y_axis='mel', x_axis='time', sr=fs, hop_length=512)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.tight_laquestiont()
plt.show()
In this code:
- We compute the FFT of the signal.
- We extract the first 128 bins.
- We apply a mel filter bank to these bins.
- Finally, we plot the resulting mel spectrogram.
This process ensures that question only use the necessary frequency information and discard the redundant higher frequency bins.
question:
-
- Why the highest frequency of the Mel filter is fs/2, 2. Why when performing a Fast Fourier Transform (FFT) of length 256, these bins represent from 0 Hz to the Nyquist frequency (i.e. fs /2) Frequency of
As::
The highest frequency of the mel filter is fs/2f_s/2fs/2 because:
- Nyquist Theorem : According to the Nyquist theorem, the maximum frequency that can be accurately represented in a digital signal is half of the sampling rate (fs/2f_s/2fs/2). This frequency is known as the Nyquist frequency. Frequencies higher than fs/2f_s/2fs/2 would alias back into the range [0,fs/2][0, f_s/2][0,fs/2] and cause distortion.
- Mel Scale Design : The mel scale is designed to be a perceptual scale of pitches judged by listeners to be equal in distance from one another. It typically spans from 0 Hz up to the Nyquist frequency because that encompasses the full range of human-audible frequencies that can be captured given the sampling rate.
Thus, the mel filter bank typically spans from 0 Hz to fs/2f_s/2fs/2, ensuring it covers the entire range of frequencies that can be accurately represented in the sampled signal.
Why FFT Bins Represent Frequencies from 0 Hz to the Nyquist Frequency
When performing a Fast Fourier Transform (FFT) of length NNN, the resulting frequency bins represent frequencies from 0 Hz to the Nyquist frequency (fs/2f_s/2fs/2) for the following reasons:
- Frequency Resolution : The FFT of a real-valued signal produces NNN complex coefficients (frequency bins). These bins correspond to specific frequencies based on the sampling rate fsf_sfs and the length of the FFT NNN:
- Bin 0 corresponds to 0 Hz.
- Bin 1 corresponds to fsN\frac{f_s}{N}Nfs Hz.
- Bin 2 corresponds to 2⋅fsN2 \cdot \frac{f_s}{N}2⋅Nfs Hz.
- …
- Bin N/2N/2N/2 corresponds to N/2⋅fsN=fs2\frac{N/2 \cdot f_s}{N} = \frac{f_s}{2}NN/2⋅fs=2fs Hz (Nyquist frequency).
- Symmetry of FFT for Real-Valued Signals : For real-valued signals, the FFT output is symmetric. This means:
- The first half of the FFT output (from 0 to N/2N/2N/2) represents frequencies from 0 Hz to the Nyquist frequency.
- The second half of the FFT output (from N/2+1N/2+1N/2+1 to N−1N-1N−1) is a mirror image of the first half and represents negative frequencies, which are complex conjugates of the positive frequencies. These are redundant for analysis of real signals.
Detailed Breakdown
For an FFT length of N=256N = 256N=256:
- FFT Bins : question get 256 frequency bins.
- Frequency Representation :
- Bin 0: 000 Hz
- Bin 1: fs256\frac{f_s}{256}256fs Hz
- Bin 2: 2⋅fs2562 \cdot \frac{f_s}{256}2⋅256fs Hz
- …
- Bin 128: 128⋅fs256=fs2128 \cdot \frac{f_s}{256} = \frac{f_s}{2}128⋅256fs=2fs Hz (Nyquist frequency)
- Bins 129 to 255: Mirror image of bins 1 to 127 (negative frequencies, redundant for real signals)
Since the bins above the Nyquist frequency are redundant and represent negative frequencies in the context of a real signal, they are not used in typical spectral analysis and can be discarded.
By using the first 128 bins, question are effectively covering the entire frequency range from 0 Hz to fs/2f_s/2fs/2 (Nyquist frequency), which is all the meaningful frequency content available in the sampled signal.
question:
- Can question illustrate the above explanations with a diagram, using graphic illustrations?
As::
- Certainly! Here’s a step-by-step illustration of the explanations using diagrams.
1. Frequency Range and Nyquist Theorem
The Nyquist theorem states that the highest frequency that can be represented is half of the sampling rate (fs/2f_s/2fs/2). This is called the Nyquist frequency.
Diagram: Frequency Range
|-----------------------------|-----------------------------|
0 f_s/2 f_s
- 0 to fs/2f_s/2fs/2: Range of frequencies that can be accurately represented.
2. FFT Bins and Frequency Representation
When question perform an FFT of length NNN, the frequency bins correspond to specific frequencies based on the sampling rate (fsf_sfs) and the length of the FFT (NNN).
Diagram: FFT Bins
|-----------------------------|-----------------------------|
Bin 0 Bin N/2 Bin N-1
0 Hz f_s/2 Hz -f_s/2 Hz (redundant)
- Bins from 0 to N/2N/2N/2 (positive frequencies).
- Bins from N/2+1N/2+1N/2+1 to N−1N-1N−1 (negative frequencies, redundant for real signals).
3. Symmetry of FFT for Real-Valued Signals
For real-valued signals, the FFT is symmetric. The second half of the FFT result mirrors the first half and represents negative frequencies.
Diagram: Symmetry in FFT
|--------|--------|--------|--------|--------|--------|--------|--------|
0 Hz +f +2f ... f_s/2 ... -2f -f 0 Hz
- f=fs/Nf = f_s/Nf=fs/N
- Positive frequencies from 0 to fs/2f_s/2fs/2.
- Negative frequencies mirror the positive ones.
4. Applying Mel Filter Bank
The mel filter bank typically spans from 0 Hz to fs/2f_s/2fs/2, covering the entire range of frequencies that can be accurately represented.
Diagram: Mel Filter Bank
|-----------------------------|-----------------------------|
0 f_s/2 f_s
|-----------------------------|
Mel Filter Bank Range
- Mel filter bank spans from 0 Hz to fs/2f_s/2fs/2.
Combined Illustration
Combining all the above points into a single illustration:
Combined Diagram
|-----------------------------------------------------------|
0 Hz f_s
|-------------------------|-------------------------|-------|
Positive Frequencies | Negative Frequencies (Redundant)
|
|-------------------------|-------------------------|
0 Hz f_s/2 -f_s/2
|-------------------------|
Mel Filter Bank Range (0 to f_s/2)
- Left to right : 0 Hz to fs/2f_s/2fs/2 (positive frequencies) to fsf_sfs.
- Middle line : Nyquist frequency fs/2f_s/2fs/2.
- Right half : Negative frequencies, redundant for real signals.
- Highlighted Range : 0 Hz to fs/2f_s/2fs/2, covered by the mel filter bank.
By using the first 128 bins of an FFT of length 256, question are focusing on the positive frequency range from 0 Hz to the Nyquist frequency (fs/2f_s/2fs/2), which is the relevant range for further analysis with the mel filter bank. The bins representing negative frequencies are discarded as they are redundant for real-valued signals.
question:
- What I mean is generating corresponding image explanations
As::
- Sure, let’s break down the explanations with corresponding graphic illustrations. Here are the steps represented visually:
1. Nyquist Theorem and Frequency Range
This diagram shows the frequency range up to the Nyquist frequency (fs/2f_s/2fs/2).
Diagram: Frequency Range
0 Hz Nyquist Frequency (f_s/2) Sampling Rate (f_s)
|-----------------------------|-----------------------------|
2. FFT Bins and Frequency Representation
This diagram shows the FFT bins corresponding to frequencies from 0 Hz to fs/2f_s/2fs/2 and the redundant bins.
Diagram: FFT Bins
FFT Output Bins:
|-----------------------------|-----------------------------|
0 128 256
0 Hz Nyquist Frequency Redundant (Negative frequencies)
3. Symmetry of FFT for Real-Valued Signals
This diagram illustrates the symmetry of the FFT output for real-valued signals.
Diagram: Symmetry in FFT
FFT Output Bins:
|-----------------------------|-----------------------------|
0 128 256
0 Hz Nyquist Frequency (f_s/2) Redundant (Negative frequencies)
- The first half (0 to 128) represents positive frequencies.
- The second half (129 to 256) is a mirror image of the first half and is redundant for real-valued signals.
4. Applying Mel Filter Bank
This diagram shows the mel filter bank applied to the FFT output.
Diagram: Mel Filter Bank
FFT Output Bins:
|-----------------------------|
0 128
0 Hz Nyquist Frequency (f_s/2)
|-----------------------------|
Mel Filter Bank Range
- The mel filter bank spans from 0 Hz to the Nyquist frequency (first 128 bins).
Combined Diagram
This combined diagram summarizes all the above points.
Combined Diagram
Overall Frequency and FFT Representation:
|-----------------------------------------------------------|
0 Hz Nyquist Frequency (f_s/2) Sampling Rate (f_s)
|-----------------------------|-----------------------------|
Positive Frequencies | Negative Frequencies (Redundant)
|
|-----------------------------|-----------------------------|
0 Hz Nyquist Frequency (f_s/2) Redundant (Negative frequencies)
|-----------------------------|
Mel Filter Bank Range (0 to f_s/2)
Visual Representation
To create these diagrams, let’s put together an actual visual representation:
- Frequency Range and Nyquist Theorem
- FFT Bins and Frequency Representation
- Symmetry of FFT for Real-Valued Signals
- Applying Mel Filter Bank
- Combined Diagram
These diagrams illustrate how the frequency range is mapped to FFT bins, the symmetry of the FFT for real-valued signals, and how the mel filter bank is applied within the meaningful frequency range (0 Hz to fs/2f_s/2fs/2).
question:
- The difference between log mel-filter bank outputs and "FBANK features
As::
- The terms “log mel-filter bank outputs” and “FBANK features” are often used interchangeably in the context of speech processing, but there are subtle differences. Here’s a breakdown of both concepts:
1. Mel-Frequency Filter Bank (MFB or FBANK)
Mel-Frequency Filter Bank (MFB or FBANK) features refer to the raw outputs obtained after applying a bank of mel filters to the power spectrum of a signal. Here’s a step-by-step outline of the process:
- Pre-emphasis : A high-pass filter is applied to the input signal to emphasize higher frequencies.
- Frame Blocking : The signal is divided into overlapping frames.
- Windowing : Each frame is windowed (typically using a Hamming window) to reduce spectral leakage.
- Fourier Transform : An FFT is applied to each frame to obtain the power spectrum.
- Mel Filter Bank : The power spectrum is passed through a series of triangular bandpass filters spaced according to the mel scale, emphasizing frequencies perceived by the human ear.
The output of the mel filter bank consists of energies in each mel band, creating a feature vector for each frame.
2. Log Mel-Frequency Filter Bank (Log-MFB or Log-FBANK)
Log Mel-Frequency Filter Bank (Log-MFB or Log-FBANK) features take the process a step further by applying a logarithm to the outputs of the mel filter bank. The logarithm is used to transform the features to a scale that more closely matches human perception of loudness, as the human ear perceives loudness logarithmically.
Steps to Obtain Log-MFB Features
- Pre-emphasis
- Frame Blocking
- Windowing
- Fourier Transform
- Mel Filter Bank
- Logarithm : Apply the logarithm to each mel filter bank output to get log-mel filter bank energies.
Visual Representation
Here’s a visual representation of both processes:
Mel-Frequency Filter Bank (FBANK) Features
Signal -> Pre-emphasis -> Frame Blocking -> Windowing -> FFT -> Mel Filter Bank -> FBANK Features
Log Mel-Frequency Filter Bank (Log-FBANK) Features
Signal -> Pre-emphasis -> Frame Blocking -> Windowing -> FFT -> Mel Filter Bank -> Logarithm -> Log-FBANK Features
Key Differences
- FBANK Features : Raw mel filter bank outputs, representing the energy in each mel band.
- Log-FBANK Features : Logarithm of the mel filter bank outputs, representing the log-energy in each mel band, making the features more perceptually relevant.
Practical Usage
- FBANK Features : Used in various speech processing tasks, including speech recognition, but less common than log-mel features.
- Log-FBANK Features : More common in modern speech processing systems, particularly in automatic speech recognition (ASR), as they provide a more perceptually accurate representation of speech.
Example in Python
Here’s a Python example using the librosa
library to compute both FBANK and Log-FBANK features:
import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt
# Load an example signal
signal, sr = librosa.load(librosa.example('trumpet'))
# Parameters
n_fft = 2048
hop_length = 512
n_mels = 128
# Compute the power spectrogram
S = np.abs(librosa.stft(signal, n_fft=n_fft, hop_length=hop_length))**2
# Apply the mel filter bank to the power spectrogram
mel_basis = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels)
mel_S = np.dot(mel_basis, S)
# Compute FBANK features (mel filter bank energies)
fbank_features = mel_S
# Compute Log-FBANK features (log mel filter bank energies)
log_fbank_features = librosa.power_to_db(mel_S)
# Plot FBANK features
plt.figure(figsize=(10, 4))
librosa.display.specshow(librosa.amplitude_to_db(fbank_features, ref=np.max),
y_axis='mel', x_axis='time', sr=sr, hop_length=hop_length)
plt.colorbar(format='%+2.0f dB')
plt.title('FBANK Features')
plt.tight_laquestiont()
plt.show()
# Plot Log-FBANK features
plt.figure(figsize=(10, 4))
librosa.display.specshow(log_fbank_features, y_axis='mel', x_axis='time', sr=sr, hop_length=hop_length)
plt.colorbar(format='%+2.0f dB')
plt.title('Log-FBANK Features')
plt.tight_laquestiont()
plt.show()
Conclusion
In summary, the key difference between FBANK and Log-FBANK features is the application of the logarithm to the mel filter bank outputs in Log-FBANK features. This logarithmic scaling aligns better with human auditory perception and is widely used in speech processing tasks, particularly in ASR systems.
question:
- So the difference between mfcc and Fbank features
As:
- Certainly! Let’s explore the differences between Mel-Frequency Cepstral Coefficients (MFCC) and Mel-Frequency Filter Bank (FBANK) features, which are both commonly used in speech processing and automatic speech recognition (ASR).
Mel-Frequency Filter Bank (FBANK) Features
FBANK features are the raw or logarithmically scaled outputs of the mel filter bank. Here’s the detailed process to obtain FBANK features:
- Pre-emphasis : A high-pass filter is applied to the input signal to emphasize higher frequencies.
- Frame Blocking : The signal is divided into overlapping frames.
- Windowing : Each frame is windowed (typically using a Hamming window) to reduce spectral leakage.
- Fourier Transform : An FFT is applied to each frame to obtain the power spectrum.
- Mel Filter Bank : The power spectrum is passed through a series of triangular bandpass filters spaced according to the mel scale.
- Logarithm (optional) : Apply the logarithm to the mel filter bank outputs to get log-mel filter bank energies.
Mel-Frequency Cepstral Coefficients (MFCC) Features
MFCC features go a step further by performing a discrete cosine transform (DCT) on the log-mel filter bank energies. Here’s the detailed process to obtain MFCC features:
- Pre-emphasis : A high-pass filter is applied to the input signal to emphasize higher frequencies.
- Frame Blocking : The signal is divided into overlapping frames.
- Windowing : Each frame is windowed (typically using a Hamming window) to reduce spectral leakage.
- Fourier Transform : An FFT is applied to each frame to obtain the power spectrum.
- Mel Filter Bank : The power spectrum is passed through a series of triangular bandpass filters spaced according to the mel scale.
- Logarithm : Apply the logarithm to the mel filter bank outputs to get log-mel filter bank energies.
- Discrete Cosine Transform (DCT) : Apply the DCT to the log-mel filter bank energies to decorrelate the features and obtain a set of coefficients.
Key Differences
- Transformation :
- FBANK Features : Represent the energy in each mel band, optionally scaled logarithmically.
- MFCC Features : Represent the decorrelated coefficients obtained by applying the DCT to the log-mel filter bank energies.
- Dimensionality :
- FBANK Features : Typically have higher dimensionality, equal to the number of mel filters used (e.g., 40).
- MFCC Features : Typically have lower dimensionality, often retaining only the first 12-13 coefficients after the DCT, plus an additional coefficient for the overall energy (resulting in 13-14 coefficients).
- Application :
- FBANK Features : Often used in deep learning-based ASR systems where raw spectral features are preferred.
- MFCC Features : Traditionally used in many classical ASR systems, speech recognition algorithms, and speaker identification tasks.
Visual Representation
To illustrate the differences visually, let’s break down the processes into diagrams.
FBANK Features Extraction Process
- Pre-emphasis -> 2. Frame Blocking -> 3. Windowing -> 4. Fourier Transform -> 5. Mel Filter Bank -> 6. (Optional) Logarithm
Diagram: FBANK Features
Signal -> Pre-emphasis -> Frame Blocking -> Windowing -> FFT -> Mel Filter Bank -> (Logarithm) -> FBANK Features
MFCC Features Extraction Process
- Pre-emphasis -> 2. Frame Blocking -> 3. Windowing -> 4. Fourier Transform -> 5. Mel Filter Bank -> 6. Logarithm -> 7. Discrete Cosine Transform (DCT)
Diagram: MFCC Features
Signal -> Pre-emphasis -> Frame Blocking -> Windowing -> FFT -> Mel Filter Bank -> Logarithm -> DCT -> MFCC Features
Example in Python
Here’s a Python example using the librosa
library to compute both FBANK and MFCC features:
import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt
# Load an example signal
signal, sr = librosa.load(librosa.example('trumpet'))
# Parameters
n_fft = 2048
hop_length = 512
n_mels = 128
# Compute the power spectrogram
S = np.abs(librosa.stft(signal, n_fft=n_fft, hop_length=hop_length))**2
# Apply the mel filter bank to the power spectrogram
mel_basis = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels)
mel_S = np.dot(mel_basis, S)
# Compute FBANK features (log mel filter bank energies)
fbank_features = librosa.power_to_db(mel_S)
# Compute MFCC features
mfcc_features = librosa.feature.mfcc(S=librosa.power_to_db(mel_S), n_mfcc=13)
# Plot FBANK features
plt.figure(figsize=(10, 4))
librosa.display.specshow(fbank_features, y_axis='mel', x_axis='time', sr=sr, hop_length=hop_length)
plt.colorbar(format='%+2.0f dB')
plt.title('FBANK Features')
plt.tight_laquestiont()
plt.show()
# Plot MFCC features
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfcc_features, x_axis='time', sr=sr, hop_length=hop_length)
plt.colorbar(format='%+2.0f')
plt.title('MFCC Features')
plt.tight_laquestiont()
plt.show()
Conclusion
In summary, FBANK features represent the energy in each mel band (optionally log-scaled), while MFCC features are obtained by applying a DCT to the log-mel filter bank energies to decorrelate the features and reduce dimensionality. MFCCs are widely used in classical speech processing systems, while FBANK features are often preferred in modern deep learning-based ASR systems.