窗口的长度与 n_fft 需要匹配大小长度;
1. Mel 语谱图的函数定义
librosa.feature.melspectrogram()
函数在spectral.py
中,实现过程为:
def melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,
power=2.0, **kwargs):
S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power)
# Build a Mel filter
mel_basis = filters.mel(sr, n_fft, **kwargs)
return np.dot(mel_basis, S)
可以看出 Mel_ 语谱图的计算主要有两个函数构成
- 计算出信号的语谱图(功率谱形式构成的), 由
_spectrogram()
函数实现 - 构造Mel 滤波器, 由
filters.mel
函数实现; - 将Mel 滤波器组与语谱图做矩阵乘法, 得到 mel 语谱图;
1.1 _spectrogram()
函数实现
def _spectrogram(y=None, S=None, n_fft=2048, hop_length=512, power=1):
if S is not None:
# Infer n_fft from spectrogram shape
n_fft = 2 * (S.shape[0] - 1)
else:
# Otherwise, compute a magnitude spectrogram from input
S = np.abs(stft(y, n_fft=n_fft, hop_length=hop_length))**power
return S, n_fft
其中的核心代码部分为stft()函数,短时傅里叶变化stft函数的实现过程为:
def stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann',
center=True, dtype=np.complex64, pad_mode='reflect'):
# By default, use the entire frame
if win_length is None:
win_length = n_fft
# Set the default hop, if it's not already specified
if hop_length is None:
hop_length = int(win_length // 4)
fft_window = get_window(window, win_length, fftbins=True)
# Pad the window out to n_fft size
fft_window = util.pad_center(fft_window, n_fft)
# Reshape so that the window can be broadcast
fft_window = fft_window.reshape((-1, 1))
# Check audio is valid
util.valid_audio(y)
# Pad the time series so that frames are centered
if center:
y = np.pad(y, int(n_fft // 2), mode=pad_mode)
# Window the time series.
y_frames = util.frame(y, frame_length=n_fft, hop_length=hop_length)
# Pre-allocate the STFT matrix
stft_matrix = np.empty((int(1 + n_fft // 2), y_frames.shape[1]),
dtype=dtype,
order='F')
# how many columns can we fit within MAX_MEM_BLOCK?
n_columns = int(util.MAX_MEM_BLOCK / (stft_matrix.shape[0] *
stft_matrix.itemsize))
for bl_s in range(0, stft_matrix.shape[1], n_columns):
bl_t = min(bl_s + n_columns, stft_matrix.shape[1])
stft_matrix[:, bl_s:bl_t] = fft.fft(fft_window *
y_frames[:, bl_s:bl_t],
axis=0)[:stft_matrix.shape[0]]
return stft_matrix
1.2 filters.mel
函数实现
输出是 [ n_mels, 1 + (n_fft / 2) ]
这其中需要的注意的是,不同的librosa版本的mel()函数的归一化参数传递的也是不一样的,比如python3.6中的norm可以传递1 或是 None,但是python3.7中就需要传递norm=‘slaney’.
产生一个线性变换矩阵, 将FFT 频率下的bin 的 转化到 Mel 尺度下的bin
@cache(level=10)
def mel(
sr,
n_fft,
n_mels=128,
fmin=0.0,
fmax=None,
htk=False,
norm="slaney",
dtype=np.float32,
):
"""Create a Mel filter-bank.
This produces a linear transformation matrix to project
FFT bins onto Mel-frequency bins.
Parameters
----------
sr : number > 0 [scalar]
sampling rate of the incoming signal
n_fft : int > 0 [scalar]
number of FFT components
n_mels : int > 0 [scalar]
number of Mel bands to generate
fmin : float >= 0 [scalar]
lowest frequency (in Hz)
fmax : float >= 0 [scalar]
highest frequency (in Hz).
If `None`, use ``fmax = sr / 2.0``
htk : bool [scalar]
use HTK formula instead of Slaney
norm : {None, 'slaney', or number} [scalar]
If 'slaney', divide the triangular mel weights by the width of the mel band
(area normalization).
If numeric, use `librosa.util.normalize` to normalize each filter by to unit l_p norm.
See `librosa.util.normalize` for a full description of supported norm values
(including `+-np.inf`).
Otherwise, leave all the triangles aiming for a peak value of 1.0
dtype : np.dtype
The data type of the output basis.
By default, uses 32-bit (single-precision) floating point.
Returns
-------
M : np.ndarray [shape=(n_mels, 1 + n_fft/2)]
Mel transform matrix
See also
--------
librosa.util.normalize
Notes
-----
This function caches at level 10.
Examples
--------
>>> melfb = librosa.filters.mel(22050, 2048)
>>> melfb
array([[ 0. , 0.016, ..., 0. , 0. ],
[ 0. , 0. , ..., 0. , 0. ],
...,
[ 0. , 0. , ..., 0. , 0. ],
[ 0. , 0. , ..., 0. , 0. ]])
Clip the maximum frequency to 8KHz
>>> librosa.filters.mel(22050, 2048, fmax=8000)
array([[ 0. , 0.02, ..., 0. , 0. ],
[ 0. , 0. , ..., 0. , 0. ],
...,
[ 0. , 0. , ..., 0. , 0. ],
[ 0. , 0. , ..., 0. , 0. ]])
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> img = librosa.display.specshow(melfb, x_axis='linear', ax=ax)
>>> ax.set(ylabel='Mel filter', title='Mel filter bank')
>>> fig.colorbar(img, ax=ax)
"""
if fmax is None:
fmax = float(sr) / 2
# Initialize the weights
n_mels = int(n_mels)
weights = np.zeros((n_mels, int(1 + n_fft // 2)), dtype=dtype)
# Center freqs of each FFT bin
fftfreqs = fft_frequencies(sr=sr, n_fft=n_fft)
# 'Center freqs' of mel bands - uniformly spaced between limits
mel_f = mel_frequencies(n_mels + 2, fmin=fmin, fmax=fmax, htk=htk)
fdiff = np.diff(mel_f)
ramps = np.subtract.outer(mel_f, fftfreqs)
for i in range(n_mels):
# lower and upper slopes for all bins
lower = -ramps[i] / fdiff[i]
upper = ramps[i + 2] / fdiff[i + 1]
# .. then intersect them with each other and zero
weights[i] = np.maximum(0, np.minimum(lower, upper))
if norm == "slaney":
# Slaney-style mel is scaled to be approx constant energy per channel
enorm = 2.0 / (mel_f[2 : n_mels + 2] - mel_f[:n_mels])
weights *= enorm[:, np.newaxis]
else:
weights = util.normalize(weights, norm=norm, axis=-1)
# Only check weights if f_mel[0] is positive
if not np.all((mel_f[:-2] == 0) | (weights.max(axis=1) > 0)):
# This means we have an empty channel somewhere
warnings.warn(
"Empty filters detected in mel frequency basis. "
"Some channels will produce empty responses. "
"Try increasing your sampling rate (and fmax) or "
"reducing n_mels."
)
return weights
2. 实例分析
假设 调用形式如下 :
librosa.feature.melspectrogram(y=current_window ,sr=sample_rate, n_mels=n_mels, fmin=f_min, fmax=f_max, n_fft=nfft, hop_length=hop)
def melspectrogram(
y=None,
sr=22050,
S=None,
n_fft=2048,
hop_length=512,
win_length=None,
window="hann",
center=True,
pad_mode="reflect",
power=2.0,
**kwargs,
):
S, n_fft = _spectrogram(
y=y,
S=S,
n_fft=n_fft,
hop_length=hop_length,
power=power,
win_length=win_length,
window=window,
center=center,
pad_mode=pad_mode,
)
# Build a Mel filter
mel_basis = filters.mel(sr, n_fft, **kwargs)
return np.dot(mel_basis, S)
输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4
返回:
mel 语谱图的输出大小为 [n_mels, n_frames ],
Mel 语谱图的输出结果是, Mel 滤波器组 和 功率谱格式的语谱图 两者共同作用的结果。
[ n_mels, 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft ] ×[ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft, n_frames ]
= [n_mels, n_frames ]
注意:
Mel 语谱图函数的输入, 其中的
S
这个参数量, 代表了输入中是否提供了语谱图的输入。
S
如果提供了语谱图输入, 此时S
直接使用输入的语谱图;S
如果没有提供语谱图的输入, 需要根据输入的一维数据y
, 先经过STFT 得到频谱, 然后在取模值的平方, 得到频谱的功率谱形式, 此时的功率谱便作为语谱图。
最终 mel 语谱图的输出大小为 [n_mels, frames ], 由 Mel 滤波器组和短时傅里叶变换STFT 两个 输出矩阵,相乘共同作用到的结果。
该函数中调用了两个函数实现:
2.1 STFT 的输出帧数
根据以上参数
输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4
调用 STFT ,得到的输出 [ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft, n_frames ]
-
行数: 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft = 1+ (800/2) = 401
-
列数:等于帧数, 帧数
n_frames
求法:
n f r a m e s = a u d i o l e n g t h / / h o p l e n g t h + 1 , n_{frames} = audio_{length} // hop_{length} + 1, nframes=audiolength//hoplength+1,
带入上式中的, 可以求得:
n_frams = (32000 // 200) + 1 = 161,
故此时的STFT 输出 [ 401, 161]
2.2 filter.mels(sr, n_fft, **kwargs)
调用 filter.mels(sr, n_fft, n_mels=128, fmin=0.0, fmax=None, htk=False, norm=1 )
函数, 生成一个 Mel 滤波器组;
输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4
其中有几个重要的参数,需要关注以下
- n f f t n_{fft} nfft 的点数选取, n f f t n_{fft} nfft 的点数越多, CPU需要的运算量也增加,可以分别的频率也精细;
该参数与与频率分辨率有关:
sr
采样率为 16 KHZ;
假设将音频的频率分辨率设置为 10 HZ,
频率分辨率 bin = F s N f f t \frac{F_{s}}{N_{fft}} NfftFs = 采 样 率 N f f t 的 点 数 \frac{采样率}{N_{fft}的点数} Nfft的点数采样率
N f f t N_{fft} Nfft = F s 10 \frac{F_{s}}{10} 10Fs = 16000 10 \frac{16000}{10} 1016000 =1600 点;