librosa 语音库（三） librosa.feature. 中的 spectrogram 与 melspectrogram

ZhengXinTang

已于 2022-06-27 17:31:33 修改

阅读量8.6k

点赞数 13

分类专栏： # 音频部分文章标签： Mel_spectgroam

于 2022-05-17 20:42:11 首次发布

本文链接：https://blog.csdn.net/chumingqian/article/details/124824357

版权

音频部分专栏收录该内容

44 篇文章

订阅专栏

本文详细解析了如何通过librosa库中的melspectrogram函数计算Mel频率谱图，涉及关键步骤包括STFT的使用、n_fft长度的选择、Mel滤波器的构造，以及如何将滤波器与功率谱图结合。实例以32000点音频数据为例，展示了参数n_fft=800和n_mels=256的影响，以及输出结果的计算过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

窗口的长度与 n_fft 需要匹配大小长度；

1. Mel 语谱图的函数定义

librosa.feature.melspectrogram()函数在spectral.py 中，实现过程为：

def melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,
                   power=2.0, **kwargs):

    S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power)

    # Build a Mel filter
    mel_basis = filters.mel(sr, n_fft, **kwargs)

    return np.dot(mel_basis, S)

可以看出 Mel_ 语谱图的计算主要有两个函数构成

计算出信号的语谱图（功率谱形式构成的），由 _spectrogram() 函数实现
构造Mel 滤波器，由filters.mel 函数实现；
将Mel 滤波器组与语谱图做矩阵乘法，得到 mel 语谱图；

1.1 `_spectrogram()` 函数实现

def _spectrogram(y=None, S=None, n_fft=2048, hop_length=512, power=1):
 
    if S is not None:
        # Infer n_fft from spectrogram shape
        n_fft = 2 * (S.shape[0] - 1)
    else:
        # Otherwise, compute a magnitude spectrogram from input
        S = np.abs(stft(y, n_fft=n_fft, hop_length=hop_length))**power

    return S, n_fft

其中的核心代码部分为stft（）函数，短时傅里叶变化stft函数的实现过程为：

def stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann',
         center=True, dtype=np.complex64, pad_mode='reflect'):

    # By default, use the entire frame
    if win_length is None:
        win_length = n_fft

    # Set the default hop, if it's not already specified
    if hop_length is None:
        hop_length = int(win_length // 4)

    fft_window = get_window(window, win_length, fftbins=True)

    # Pad the window out to n_fft size
    fft_window = util.pad_center(fft_window, n_fft)

    # Reshape so that the window can be broadcast
    fft_window = fft_window.reshape((-1, 1))

    # Check audio is valid
    util.valid_audio(y)

    # Pad the time series so that frames are centered
    if center:
        y = np.pad(y, int(n_fft // 2), mode=pad_mode)

    # Window the time series.
    y_frames = util.frame(y, frame_length=n_fft, hop_length=hop_length)

    # Pre-allocate the STFT matrix
    stft_matrix = np.empty((int(1 + n_fft // 2), y_frames.shape[1]),
                           dtype=dtype,
                           order='F')

    # how many columns can we fit within MAX_MEM_BLOCK?
    n_columns = int(util.MAX_MEM_BLOCK / (stft_matrix.shape[0] *
                                          stft_matrix.itemsize))

    for bl_s in range(0, stft_matrix.shape[1], n_columns):
        bl_t = min(bl_s + n_columns, stft_matrix.shape[1])

        stft_matrix[:, bl_s:bl_t] = fft.fft(fft_window *
                                            y_frames[:, bl_s:bl_t],
                                            axis=0)[:stft_matrix.shape[0]]

    return stft_matrix

1.2 `filters.mel` 函数实现

输出是 [ n_mels, 1 + (n_fft / 2) ]

这其中需要的注意的是，不同的librosa版本的mel（）函数的归一化参数传递的也是不一样的，比如python3.6中的norm可以传递1 或是 None，但是python3.7中就需要传递norm=‘slaney’.

产生一个线性变换矩阵，将FFT 频率下的bin 的转化到 Mel 尺度下的bin

@cache(level=10)
def mel(
    sr,
    n_fft,
    n_mels=128,
    fmin=0.0,
    fmax=None,
    htk=False,
    norm="slaney",
    dtype=np.float32,
):
    """Create a Mel filter-bank.

    This produces a linear transformation matrix to project
    FFT bins onto Mel-frequency bins.

    Parameters
    ----------
    sr        : number > 0 [scalar]
        sampling rate of the incoming signal

    n_fft     : int > 0 [scalar]
        number of FFT components

    n_mels    : int > 0 [scalar]
        number of Mel bands to generate

    fmin      : float >= 0 [scalar]
        lowest frequency (in Hz)

    fmax      : float >= 0 [scalar]
        highest frequency (in Hz).
        If `None`, use ``fmax = sr / 2.0``

    htk       : bool [scalar]
        use HTK formula instead of Slaney

    norm : {None, 'slaney', or number} [scalar]
        If 'slaney', divide the triangular mel weights by the width of the mel band
        (area normalization).

        If numeric, use `librosa.util.normalize` to normalize each filter by to unit l_p norm.
        See `librosa.util.normalize` for a full description of supported norm values
        (including `+-np.inf`).

        Otherwise, leave all the triangles aiming for a peak value of 1.0

    dtype : np.dtype
        The data type of the output basis.
        By default, uses 32-bit (single-precision) floating point.

    Returns
    -------
    M         : np.ndarray [shape=(n_mels, 1 + n_fft/2)]
        Mel transform matrix

    See also
    --------
    librosa.util.normalize

    Notes
    -----
    This function caches at level 10.

    Examples
    --------
    >>> melfb = librosa.filters.mel(22050, 2048)
    >>> melfb
    array([[ 0.   ,  0.016, ...,  0.   ,  0.   ],
           [ 0.   ,  0.   , ...,  0.   ,  0.   ],
           ...,
           [ 0.   ,  0.   , ...,  0.   ,  0.   ],
           [ 0.   ,  0.   , ...,  0.   ,  0.   ]])


    Clip the maximum frequency to 8KHz

    >>> librosa.filters.mel(22050, 2048, fmax=8000)
    array([[ 0.  ,  0.02, ...,  0.  ,  0.  ],
           [ 0.  ,  0.  , ...,  0.  ,  0.  ],
           ...,
           [ 0.  ,  0.  , ...,  0.  ,  0.  ],
           [ 0.  ,  0.  , ...,  0.  ,  0.  ]])


    >>> import matplotlib.pyplot as plt
    >>> fig, ax = plt.subplots()
    >>> img = librosa.display.specshow(melfb, x_axis='linear', ax=ax)
    >>> ax.set(ylabel='Mel filter', title='Mel filter bank')
    >>> fig.colorbar(img, ax=ax)
    """

    if fmax is None:
        fmax = float(sr) / 2

    # Initialize the weights
    n_mels = int(n_mels)
    weights = np.zeros((n_mels, int(1 + n_fft // 2)), dtype=dtype)

    # Center freqs of each FFT bin
    fftfreqs = fft_frequencies(sr=sr, n_fft=n_fft)

    # 'Center freqs' of mel bands - uniformly spaced between limits
    mel_f = mel_frequencies(n_mels + 2, fmin=fmin, fmax=fmax, htk=htk)

    fdiff = np.diff(mel_f)
    ramps = np.subtract.outer(mel_f, fftfreqs)

    for i in range(n_mels):
        # lower and upper slopes for all bins
        lower = -ramps[i] / fdiff[i]
        upper = ramps[i + 2] / fdiff[i + 1]

        # .. then intersect them with each other and zero
        weights[i] = np.maximum(0, np.minimum(lower, upper))

    if norm == "slaney":
        # Slaney-style mel is scaled to be approx constant energy per channel
        enorm = 2.0 / (mel_f[2 : n_mels + 2] - mel_f[:n_mels])
        weights *= enorm[:, np.newaxis]
    else:
        weights = util.normalize(weights, norm=norm, axis=-1)

    # Only check weights if f_mel[0] is positive
    if not np.all((mel_f[:-2] == 0) | (weights.max(axis=1) > 0)):
        # This means we have an empty channel somewhere
        warnings.warn(
            "Empty filters detected in mel frequency basis. "
            "Some channels will produce empty responses. "
            "Try increasing your sampling rate (and fmax) or "
            "reducing n_mels."
        )

    return weights

2. 实例分析

假设调用形式如下：

librosa.feature.melspectrogram(y=current_window ，sr=sample_rate, n_mels=n_mels, fmin=f_min, fmax=f_max, n_fft=nfft, hop_length=hop)



def melspectrogram(
    y=None,
    sr=22050,
    S=None,
    n_fft=2048,
    hop_length=512,
    win_length=None,
    window="hann",
    center=True,
    pad_mode="reflect",
    power=2.0,
    **kwargs,
):

    S, n_fft = _spectrogram(
        y=y,
        S=S,
        n_fft=n_fft,
        hop_length=hop_length,
        power=power,
        win_length=win_length,
        window=window,
        center=center,
        pad_mode=pad_mode,
    )

    # Build a Mel filter
    mel_basis = filters.mel(sr, n_fft, **kwargs)

    return np.dot(mel_basis, S)

输入参数：
y = 32000 个点的音频数据，
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800；
hop_length = win_length / 4 = n_fft / 4

返回：
mel 语谱图的输出大小为 [n_mels, n_frames ]，

Mel 语谱图的输出结果是， Mel 滤波器组和功率谱格式的语谱图两者共同作用的结果。

[ n_mels, 1 + $\frac{n_{fft}}{2}$ ] ×[ 1 + $\frac{n_{fft}}{2}$ , n_frames ]

= [n_mels, n_frames ]

注意：

Mel 语谱图函数的输入，其中的S 这个参数量，代表了输入中是否提供了语谱图的输入。

S 如果提供了语谱图输入，此时 S 直接使用输入的语谱图；
S 如果没有提供语谱图的输入，需要根据输入的一维数据y, 先经过STFT 得到频谱，然后在取模值的平方，得到频谱的功率谱形式，此时的功率谱便作为语谱图。

最终 mel 语谱图的输出大小为 [n_mels, frames ]，由 Mel 滤波器组和短时傅里叶变换STFT 两个输出矩阵，相乘共同作用到的结果。

该函数中调用了两个函数实现：

2.1 STFT 的输出帧数

根据以上参数

输入参数：
y = 32000 个点的音频数据，
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800；
hop_length = win_length / 4 = n_fft / 4

调用 STFT ，得到的输出 [ 1 + $\frac{n_{fft}}{2}$ , n_frames ]

行数: 1 + $\frac{n_{fft}}{2}$ = 1+ (800/2) = 401
列数：等于帧数，帧数 n_frames 求法：
$n_{frames} = audio_{length} // hop_{length} + 1,$
带入上式中的，可以求得：
n_frams = (32000 // 200) + 1 = 161,