torch.stft()与librosa.stft()的对比

最新推荐文章于 2025-02-14 14:06:12 发布

呦呵嘿呀

最新推荐文章于 2025-02-14 14:06:12 发布

阅读量1.1w

点赞数 4

分类专栏：语音处理文章标签： pytorch stft 短时傅立叶变换

本文链接：https://blog.csdn.net/b15040915/article/details/105492951

版权

语音处理专栏收录该内容

4 篇文章

订阅专栏

本文详细对比了PyTorch库中的torch.stft函数与Librosa库中的librosa.stft函数在处理语音信号时的参数设置与使用方法。两者均用于通过短时傅立叶变换（STFT）获取语音信号的幅度和相位信息，但具体实现细节存在差异。torch.stft适用于深度学习框架内的张量操作，而librosa.stft则在音频信号处理领域更为专业。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

对比torch.stft与librosa.stft在获取语音的幅度和相位的不同表达

torch.stft
librosa.stft

torch.stft

stft(self, n_fft, hop_length=None, win_length=None,window=None,center=True, pad_mode='reflect', normalized=False, onesided=True)

Parameters：
----------

input (Tensor) – the input tensor

n_fft (int) – size of Fourier transform

hop_length (int, optional) – the distance between neighboring sliding window frames. Default: None (treated as equal to floor(n_fft / 4))

win_length (int, optional) – the size of window frame and STFT filter. Default: None (treated as equal to n_fft)

window (Tensor, optional) – the optional window function. Default: None (treated as window of all 111 s)

center (bool, optional) – whether to pad input on both sides so that the ttt -th frame is centered at time t×hop_lengtht \times \text{hop\_length}t×hop_length . Default: True

pad_mode (string, optional) – controls the padding method used when center is True. Default: "reflect"

normalized (bool, optional) – controls whether to return the normalized STFT results Default: False

onesided (bool, optional) – controls whether to return half of results to avoid redundancy Default: True

Returns the real and the imaginary parts together as one tensor of size :math:`(* \times N \times T \times 2)`, where :math:`*` is the optional batch size of :attr:`input`, :math:`N` is the number of frequencies where STFT is applied, :math:`T` is the total number of frames used, and each pair in the last dimension represents a complex number as the real part and the imaginary part.








----------

其输入为一维或者二维的时间序列
返回值为一个tensor,其中第一个维度为输入数据的batch size，第二个维度为STFT应用的频数，第三个维度为帧总数，最后一个维度包含了返回的复数值中的实部和虚部部分。
幅度和相位的获取如下：

spec = torch.stft(mono,n_fft=len_frame,hop_length=len_hop)
rea = spec[:, :, 0]#实部
imag = spec[:, :, 1]#虚部
mag = torch.abs(torch.sqrt(torch.pow(rea, 2) + torch.pow(imag, 2)))
pha = torch.atan2(imag.data, rea.data)

librosa.stft

stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann', center=True, dtype=np.complex64, pad_mode='reflect')

Parameters
----------

y : np.ndarray [shape=(n,)], real-valuedinput signal

n_fft : int > 0 [scalar]
    length of the windowed signal after padding with zeros.
    The number of rows in the STFT matrix `D` is (1 + n_fft/2).
    The default value, n_fft=2048 samples, corresponds to a physical
    duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the
    default sample rate in librosa. This value is well adapted for music
    signals. However, in speech processing, the recommended value is 512,
    corresponding to 23 milliseconds at a sample rate of 22050 Hz.
    In any case, we recommend setting `n_fft` to a power of two for
    optimizing the speed of the fast Fourier transform (FFT) algorithm.

hop_length : int > 0 [scalar]
    number of audio samples between adjacent STFT columns.

    Smaller values increase the number of columns in `D` without
    affecting the frequency resolution of the STFT.

    If unspecified, defaults to `win_length / 4` (see below).

win_length : int <= n_fft [scalar]
    Each frame of audio is windowed by `window()` of length `win_length`
    and then padded with zeros to match `n_fft`.

    Smaller values improve the temporal resolution of the STFT (i.e. the
    ability to discriminate impulses that are closely spaced in time)
    at the expense of frequency resolution (i.e. the ability to discriminate
    pure tones that are closely spaced in frequency). This effect is known
    as the time-frequency localization tradeoff and needs to be adjusted
    according to the properties of the input signal `y`.

    If unspecified, defaults to ``win_length = n_fft``.

window : string, tuple, number, function, or np.ndarray [shape=(n_fft,)]
    Either:

    - a window specification (string, tuple, or number);
      see `scipy.signal.get_window`

    - a window function, such as `scipy.signal.hanning`

    - a vector or array of length `n_fft`


    Defaults to a raised cosine window ("hann"), which is adequate for
    most applications in audio signal processing.

    .. see also:: `filters.get_window`

center : boolean
    If `True`, the signal `y` is padded so that frame
    `D[:, t]` is centered at `y[t * hop_length]`.

    If `False`, then `D[:, t]` begins at `y[t * hop_length]`.

    Defaults to `True`,  which simplifies the alignment of `D` onto a
    time grid by means of `librosa.core.frames_to_samples`.
    Note, however, that `center` must be set to `False` when analyzing
    signals with `librosa.stream`.

    .. see also:: `stream`

dtype : numeric type
    Complex numeric type for `D`.  Default is single-precision
    floating-point complex (`np.complex64`).

pad_mode : string or function
    If `center=True`, this argument is passed to `np.pad` for padding
    the edges of the signal `y`. By default (`pad_mode="reflect"`),
    `y` is padded on both sides with its own reflection, mirrored around
    its first and last sample respectively.
    If `center=False`,  this argument is ignored.

    .. see also:: `np.pad`

通过在短重叠窗口上计算离散傅里叶变换(DFT)来表示时频域信号。返回值为一个复数值矩阵D，其中np.abs(D)表示幅度，np.angle(D)表示相位。
幅度和相位的获取如下：

spec = librosa.stft(mono, n_fft=len_frame, hop_length=len_hop)
mag = np.abs(spec)
pha = np.angle(spec)

或者直接利用librosa.core中封装好的函数

spec = librosa.stft(mono, n_fft=len_frame, hop_length=len_hop)
mag,pha = librosa.core.magphase(spec)