1. 原理
短时傅立叶变换中的每一个小块
X
(
m
,
k
)
X(m,k)
X(m,k)(
m
m
m代表帧索引,
k
k
k代表频带索引),都表征了第
m
m
m 帧中,一个模长为
∣
X
(
m
,
k
)
∣
|X(m,k)|
∣X(m,k)∣, 相位为
φ
(
m
,
k
)
φ(m,k)
φ(m,k) ,数字频率为
F
p
r
e
d
=
k
N
∗
F
s
Fpred=kN∗Fs
Fpred=kN∗Fs 的复指数序列成分。然而受到短时傅立叶变换中频谱分辨率的影响,
s
t
f
t
stft
stft 无法分析出原始信号中数字角频率位于频带范围中的频率成分。举个例子,有一个520hz,16k采样率的正弦信号。使用帧长为64(频率分辨率250hz),帧偏移32个采样点的
s
t
f
t
stft
stft来分析,频段为
[
0
250
]
[
250
500
]
[
500
750
]
.
.
.
.
.
.
.
[0~250][250~500][500~750].......
[0 250][250 500][500 750].......;那么520hz的频率分量体现在语谱图上,仅仅在[500~750]这个频段区间。
图2给出stft变换中的一个示意图来理解相位声码器的大致原理。X(m,k)是第m帧,第k个频带对应的正弦分量;频带对应频率的计算方式:
F
p
r
e
d
(
k
)
=
k
F
s
N
F_{pred}(k)=\frac{kF_s}{N}
Fpred(k)=NkFs
其中,
k
k
k代表频带索引,
F
s
F_s
Fs代表音频的采样频率,
N
N
N代表
s
t
f
t
stft
stft的分析帧长。对应的预估角频率
ω
p
r
e
d
\omega_{pred}
ωpred为
F
p
r
e
d
∗
2
π
F_{pred}*2\pi
Fpred∗2π;体现在图2中,即就是两帧中红色和绿色对应的角频率,这个频率是不精确有误差的,我们需要对它进行精确化处理。
X
(
m
+
1
,
k
)
X(m+1,k)
X(m+1,k)对应的是第
m
+
1
m+1
m+1帧,第
k
k
k个频带对应的正弦分量。两帧起始点的采样点间隔即就是stft中的帧偏移
H
a
H_a
Ha,该间隔对应的时间间隔即就是
H
a
/
F
s
(
s
)
H_a / F_s(s)
Ha/Fs(s)。通过stft,还可以得到m和m+1帧中,k频段正弦分量对应的初始相位,分别表示为
φ
1
=
φ
(
m
,
k
)
\varphi_1=\varphi(m,k)
φ1=φ(m,k)和
φ
2
=
φ
(
m
+
1
,
k
)
\varphi_2=\varphi(m+1,k)
φ2=φ(m+1,k);通过预估的角频率
ω
\omega
ω以及m帧的初始相位
φ
1
\varphi_1
φ1,我们可以估计出m+1帧的起始时刻(
t
2
t_2
t2)的相位,即就是图2中蓝色椭圆内红色线在
t
2
t_2
t2时刻的相位。可以发现如果预估的角频率和真实的角频率一致的话,t2时刻预估的相位
φ
p
r
e
d
\varphi_{pred}
φpred应该和m+1帧的初始相位
φ
2
\varphi_2
φ2保持一致(差值是
2
×
π
2×\pi
2×π的正整数倍),即就是蓝色椭圆内,红色点和绿色点应该重合。定义t2时刻的预测相位和真实相位之间的差值为
φ
e
r
r
=
φ
2
−
φ
p
r
e
d
\varphi_{err}=\varphi_2 - \varphi_{pred}
φerr=φ2−φpred,该差值由误差角频率(真实角频率-预估角频率)在帧偏移这段时间的转动相位。
φ
e
r
r
=
(
ω
r
e
a
l
−
ω
p
r
e
d
)
∗
Δ
t
\varphi_{err} = (\omega_{real} - \omega_{pred}) * \Delta_t
φerr=(ωreal−ωpred)∗Δt
因此,真实角频率即就是预估角频率加上误差角频率
ω
r
e
a
l
=
ω
p
r
e
d
+
φ
e
r
r
Δ
t
\omega_{real} = \omega_{pred} + \frac{\varphi_{err}}{\Delta_t}
ωreal=ωpred+Δtφerr
2. 仿真
2.1.参数
参数 | 值 |
---|---|
帧长 | 64 |
帧偏移 | 32 |
真实频率(hz) | 520 |
分析音频文件 |
# -*- coding: utf-8 -*-
# @Time : 2021-05-13
# @Author : Zhangjieyang
# @Email : jieyangzhang@outlook.com
# @File : phase_vocoder_demo.py
# @Software: vscode
import numpy as np
import matplotlib.pyplot as plt
from nara_wpe.utils import stft
import soundfile as sf
def main():
N = 64 # 分帧长度
shift = 32 # 偏移
stft_options = dict(size=N, shift=shift)
audio_name = '520hz_16k.wav'
y, freq = sf.read(audio_name)
Y = stft(y, **stft_options, window=np.ones)
magnitude = np.abs(Y) # 模长
phase = np.arctan2(Y.imag, Y.real) # 相位
k = 16
predict_freq_list = []
for k in range(Y.shape[0]):
base = np.argmax(magnitude[k])
base_freq = base / N * freq
previous_phase = phase[k - 1][base]
current_phase = phase[k][base]
pred_phase = previous_phase + shift / freq * (2 * np.pi * base_freq)
err_phase = (current_phase - pred_phase) % (2 * np.pi)
# if err_phase > np.pi:
# err_phase = 2 * np.pi - err_phase
estimate_freq = base_freq + err_phase / (shift / freq) / (2 * np.pi)
predict_freq_list.append(estimate_freq)
print('frame:%d -- base freq:%d predict phase:%f, current phase:%f, previous phase:%f, err phase:%f, predict freq:%f'
% (k, base_freq, pred_phase, current_phase, previous_phase, err_phase, estimate_freq))
# 计算预估频率均值
average_predict_freq = np.average(predict_freq_list)
print(average_predict_freq)
# 绘制语谱图
half = N / 2 + 1
ylocs = np.linspace(0, half, int(half))
freq_solution = freq / 2 / half
plt.yticks(ylocs, ["%.02f" % l for l in (ylocs * freq_solution)])
plt.ylabel('freq (hz)')
plt.xlabel("frame")
plt.imshow(np.transpose(np.abs(Y)), origin="lower", cmap="jet", aspect="auto", interpolation="none")
plt.title(audio_name)
plt.rcParams['ytick.direction'] = 'in'
plt.show()
if __name__ == "__main__":
main()
2.3.结果总结
截取一段帧区域的预测结果,有:
frame:29987 -- base freq:500 predict phase:7.456728, current phase:1.432367, previous phase:1.173543, err phase:0.258823, predict freq:520.596510
frame:29988 -- base freq:500 predict phase:7.715552, current phase:1.693486, previous phase:1.432367, err phase:0.261119, predict freq:520.779217
frame:29989 -- base freq:500 predict phase:7.976671, current phase:1.954352, previous phase:1.693486, err phase:0.260866, predict freq:520.759074
frame:29990 -- base freq:500 predict phase:8.237537, current phase:2.212500, previous phase:1.954352, err phase:0.258147, predict freq:520.542723
frame:29991 -- base freq:500 predict phase:8.495685, current phase:2.466256, previous phase:2.212500, err phase:0.253757, predict freq:520.193311
frame:29992 -- base freq:500 predict phase:8.749441, current phase:2.715126, previous phase:2.466256, err phase:0.248870, predict freq:519.804421
frame:29993 -- base freq:500 predict phase:8.998311, current phase:2.959832, previous phase:2.715126, err phase:0.244706, predict freq:519.473091
frame:29994 -- base freq:500 predict phase:9.243017, current phase:-3.081169, previous phase:2.959832, err phase:0.242184, predict freq:519.272386
frame:29995 -- base freq:500 predict phase:3.202016, current phase:-2.839341, previous phase:-3.081169, err phase:0.241828, predict freq:519.244056
frame:29996 -- base freq:500 predict phase:3.443844, current phase:-2.595632, previous phase:-2.839341, err phase:0.243710, predict freq:519.393789
frame:29997 -- base freq:500 predict phase:3.687553, current phase:-2.348190, previous phase:-2.595632, err phase:0.247442, predict freq:519.690841
frame:29998 -- base freq:500 predict phase:3.934996, current phase:-2.095976, previous phase:-2.348190, err phase:0.252213, predict freq:520.070504
frame:29999 -- base freq:500 predict phase:4.187209, current phase:-1.839067, previous phase:-2.095976, err phase:0.256909, predict freq:520.444147
预估的均值为:519.9993333555548hz和原始信号基本一致。
3.参考文献
A Review of Time-Scale Modification of Music Signals—5.3