语音增强--------------子空间算法
原理介绍
子空间方法的原理是将观测信号的向量空间分解为信号子空间和噪声子空间,通过消除噪声子空间并保留信号子空间从而估计出干净语音,子空间分解的过程是对带噪语音信号做KLT变换,然后设置一个门限阈值,利用KLT系数的稀疏性,将噪声的KLT系数置为0,之后通过逆KLT变换得到增强后的语音。
若观测噪声
v
\mathbf{v}
v与干净语音
x
\mathbf{x}
x无关,则带噪语音可以表示为
y
=
x
+
v
\mathbf{y}=\mathbf{x}+\mathbf{v}
y=x+v
假定对干净语音
x
\mathbf{x}
x的线性估计为
x
^
=
H
y
\mathbf{\hat{x}}=\mathbf{Hy}
x^=Hy,此处
H
\mathbf{H}
H为
K
×
K
K\times K
K×K的滤波器矩阵。
那么滤波误差可以写为
ε
=
x
^
−
x
=
(
H
−
I
)
⋅
x
+
H
⋅
v
=
ε
x
+
ε
v
\mathbf{\varepsilon }=\widehat{\mathbf{x}}-\mathbf{x}=(\mathbf{H}-\mathbf{I})\cdot \mathbf{x}+\mathbf{H}\cdot \mathbf{v}={{\mathbf{\varepsilon }}_{\mathbf{x}}}+{{\mathbf{\varepsilon }}_{\mathbf{v}}}
ε=x
−x=(H−I)⋅x+H⋅v=εx+εv
其中
ε
x
{{\mathbf{\varepsilon }}_{\mathbf{x}}}
εx为语音失真,
ε
v
{{\mathbf{\varepsilon }}_{\mathbf{v}}}
εv为残留噪声,定义
语音失真能量
ε
x
2
=tr
(
E
[
ε
x
ε
x
T
]
)
\mathbf{\varepsilon }_{\mathbf{x}}^{2}\text{=tr}\left( E\left[ {{\mathbf{\varepsilon }}_{\mathbf{x}}}\mathbf{\varepsilon }_{\mathbf{x}}^{T} \right] \right)
εx2=tr(E[εxεxT])
残留噪声能量
ε
v
2
=tr
(
E
[
ε
v
ε
v
T
]
)
\mathbf{\varepsilon }_{\mathbf{v}}^{2}\text{=tr}\left( E\left[ {{\mathbf{\varepsilon }}_{\mathbf{v}}}\mathbf{\varepsilon }_{\mathbf{v}}^{T} \right] \right)
εv2=tr(E[εvεvT])
为了得到最优线性滤波器,可以寻求解决如下有约束的优化问题:
{
min
H
ε
x
2
s
.
t
.
1
K
ε
v
2
≤
σ
2
\left\{ \begin{aligned} & \underset{\mathbf{H}}{\mathop{\min }}\,\ \ \ \mathbf{\varepsilon }_{\mathbf{x}}^{2} \\ & s.t.\ \ \ \ \frac{1}{K}\mathbf{\varepsilon }_{\mathbf{v}}^{2}\le {{\sigma }^{2}} \\ \end{aligned} \right.
⎩⎪⎨⎪⎧Hmin εx2s.t. K1εv2≤σ2
其中
σ
2
{{\sigma }^{2}}
σ2为正实数,代表噪声的容忍度。
最优解可以近似写作
H
opt
=
R
x
(
R
x
+
μ
R
v
)
−
1
{{\mathbf{H}}_{\text{opt}}}={{\mathbf{R}}_{\mathbf{x}}}{{\left( {{\mathbf{R}}_{\mathbf{x}}}\text{+}\mu {{\mathbf{R}}_{\mathbf{v}}} \right)}^{-1}}
Hopt=Rx(Rx+μRv)−1
此处
R
x
{{\mathbf{R}}_{\mathbf{x}}}
Rx和
R
v
{{\mathbf{R}}_{\mathbf{v}}}
Rv分别是干净语音和噪声的协方差矩阵,
μ
\mu
μ是拉格朗日乘子。参数
μ
\mu
μ由噪声容忍度 确定,其控制着语音失真和残留噪声之间的权衡。
当
μ
=
1
\mu \text{=}1
μ=1时,由于干净语音和噪声的互不相关特性,
H
opt
{{\mathbf{H}}_{\text{opt}}}
Hopt变为维纳意义上的最优解
H
opt
=
R
x
R
y
−
1
{{\mathbf{H}}_{\text{opt}}}={{\mathbf{R}}_{\mathbf{x}}}\mathbf{R}_{\mathbf{y}}^{-1}
Hopt=RxRy−1,这意味着,维纳滤波器是子空间算法的一种特殊情况。
对
R
x
{{\mathbf{R}}_{\mathbf{x}}}
Rx进行特征分解
R
x
=
U
Σ
x
U
T
{{\mathbf{R}}_{\mathbf{x}}}=\mathbf{U}{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\mathbf{U}}^{T}}
Rx=UΣxUT
那么
H
opt
=
R
x
(
R
x
+
μ
R
v
)
−
1
=
U
Σ
x
(
Σ
x
+
μ
U
T
R
v
U
)
−
1
U
T
\begin{aligned} {{\mathbf{H}}_{\text{opt}}}& ={{\mathbf{R}}_{\mathbf{x}}}{{\left( {{\mathbf{R}}_{\mathbf{x}}}\text{+}\mu {{\mathbf{R}}_{\mathbf{v}}} \right)}^{-1}} \\ & \text{=}\mathbf{U}{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\left( {{\mathbf{\Sigma }}_{\mathbf{x}}}\text{+}\mu {{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U} \right)}^{-1}}{{\mathbf{U}}^{T}} \end{aligned}
Hopt=Rx(Rx+μRv)−1=UΣx(Σx+μUTRvU)−1UT
在文献[2]中,
U
T
R
v
U
{{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U}
UTRvU 被近似为对角阵
Σ
v
=
d
i
a
g
{
E
[
∣
u
1
T
v
∣
]
E
[
∣
u
2
T
v
∣
]
⋯
E
[
∣
u
K
T
v
∣
]
}
≈
U
T
R
v
U
{{\mathbf{\Sigma }}_{\mathbf{v}}}=diag\left\{ \begin{matrix} E\left[ \left| \mathbf{u}_{1}^{T}\mathbf{v} \right| \right] & E\left[ \left| \mathbf{u}_{2}^{T}\mathbf{v} \right| \right] & \cdots & E\left[ \left| \mathbf{u}_{K}^{T}\mathbf{v} \right| \right] \\ \end{matrix} \right\}\approx {{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U}
Σv=diag{E[∣∣u1Tv∣∣]E[∣∣u2Tv∣∣]⋯E[∣∣uKTv∣∣]}≈UTRvU
其中
u
k
T
\mathbf{u}_{k}^{T}
ukT是
R
x
{{\mathbf{R}}_{\mathbf{x}}}
Rx的第
k
k
k个特征向量,基于上述近似,可以得到次优估计器
H
opt
≈
U
Σ
x
(
Σ
x
+
μ
Σ
v
)
−
1
U
T
{{\mathbf{H}}_{\text{opt}}}\approx \mathbf{U}{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\left( {{\mathbf{\Sigma }}_{\mathbf{x}}}\text{+}\mu {{\mathbf{\Sigma }}_{\mathbf{v}}} \right)}^{-1}}{{\mathbf{U}}^{T}}
Hopt≈UΣx(Σx+μΣv)−1UT
因此,子空间算法先将含噪的观测语音变换到KLT域,得到KLT系数
U
T
y
{{\mathbf{U}}^{T}}\mathbf{y}
UTy;根据干净语音在KLT域的稀疏性对
U
T
y
{{\mathbf{U}}^{T}}\mathbf{y}
UTy进行加权处理,得到稀疏化的干净语音KLT系数估计值
Σ
x
(
Σ
x
+
μ
Σ
v
)
−
1
U
T
{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\left( {{\mathbf{\Sigma }}_{\mathbf{x}}}\text{+}\mu {{\mathbf{\Sigma }}_{\mathbf{v}}} \right)}^{-1}}{{\mathbf{U}}^{T}}
Σx(Σx+μΣv)−1UT;最后通过逆KLT变换得到增强后的语音
H
opt
y
{{\mathbf{H}}_{\text{opt}}}\mathbf{y}
Hopty。
在实际的仿真中,即使对于白噪声,
U
T
R
v
U
{{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U}
UTRvU不是严格的对角矩阵;对于色噪声,
U
T
R
v
U
{{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U}
UTRvU更不会是一个对角矩阵。
如果
R
x
{{\mathbf{R}}_{\mathbf{x}}}
Rx是对称矩阵,
R
v
{{\mathbf{R}}_{\mathbf{v}}}
Rv是正定矩阵,那么存在矩阵
V
\mathbf{V}
V能够同时对角化两个协方差矩阵:
V
T
R
x
V
=
Λ
{{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{x}}}\mathbf{V}=\mathbf{\Lambda }
VTRxV=Λ
V
T
R
v
V
=
I
{{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{V}=\mathbf{I}
VTRvV=I
其中
Λ
\mathbf{\Lambda }
Λ和
V
\mathbf{V}
V分别是
Σ
=
R
v
−
1
R
x
\mathbf{\Sigma }=\mathbf{R}_{\mathbf{v}}^{-1}{{\mathbf{R}}_{\mathbf{x}}}
Σ=Rv−1Rx特征值矩阵和特征向量矩阵,即
Σ
V
=
V
Λ
\mathbf{\Sigma V}\text{=}\mathbf{V\Lambda }
ΣV=VΛ
此时
Σ
\mathbf{\Sigma }
Σ的特征值均是实数,利用广义特征值分解,可以得到针对色噪声情况的最优线性估计器
H
opt
=
R
x
[
V
−
T
(
V
T
R
x
V
+
μ
V
T
R
v
V
)
V
−
1
]
−
1
=
R
x
V
(
Λ
+
μ
I
)
−
1
V
T
=
V
−
T
Λ
(
Λ
+
μ
I
)
−
1
V
T
\begin{aligned} {{\mathbf{H}}_{\text{opt}}}&={{\mathbf{R}}_{\mathbf{x}}}{{\left[ {{\mathbf{V}}^{-T}}\left( {{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{x}}}\mathbf{V}+\mu {{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{V} \right){{\mathbf{V}}^{-1}} \right]}^{-1}} \\ & ={{\mathbf{R}}_{\mathbf{x}}}\mathbf{V}{{\left( \mathbf{\Lambda }+\mu \mathbf{I} \right)}^{-1}}{{\mathbf{V}}^{T}} \\ & \text{=}{{\mathbf{V}}^{-T}}\mathbf{\Lambda }{{\left( \mathbf{\Lambda }+\mu \mathbf{I} \right)}^{-1}}{{\mathbf{V}}^{T}} \end{aligned}
Hopt=Rx[V−T(VTRxV+μVTRvV)V−1]−1=RxV(Λ+μI)−1VT=V−TΛ(Λ+μI)−1VT
首先,对含噪声语音
y
\mathbf{y}
y作 变换
V
T
{{\mathbf{V}}^{T}}
VT,通过一个增益函数
Λ
(
Λ
+
μ
I
)
−
1
\mathbf{\Lambda }{{\left( \mathbf{\Lambda }+\mu \mathbf{I} \right)}^{-1}}
Λ(Λ+μI)−1对变换系数
U
T
y
{{\mathbf{U}}^{T}}\mathbf{y}
UTy进行调整,对调整后的系数做
V
T
{{\mathbf{V}}^{T}}
VT逆变换。
仿真参数为
参数名称 | 参数值 |
---|---|
信噪比 | 5dB |
采样率 | 16KHz |
从图中可以看出,在经过子空间算法处理之后,噪声被很大程度滤除了,达到了良好的去噪效果。此外,从图也可以看出,增强后的语音的一些细节方面也被滤除了,这也是子空间算法存在的一些缺点。
主函数main.m
clear;
close all;
clc;
%% 读入数据
[signal,~]=audioread('clean.wav'); %读入干净语音
[noise,fs]=audioread('noise.wav'); %读入噪声
N=3*fs; %选取3秒的语音
signal=signal(1:N);
noise=noise(1:N);
N=length(signal);
t=(0:N-1)/fs;
SNR=5; %信噪比大小
noise=noise/norm(noise,2).*10^(-SNR/20)*norm(signal);
x=signal+noise; %产生固定信噪比的带噪语音
audiowrite('mixed.wav',x,fs); %将混合的语音写入
enhanced_speech=klt('mixed.wav','enhanced_speech.wav');
figure(1)
subplot(321);
plot(t,signal);ylim([-1.5,1.5]);title('干净语音');xlabel('时间/s');ylabel('幅度');
subplot(323);
plot(t,x);ylim([-1.5,1.5]);title('带噪语音');xlabel('时间/s');ylabel('幅度');
subplot(325);
plot(t,real(enhanced_speech));ylim([-1.5,1.5]);title('子空间法增强后的语音');xlabel('时间/s');ylabel('幅度');
subplot(322);
spectrogram(signal,256,128,256,16000,'yaxis');
subplot(324);
spectrogram(x,256,128,256,16000,'yaxis');
subplot(326);
spectrogram(enhanced_speech,256,128,256,16000,'yaxis');
klt.m
function [enhanced_speech, fs, nbits] =klt( filename, outfile)
%
% Implements the generalized subspace algorithm with embedded pre-whitening [1].
% Makes no assumptions about the nature of the noise (white vs. colored).
%
% Usage: klt(noisyFile, outputFile)
%
% infile - noisy speech file in .wav format
% outputFile - enhanced output file in .wav format
%
% The covariance matrix estimation is done using a sine-taper method [1].
%
% Example call: klt('sp04_babble_sn10.wav','out_klt.wav');
%
% References:
% [1] Hu, Y. and Loizou, P. (2003). A generalized subspace approach for
% enhancing speech corrupted by colored noise. IEEE Trans. on Speech
% and Audio Processing, 11, 334-341.
%
% Authors: Yi Hu and Philipos C. Loizou
%
% Copyright (c) 2006 by Philipos C. Loizou
% $Revision: 0.0 $ $Date: 10/09/2006 $
%-------------------------------------------------------------------------
if nargin<2
fprintf('Usage: klt(noisyfile.wav,outFile.wav) \n\n');
return;
end
L=16; %number of tapers used to estimate the covariance matrix in
% multi-window method
vad_thre= 1.2; % VAD threshold value
mu_vad= 0.98; % mu to use in smoothing of the Rn matrix
% --- initialize mu values -----------------
%
mu_max=5;
mu_toplus= 1; % mu value with SNRlog>= 20dB
mu_tominus= mu_max; % mu value with SNRlog< -5dB
mu_slope= (mu_tominus- mu_toplus )/ 25;
mu0= mu_toplus+ 20* mu_slope;
%===========================================
[noisy_speech, fs]= audioread(filename);
aInfo = audioinfo(filename);
nbits = aInfo.BitsPerSample; % 16 bits resolution
enhanced_speech=zeros(1,length(noisy_speech));
subframe_dur= 4; %subframe length is 4 ms
len= floor( fs* subframe_dur/ 1000);
P= len;
frame_dur= 32; % frame duration in msecs
N= frame_dur* fs/ 1000;
Nover2= N/ 2; % window overlap in 50% of frame size
K= N;
frame_window= hamming( N);
subframe_window= hamming( P);
% ==== Esimate noise covariance matrix ------------
%
L120=floor( 120* fs/ 1000); % assume initial 120ms is noise only
noise= noisy_speech( 1: L120);
if L== 1
noise_autoc= xcorr( noise, P- 1, 'biased');
% from -(len- 1) to (len- 1)
% obtain the autocorrelation functions
Rn= toeplitz( noise_autoc( P: end));
% form a Toeplitz matrix to obtain the noise signal covariance matrix
else
tapers= sine_taper( L, L120); % generate sine tapers
% [tapers, v]= dpss( L120, 4); % generate slepian tapers
Rn= Rest_mt( noise, P, tapers);
end
iRn= inv( Rn); % inverse Rn
% ===================================================
n_start= 1;
In= eye(len);
Nframes= floor( length( noisy_speech)/ (N/ 2))- 1; % number of frames
x_overlap= zeros( Nover2, 1);
tapers1= sine_taper( L, N); % generate sine tapers
% [tapers1, v]= dpss( N, 4); % generate slepian tapers
%=============================== Start Processing =====================
%
for n=1: Nframes
noisy= noisy_speech( n_start: n_start+ N- 1);
if L== 1
noisy_autoc= xcorr( noisy, P- 1, 'biased');
Ry= toeplitz( noisy_autoc( P: 2* P- 1));
else
Ry= Rest_mt( noisy, P, tapers1); % use sine tapers to estimate the cov matrix
end
% use simple VAD to update the noise cov matrix, Rn
vad_ratio= Ry(1,1)/ Rn(1,1);
if (vad_ratio<= vad_thre) % noise dominant
Rn= mu_vad* Rn+ (1- mu_vad)* Ry;
iRn= inv( Rn);
end
%================
iRnRx= iRn* Ry- In; % compute Rn^-1 Rx=Rn^-1- I
[V, D]= eig( iRnRx); % EVD
iV= inv( V);
dRx= max( diag( D), 0); % estimated eigenvalues of Rx
SNR= sum( dRx)/ len;
SNRlog( n)= 10* log10( SNR+ eps);
if SNRlog( n)>= 20
mu( n)= mu_toplus; %actually this corresponds to wiener filtering
elseif ( SNRlog( n)< 20) && ( SNRlog( n)>= -5)
mu( n)= mu0- SNRlog( n)* mu_slope;
else
mu( n)= mu_tominus;
end
gain_vals= dRx./( dRx+ mu( n));
G= diag( gain_vals);
H= iV'* G* V';
% first step of synthesis for subframe
sub_start= 1;
sub_overlap= zeros( P/2, 1);
for m= 1: (2*N/P- 1)
sub_noisy= noisy( sub_start: sub_start+ P- 1);
enhanced_sub_tmp= (H* sub_noisy).* subframe_window;
enhanced_sub( sub_start: sub_start+ P/2- 1)= ...
enhanced_sub_tmp( 1: P/2)+ sub_overlap;
sub_overlap= enhanced_sub_tmp( P/2+1: P);
sub_start= sub_start+ P/2;
end
enhanced_sub( sub_start: sub_start+ P/2- 1)= sub_overlap;
% ===============
xi= enhanced_sub'.* frame_window;
enhanced_speech( n_start: n_start+ Nover2- 1)= x_overlap+ xi( 1: Nover2);
x_overlap= xi( Nover2+ 1: N);
n_start= n_start+ Nover2;
end
enhanced_speech( n_start: n_start+ Nover2- 1)= x_overlap;
audiowrite( outfile, enhanced_speech, fs, 'BitsPerSample', nbits);
end
%=========================== E N D===============================================
function tapers= sine_taper( L, N)
% this function is used to generate the sine tapers proposed by Riedel et
% al, IEEE Transactions on Signal Processing, pp. 188- 195, Jan. 1995
% there are two parameters, 'L' is the number of the sine tapers generated,
% and 'N' is the length of each sine taper; the returned value 'tapers' is
% a N-by-L matrix with each column being sine taper
tapers= zeros( N, L);
for index= 1: L
tapers( :, index)= sqrt( 2/ (N+ 1))* sin (pi* index* (1: N)'/ (N+ 1));
end
end
function R_mt= Rest_mt( x, p, W)
% multi-taper method for covariance matrix estimation, we have 'x' of
% length N, and estimate p-order covariance matrix.
% this estimator is:
% 1): quadratic in the data
% 2): modulation covariant
% 3): nonnegative definite Toeplitz
% 'x' must be a vector of length N,
% 'p' is the order of the covariance matrix,
% 'W' is the N-by-L taper matrix, each column of which is a taper.
%
%
% Reference: L. T. McWhorter and L. L. Scharf, "Multiwindow estimator of correlation"
% IEEE Trans. on Signal processing, Vol. 46, No. 2, Feb. 1998
%====================
x= x( :); % make data a column vector
[N, L]= size( W);
% tapers is a N-by-L matrix, and has the form of [w_1 w_2 ... w_L]
% first step is to compute R'= sum_{i=1}^{L}(w_i y)(w_i y)*, which is a
% multitaper method: weight x with each taper w_i, and compute outer
% product, and add them together.
x_rep= x( :, ones( 1, L));
% make x repeat L times, i.e., [x x ...x]
x_w= W.* x_rep;
% now each column of x_w is the weighted x
R1= x_w* x_w'; % sum of outer product
for k= 1: p
r( k)= sum( diag( R1, k- 1));
end
R_mt= toeplitz( r);
end
关于语音及噪声文件,具体请参考:语音信号处理常用语料库下载地址
参考文献
[1]Yi Hu and P. C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored noise,” in IEEE Transactions on Speech and Audio Processing, vol. 11, no. 4, pp. 334-341, July 2003, doi: 10.1109/TSA.2003.814458.
[2]U. Mittal and N. Phamdo, “Signal/noise KLT based approach for enhancing speech degraded by colored noise,” in IEEE Transactions on Speech and Audio Processing, vol. 8, no. 2, pp. 159-167, March 2000, doi: 10.1109/89.824700.
[3]PhiliposC.Loizou. 语音增强:理论与实践:theory and practice[M]// 语音增强:理论与实践:theory and practice. 电子科技大学出版社, 2012.
[4]Loizou P C . Speech Enhancement: Theory and Practice[M]. CRC Press, Inc. 2007.