语音增强--子空间算法及MATLAB实现

语音增强--------------子空间算法

原理介绍

子空间方法的原理是将观测信号的向量空间分解为信号子空间和噪声子空间,通过消除噪声子空间并保留信号子空间从而估计出干净语音,子空间分解的过程是对带噪语音信号做KLT变换,然后设置一个门限阈值,利用KLT系数的稀疏性,将噪声的KLT系数置为0,之后通过逆KLT变换得到增强后的语音。
若观测噪声 v \mathbf{v} v与干净语音 x \mathbf{x} x无关,则带噪语音可以表示为
y = x + v \mathbf{y}=\mathbf{x}+\mathbf{v} y=x+v
假定对干净语音 x \mathbf{x} x的线性估计为 x ^ = H y \mathbf{\hat{x}}=\mathbf{Hy} x^=Hy,此处 H \mathbf{H} H K × K K\times K K×K的滤波器矩阵。
那么滤波误差可以写为
ε = x ^ − x = ( H − I ) ⋅ x + H ⋅ v = ε x + ε v \mathbf{\varepsilon }=\widehat{\mathbf{x}}-\mathbf{x}=(\mathbf{H}-\mathbf{I})\cdot \mathbf{x}+\mathbf{H}\cdot \mathbf{v}={{\mathbf{\varepsilon }}_{\mathbf{x}}}+{{\mathbf{\varepsilon }}_{\mathbf{v}}} ε=x x=(HI)x+Hv=εx+εv
其中 ε x {{\mathbf{\varepsilon }}_{\mathbf{x}}} εx为语音失真, ε v {{\mathbf{\varepsilon }}_{\mathbf{v}}} εv为残留噪声,定义
语音失真能量 ε x 2 =tr ( E [ ε x ε x T ] ) \mathbf{\varepsilon }_{\mathbf{x}}^{2}\text{=tr}\left( E\left[ {{\mathbf{\varepsilon }}_{\mathbf{x}}}\mathbf{\varepsilon }_{\mathbf{x}}^{T} \right] \right) εx2=tr(E[εxεxT])
残留噪声能量 ε v 2 =tr ( E [ ε v ε v T ] ) \mathbf{\varepsilon }_{\mathbf{v}}^{2}\text{=tr}\left( E\left[ {{\mathbf{\varepsilon }}_{\mathbf{v}}}\mathbf{\varepsilon }_{\mathbf{v}}^{T} \right] \right) εv2=tr(E[εvεvT])
为了得到最优线性滤波器,可以寻求解决如下有约束的优化问题:
{ min ⁡ H       ε x 2 s . t .      1 K ε v 2 ≤ σ 2 \left\{ \begin{aligned} & \underset{\mathbf{H}}{\mathop{\min }}\,\ \ \ \mathbf{\varepsilon }_{\mathbf{x}}^{2} \\ & s.t.\ \ \ \ \frac{1}{K}\mathbf{\varepsilon }_{\mathbf{v}}^{2}\le {{\sigma }^{2}} \\ \end{aligned} \right. Hmin   εx2s.t.    K1εv2σ2
其中 σ 2 {{\sigma }^{2}} σ2为正实数,代表噪声的容忍度。
最优解可以近似写作
H opt = R x ( R x + μ R v ) − 1 {{\mathbf{H}}_{\text{opt}}}={{\mathbf{R}}_{\mathbf{x}}}{{\left( {{\mathbf{R}}_{\mathbf{x}}}\text{+}\mu {{\mathbf{R}}_{\mathbf{v}}} \right)}^{-1}} Hopt=Rx(Rx+μRv)1
此处 R x {{\mathbf{R}}_{\mathbf{x}}} Rx R v {{\mathbf{R}}_{\mathbf{v}}} Rv分别是干净语音和噪声的协方差矩阵, μ \mu μ是拉格朗日乘子。参数 μ \mu μ由噪声容忍度 确定,其控制着语音失真和残留噪声之间的权衡。
μ = 1 \mu \text{=}1 μ=1时,由于干净语音和噪声的互不相关特性, H opt {{\mathbf{H}}_{\text{opt}}} Hopt变为维纳意义上的最优解 H opt = R x R y − 1 {{\mathbf{H}}_{\text{opt}}}={{\mathbf{R}}_{\mathbf{x}}}\mathbf{R}_{\mathbf{y}}^{-1} Hopt=RxRy1,这意味着,维纳滤波器是子空间算法的一种特殊情况。
R x {{\mathbf{R}}_{\mathbf{x}}} Rx进行特征分解
R x = U Σ x U T {{\mathbf{R}}_{\mathbf{x}}}=\mathbf{U}{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\mathbf{U}}^{T}} Rx=UΣxUT
那么
H opt = R x ( R x + μ R v ) − 1 = U Σ x ( Σ x + μ U T R v U ) − 1 U T \begin{aligned} {{\mathbf{H}}_{\text{opt}}}& ={{\mathbf{R}}_{\mathbf{x}}}{{\left( {{\mathbf{R}}_{\mathbf{x}}}\text{+}\mu {{\mathbf{R}}_{\mathbf{v}}} \right)}^{-1}} \\ & \text{=}\mathbf{U}{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\left( {{\mathbf{\Sigma }}_{\mathbf{x}}}\text{+}\mu {{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U} \right)}^{-1}}{{\mathbf{U}}^{T}} \end{aligned} Hopt=Rx(Rx+μRv)1=UΣx(Σx+μUTRvU)1UT
在文献[2]中, U T R v U {{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U} UTRvU 被近似为对角阵
Σ v = d i a g { E [ ∣ u 1 T v ∣ ] E [ ∣ u 2 T v ∣ ] ⋯ E [ ∣ u K T v ∣ ] } ≈ U T R v U {{\mathbf{\Sigma }}_{\mathbf{v}}}=diag\left\{ \begin{matrix} E\left[ \left| \mathbf{u}_{1}^{T}\mathbf{v} \right| \right] & E\left[ \left| \mathbf{u}_{2}^{T}\mathbf{v} \right| \right] & \cdots & E\left[ \left| \mathbf{u}_{K}^{T}\mathbf{v} \right| \right] \\ \end{matrix} \right\}\approx {{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U} Σv=diag{E[u1Tv]E[u2Tv]E[uKTv]}UTRvU
其中 u k T \mathbf{u}_{k}^{T} ukT R x {{\mathbf{R}}_{\mathbf{x}}} Rx的第 k k k个特征向量,基于上述近似,可以得到次优估计器
H opt ≈ U Σ x ( Σ x + μ Σ v ) − 1 U T {{\mathbf{H}}_{\text{opt}}}\approx \mathbf{U}{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\left( {{\mathbf{\Sigma }}_{\mathbf{x}}}\text{+}\mu {{\mathbf{\Sigma }}_{\mathbf{v}}} \right)}^{-1}}{{\mathbf{U}}^{T}} HoptUΣx(Σx+μΣv)1UT
因此,子空间算法先将含噪的观测语音变换到KLT域,得到KLT系数 U T y {{\mathbf{U}}^{T}}\mathbf{y} UTy;根据干净语音在KLT域的稀疏性对 U T y {{\mathbf{U}}^{T}}\mathbf{y} UTy进行加权处理,得到稀疏化的干净语音KLT系数估计值 Σ x ( Σ x + μ Σ v ) − 1 U T {{\mathbf{\Sigma }}_{\mathbf{x}}}{{\left( {{\mathbf{\Sigma }}_{\mathbf{x}}}\text{+}\mu {{\mathbf{\Sigma }}_{\mathbf{v}}} \right)}^{-1}}{{\mathbf{U}}^{T}} Σx(Σx+μΣv)1UT;最后通过逆KLT变换得到增强后的语音 H opt y {{\mathbf{H}}_{\text{opt}}}\mathbf{y} Hopty
在实际的仿真中,即使对于白噪声, U T R v U {{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U} UTRvU不是严格的对角矩阵;对于色噪声, U T R v U {{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U} UTRvU更不会是一个对角矩阵。
如果 R x {{\mathbf{R}}_{\mathbf{x}}} Rx是对称矩阵, R v {{\mathbf{R}}_{\mathbf{v}}} Rv是正定矩阵,那么存在矩阵 V \mathbf{V} V能够同时对角化两个协方差矩阵:
V T R x V = Λ {{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{x}}}\mathbf{V}=\mathbf{\Lambda } VTRxV=Λ
V T R v V = I {{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{V}=\mathbf{I} VTRvV=I
其中 Λ \mathbf{\Lambda } Λ V \mathbf{V} V分别是 Σ = R v − 1 R x \mathbf{\Sigma }=\mathbf{R}_{\mathbf{v}}^{-1}{{\mathbf{R}}_{\mathbf{x}}} Σ=Rv1Rx特征值矩阵和特征向量矩阵,即
Σ V = V Λ \mathbf{\Sigma V}\text{=}\mathbf{V\Lambda } ΣV=VΛ
此时 Σ \mathbf{\Sigma } Σ的特征值均是实数,利用广义特征值分解,可以得到针对色噪声情况的最优线性估计器
H opt = R x [ V − T ( V T R x V + μ V T R v V ) V − 1 ] − 1 = R x V ( Λ + μ I ) − 1 V T = V − T Λ ( Λ + μ I ) − 1 V T \begin{aligned} {{\mathbf{H}}_{\text{opt}}}&={{\mathbf{R}}_{\mathbf{x}}}{{\left[ {{\mathbf{V}}^{-T}}\left( {{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{x}}}\mathbf{V}+\mu {{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{V} \right){{\mathbf{V}}^{-1}} \right]}^{-1}} \\ & ={{\mathbf{R}}_{\mathbf{x}}}\mathbf{V}{{\left( \mathbf{\Lambda }+\mu \mathbf{I} \right)}^{-1}}{{\mathbf{V}}^{T}} \\ & \text{=}{{\mathbf{V}}^{-T}}\mathbf{\Lambda }{{\left( \mathbf{\Lambda }+\mu \mathbf{I} \right)}^{-1}}{{\mathbf{V}}^{T}} \end{aligned} Hopt=Rx[VT(VTRxV+μVTRvV)V1]1=RxV(Λ+μI)1VT=VTΛ(Λ+μI)1VT
首先,对含噪声语音 y \mathbf{y} y作 变换 V T {{\mathbf{V}}^{T}} VT,通过一个增益函数 Λ ( Λ + μ I ) − 1 \mathbf{\Lambda }{{\left( \mathbf{\Lambda }+\mu \mathbf{I} \right)}^{-1}} Λ(Λ+μI)1对变换系数 U T y {{\mathbf{U}}^{T}}\mathbf{y} UTy进行调整,对调整后的系数做 V T {{\mathbf{V}}^{T}} VT逆变换。
仿真参数为

参数名称参数值
信噪比5dB
采样率16KHz

在这里插入图片描述
从图中可以看出,在经过子空间算法处理之后,噪声被很大程度滤除了,达到了良好的去噪效果。此外,从图也可以看出,增强后的语音的一些细节方面也被滤除了,这也是子空间算法存在的一些缺点。
主函数main.m

clear;
close all;
clc;
%% 读入数据
[signal,~]=audioread('clean.wav');      %读入干净语音
[noise,fs]=audioread('noise.wav');       %读入噪声
N=3*fs;                                              %选取3秒的语音
signal=signal(1:N);
noise=noise(1:N);
N=length(signal);
t=(0:N-1)/fs;
SNR=5;                 %信噪比大小
noise=noise/norm(noise,2).*10^(-SNR/20)*norm(signal);     
x=signal+noise;          %产生固定信噪比的带噪语音
audiowrite('mixed.wav',x,fs);          %将混合的语音写入
enhanced_speech=klt('mixed.wav','enhanced_speech.wav');  
figure(1)
subplot(321);
plot(t,signal);ylim([-1.5,1.5]);title('干净语音');xlabel('时间/s');ylabel('幅度');
subplot(323);
plot(t,x);ylim([-1.5,1.5]);title('带噪语音');xlabel('时间/s');ylabel('幅度');
subplot(325);
plot(t,real(enhanced_speech));ylim([-1.5,1.5]);title('子空间法增强后的语音');xlabel('时间/s');ylabel('幅度');
subplot(322);
spectrogram(signal,256,128,256,16000,'yaxis');
subplot(324);
spectrogram(x,256,128,256,16000,'yaxis');
subplot(326);
spectrogram(enhanced_speech,256,128,256,16000,'yaxis');

klt.m

function [enhanced_speech, fs, nbits] =klt( filename, outfile)
%
%  Implements the generalized subspace algorithm with embedded pre-whitening [1].
%  Makes no assumptions about the nature of the noise (white vs. colored).
%
%  Usage:  klt(noisyFile, outputFile)
%           
%         infile - noisy speech file in .wav format
%         outputFile - enhanced output file in .wav format
%  
%  The covariance matrix estimation is done using a sine-taper method [1].
%
%  Example call:  klt('sp04_babble_sn10.wav','out_klt.wav');
%
%  References:
%   [1] Hu, Y. and Loizou, P. (2003). A generalized subspace approach for 
%       enhancing speech corrupted by colored noise. IEEE Trans. on Speech
%       and Audio Processing, 11, 334-341.
%   
% Authors: Yi Hu and Philipos C. Loizou
%
% Copyright (c) 2006 by Philipos C. Loizou
% $Revision: 0.0 $  $Date: 10/09/2006 $
%-------------------------------------------------------------------------

if nargin<2
   fprintf('Usage: klt(noisyfile.wav,outFile.wav) \n\n');
   return;
end

L=16;   %number of tapers used to estimate the covariance matrix in
        %  multi-window method  
vad_thre= 1.2; % VAD threshold value
mu_vad= 0.98; % mu to use in smoothing of the Rn matrix

% --- initialize mu values -----------------
%
mu_max=5;
mu_toplus= 1; % mu value with SNRlog>= 20dB 
mu_tominus= mu_max;   % mu value with SNRlog< -5dB
mu_slope= (mu_tominus- mu_toplus )/ 25;
mu0= mu_toplus+ 20* mu_slope;
%===========================================

[noisy_speech, fs]= audioread(filename);
aInfo = audioinfo(filename);
nbits = aInfo.BitsPerSample; % 16 bits resolution
enhanced_speech=zeros(1,length(noisy_speech));
subframe_dur= 4;  %subframe length is 4 ms
len= floor( fs* subframe_dur/ 1000);    
P= len; 
frame_dur= 32;  % frame duration in msecs
N= frame_dur* fs/ 1000; 
Nover2= N/ 2; % window overlap in 50% of frame size
K= N;
frame_window= hamming( N);
subframe_window= hamming( P); 


% ==== Esimate noise covariance matrix ------------
%
L120=floor( 120* fs/ 1000);  % assume initial 120ms is noise only
noise= noisy_speech( 1: L120);

if L== 1
    noise_autoc= xcorr( noise, P- 1, 'biased');
    % from -(len- 1) to (len- 1)
    % obtain the autocorrelation functions
    Rn= toeplitz( noise_autoc( P: end));
    % form a Toeplitz matrix to obtain the noise signal covariance matrix
else
    
    tapers= sine_taper( L, L120); % generate sine tapers
%     [tapers, v]= dpss( L120, 4); % generate slepian tapers
    Rn= Rest_mt( noise, P, tapers);
end
iRn= inv( Rn);  % inverse Rn

% ===================================================

n_start= 1;
In= eye(len);
Nframes= floor( length( noisy_speech)/ (N/ 2))- 1;     % number of frames
x_overlap= zeros( Nover2, 1);
tapers1= sine_taper( L, N);  % generate sine tapers
% [tapers1, v]= dpss( N, 4); % generate slepian tapers

%===============================  Start Processing =====================
%
for n=1: Nframes  
    
     noisy= noisy_speech( n_start: n_start+ N- 1);  
    
    if L== 1
        noisy_autoc= xcorr( noisy, P- 1, 'biased');
        Ry= toeplitz( noisy_autoc( P: 2* P- 1));
    else
        Ry= Rest_mt( noisy, P, tapers1); % use sine tapers to estimate the cov matrix
    end
    
    % use simple VAD to update the noise cov matrix, Rn 
    vad_ratio= Ry(1,1)/ Rn(1,1); 
    if (vad_ratio<= vad_thre) % noise dominant
        Rn= mu_vad* Rn+ (1- mu_vad)* Ry;
        iRn= inv( Rn);
    end
    %================
    
    iRnRx= iRn* Ry- In;  % compute Rn^-1 Rx=Rn^-1- I
    [V, D]= eig( iRnRx); % EVD
    iV= inv( V);
    dRx= max( diag( D), 0);  % estimated eigenvalues of Rx
    SNR= sum( dRx)/ len;
    SNRlog( n)= 10* log10( SNR+ eps);
    
    if SNRlog( n)>= 20
        mu( n)= mu_toplus;  %actually this corresponds to wiener filtering
    elseif ( SNRlog( n)< 20) && ( SNRlog( n)>= -5)
        mu( n)= mu0- SNRlog( n)* mu_slope;
    else
        mu( n)= mu_tominus;
    end
    
    gain_vals= dRx./( dRx+ mu( n));   
    G= diag( gain_vals);
    H= iV'* G* V';
    
    % first step of synthesis for subframe
    sub_start= 1; 
    sub_overlap= zeros( P/2, 1);
    for m= 1: (2*N/P- 1)
        sub_noisy= noisy( sub_start: sub_start+ P- 1);
        enhanced_sub_tmp= (H* sub_noisy).* subframe_window;
        enhanced_sub( sub_start: sub_start+ P/2- 1)= ...
            enhanced_sub_tmp( 1: P/2)+ sub_overlap; 
        sub_overlap= enhanced_sub_tmp( P/2+1: P);
        sub_start= sub_start+ P/2;
    end
    enhanced_sub( sub_start: sub_start+ P/2- 1)= sub_overlap; 
    % ===============
        
    xi= enhanced_sub'.* frame_window;    
    enhanced_speech( n_start: n_start+ Nover2- 1)= x_overlap+ xi( 1: Nover2);    
    x_overlap= xi( Nover2+ 1: N);               
        
    n_start= n_start+ Nover2;     
    
end

enhanced_speech( n_start: n_start+ Nover2- 1)= x_overlap; 

audiowrite( outfile, enhanced_speech, fs, 'BitsPerSample', nbits);
end

%=========================== E N D===============================================

function tapers= sine_taper( L, N)

% this function is used to generate the sine tapers proposed by Riedel et
% al, IEEE Transactions on Signal Processing, pp. 188- 195, Jan. 1995

% there are two parameters, 'L' is the number of the sine tapers generated,
% and 'N' is the length of each sine taper; the returned value 'tapers' is
% a N-by-L matrix with each column being sine taper 

tapers= zeros( N, L);

for index= 1: L
    tapers( :, index)= sqrt( 2/ (N+ 1))* sin (pi* index* (1: N)'/ (N+ 1));
end
end



function R_mt= Rest_mt( x, p, W)
% multi-taper method for covariance matrix estimation, we have 'x' of
% length N, and estimate p-order covariance matrix. 
% this estimator is: 
%   1): quadratic in the data
%   2): modulation covariant
%   3): nonnegative definite Toeplitz

% 'x' must be a vector of length N,
% 'p' is the order of the covariance matrix,
% 'W' is the N-by-L taper matrix, each column of which is a taper.
%
%
% Reference:  L. T. McWhorter and L. L. Scharf, "Multiwindow estimator of correlation" 
%  IEEE Trans. on Signal processing, Vol. 46, No. 2, Feb. 1998 
%====================


x= x( :);   % make data a column vector
[N, L]= size( W);  
% tapers is a N-by-L matrix, and has the form of [w_1 w_2 ... w_L]

% first step is to compute R'= sum_{i=1}^{L}(w_i y)(w_i y)*, which is a
% multitaper method: weight x with each taper w_i, and compute outer
% product, and add them together. 
x_rep= x( :, ones( 1, L)); 
% make x repeat L times, i.e., [x x ...x] 
x_w= W.* x_rep; 
% now each column of x_w is the weighted x

R1= x_w* x_w'; % sum of outer product
for k= 1: p
    r( k)= sum( diag( R1, k- 1));
end

R_mt= toeplitz( r); 
end

关于语音及噪声文件,具体请参考:语音信号处理常用语料库下载地址

参考文献
[1]Yi Hu and P. C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored noise,” in IEEE Transactions on Speech and Audio Processing, vol. 11, no. 4, pp. 334-341, July 2003, doi: 10.1109/TSA.2003.814458.
[2]U. Mittal and N. Phamdo, “Signal/noise KLT based approach for enhancing speech degraded by colored noise,” in IEEE Transactions on Speech and Audio Processing, vol. 8, no. 2, pp. 159-167, March 2000, doi: 10.1109/89.824700.
[3]PhiliposC.Loizou. 语音增强:理论与实践:theory and practice[M]// 语音增强:理论与实践:theory and practice. 电子科技大学出版社, 2012.
[4]Loizou P C . Speech Enhancement: Theory and Practice[M]. CRC Press, Inc. 2007.

  • 12
    点赞
  • 89
    收藏
    觉得还不错? 一键收藏
  • 34
    评论
评论 34
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值