语音增强--子空间算法及MATLAB实现

最新推荐文章于 2024-06-29 20:24:28 发布

YHCANDOU

最新推荐文章于 2024-06-29 20:24:28 发布

阅读量5.5k

点赞数 12

分类专栏：语音信号处理文章标签： matlab 算法语音识别

本文链接：https://blog.csdn.net/yhcwjh/article/details/112691969

版权

语音信号处理专栏收录该内容

10 篇文章 40 订阅

订阅专栏

语音增强--------------子空间算法

原理介绍

子空间方法的原理是将观测信号的向量空间分解为信号子空间和噪声子空间，通过消除噪声子空间并保留信号子空间从而估计出干净语音，子空间分解的过程是对带噪语音信号做KLT变换,然后设置一个门限阈值，利用KLT系数的稀疏性，将噪声的KLT系数置为0，之后通过逆KLT变换得到增强后的语音。
若观测噪声 $\mathbf{v}$ 与干净语音 $\mathbf{x}$ 无关，则带噪语音可以表示为
$\mathbf{y}=\mathbf{x}+\mathbf{v}$
假定对干净语音 $\mathbf{x}$ 的线性估计为 $\mathbf{\hat{x}}=\mathbf{Hy}$ ，此处 $\mathbf{H}$ 为 $K\times K$ 的滤波器矩阵。
那么滤波误差可以写为
$\mathbf{\varepsilon }=\widehat{\mathbf{x}}-\mathbf{x}=(\mathbf{H}-\mathbf{I})\cdot \mathbf{x}+\mathbf{H}\cdot \mathbf{v}={{\mathbf{\varepsilon }}_{\mathbf{x}}}+{{\mathbf{\varepsilon }}_{\mathbf{v}}}$
其中 ${{\mathbf{\varepsilon }}_{\mathbf{x}}}$ 为语音失真， ${{\mathbf{\varepsilon }}_{\mathbf{v}}}$ 为残留噪声，定义
语音失真能量 $\mathbf{\varepsilon }_{\mathbf{x}}^{2}\text{=tr}\left( E\left[ {{\mathbf{\varepsilon }}_{\mathbf{x}}}\mathbf{\varepsilon }_{\mathbf{x}}^{T} \right] \right)$
残留噪声能量 $\mathbf{\varepsilon }_{\mathbf{v}}^{2}\text{=tr}\left( E\left[ {{\mathbf{\varepsilon }}_{\mathbf{v}}}\mathbf{\varepsilon }_{\mathbf{v}}^{T} \right] \right)$
为了得到最优线性滤波器，可以寻求解决如下有约束的优化问题：
$\left\{ \begin{aligned} & \underset{\mathbf{H}}{\mathop{\min }}\,\ \ \ \mathbf{\varepsilon }_{\mathbf{x}}^{2} \\ & s.t.\ \ \ \ \frac{1}{K}\mathbf{\varepsilon }_{\mathbf{v}}^{2}\le {{\sigma }^{2}} \\ \end{aligned} \right.$
其中 ${{\sigma }^{2}}$ 为正实数，代表噪声的容忍度。
最优解可以近似写作
${{\mathbf{H}}_{\text{opt}}}={{\mathbf{R}}_{\mathbf{x}}}{{\left( {{\mathbf{R}}_{\mathbf{x}}}\text{+}\mu {{\mathbf{R}}_{\mathbf{v}}} \right)}^{-1}}$
此处 ${{\mathbf{R}}_{\mathbf{x}}}$ 和 ${{\mathbf{R}}_{\mathbf{v}}}$ 分别是干净语音和噪声的协方差矩阵， $\mu$ 是拉格朗日乘子。参数 $\mu$ 由噪声容忍度确定，其控制着语音失真和残留噪声之间的权衡。
当 $\mu \text{=}1$ 时，由于干净语音和噪声的互不相关特性， ${{\mathbf{H}}_{\text{opt}}}$ 变为维纳意义上的最优解 ${{\mathbf{H}}_{\text{opt}}}={{\mathbf{R}}_{\mathbf{x}}}\mathbf{R}_{\mathbf{y}}^{-1}$ ，这意味着，维纳滤波器是子空间算法的一种特殊情况。
对 ${{\mathbf{R}}_{\mathbf{x}}}$ 进行特征分解
${{\mathbf{R}}_{\mathbf{x}}}=\mathbf{U}{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\mathbf{U}}^{T}}$
那么
$\begin{aligned} {{\mathbf{H}}_{\text{opt}}}& ={{\mathbf{R}}_{\mathbf{x}}}{{\left( {{\mathbf{R}}_{\mathbf{x}}}\text{+}\mu {{\mathbf{R}}_{\mathbf{v}}} \right)}^{-1}} \\ & \text{=}\mathbf{U}{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\left( {{\mathbf{\Sigma }}_{\mathbf{x}}}\text{+}\mu {{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U} \right)}^{-1}}{{\mathbf{U}}^{T}} \end{aligned}$
在文献[2]中， ${{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U}$ 被近似为对角阵
${{\mathbf{\Sigma }}_{\mathbf{v}}}=diag\left\{ \begin{matrix} E\left[ \left| \mathbf{u}_{1}^{T}\mathbf{v} \right| \right] & E\left[ \left| \mathbf{u}_{2}^{T}\mathbf{v} \right| \right] & \cdots & E\left[ \left| \mathbf{u}_{K}^{T}\mathbf{v} \right| \right] \\ \end{matrix} \right\}\approx {{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U}$
其中 $\mathbf{u}_{k}^{T}$ 是 ${{\mathbf{R}}_{\mathbf{x}}}$ 的第 $k$ 个特征向量，基于上述近似，可以得到次优估计器
${{\mathbf{H}}_{\text{opt}}}\approx \mathbf{U}{{\mathbf{\Sigma }}_{\mathbf{x}}}{{\left( {{\mathbf{\Sigma }}_{\mathbf{x}}}\text{+}\mu {{\mathbf{\Sigma }}_{\mathbf{v}}} \right)}^{-1}}{{\mathbf{U}}^{T}}$
因此，子空间算法先将含噪的观测语音变换到KLT域，得到KLT系数 ${{\mathbf{U}}^{T}}\mathbf{y}$ ；根据干净语音在KLT域的稀疏性对 ${{\mathbf{U}}^{T}}\mathbf{y}$ 进行加权处理，得到稀疏化的干净语音KLT系数估计值 ${{\mathbf{\Sigma }}_{\mathbf{x}}}{{\left( {{\mathbf{\Sigma }}_{\mathbf{x}}}\text{+}\mu {{\mathbf{\Sigma }}_{\mathbf{v}}} \right)}^{-1}}{{\mathbf{U}}^{T}}$ ；最后通过逆KLT变换得到增强后的语音 ${{\mathbf{H}}_{\text{opt}}}\mathbf{y}$ 。
在实际的仿真中，即使对于白噪声， ${{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U}$ 不是严格的对角矩阵；对于色噪声， ${{\mathbf{U}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{U}$ 更不会是一个对角矩阵。
如果 ${{\mathbf{R}}_{\mathbf{x}}}$ 是对称矩阵， ${{\mathbf{R}}_{\mathbf{v}}}$ 是正定矩阵，那么存在矩阵 $\mathbf{V}$ 能够同时对角化两个协方差矩阵：
${{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{x}}}\mathbf{V}=\mathbf{\Lambda }$
${{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{V}=\mathbf{I}$
其中 $\mathbf{\Lambda }$ 和 $\mathbf{V}$ 分别是 $\mathbf{\Sigma }=\mathbf{R}_{\mathbf{v}}^{-1}{{\mathbf{R}}_{\mathbf{x}}}$ 特征值矩阵和特征向量矩阵，即
$\mathbf{\Sigma V}\text{=}\mathbf{V\Lambda }$
此时 $\mathbf{\Sigma }$ 的特征值均是实数,利用广义特征值分解，可以得到针对色噪声情况的最优线性估计器
$\begin{aligned} {{\mathbf{H}}_{\text{opt}}}&={{\mathbf{R}}_{\mathbf{x}}}{{\left[ {{\mathbf{V}}^{-T}}\left( {{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{x}}}\mathbf{V}+\mu {{\mathbf{V}}^{T}}{{\mathbf{R}}_{\mathbf{v}}}\mathbf{V} \right){{\mathbf{V}}^{-1}} \right]}^{-1}} \\ & ={{\mathbf{R}}_{\mathbf{x}}}\mathbf{V}{{\left( \mathbf{\Lambda }+\mu \mathbf{I} \right)}^{-1}}{{\mathbf{V}}^{T}} \\ & \text{=}{{\mathbf{V}}^{-T}}\mathbf{\Lambda }{{\left( \mathbf{\Lambda }+\mu \mathbf{I} \right)}^{-1}}{{\mathbf{V}}^{T}} \end{aligned}$
首先，对含噪声语音 $\mathbf{y}$ 作变换 ${{\mathbf{V}}^{T}}$ ，通过一个增益函数 $\mathbf{\Lambda }{{\left( \mathbf{\Lambda }+\mu \mathbf{I} \right)}^{-1}}$ 对变换系数 ${{\mathbf{U}}^{T}}\mathbf{y}$ 进行调整，对调整后的系数做 ${{\mathbf{V}}^{T}}$ 逆变换。
仿真参数为

参数名称	参数值
信噪比	5dB
采样率	16KHz

在这里插入图片描述
从图中可以看出，在经过子空间算法处理之后，噪声被很大程度滤除了，达到了良好的去噪效果。此外，从图也可以看出，增强后的语音的一些细节方面也被滤除了，这也是子空间算法存在的一些缺点。
主函数main.m

clear;
close all;
clc;
%% 读入数据
[signal,~]=audioread('clean.wav');      %读入干净语音
[noise,fs]=audioread('noise.wav');       %读入噪声
N=3*fs;                                              %选取3秒的语音
signal=signal(1:N);
noise=noise(1:N);
N=length(signal);
t=(0:N-1)/fs;
SNR=5;                 %信噪比大小
noise=noise/norm(noise,2).*10^(-SNR/20)*norm(signal);     
x=signal+noise;          %产生固定信噪比的带噪语音
audiowrite('mixed.wav',x,fs);          %将混合的语音写入
enhanced_speech=klt('mixed.wav','enhanced_speech.wav');  
figure(1)
subplot(321);
plot(t,signal);ylim([-1.5,1.5]);title('干净语音');xlabel('时间/s');ylabel('幅度');
subplot(323);
plot(t,x);ylim([-1.5,1.5]);title('带噪语音');xlabel('时间/s');ylabel('幅度');
subplot(325);
plot(t,real(enhanced_speech));ylim([-1.5,1.5]);title('子空间法增强后的语音');xlabel('时间/s');ylabel('幅度');
subplot(322);
spectrogram(signal,256,128,256,16000,'yaxis');
subplot(324);
spectrogram(x,256,128,256,16000,'yaxis');
subplot(326);
spectrogram(enhanced_speech,256,128,256,16000,'yaxis');

klt.m

function [enhanced_speech, fs, nbits] =klt( filename, outfile)
%
%  Implements the generalized subspace algorithm with embedded pre-whitening [1].
%  Makes no assumptions about the nature of the noise (white vs. colored).
%
%  Usage:  klt(noisyFile, outputFile)
%           
%         infile - noisy speech file in .wav format
%         outputFile - enhanced output file in .wav format
%  
%  The covariance matrix estimation is done using a sine-taper method [1].
%
%  Example call:  klt('sp04_babble_sn10.wav','out_klt.wav');
%
%  References:
%   [1] Hu, Y. and Loizou, P. (2003). A generalized subspace approach for 
%       enhancing speech corrupted by colored noise. IEEE Trans. on Speech
%       and Audio Processing, 11, 334-341.
%   
% Authors: Yi Hu and Philipos C. Loizou
%
% Copyright (c) 2006 by Philipos C. Loizou
% $Revision: 0.0 $  $Date: 10/09/2006 $
%-------------------------------------------------------------------------

if nargin<2
   fprintf('Usage: klt(noisyfile.wav,outFile.wav) \n\n');
   return;
end

L=16;   %number of tapers used to estimate the covariance matrix in
        %  multi-window method  
vad_thre= 1.2; % VAD threshold value
mu_vad= 0.98; % mu to use in smoothing of the Rn matrix

% --- initialize mu values -----------------
%
mu_max=5;
mu_toplus= 1; % mu value with SNRlog>= 20dB 
mu_tominus= mu_max;   % mu value with SNRlog< -5dB
mu_slope= (mu_tominus- mu_toplus )/ 25;
mu0= mu_toplus+ 20* mu_slope;
%===========================================

[noisy_speech, fs]= audioread(filename);
aInfo = audioinfo(filename);
nbits = aInfo.BitsPerSample; % 16 bits resolution
enhanced_speech=zeros(1,length(noisy_speech));
subframe_dur= 4;  %subframe length is 4 ms
len= floor( fs* subframe_dur/ 1000);    
P= len; 
frame_dur= 32;  % frame duration in msecs
N= frame_dur* fs/ 1000; 
Nover2= N/ 2; % window overlap in 50% of frame size
K= N;
frame_window= hamming( N);
subframe_window= hamming( P); 


% ==== Esimate noise covariance matrix ------------
%
L120=floor( 120* fs/ 1000);  % assume initial 120ms is noise only
noise= noisy_speech( 1: L120);

if L== 1
    noise_autoc= xcorr( noise, P- 1, 'biased');
    % from -(len- 1) to (len- 1)
    % obtain the autocorrelation functions
    Rn= toeplitz( noise_autoc( P: end));
    % form a Toeplitz matrix to obtain the noise signal covariance matrix
else
    
    tapers= sine_taper( L, L120); % generate sine tapers
%     [tapers, v]= dpss( L120, 4); % generate slepian tapers
    Rn= Rest_mt( noise, P, tapers);
end
iRn= inv( Rn);  % inverse Rn

% ===================================================

n_start= 1;
In= eye(len);
Nframes= floor( length( noisy_speech)/ (N/ 2))- 1;     % number of frames
x_overlap= zeros( Nover2, 1);
tapers1= sine_taper( L, N);  % generate sine tapers
% [tapers1, v]= dpss( N, 4); % generate slepian tapers

%===============================  Start Processing =====================
%
for n=1: Nframes  
    
     noisy= noisy_speech( n_start: n_start+ N- 1);  
    
    if L== 1
        noisy_autoc= xcorr( noisy, P- 1, 'biased');
        Ry= toeplitz( noisy_autoc( P: 2* P- 1));
    else
        Ry= Rest_mt( noisy, P, tapers1); % use sine tapers to estimate the cov matrix
    end
    
    % use simple VAD to update the noise cov matrix, Rn 
    vad_ratio= Ry(1,1)/ Rn(1,1); 
    if (vad_ratio<= vad_thre) % noise dominant
        Rn= mu_vad* Rn+ (1- mu_vad)* Ry;
        iRn= inv( Rn);
    end
    %================
    
    iRnRx= iRn* Ry- In;  % compute Rn^-1 Rx=Rn^-1- I
    [V, D]= eig( iRnRx); % EVD
    iV= inv( V);
    dRx= max( diag( D), 0);  % estimated eigenvalues of Rx
    SNR= sum( dRx)/ len;
    SNRlog( n)= 10* log10( SNR+ eps);
    
    if SNRlog( n)>= 20
        mu( n)= mu_toplus;  %actually this corresponds to wiener filtering
    elseif ( SNRlog( n)< 20) && ( SNRlog( n)>= -5)
        mu( n)= mu0- SNRlog( n)* mu_slope;
    else
        mu( n)= mu_tominus;
    end
    
    gain_vals= dRx./( dRx+ mu( n));   
    G= diag( gain_vals);
    H= iV'* G* V';
    
    % first step of synthesis for subframe
    sub_start= 1; 
    sub_overlap= zeros( P/2, 1);
    for m= 1: (2*N/P- 1)
        sub_noisy= noisy( sub_start: sub_start+ P- 1);
        enhanced_sub_tmp= (H* sub_noisy).* subframe_window;
        enhanced_sub( sub_start: sub_start+ P/2- 1)= ...
            enhanced_sub_tmp( 1: P/2)+ sub_overlap; 
        sub_overlap= enhanced_sub_tmp( P/2+1: P);
        sub_start= sub_start+ P/2;
    end
    enhanced_sub( sub_start: sub_start+ P/2- 1)= sub_overlap; 
    % ===============
        
    xi= enhanced_sub'.* frame_window;    
    enhanced_speech( n_start: n_start+ Nover2- 1)= x_overlap+ xi( 1: Nover2);    
    x_overlap= xi( Nover2+ 1: N);               
        
    n_start= n_start+ Nover2;     
    
end

enhanced_speech( n_start: n_start+ Nover2- 1)= x_overlap; 

audiowrite( outfile, enhanced_speech, fs, 'BitsPerSample', nbits);
end

%=========================== E N D===============================================

function tapers= sine_taper( L, N)

% this function is used to generate the sine tapers proposed by Riedel et
% al, IEEE Transactions on Signal Processing, pp. 188- 195, Jan. 1995

% there are two parameters, 'L' is the number of the sine tapers generated,
% and 'N' is the length of each sine taper; the returned value 'tapers' is
% a N-by-L matrix with each column being sine taper 

tapers= zeros( N, L);

for index= 1: L
    tapers( :, index)= sqrt( 2/ (N+ 1))* sin (pi* index* (1: N)'/ (N+ 1));
end
end



function R_mt= Rest_mt( x, p, W)
% multi-taper method for covariance matrix estimation, we have 'x' of
% length N, and estimate p-order covariance matrix. 
% this estimator is: 
%   1): quadratic in the data
%   2): modulation covariant
%   3): nonnegative definite Toeplitz

% 'x' must be a vector of length N,
% 'p' is the order of the covariance matrix,
% 'W' is the N-by-L taper matrix, each column of which is a taper.
%
%
% Reference:  L. T. McWhorter and L. L. Scharf, "Multiwindow estimator of correlation" 
%  IEEE Trans. on Signal processing, Vol. 46, No. 2, Feb. 1998 
%====================


x= x( :);   % make data a column vector
[N, L]= size( W);  
% tapers is a N-by-L matrix, and has the form of [w_1 w_2 ... w_L]

% first step is to compute R'= sum_{i=1}^{L}(w_i y)(w_i y)*, which is a
% multitaper method: weight x with each taper w_i, and compute outer
% product, and add them together. 
x_rep= x( :, ones( 1, L)); 
% make x repeat L times, i.e., [x x ...x] 
x_w= W.* x_rep; 
% now each column of x_w is the weighted x

R1= x_w* x_w'; % sum of outer product
for k= 1: p
    r( k)= sum( diag( R1, k- 1));
end

R_mt= toeplitz( r); 
end