Summmary of Dereverberation techniques_frame-online dnn-wpe dereverberation-CSDN博客

本文链接：https://blog.csdn.net/qq_32591057/article/details/88371590

Summmary of Dereverberation techniques

Zhang Fan

Reverberation is caused by multipath effect when sound wave propargates in a enclosure. The received reverberated signals contains three components: direct-path signal, early reflection signal and late reflection signal. The effects of these two reflection signals are different. Early reflection signal produces coloration effect, which can enhances perception effect of direct-path signal due to the short time difference of arrival of these two signals. However, late reflection signal can be distinguished from direct-path signal by human auditory system, which sounded like “echo”. The latter component is almost undesirable in speech intelligibility perspective, and leads extreme degration to performance of ASR system.

The reverberated signal can be modelled as the convolution of clean signal and acoustic impulse response (AIR) represented by FIR. Dereverberation methods aim to restore clean signal from reverberated signals with unknown AIRs in practice. Here we assume this is a SISO/SIMO framework, which means we have only one desired source, no competing sources, and one/multiple output(s).

The most straightforward dereverberation method composed by two stages: blind channel identificaiton and channel equalization. In the first stage, blind channel identification is always realized in SIMO framework, where AIRs are bindly estimated from multiple outputs[1][2][3]. Blind indentification is not possible if common zeros exist in the channels and its performance degrates in the presense of near-common zeros which is not strange in RTFs especially for large reverberation time. A forced spectral diversity algorithm[5] employs undermodeling in combination with spectral shaping filter to reduce the effect of near-common zeros. In the second stage, channel equalizers is designed to remove the convolution effect from microphone signals. MINT[1][3] can achieve perfect dereverberation with exact channel identification when some conditions are satisfied. However, channel identification errors and additive noise are not avoidable in practice. There is need to design equalizers being robust to these errors. Many ideas come out to solve this problem, such as weighted LS[4][12], regularized MINT[7], truncated MINT[7], relaxed multichannel LS(RMCLS)[7][10][13], channel shorting, and partial MINT(P-MINT)[7] .etc. In order to reduce computation complexity, adaptive method[6] [11][14]and subband version[8][10][15][16] are proposed.

Spectral suppression method is based on the assumption that early reverberation is uncorrelated with late reverberation, such that spectral subtraction[17][23] wiener estimator[8] originally used to weaken noise component in noisy speech, can used to weaken late reverberation. Like the key parameter is noise PSD in speech enhancement, the key problem is late reverberation PSD estimation now[17][24][28][29][30][31][34][35][36][38][39]. Spectral suppression based methods are robust to additive noise, however only late reverberation is suppressed and distorion occurs at the same time[17]. So this method is always combined with inverse filter[18][21] and microphone array[20][25][27][26][36][37] as second stage. When we model reverberation as convolution between clean speech STFT coefficients and a convolutive transfer function(CTF) in STFT domain, many well-known speech spectral estimator such as MAP estimator[26][33][39] can be applied.

Linear prediction based methods contains approximate two kinds: LPC residuals based, multi-channel linear prediction(MCLP) based. LPC residuals of reverberated signal can be modelled as convolution between LPC residuals of clean signal and channel impulse response. To deemphasize the effect of AIR in residual signal, one idea is do the coarse channel estimation then form a matched filter applied to residuals of reverberated signal[40]. Another idea is finding an optimal criterion to distinguish differnence between two residual signals, such as kurtosis, such that apaptive filtering can be applied[41][47][48]. MCLP based method which first is modelled in time domain[42][43][44][46][53], shows that using long prediction filters, we can fully remove the effect of reverberation on the speech residual. The extension form (we call it WPE) [45][49][50][66][65][55]is developed in STFT domain, which remarkably reduce the compution complexity, and has efficient dereverberation performance. WPE has two parameters to estimate: weighted coefficients and speech power spectral. That may increase distortion of speech especially when observation is short. To overcome this problem, [51] [56][57] [59][61][58][63] [67]introduce constraints such as estimated speech log-spectral priors, sparse nature, speech low-rank approximation and estimated late reverberat power. To improve robustness to noise, WPE combined with beamforming performs better[62][64].

Other dereverberation methods such as cepstral processing[68][69][70][71], subband envelop estimaiton[72][73][74], spherical microphone arrays processing[76][77], harmonic structure based[78][79][80][81], DNN based[84][85][86][87][88] still have their place.

References: