Unsupervised training of neural mask-based beamforming
Lukas Drude?
, Jahn Heymann?
, Reinhold Haeb-Umbach
Paderborn University, Department of Communications Engineering, Paderborn, Germany
{drude, heymann, haeb}@nt.upb.de
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2549.pdf
Abstract
We present an unsupervised training approach for a neural network-based mask estimator in an acoustic beamforming application. The network is trained to maximize a likelihood criterion derived from a spatial mixture model of the observations. It is trained from scratch without requiring any parallel data consisting of degraded input and clean training targets. Thus, training can be carried out on real recordings of noisy speech rather than simulated ones. In contrast to previous work on unsupervised training of neural mask estimators, our approach avoids the need for a possibly pre-trained teacher model entirely. We demonstrate the effectiveness of our approach by speech recognition experiments on two different datasets: one mainly deteriorated by noise (CHiME 4) and one by reverberation (REVERB). The results show that the performance of the proposed system is on par with a supervised system using oracle target masks for training and with a system trained using a model-based teacher
我们提出了一种基于神经网络的声波束形成掩模估计器的无监督训练方法。该网络被训练以最大化从观测的空间混合模型导出的似然准则。它是从零开始训练的,不需要任何由退化的输入和干净的训练目标组成的并行数据。因此,训练可以在真实的有噪语音记录上进行,而不是在模拟语音记录上进行。与以往无监督训练神经面具估计器的工作相比,我们的方法完全避免了可能的预先训练教师模型的需要。通过对两个不同数据集的语音识别实验,我们证明了该方法的有效性:一个主要由噪声(CHiME 4)恶化,另一个由混响(REVERB)恶化。结果表明,该系统的性能与使用oracle目标模板进行训练的监控系统和使用基于模型的教师进行训练的系统相当