Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks

最新推荐文章于 2022-11-01 11:17:08 发布

Grace_yanyanyan

最新推荐文章于 2022-11-01 11:17:08 发布

阅读量759

点赞数 1

分类专栏：语音增强 papers

原文链接：https://arxiv.org/abs/2002.00319

版权

papers 同时被 2 个专栏收录

17 篇文章 1 订阅

订阅专栏

语音增强

1 篇文章 0 订阅

订阅专栏

Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks
标题：基于时域卷积递归神经网络的单通道语音增强
作者： Jingdong Li, Changliang Li
链接：https://arxiv.org/abs/2002.00319

Jingdong Li∗ Hui Zhang∗, Xueliang Zhang∗ and Changliang Li†
∗ College of Computer Science, Inner Mongolian University, Hohhot, China 内蒙古大学张学良
jingdong.li@mail.imu.edu.cn {cszh, cszxl}@imu.edu.cn
† Kingsoft AI Laboratory, Beijing, China 金山AI实验室
lichangliang@kingsoft.com

摘要
In recent decades, neural network based methods have significantly improved the performace of speech enhancement. Most of them estimate time-frequency (T-F) representation of target speech directly or indirectly, then resynthesize waveform using the estimated T-F representation. In this work, we proposed the temporal convolutional recurrent network (TCRN), an end-to-end model that directly map noisy waveform to clean waveform. The TCRN, which is combined convolution and recurrent neural network, is able to efficiently and effectively leverage short-term ang long-term information. Futuremore, we present the architecture that repeatedly downsample and upsample speech during forward propagation. We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks. Futuremore, We present several key techniques to stabilize the training process. The experimental results show that our model consistently outperforms existing speech enhancement approaches, in terms of speech intelligibility and quality.
近几十年来，基于神经网络的语音增强方法大大提高了语音增强的性能。它们大多直接或间接估计目标语音的时频（T-F）表示，然后利用估计的T-F表示重新合成波形。在这项工作中，我们提出了时间卷积递归网络（TCRN），一个端到端的模型，它直接将有噪声的波形映射到干净的波形。将卷积和递归神经网络相结合的TCRN能够有效地利用短期和长期信息。进一步，我们提出了一种在前向传播过程中重复下采样和上采样的语音结构。与现有的卷积递归网络相比，我们的模型能够改善模型的性能。此外，我们还介绍了稳定训练过程的几个关键技术。实验结果表明，我们的模型始终优于现有的语音增强方法，在语音清晰度和质量方面。

1 introduction

Monaural speech enhancement is the task to extract cleanp speech from one-microphone noisy signals. The purpose of speech enhancement is to improve speech quality and intelligibility. It is widely and successfully applied in many modern speech applications, such as hearing aids, communication system, automatic speech recognition (ASR) and speaker verification, etc[1].
**语音增强是从一个麦克风噪声信号中提取干净语音的任务。语音增强的目的是提高语音质量和清晰度（一般是给人听的，而不是给机器听的）。**它在助听器、通信系统、自动语音识别（ASR）和说话人识别等现代语音应用中得到了广泛而成功的应用[1]。
Traditional speech enhancement approaches include spectr subtraction [2], Wiener filtering [3], nonnegative matrix factorization [4] etc. These approaches typically rely on the strong assumption that noise has a stationary statistically characteristic. However, there are few noises keep stationary all the time in this complicated world. This makes it hard for traditional methods to achieve satisfactory performance as designed.
传统的语音增强方法包括谱减法[2]、维纳滤波[3]、非负矩阵分解[4]等，这些方法通常依赖于，噪声具有平稳统计特性，这种很强的前提假设。然而，在这个复杂的世界里，几乎没有什么噪音是一直保持静止的。这使得传统的方法很难达到预期的效果。
To deal with annoying nonstationary noise, deep neural networks (DNNs) [5], [6], [7], [8] are introduced in the speech enhancement, and obtained unprecedented performance. The DNN predicts a label for each frame from a small context window. The limited input makes DNN cannot capture information of a long-term context. The DNN-based methods also perform poorly on unseen speakers. The long short-term memory networks (LSTMs) [9], [10] were introduced into speech enhancement to alleviate the limits of DNN-based methods. Chen et al.[10] proposed a four-layer LSTM to deal with speaker generalization of noise-independent speech enhancement. Their experimental results showed that the LSTM model substantially outperforms the DNNs. A more recent study found that a combination of convolution and recurrent network (CRN) [11] leads better performance than LSTM.
为了解决恼人的非平稳噪声，在语音增强中引入了深神经网络（DNNs）[5]、[6]、[7]、[8]，取得了前所未有的性能。DNN从一个小的上下文窗口为每个帧预测一个标签。有限的输入使得DNN无法捕获长期上下文的信息。基于DNN的方法在看不见的说话人身上也表现不佳。为了缓解基于DNN的语音增强方法的局限性，在语音增强中引入了长短期记忆网络（LSTMs）[9]、[10]。Chen等人[10]提出了一种四层LSTM来处理非噪声语音增强中的说话人泛化问题。实验结果表明，LSTM模型明显优于DNNs模型。最近的一项研究发现卷积和递归网络（CRN）[11]的结合比LSTM有更好的性能。

Most of the existing approaches are aims to directly or indirectly estimate T-F representations of target speech. They are mainly two groups: “mapping-based” methods and “masking-based” methods. The mapping-based methods directly predict the T-F representations, while the magnitude spectrum of STFT is the most popular choice. The masking based methods predict a T-F mask at the first stage, then multiply the estimated mask to the T-F features of mixtures to obtain clean features of target speech. In earlier studies, masking-based methods focus on the masks of magnitude spectrum, including ideal binary mask (IBM), ideal ratio mask (IRM) [12], spectral magnitude mask (SMM) [6], phase-sensitive mask (PSM) [9] and so on. Since the performance of magnitude-masking was limited by noisy phase reusing, the complex ideal ratio mask (cIRM) [13] was proposed to improve the performance of speech enhancement. Theoretically, we can get both perfect magnitude and phase using cIRM masking. However, the imaginary part of cIRM exhibits unclear temporal and spectral structure, which is difficult to estimate. It makes cIRM cannot consistently lead to a better performance than other methods.
现有的方法大多是直接或间接地估计目标语音的T-F表示。它们主要有两类：基于映射的方法和基于掩蔽的方法。基于映射的方法直接预测T-F表示，而STFT的幅度谱是最常用的选择。基于掩蔽的方法在第一阶段预测T-F掩蔽，然后将估计的掩蔽与混合语音的T-F特征相乘，得到目标语音的干净特征。在早期的研究中，基于掩蔽的方法主要集中在幅度谱掩蔽上，包括理想二值掩蔽（IBM）、理想比值掩蔽（IRM）[12]、谱幅度掩蔽（SMM）[6]、相敏掩蔽（PSM）[9]等。由于幅度掩蔽的性能受到噪声相位复用的限制，为了提高语音增强的性能，提出了复理想比率掩蔽（cIRM）[13]。理论上，利用cIRM掩蔽可以得到理想的幅度和相位。然而，cIRM的虚部具有不清晰的时间和频谱结构，难以估计。这使得cIRM不能始终如一地获得比其他方法更好的性能。
Recently, some works are developed to use neural networks for speech analysis and synthesis in time domain. Temporal convolutional layers are trained as filterbanks to extract features from waveform to improve the performance of ASR [14], [15], [16]. Compared with hand-crafted mel-filterbank and gamatone-filterbank features, an ASR system jointly trained with trainable filterbanks consistently leads lower word error rate (WER). Sercan et al.[17] utilized group convolution networks to synthesis waveform conditioned by magnitude spectrograms. They show that CNN-based methods could generate higher quality speech than signal processing methods, like Griffin-Lim [18]. There are also some works attempted to conduct speech enhancement in time domain. In [19], a CNN based autoencoder is proposed to conduct speech enhancement in time domain, which outperforms the DNN-based methods in T-F domain. Inspired by these works, we proposed to use temporal convolutional recurrent network (TCRN) to conduct the speech enhancement. Compared with LSTMs and CRN, our proposed model TCRN consistently leads to better speech intelligibility and speech quality.
近年来，有一些研究工作将神经网络用于语音的时域分析和合成。时间卷积层被训练成滤波器组，从波形中提取特征，以提高ASR的性能〔14〕、〔15〕、〔16〕。与手工制作的mel filterbank和gamatone filterbank功能相比，一个由可训练的filterbank联合训练的ASR系统始终导致较低的字错误率（WER）。Sercan等人【17】利用群卷积网络来合成由幅度谱图调整的波形。他们表明基于CNN的方法可以产生比信号处理方法更高质量的语音，比如Griffin Lim[18]。也有一些工作试图在时域上进行语音增强。在文献[19]中，我们提出了一种基于CNN的自动编码器来进行时域语音增强，其性能优于基于DNN的T-F域语音增强方法。在这些工作的启发下，我们提出利用时间卷积递归网络（TCRN）进行语音增强。与LSTMs和CRN相比，本文提出的TCRN模型具有更好的语音清晰度和语音质量。
The rest of the paper organized as follows: section 2 describes the details of the proposed system. Section 3 describes the loss functions used in this study. Section 4 presents the experimental setup and results. Finally, we conclude our work in section 5.
论文的其余部分组织如下：第2节详细介绍了本文提出的系统。第3节描述了本研究中使用的损失函数。第四节介绍了实验设置和实验结果。最后，我们在第5节总结我们的工作。

Grace_yanyanyan

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks

Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks标题：基于时域卷积递归神经网络的单通道语音增强作者： Jingdong Li, Changliang Li链接：https://arxiv.org/abs/2002.00319Jingdong Li∗ Hui Zha...
复制链接

扫一扫

专栏目录