论文地址:https://ieeexplore.ieee.org/abstract/document/9414265
会议:ICASSP2021
Abstract
目前语音增强的对抗生成网络仅依赖于卷积运算,这可能会掩盖输入序列中的时间依赖性。为解决该问题,提出一种适应非局部注意力的注意力层,并结合时域语音增强GAN的卷积和反卷积。实验结果显示,将自注意力引入SEGAN会让客观评估指标持续改进。
Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation, which may obscure temporal dependencies across the sequence input. To remedy this issue, we propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of a speech enhancement GAN (SEGAN) using raw signal input. Further, we empirically study the effect of placing the self-attention layer at the (de)convolutional layers with varying layer indices as well as at all of them when memory allows. Our experiments show that introducing self-attention to SEGAN leads to consistent improvement across the objective evaluation metrics of enhancement performance. Furthermore, applying at different (de)convolutional layers does not significantly alter performance, suggesting that it can be conveniently applied at the highest-level (de)convolutional layer with the smallest memory overhead.
1. Introduction
SEGAN是一种可以实现时域语音增强的网络,但其主干还是依靠卷积神经网络。由于卷积算子的局部感受野,对卷积算子的依赖限制了SEGAN在输入序列中捕获远程依赖的能力。时间依赖性建模是语音建模的一个组成部分,但在SEGAN中,有关时间依赖性是未知的。
This reliance onthe convolution operator limits SEGAN’s capability in capturing long-range dependencies across an input sequence due to the convolution operator’s local receptive field.
Temporal dependency modeling is, in general, an integral part of a speech modeling system [17, 18], including speech enhancement when input is a long segment of signal with a rich underlying structure. However, it has mostly remained uncharted in SEGAN systems.
一方面,自注意力已成功用于不同语音建模任务中的顺序建模。另一方面,它在建模远程和局部依赖方面更加灵活,并且在计算成本方面比RNN更有效,尤其是在应用于长序列时。
On the one hand, self-attention has been successfully used for sequential modeling in different speech modeling tasks. On the other hand, it is more flexible in modeling both long-range and local dependencies and is more efficient than RNN in terms of computational cost, especially when applied to long sequences.
因此,我们提出了一个遵循非局部注意原则的自注意层 ,并将其与 SEGAN 的(反)卷积层耦合以构建自注意 SEGAN(简称 SASEGAN)。
We, therefore, propose a self-attention layer following
the principle of non-local attention [21, 22] and couple it
with the (de)convolutional layers of a SEGAN to construct
a self-attention SEGAN (SASEGAN for short).
2. SELF-ATTENTION SEGAN
2.1 SEGAN
带有噪音的语音信号表示为 x ~ = x + n ∈ R T \tilde{x}=x+n∈R^T x~=x+n∈RT,其中 x ∈ R T x∈R^T x∈RT表示干净的语音信号, n ∈ R T n∈R^T n∈RT表示背景噪音,目标是得到映射 f ( x ~ ) : x ~ → x