Multi-channel Acoustic Modeling using Mixed Bitrate OPUS Compression
标题:使用混合比特率opus压缩的多声道声学建模
作者: Aparna Khare, Minhua Wu
链接:https://arxiv.org/abs/2002.00122
Recent literature has shown that a learned front end with multi-channel audio input can outperform traditional beam-forming algorithms for automatic speech recognition (ASR). In this paper, we present our study on multi-channel acoustic modeling using OPUS compression with different bitrates for the different channels. We analyze the degradation in word error rate (WER) as a function of the audio encoding bitrate and show that the WER degrades by 12.6% relative with 16kpbs as compared to uncompressed audio. We show that its always preferable to have a multi-channel audio input over a single channel audio input given limited bandwidth. Our results show that for the best WER, when one of the two channels can be encoded with a bitrate higher than 32kbps, its optimal to encode the other channel with the highest bitrate possible. For bitrates lower than that, its preferable to distribute the bitrate equally between the two channels. We further show that by training the acoustic model on mixed bitrate input, up to 50% of the degradation can be recovered using a single model.
最近的文献表明,一个多通道音频输入的学习型前端在自动语音识别(ASR)中可以优于传统的波束形成算法。本文针对不同信道,采用不同比特率的OPUS压缩进行多信道声学建模研究。我们分析了作为音频编码比特率函数的字错误率(WER)的下降,结果表明,与未压缩音频相比,WER相对于16kpbs下降了12.6%。我们表明,在有限的带宽下,多声道音频输入总比单声道音频输入好。我们的结果表明,当两个信道中的一个可以用大于32kbps的比特率进行编码时,对于最佳的WER,用可能的最高比特率对另一个信道进行编码是最佳的。如果比特率低于此值,最好在两个信道之间平均分配比特率。我们进一步证明,通过对混合比特率输入的声学模型进行训练,使用单个模型可以恢复高达50%的退化。
- INTRODUCTION
一。导言
Multi-channel modeling has become a popular alternative to conventional beamforming in recent literature [1][2]. This method of front-end processing opens up avenues for pro- cessing audio not just from microphones from a single mi- crophone array but from microphones from distinct devices. With the popularity of voice enabled devices in the consumer markets, this opens up an interesting area of research. Differ- ent devices from different manufacturers, however, may not adhere to the same encoding standards. In addition, the band- width available for audio upload at different times might be different. Motivated by these questions, we wanted to study how multi-channel acoustic models perform when different audio inputs have different bitrates.
在最近的文献中,多信道建模已经成为传统波束形成的一种流行的替代方法[1][2]。这种前端处理方法为处理音频开辟了途径,不仅是从单个麦克风阵列中的麦克风,而且从不同设备中的麦克风。随着语音设备在消费市场的普及,这开辟了一个有趣的研究领域。然而,不同制造商的不同设备可能不遵循相同的编码标准。此外,在不同的时间,可用于音频上传的带宽可能不同。基于这些问题,我们想研究当不同的音频输入具有不同的比特率时,多通道声学模型是如何工作的。
Prior work in this area has focused at developing com- pression algorithms to minimally impact speech recognition systems [3] or at analyzing the quality of the opus codec as it pertains to voice quality [4] . In [5], the authors evaluate the performance of different audio codecs on the frequency spec-trum. There is also prior work in ASR that examines mixed- bandwidth models that deal with data with different sampling rates [6]. There has been some related work in the emotion recognition domain; the authors in [7] and [8] show how dif- ferent codecs and different bitrates affect emotion recognition accuracy. The authors in [9] analyze human emotion intelli- gibility as a function of the bitrates. The other related work is in the domain of multi-channel acoustic modeling where the traditional front-end processing algorithms like beamforming are learned within the acoustic modeling network. [10] and [2] demonstrate that a data driven beamformer out-performs a traditional beamforming approaches.
该领域的前期工作集中在开发压缩算法,使语音识别系统的影响最小[3]或分析与语音质量相关的opus编解码器的质量[4]。在[5]中,作者评估了不同音频编解码器在频率规范trum上的性能。ASR中也有先前的工作,研究处理不同采样率数据的混合带宽模型[6]。在情感识别领域已经有了一些相关的工作,[7]和[8]中的作者展示了不同的编解码器和不同的比特率如何影响情感识别的准确性。[9]中的作者分析了人类情感的可理解性作为比特率的函数。另一个相关的工作是在多通道声学建模领域,在声学建模网络中学习传统的前端处理算法,如波束形成。[10] 并且[2]证明了数据驱动的波束形成器优于传统的波束形成方法。
Our contribution in this paper is to study how the ASR model performance varies with the different bitrate encod- ing of the OPUS codec [11]. We chose the OPUS codec for our analysis because it is an open source, industry lead- ing standard for audio coding for multiple applications [12]. The main benefit of OPUS over other codes is the hybrid ap- proach, where it uses SILK for encoding information below 8khz and CELT for information above 8khz [11], thus pro- viding high quality for both speech and audio/music compo- nents. We analyze the performance of both the multi-channel and single channel acoustic models as a function of the audio input bitrate. We demonstrate that by training with mixed- bitrates we can recover some of the performance loss. Mo- tivated by the multiple device use case described above, we also study the performance of the system if the multiple au- dio input channels to the network are not encoded with the same bitrate. The closest work to this was in [13], where the authors show how the ASR WER varies with different codecs at a fixed bitrate encoding. This paper is organized as follows. In Section 2, we describe our model architecture, training techniques and the training and evaluation data. In Section 3, we present our analysis of the ASR system performance with different bi- trate audio inputs. We discuss our experiments on how to optimally distribute bandwidth between multiple channels in Section 4. In Section 5, we describe the mixed-bitrate model training focused at recovering from degradation introduced by low bitrate encoding. Finally in Section 6, we present analyze and conclude our work, and present future work.
本文的贡献在于研究ASR模型的性能如何随OPUS编解码器的不同比特率编码而变化[11]。我们之所以选择OPUS编解码器进行分析,是因为它是一个开源的、行业领先的多应用音频编码标准[12]。OPUS相对于其他代码的主要优点是混合ap-proach,它使用SILK编码8khz以下的信息,使用CELT编码8khz以上的信息[11],从而为语音和音频/音乐组件提供高质量。我们分析了多声道和单声道声学模型作为音频输入比特率函数的性能。我们证明,通过混合比特率的训练,我们可以恢复一些性能损失。在上述多设备用例的激励下,我们还研究了如果网络的多个音频输入信道不是用相同的比特率编码的话,系统的性能。与此最接近的工作是在[13]中,作者展示了在固定比特率编码下,ASR功率随不同编解码器的变化。本文的结构如下。在第二节中,我们描述了我们的模型架构、训练技术以及训练和评估数据。在第三节中,我们分析了不同双声道音频输入下ASR系统的性能。在第四节中,我们讨论了如何在多个信道之间优化分配带宽的实验。在第5节中,我们描述了混合比特率模型训练,重点是从低比特率编码引入的退化中恢复。最后在第六节中,我们对我们的工作进行了分析和总结,并对未来的工作进行了展望。