Multi-channel Acoustic Modeling using Mixed Bitrate OPUS Compression

2 篇文章 0 订阅
1 篇文章 0 订阅

Multi-channel Acoustic Modeling using Mixed Bitrate OPUS Compression
标题:使用混合比特率opus压缩的多声道声学建模
作者: Aparna Khare, Minhua Wu
链接:https://arxiv.org/abs/2002.00122

Recent literature has shown that a learned front end with multi-channel audio input can outperform traditional beam-forming algorithms for automatic speech recognition (ASR). In this paper, we present our study on multi-channel acoustic modeling using OPUS compression with different bitrates for the different channels. We analyze the degradation in word error rate (WER) as a function of the audio encoding bitrate and show that the WER degrades by 12.6% relative with 16kpbs as compared to uncompressed audio. We show that its always preferable to have a multi-channel audio input over a single channel audio input given limited bandwidth. Our results show that for the best WER, when one of the two channels can be encoded with a bitrate higher than 32kbps, its optimal to encode the other channel with the highest bitrate possible. For bitrates lower than that, its preferable to distribute the bitrate equally between the two channels. We further show that by training the acoustic model on mixed bitrate input, up to 50% of the degradation can be recovered using a single model.
最近的文献表明,一个多通道音频输入的学习型前端在自动语音识别(ASR)中可以优于传统的波束形成算法。本文针对不同信道,采用不同比特率的OPUS压缩进行多信道声学建模研究。我们分析了作为音频编码比特率函数的字错误率(WER)的下降,结果表明,与未压缩音频相比,WER相对于16kpbs下降了12.6%。我们表明,在有限的带宽下,多声道音频输入总比单声道音频输入好。我们的结果表明,当两个信道中的一个可以用大于32kbps的比特率进行编码时,对于最佳的WER,用可能的最高比特率对另一个信道进行编码是最佳的。如果比特率低于此值,最好在两个信道之间平均分配比特率。我们进一步证明,通过对混合比特率输入的声学模型进行训练,使用单个模型可以恢复高达50%的退化。

  1. INTRODUCTION
    一。导言
    Multi-channel modeling has become a popular alternative to conventional beamforming in recent literature [1][2]. This method of front-end processing opens up avenues for pro- cessing audio not just from microphones from a single mi- crophone array but from microphones from distinct devices. With the popularity of voice enabled devices in the consumer markets, this opens up an interesting area of research. Differ- ent devices from different manufacturers, however, may not adhere to the same encoding standards. In addition, the band- width available for audio upload at different times might be different. Motivated by these questions, we wanted to study how multi-channel acoustic models perform when different audio inputs have different bitrates.
    在最近的文献中,多信道建模已经成为传统波束形成的一种流行的替代方法[1][2]。这种前端处理方法为处理音频开辟了途径,不仅是从单个麦克风阵列中的麦克风,而且从不同设备中的麦克风。随着语音设备在消费市场的普及,这开辟了一个有趣的研究领域。然而,不同制造商的不同设备可能不遵循相同的编码标准。此外,在不同的时间,可用于音频上传的带宽可能不同。基于这些问题,我们想研究当不同的音频输入具有不同的比特率时,多通道声学模型是如何工作的。
    Prior work in this area has focused at developing com- pression algorithms to minimally impact speech recognition systems [3] or at analyzing the quality of the opus codec as it pertains to voice quality [4] . In [5], the authors evaluate the performance of different audio codecs on the frequency spec-trum. There is also prior work in ASR that examines mixed- bandwidth models that deal with data with different sampling rates [6]. There has been some related work in the emotion recognition domain; the authors in [7] and [8] show how dif- ferent codecs and different bitrates affect emotion recognition accuracy. The authors in [9] analyze human emotion intelli- gibility as a function of the bitrates. The other related work is in the domain of multi-channel acoustic modeling where the traditional front-end processing algorithms like beamforming are learned within the acoustic modeling network. [10] and [2] demonstrate that a data driven beamformer out-performs a traditional beamforming approaches.
    该领域的前期工作集中在开发压缩算法,使语音识别系统的影响最小[3]或分析与语音质量相关的opus编解码器的质量[4]。在[5]中,作者评估了不同音频编解码器在频率规范trum上的性能。ASR中也有先前的工作,研究处理不同采样率数据的混合带宽模型[6]。在情感识别领域已经有了一些相关的工作,[7]和[8]中的作者展示了不同的编解码器和不同的比特率如何影响情感识别的准确性。[9]中的作者分析了人类情感的可理解性作为比特率的函数。另一个相关的工作是在多通道声学建模领域,在声学建模网络中学习传统的前端处理算法,如波束形成。[10] 并且[2]证明了数据驱动的波束形成器优于传统的波束形成方法。
    Our contribution in this paper is to study how the ASR model performance varies with the different bitrate encod- ing of the OPUS codec [11]. We chose the OPUS codec for our analysis because it is an open source, industry lead- ing standard for audio coding for multiple applications [12]. The main benefit of OPUS over other codes is the hybrid ap- proach, where it uses SILK for encoding information below 8khz and CELT for information above 8khz [11], thus pro- viding high quality for both speech and audio/music compo- nents. We analyze the performance of both the multi-channel and single channel acoustic models as a function of the audio input bitrate. We demonstrate that by training with mixed- bitrates we can recover some of the performance loss. Mo- tivated by the multiple device use case described above, we also study the performance of the system if the multiple au- dio input channels to the network are not encoded with the same bitrate. The closest work to this was in [13], where the authors show how the ASR WER varies with different codecs at a fixed bitrate encoding. This paper is organized as follows. In Section 2, we describe our model architecture, training techniques and the training and evaluation data. In Section 3, we present our analysis of the ASR system performance with different bi- trate audio inputs. We discuss our experiments on how to optimally distribute bandwidth between multiple channels in Section 4. In Section 5, we describe the mixed-bitrate model training focused at recovering from degradation introduced by low bitrate encoding. Finally in Section 6, we present analyze and conclude our work, and present future work.
    本文的贡献在于研究ASR模型的性能如何随OPUS编解码器的不同比特率编码而变化[11]。我们之所以选择OPUS编解码器进行分析,是因为它是一个开源的、行业领先的多应用音频编码标准[12]。OPUS相对于其他代码的主要优点是混合ap-proach,它使用SILK编码8khz以下的信息,使用CELT编码8khz以上的信息[11],从而为语音和音频/音乐组件提供高质量。我们分析了多声道和单声道声学模型作为音频输入比特率函数的性能。我们证明,通过混合比特率的训练,我们可以恢复一些性能损失。在上述多设备用例的激励下,我们还研究了如果网络的多个音频输入信道不是用相同的比特率编码的话,系统的性能。与此最接近的工作是在[13]中,作者展示了在固定比特率编码下,ASR功率随不同编解码器的变化。本文的结构如下。在第二节中,我们描述了我们的模型架构、训练技术以及训练和评估数据。在第三节中,我们分析了不同双声道音频输入下ASR系统的性能。在第四节中,我们讨论了如何在多个信道之间优化分配带宽的实验。在第5节中,我们描述了混合比特率模型训练,重点是从低比特率编码引入的退化中恢复。最后在第六节中,我们对我们的工作进行了分析和总结,并对未来的工作进行了展望。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
智慧校园整体解决方案是响应国家教育信息化政策,结合教育改革和技术创新的产物。该方案以物联网、大数据、人工智能和移动互联技术为基础,旨在打造一个安全、高效、互动且环保的教育环境。方案强调从数字化校园向智慧校园的转变,通过自动数据采集、智能分析和按需服务,实现校园业务的智能化管理。 方案的总体设计原则包括应用至上、分层设计和互联互通,确保系统能够满足不同用户角色的需求,并实现数据和资源的整合与共享。框架设计涵盖了校园安全、管理、教学、环境等多个方面,构建了一个全面的校园应用生态系统。这包括智慧安全系统、校园身份识别、智能排课及选课系统、智慧学习系统、精品录播教室方案等,以支持个性化学习和教学评估。 建设内容突出了智慧安全和智慧管理的重要性。智慧安全管理通过分布式录播系统和紧急预案一键启动功能,增强校园安全预警和事件响应能力。智慧管理系统则利用物联网技术,实现人员和设备的智能管理,提高校园运营效率。 智慧教学部分,方案提供了智慧学习系统和精品录播教室方案,支持专业级学习硬件和智能化网络管理,促进个性化学习和教学资源的高效利用。同时,教学质量评估中心和资源应用平台的建设,旨在提升教学评估的科学性和教育资源的共享性。 智慧环境建设则侧重于基于物联网的设备管理,通过智慧教室管理系统实现教室环境的智能控制和能效管理,打造绿色、节能的校园环境。电子班牌和校园信息发布系统的建设,将作为智慧校园的核心和入口,提供教务、一卡通、图书馆等系统的集成信息。 总体而言,智慧校园整体解决方案通过集成先进技术,不仅提升了校园的信息化水平,而且优化了教学和管理流程,为学生、教师和家长提供了更加便捷、个性化的教育体验。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值