Multi-channel Acoustic Modeling using Mixed Bitrate OPUS Compression

2 篇文章 0 订阅
1 篇文章 0 订阅

Multi-channel Acoustic Modeling using Mixed Bitrate OPUS Compression
作者: Aparna Khare, Minhua Wu

Recent literature has shown that a learned front end with multi-channel audio input can outperform traditional beam-forming algorithms for automatic speech recognition (ASR). In this paper, we present our study on multi-channel acoustic modeling using OPUS compression with different bitrates for the different channels. We analyze the degradation in word error rate (WER) as a function of the audio encoding bitrate and show that the WER degrades by 12.6% relative with 16kpbs as compared to uncompressed audio. We show that its always preferable to have a multi-channel audio input over a single channel audio input given limited bandwidth. Our results show that for the best WER, when one of the two channels can be encoded with a bitrate higher than 32kbps, its optimal to encode the other channel with the highest bitrate possible. For bitrates lower than that, its preferable to distribute the bitrate equally between the two channels. We further show that by training the acoustic model on mixed bitrate input, up to 50% of the degradation can be recovered using a single model.

    Multi-channel modeling has become a popular alternative to conventional beamforming in recent literature [1][2]. This method of front-end processing opens up avenues for pro- cessing audio not just from microphones from a single mi- crophone array but from microphones from distinct devices. With the popularity of voice enabled devices in the consumer markets, this opens up an interesting area of research. Differ- ent devices from different manufacturers, however, may not adhere to the same encoding standards. In addition, the band- width available for audio upload at different times might be different. Motivated by these questions, we wanted to study how multi-channel acoustic models perform when different audio inputs have different bitrates.
    Prior work in this area has focused at developing com- pression algorithms to minimally impact speech recognition systems [3] or at analyzing the quality of the opus codec as it pertains to voice quality [4] . In [5], the authors evaluate the performance of different audio codecs on the frequency spec-trum. There is also prior work in ASR that examines mixed- bandwidth models that deal with data with different sampling rates [6]. There has been some related work in the emotion recognition domain; the authors in [7] and [8] show how dif- ferent codecs and different bitrates affect emotion recognition accuracy. The authors in [9] analyze human emotion intelli- gibility as a function of the bitrates. The other related work is in the domain of multi-channel acoustic modeling where the traditional front-end processing algorithms like beamforming are learned within the acoustic modeling network. [10] and [2] demonstrate that a data driven beamformer out-performs a traditional beamforming approaches.
    该领域的前期工作集中在开发压缩算法,使语音识别系统的影响最小[3]或分析与语音质量相关的opus编解码器的质量[4]。在[5]中,作者评估了不同音频编解码器在频率规范trum上的性能。ASR中也有先前的工作,研究处理不同采样率数据的混合带宽模型[6]。在情感识别领域已经有了一些相关的工作,[7]和[8]中的作者展示了不同的编解码器和不同的比特率如何影响情感识别的准确性。[9]中的作者分析了人类情感的可理解性作为比特率的函数。另一个相关的工作是在多通道声学建模领域,在声学建模网络中学习传统的前端处理算法,如波束形成。[10] 并且[2]证明了数据驱动的波束形成器优于传统的波束形成方法。
    Our contribution in this paper is to study how the ASR model performance varies with the different bitrate encod- ing of the OPUS codec [11]. We chose the OPUS codec for our analysis because it is an open source, industry lead- ing standard for audio coding for multiple applications [12]. The main benefit of OPUS over other codes is the hybrid ap- proach, where it uses SILK for encoding information below 8khz and CELT for information above 8khz [11], thus pro- viding high quality for both speech and audio/music compo- nents. We analyze the performance of both the multi-channel and single channel acoustic models as a function of the audio input bitrate. We demonstrate that by training with mixed- bitrates we can recover some of the performance loss. Mo- tivated by the multiple device use case described above, we also study the performance of the system if the multiple au- dio input channels to the network are not encoded with the same bitrate. The closest work to this was in [13], where the authors show how the ASR WER varies with different codecs at a fixed bitrate encoding. This paper is organized as follows. In Section 2, we describe our model architecture, training techniques and the training and evaluation data. In Section 3, we present our analysis of the ASR system performance with different bi- trate audio inputs. We discuss our experiments on how to optimally distribute bandwidth between multiple channels in Section 4. In Section 5, we describe the mixed-bitrate model training focused at recovering from degradation introduced by low bitrate encoding. Finally in Section 6, we present analyze and conclude our work, and present future work.
  • 0
  • 0
    觉得还不错? 一键收藏
  • 0
智慧校园整体解决方案是响应国家教育信息化政策,结合教育改革和技术创新的产物。该方案以物联网、大数据、人工智能和移动互联技术为基础,旨在打造一个安全、高效、互动且环保的教育环境。方案强调从数字化校园向智慧校园的转变,通过自动数据采集、智能分析和按需服务,实现校园业务的智能化管理。 方案的总体设计原则包括应用至上、分层设计和互联互通,确保系统能够满足不同用户角色的需求,并实现数据和资源的整合与共享。框架设计涵盖了校园安全、管理、教学、环境等多个方面,构建了一个全面的校园应用生态系统。这包括智慧安全系统、校园身份识别、智能排课及选课系统、智慧学习系统、精品录播教室方案等,以支持个性化学习和教学评估。 建设内容突出了智慧安全和智慧管理的重要性。智慧安全管理通过分布式录播系统和紧急预案一键启动功能,增强校园安全预警和事件响应能力。智慧管理系统则利用物联网技术,实现人员和设备的智能管理,提高校园运营效率。 智慧教学部分,方案提供了智慧学习系统和精品录播教室方案,支持专业级学习硬件和智能化网络管理,促进个性化学习和教学资源的高效利用。同时,教学质量评估中心和资源应用平台的建设,旨在提升教学评估的科学性和教育资源的共享性。 智慧环境建设则侧重于基于物联网的设备管理,通过智慧教室管理系统实现教室环境的智能控制和能效管理,打造绿色、节能的校园环境。电子班牌和校园信息发布系统的建设,将作为智慧校园的核心和入口,提供教务、一卡通、图书馆等系统的集成信息。 总体而言,智慧校园整体解决方案通过集成先进技术,不仅提升了校园的信息化水平,而且优化了教学和管理流程,为学生、教师和家长提供了更加便捷、个性化的教育体验。




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


