AISHELL-ASR0009-OS1 开源中文语音数据库

ABSTRACT

An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin. The recording procedure, including audio capturing devices and environments are presented in details. The preparation of the related resources, including transcriptions and lexicon are described. The corpus is released with a Kaldi recipe. Experimental results implies that the quality of audio recordings and transcriptions are promising.

Index Terms— Speech Recognition, Mandarin Corpus, Open-Source Data

INTRODUCTION

Automatic Speech Recognition(ASR) has been an active research topic for several decades. Most state-of-the-art ASR systems benefit from powerful statistical models, such as Gaussian Mixture Models(GMM), Hidden Markov Models(HMM) and Deep Neural Networks(DNN) . These statistical frameworks often require a large amount of high quality data. Luckily, along with the wide adoption of smart phones, and the emerging market of various smart devices, real user data are generated world-wide and everyday, hence collecting data becomes easier than ever before. Combined with sufficient amount of real data and supervised-training, statistical approach achieves great success all over the speech industry.

However, for legal and commercial reasons, most companies are not willing to share their data with the public: large industrial datasets are often inaccessible for academic community, which leads to a divergence between research and industry. On one hand, researchers are interested in fundamental problems such as designing new model structures or beating over-fitting under limited data. Such innovations and tricks in academic papers sometimes are proven to be not effective when the dataset gets much larger, different scales ofdata lead to different stories. On the other hand, industrial developers are more concerned about building products and infrastructures that can quickly accumulate real user data, then feedback collected data into simple algorithms such as logistic regression and deep learning.

In ASR community, open-slr project is established to alleviate this problem1 . For English ASR, industrial-sized datasets such as Ted-Lium and LibriSpeech offer open platforms, for both researchers and industrial developers, to experiment and to compare system performances. Unfortunately, for Chinese ASR, the only open-source corpus is THCHS30, released by Tsinghua University, containing 50 speakers, and around 30 hours mandarin speech data . Generally speaking, Mandarin ASR systems based on small dataset like THCHS30 are not expected to perform well. In this paper, we present AISHELL-1 corpus. To authors’ limited knowledge, AISHELL-1 is by far the largest opensource Mandarin ASR corpus. 

This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. )

178小时 | 178 Hours

400人中文普通话

400 speakers in the recording

语音识别实验

声纹实验

Speech & Speaker Recognition

Kaldi系统应用

merged with Kaldi system

Kaldi recipe

希尔贝壳—专注于人工智能大数据和技术的创新北京希尔贝壳科技有限公司成立于2017年,是一家专注人工智能大数据和技术服务的创新公司。针对家居、车载、机器人等语音智能产品做精准场景语音数据并输出方案。利用机器学习平台,在语音数据评测、辅助转写、数据分析、智能语音客服等场景业务建立了领先的核心技术体系。http://www.aishelltech.com/kysjcp

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值