AISHELL-ASR0009-OS1 开源中文语音数据库


An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin. The recording procedure, including audio capturing devices and environments are presented in details. The preparation of the related resources, including transcriptions and lexicon are described. The corpus is released with a Kaldi recipe. Experimental results implies that the quality of audio recordings and transcriptions are promising.

Index Terms— Speech Recognition, Mandarin Corpus, Open-Source Data


Automatic Speech Recognition(ASR) has been an active research topic for several decades. Most state-of-the-art ASR systems benefit from powerful statistical models, such as Gaussian Mixture Models(GMM), Hidden Markov Models(HMM) and Deep Neural Networks(DNN) . These statistical frameworks often require a large amount of high quality data. Luckily, along with the wide adoption of smart phones, and the emerging market of various smart devices, real user data are generated world-wide and everyday, hence collecting data becomes easier than ever before. Combined with sufficient amount of real data and supervised-training, statistical approach achieves great success all over the speech industry.

However, for legal and commercial reasons, most companies are not willing to share their data with the public: large industrial datasets are often inaccessible for academic community, which leads to a divergence between research and industry. On one hand, researchers are interested in fundamental problems such as designing new model structures or beating over-fitting under limited data. Such innovations and tricks in academic papers sometimes are proven to be not effective when the dataset gets much larger, different scales ofdata lead to different stories. On the other hand, industrial developers are more concerned about building products and infrastructures that can quickly accumulate real user data, then feedback collected data into simple algorithms such as logistic regression and deep learning.

In ASR community, open-slr project is established to alleviate this problem1 . For English ASR, industrial-sized datasets such as Ted-Lium and LibriSpeech offer open platforms, for both researchers and industrial developers, to experiment and to compare system performances. Unfortunately, for Chinese ASR, the only open-source corpus is THCHS30, released by Tsinghua University, containing 50 speakers, and around 30 hours mandarin speech data . Generally speaking, Mandarin ASR systems based on small dataset like THCHS30 are not expected to perform well. In this paper, we present AISHELL-1 corpus. To authors’ limited knowledge, AISHELL-1 is by far the largest opensource Mandarin ASR corpus. 

This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. )

178小时 | 178 Hours


400 speakers in the recording



Speech & Speaker Recognition


merged with Kaldi system

Kaldi recipe


  • 0
  • 1
    觉得还不错? 一键收藏
  • 0




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


