AISHELL-3 高保真中文语音数据库

ABSTRACT

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese characterlevel and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multispeaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset1 , baseline system code and generated samples2 are available online.

Index Terms— open source database, Text-to-speech, multispeaker speech synthesis, speaker embedding, end-to-end

NTRODUCTION

Speech synthesis, or Text-To-Speech(TTS), is the automated process of mapping input text specifications to target utterances . In recent years, neural network based TTS synthesis systems have achieved marvelous results in terms of audio quality and perceptual naturalness . This flourishing research progress is made largely due to the introduction of attention based sequence-to-sequence modeling architectures such as Tacotron or Transformer-TTS , and neural vocoders that map the lower dimensional acoustic representation to waveforms .

A key characteristic of TTS is the lack of constraint, which renders the task essentially a one-to-many mapping. Since given only textual content, speeches uttered by either male or female, with voices agitated or neutral, are equally valid outputs. But real-world application of such systems requires robust and consistent behaviors. This begs the question of whether we could provide further specification to the system to gain more flexibility over conventional approaches. There is a growing interest within the field in designingTTS systems that are more flexible and admits stronger constraints on its behaviors. Recent publications on expressive or prosodic TTS systems tend to associate the acoustic model with explicit control signals (e.g., pitch/energy for supervised settings and learned embeddings for unsupervised variants) as augmented input besides normalized texts. A more prominent and intuitive feature of speech is the speaker identity, and multi-speaker acoustic models give TTS systems the ability to disentangle perceptual speaker identity from the textual contents of the synthesized utterance by explicitly conditioning the model on the desired speaker.

Training such systems naturally requires significant amount of annotated data. VCTK is a freely available multi-speaker corpus that could be used to train such systems. However, VCTK only contains recordings in English. As suggested by previous studies , despite the cultural influence of English language as a lingua franca in the academia, language specific subsystems and model modifications are indeed an area of active research. TTS systems targeted on tonal languages such as Chinese Mandarin and Japanese face difficult situations considering their complex tonal and prosodic structures . The lack of a publicly available multi-speaker Mandarin dataset suitable for TTS system training makes researches in this area more difficult and costly, and lacks objective indicators that are comparable across studies.

To this end, we introduce the AISHELL-3 corpus in this paper to fill this vacancy in open resources. AISHELL-3 contains roughly 85 hours of high fidelity Mandarin speech recordings from 218 native speakers, with manually transcribed Chinese characters and pronunciations in the form of pinyin notation. Furthermore, we present a multi-speaker TTS system trained with this dataset as a baseline system. Objective evaluations on the synthesized samples show consistent behavior with previous studies conducted on a VCTK system with the same architecture.

AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and total 88035 utterances. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. The  word & tone transcription accuracy rate is above 98%, through professional speech annotation and strict quality inspection for tone and prosody. ( This database is free for academic research, not in the commerce, if without permission. )

85小时 | 85 Hours

88035 句 | 88035 Utterances

218 人 | 218 Speakers

语音合成实验

Text-To-Speech (TTS) Systems

开源TTS系统应用

Open Source

 希尔贝壳—专注于人工智能大数据和技术的创新北京希尔贝壳科技有限公司成立于2017年,是一家专注人工智能大数据和技术服务的创新公司。针对家居、车载、机器人等语音智能产品做精准场景语音数据并输出方案。利用机器学习平台,在语音数据评测、辅助转写、数据分析、智能语音客服等场景业务建立了领先的核心技术体系。http://www.aishelltech.com/aishell_3

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值