AISHELL-3 高保真中文语音数据库

希尔贝壳AISHELL

已于 2022-03-11 14:05:25 修改

阅读量1.1k

点赞数

分类专栏：开源数据文章标签：人工智能

于 2022-03-09 17:49:52 首次发布

本文链接：https://blog.csdn.net/AI_SHELL/article/details/123383064

版权

开源数据专栏收录该内容

6 篇文章 1 订阅

订阅专栏

ABSTRACT

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese characterlevel and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multispeaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset1 , baseline system code and generated samples2 are available online.

Index Terms— open source database, Text-to-speech, multispeaker speech synthesis, speaker embedding, end-to-end

NTRODUCTION

Speech synthesis, or Text-To-Speech(TTS), is the automated process of mapping input text specifications to target utterances . In recent years, neural network based TTS synthesis systems have achieved marvelous results in terms of audio quality and perceptual naturalness . This flourishing research progress is made largely due to the introduction of attention based sequence-to-sequence modeling architectures such as Tacotron or Transformer-TTS , and neural vocoders that map the lower dimensional acoustic representation to waveforms .

A key characteristic of TTS is the lack of constraint, which renders the task essentially a one-to-many mapping. Since given only textual content, speeches uttered by either male or female, with voices agitated or neutral, are equally valid outputs. But real-world application of such systems requires robust and consistent behaviors. This begs the question of whether we could provide further specification to the system to gain more flexibility over conventional approaches. There is a growing interest within the field in designingTTS systems that are more flexible and admits stronger constraints on its behaviors. Recent publications on expressive or prosodic TTS systems tend to associate the acoustic model with explicit control signals (e.g., pitch/energy for supervised settings and learned embeddings for unsupervised variants) as augmented input besides normalized texts. A more prominent and intuitive feature of speech is the speaker identity, and multi-speaker acoustic models give TTS systems the ability to disentangle perceptual speaker identity from the textual contents of the synthesized utterance by explicitly conditioning the model on the desired speaker.

Training such systems naturally requires significant amount of annotated data. VCTK is a freely available multi-speaker corpus that could be used to train such systems. However, VCTK only contains recordings in English. As suggested by previous studies , despite the cultural influence of English language as a lingua franca in the academia, language specific subsystems and model modifications are indeed an area of active research. TTS systems targeted on tonal languages such as Chinese Mandarin and Japanese face difficult situations considering their complex tonal and prosodic structures . The lack of a publicly available multi-speaker Mandarin dataset suitable for TTS system training makes researches in this area more difficult and costly, and lacks objective indicators that are comparable across studies.

To this end, we introduce the AISHELL-3 corpus in this paper to fill this vacancy in open resources. AISHELL-3 contains roughly 85 hours of high fidelity Mandarin speech recordings from 218 native speakers, with manually transcribed Chinese characters and pronunciations in the form of pinyin notation. Furthermore, we present a multi-speaker TTS system trained with this dataset as a baseline system. Objective evaluations on the synthesized samples show consistent behavior with previous studies conducted on a VCTK system with the same architecture.

AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and total 88035 utterances. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. The word & tone transcription accuracy rate is above 98%, through professional speech annotation and strict quality inspection for tone and prosody. ( This database is free for academic research, not in the commerce, if without permission. )