espnet使用方法
Text-to-speech (TTS) as the name suggests, reads aloud text. It takes written words as input and converts them into audio. TTS can help anyone who doesn't want to give the effort to read a book, blog or an article. In this article, we will see how we can create a TTS engine considering we don’t know a thing about TTS.
顾名思义,文本转语音(TTS)会朗读文本。 它以书面文字作为输入并将其转换为音频。 TTS可以帮助任何不想阅读书籍,博客或文章的人。 在本文中,考虑到我们对TTS一无所知,我们将了解如何创建TTS引擎。
文字转语音架构 (Text-To-Speech Architecture)
The above diagram is a simplistic representation of the architecture we are going to follow. We will look into each and every component in detail and we will be using ESPnet framework for implementation purpose.
上图是我们将要遵循的架构的简化表示。 我们将详细研究每个组件,并将使用ESPnet框架进行实现。
前端 (Front-end)
It has mainly three components :
它主要包括三个部分:
POS Tagger: It does the Part Of Speech tagging of the input text.
POS Tagger:对输入文本进行词性标注。
Tokenize: Tokenize a sentence into words.
标记化:将一个句子标记成单词。
Pronunciation: It breaks the input text into phonemes, based on the pronunciation. e.g. Hello, how are you → HH AH0 L OW, HH AW1 AA1 R Y UW1. This is done