Text-to-Speech Synthesis(文本到语音合成）

每天都在头秃

已于 2024-05-07 16:17:22 修改

阅读量708

点赞数 10

分类专栏：语音和语言处理文章标签：语音识别人工智能

于 2024-05-07 14:21:44 首次发布

本文链接：https://blog.csdn.net/m0_74756454/article/details/138527974

版权

语音和语言处理专栏收录该内容

3 篇文章 0 订阅

订阅专栏

I. Related knowledge

1,Text normalization(文本规范化）

Normalizing text into standard format
Every NLP task requires text normalization

- Tokenizing (segmenting) words

- Normalizing word formats

- Segmenting sentences

2,Grapheme to phoneme(字素到音素）

Grapheme: a letter or a group of letters that represent a single phoneme
Phoneme: the smallest unit of sound that can distinguish one word from another in a particular language
when a child says the sound /t/ this is a phoneme, but when they write the letter 't' this is a grapheme.

3，Part-of-speech tagging is a disambiguation process

Why？：For example, In English, a word may have different meaning in different sentence.

4,Embedding representations

(You can see it in previous blog)

II. TTS model

1,basic introduction:

we should solve a end-to-end problem: text -> Text-to-Speech ->waveform
The two-stage pipeline:

text-> Front end ->waveform generater -> waveform

The three-stage pipeline

text->-> Front end ---->Acoustic model--->waveform generater -> waveform

linguistic specification acoustic features

2,Front end

Definition

text--->Front end[sentence structure, tokenize/text normalnization, Part-of-speech tagging, linguistic analysis]--->linguistic specification

features and processing

1)Language dependent: Each language has its unique characteristics

2)Handle text normalization

eg: $123 -> one hundred and twenty three dollars

3)Handle pronunciation of words in different context

eg: Read

record

奇偶 vs 奇怪

example

Classic front end :

‣ A chain of processes

‣ Each process is performed by a model

‣ These models are independently trained in a supervised fashion on annotated data

Neural front end :Learn by a neural network

text--> Neural Net-->linguistic specification

3,Acoustic model

1) input and output

input sequence: linguistic features

output sequence: acoustic features

2)Acoustic features with ML algorithm(You can search more material for further learning)

Acoustic model - Decision tree

Decision tree to group HMM states, which model acoustic feature distribution.

Acoustic model: DNN

Feedforward neural network

Acoustic model - RNN based

Tacotron2: A sequence-to-sequence model based on Recurrent Neural Networks

Acoustic model - Transformer based

FastSpeech2: parallel generation and not depending on the location attention

4,Waveform generator:Vocoder

1)Introduction

acoustic features-> Waveform generater->waveform

2)Solutions(You can search more material for further learning)

Vocoder - Signal processing based
Vocoder: Autoregressive

eg:WaveNet: autoregressive model with dilated causal convolution

WaveRNN: autoregressive model with RNN

Vocoder: Flow based

eg:AF (autoregressive flow) and IAF (inverse autoregressive flow)

- Parallel inference of IAF student

- Parallel training of AF teacher

Vocoder: GAN based

eg:MelGAN: Generator + Discriminator Generative models

Vocoder: Diffusion based

Diffusion probabilistic model

- Forward process: diffusion

- Reverse process: denoising Generative models

III.Application of TTS

1,voice assistant（语音助手）

2,Voice navigation（语音导航）

3,Speech helper for disabled

每天都在头秃

关注

10
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Text-to-Speech Synthesis(文本到语音合成）

2,Grapheme to phoneme(字素到音素）1,Text normalization(文本规范化）2,Voice navigation（语音导航）1,voice assistant（语音助手）
复制链接

扫一扫

专栏目录