I. Related knowledge
1,Text normalization(文本规范化)
- Normalizing text into standard format
- Every NLP task requires text normalization
- Tokenizing (segmenting) words
- Normalizing word formats
- Segmenting sentences
2,Grapheme to phoneme(字素到音素)
- Grapheme: a letter or a group of letters that represent a single phoneme
- Phoneme: the smallest unit of sound that can distinguish one word from another in a particular language
- when a child says the sound /t/ this is a phoneme, but when they write the letter 't' this is a grapheme.
3,Part-of-speech tagging is a disambiguation process
Why?:For example, In English, a word may have different meaning in different sentence.
4,Embedding representations
(You can see it in previous blog)
II. TTS model
1,basic introduction:
- we should solve a end-to-end problem: text -> Text-to-Speech ->waveform
- The two-stage pipeline:
text-> Front end ->waveform generater -> waveform
- The three-stage pipeline
text->-> Front end ---->Acoustic model--->waveform generater -> waveform
linguistic specification acoustic features
2,Front end
- Definition
text--->Front end[sentence structure, tokenize/text normalnization, Part-of-speech tagging, linguistic analysis]--->linguistic specification
- features and processing
1)Language dependent: Each language has its unique characteristics
2)Handle text normalization
eg: $123 -> one hundred and twenty three dollars
3)Handle pronunciation of words in different context
eg: Read
record
奇偶 vs 奇怪
- example
Classic front end :
‣ A chain of processes
‣ Each process is performed by a model
‣ These models are independently trained in a supervised fashion on annotated data
Neural front end :Learn by a neural network
text--> Neural Net-->linguistic specification
3,Acoustic model
1) input and output
input sequence: linguistic features
output sequence: acoustic features
2)Acoustic features with ML algorithm(You can search more material for further learning)
- Acoustic model - Decision tree
Decision tree to group HMM states, which model acoustic feature distribution.
- Acoustic model: DNN
Feedforward neural network
- Acoustic model - RNN based
Tacotron2: A sequence-to-sequence model based on Recurrent Neural Networks
- Acoustic model - Transformer based
FastSpeech2: parallel generation and not depending on the location attention
4,Waveform generator:Vocoder
1)Introduction
acoustic features-> Waveform generater->waveform
2)Solutions(You can search more material for further learning)
- Vocoder - Signal processing based
- Vocoder: Autoregressive
eg:WaveNet: autoregressive model with dilated causal convolution
WaveRNN: autoregressive model with RNN
- Vocoder: Flow based
eg:AF (autoregressive flow) and IAF (inverse autoregressive flow)
- Parallel inference of IAF student
- Parallel training of AF teacher
- Vocoder: GAN based
eg:MelGAN: Generator + Discriminator Generative models
- Vocoder: Diffusion based
Diffusion probabilistic model
- Forward process: diffusion
- Reverse process: denoising Generative models
III.Application of TTS
1,voice assistant(语音助手)
2,Voice navigation(语音导航)
3,Speech helper for disabled