Text-to-Speech Synthesis(文本到语音合成)

I. Related knowledge

1,Text normalization(文本规范化)

  • Normalizing text into standard format
  • Every NLP task requires text normalization

         - Tokenizing (segmenting) words

         - Normalizing word formats

         - Segmenting sentences

2,Grapheme to phoneme(字素到音素)

  • Grapheme: a letter or a group of letters that represent a single phoneme
  • Phoneme: the smallest unit of sound that can distinguish one word from another in a particular language
  • when a child says the sound /t/ this is a phoneme, but when they write the letter 't' this is a grapheme.

3,Part-of-speech tagging is a disambiguation process

Why?:For example, In English, a word may have different meaning in different sentence.

4,Embedding representations

(You can see it in previous blog)

II. TTS model

1,basic introduction:

  • we should solve a end-to-end problem: text -> Text-to-Speech ->waveform
  • The two-stage pipeline:

text-> Front end ->waveform generater -> waveform

  • The three-stage pipeline

text->-> Front end ---->Acoustic model--->waveform generater -> waveform

                  linguistic specification        acoustic features

2,Front end

  • Definition

text--->Front end[sentence structure, tokenize/text normalnization, Part-of-speech tagging, linguistic analysis]--->linguistic specification

  • features and processing

1)Language dependent: Each language has its unique characteristics

2)Handle text normalization

       eg: $123 -> one hundred and twenty three dollars

3)Handle pronunciation of words in different context

       eg: Read

             record

             奇偶 vs 奇怪

  • example

Classic front end :

   ‣ A chain of processes

   ‣ Each process is performed by a model

   ‣ These models are independently trained in a supervised fashion on annotated data

Neural front end :Learn by a neural network

text--> Neural Net-->linguistic specification

3,Acoustic model

1) input and output

input sequence: linguistic features

output sequence: acoustic features

2)Acoustic features with ML algorithm(You can search more material for further learning) 

  • Acoustic model - Decision tree

Decision tree to group HMM states, which model acoustic feature distribution.

  • Acoustic model: DNN

Feedforward neural network

  • Acoustic model - RNN based

Tacotron2: A sequence-to-sequence model based on Recurrent Neural Networks

  • Acoustic model - Transformer based

FastSpeech2: parallel generation and not depending on the location attention

4,Waveform generator:Vocoder

1)Introduction

acoustic features-> Waveform generater->waveform

2)Solutions(You can search more material for further learning)

  • Vocoder - Signal processing based
  • Vocoder: Autoregressive

      eg:WaveNet: autoregressive model with dilated causal convolution

           WaveRNN: autoregressive model with RNN

  • Vocoder: Flow based

      eg:AF (autoregressive flow) and IAF (inverse autoregressive flow)

            - Parallel inference of IAF student

            - Parallel training of AF teacher

  • Vocoder: GAN based

        eg:MelGAN: Generator + Discriminator Generative models

  • Vocoder: Diffusion based

        Diffusion probabilistic model

             - Forward process: diffusion

             - Reverse process: denoising Generative models

III.Application of TTS

1,voice assistant(语音助手)

2,Voice navigation(语音导航)

3,Speech helper for disabled

  • 10
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值