TensorFlow Lite (TFLite)的TTS模型集

GitHub 链接:

https://github.com/tulasiram58827/TTS_TFLite

This repository provides a collection of widely popular text-to-speech (TTS) models in TensorFlow Lite (TFLite). These models primarily come from two repositories - TTS and TensorFlowTTS. We provide end-to-end Colab Notebooks that show the model conversion and inference process using TFLite. This includes converting PyTorch models to TFLite as well.

 

TTS is a two-step process - first you generate a MEL spectrogram using a TTS model and then you pass it to a VOCODER for generating the audio waveform. We include both of these models inside this repository.

Note that these models are trained on LJSpeech dataset.

Here’s a sample result (with Fastspeech2 and MelGAN) for the text “Bill got in the habit of asking himself".

 

Models Included

  • Tacotron2

  • Fastspeech2

     

  • MelGAN

  • Multi-Band MelGAN (MB MelGAN)

  • Parallel WaveGAN

  • HiFi-GAN

  • TTS:

  • VOCODER:

In the future, we may add more models.

*Currently, conversion of the Glow TTS model is unavailable.

Notes:

  • Training data used for HiFi-GAN (MEL spectogram generation) is different w.r.t other models like Tacotron2, FastSpech2. So it is not compatible with the other architectures available inside this repo.

  • If you want to use HiFi-GAN in end-to-end scenario you can refer to this notebook. In future we are planning to make it compatible with other architectures and add it in our end-to-end notebook. Stay tuned!

About the Notebooks

  • End_to_End_TTS.ipynb: This notebook allows you to load up different TTS and VOCODER models (enlisted above) and to perform inference.

  • MelGAN_TFLite.ipynb: Shows the model conversion process of MelGAN.

  • Parallel_WaveGAN_TFLite.ipynb: Shows the model conversion process of Parallel WaveGAN.

  • HiFi-GAN.ipynb: Shows the model conversion process of HiFi-GAN.

 

Model conversion processes for Tacotron2, Fastspeech2, and Multi-Band MelGAN are available via the following notebooks:

 

  • Tacotron2 & Multi-Band MelGAN

  • Fastspeech2

 

Model Benchmarks

After converting to TFLite, we used the Benchmark tool in order to report performance metrics of the various models such as inference latency, peak memory usage. We used Redmi K20 for this purpose. For all the experiments we kept the number of threads to one and we used the CPU of Redmi K20 and no other hardware accelerator.

 

ModelQuantization

Model Size 

(MB)

Average

Inference atency 

(sec)

Memory

Footprint

(MB)

Parallel 

WaveGAN

Dynamic-range5.70.0431.5

Parallel 

WaveGAN

Float163.20.0534
MelGANDynamic-range170.5181
MelGANFloat16 8.30.5289
MB MelGANDynamic-range170.0217
HiFi-GANDynamic-range3.50.00159.88
HiFi-GANFloat162.90.003620.3
Tacotron2Dynamic-range30.11.6675
Fastspeech2Dynamic-range300.1155

 

Notes:

  • All the models above support dynamic shaped inputs. However, benchmarking dynamic input size MelGAN models is not currently supported. So to benchmark those models we used inputs of shape (100, 80).

  • Similary for Fastspeech2 benchmarking dynamic input size model is erroring out. So to benchmark we used inputs of shape (1, 50) where 50 represents number of tokens. This issue thread provides more details.(https://github.com/tensorflow/tensorflow/issues/45986)

 

Audio Samples

All combination of samples are available in audio_samples folder. To listen directly without downloading refer to this Sound Cloud folder.

(https://soundcloud.com/tulasi-ram-887761209)

 

References

  • Dynamic-range quantization in TensorFlow Lite

  • Float16 quantization in TensorFlow Lite

from: https://mp.weixin.qq.com/s/0p9yofI4g8pOo2qJoQFIFw

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值