TensorFlow Lite (TFLite)的TTS模型集

最新推荐文章于 2025-04-02 09:34:56 发布

mandagod

最新推荐文章于 2025-04-02 09:34:56 发布

阅读量1.2k

点赞数

分类专栏：语音合成

原文链接：https://mp.weixin.qq.com/s/0p9yofI4g8pOo2qJoQFIFw

版权

语音合成专栏收录该内容

9 篇文章

订阅专栏

GitHub 链接：

https://github.com/tulasiram58827/TTS_TFLite

This repository provides a collection of widely popular text-to-speech (TTS) models in TensorFlow Lite (TFLite). These models primarily come from two repositories - TTS and TensorFlowTTS. We provide end-to-end Colab Notebooks that show the model conversion and inference process using TFLite. This includes converting PyTorch models to TFLite as well.

TTS is a two-step process - first you generate a MEL spectrogram using a TTS model and then you pass it to a VOCODER for generating the audio waveform. We include both of these models inside this repository.

Note that these models are trained on LJSpeech dataset.

Here’s a sample result (with Fastspeech2 and MelGAN) for the text “Bill got in the habit of asking himself".

Models Included

Tacotron2
Fastspeech2

MelGAN
Multi-Band MelGAN (MB MelGAN)
Parallel WaveGAN
HiFi-GAN

TTS:
VOCODER:

In the future, we may add more models.

*Currently, conversion of the Glow TTS model is unavailable.

Notes:

Training data used for HiFi-GAN (MEL spectogram generation) is different w.r.t other models like Tacotron2, FastSpech2. So it is not compatible with the other architectures available inside this repo.
If you want to use HiFi-GAN in end-to-end scenario you can refer to this notebook. In future we are planning to make it compatible with other architectures and add it in our end-to-end notebook. Stay tuned!

About the Notebooks

End_to_End_TTS.ipynb: This notebook allows you to load up different TTS and VOCODER models (enlisted above) and to perform inference.
MelGAN_TFLite.ipynb: Shows the model conversion process of MelGAN.
Parallel_WaveGAN_TFLite.ipynb: Shows the model conversion process of Parallel WaveGAN.
HiFi-GAN.ipynb: Shows the model conversion process of HiFi-GAN.

Model conversion processes for Tacotron2, Fastspeech2, and Multi-Band MelGAN are available via the following notebooks:

Tacotron2 & Multi-Band MelGAN
Fastspeech2

Model Benchmarks

After converting to TFLite, we used the Benchmark tool in order to report performance metrics of the various models such as inference latency, peak memory usage. We used Redmi K20 for this purpose. For all the experiments we kept the number of threads to one and we used the CPU of Redmi K20 and no other hardware accelerator.

Model	Quantization	Model Size (MB)	Average Inference atency (sec)	Memory Footprint (MB)
Parallel WaveGAN	Dynamic-range	5.7	0.04	31.5
Parallel WaveGAN	Float16	3.2	0.05	34
MelGAN	Dynamic-range	17	0.51	81
MelGAN	Float16	8.3	0.52	89
MB MelGAN	Dynamic-range	17	0.02	17
HiFi-GAN	Dynamic-range	3.5	0.0015	9.88
HiFi-GAN	Float16	2.9	0.0036	20.3
Tacotron2	Dynamic-range	30.1	1.66	75
Fastspeech2	Dynamic-range	30	0.11	55

Notes:

All the models above support dynamic shaped inputs. However, benchmarking dynamic input size MelGAN models is not currently supported. So to benchmark those models we used inputs of shape (100, 80).
Similary for Fastspeech2 benchmarking dynamic input size model is erroring out. So to benchmark we used inputs of shape (1, 50) where 50 represents number of tokens. This issue thread provides more details.(https://github.com/tensorflow/tensorflow/issues/45986)

Audio Samples

All combination of samples are available in audio_samples folder. To listen directly without downloading refer to this Sound Cloud folder.

(https://soundcloud.com/tulasi-ram-887761209)

References