清华大学LightGrad-TTS，且流式实现

最新推荐文章于 2025-03-01 18:25:18 发布

语音之家

最新推荐文章于 2025-03-01 18:25:18 发布

阅读量366

点赞数

分类专栏：智能语音文章标签：人工智能语音识别科技

本文链接：https://blog.csdn.net/weixin_48827824/article/details/132852993

版权

智能语音专栏收录该内容

193 篇文章

订阅专栏

论文链接：

https://arxiv.org/abs/2308.16569

代码地址：

https://github.com/thuhcsi/LightGrad

数据支持：

针对BZNSYP和LJSpeech提供训练脚本

针对Grad-TTS提出两个问题：

DPMs are not lightweight enough for resource-constrained devices.
DPMs require many denoising steps in inference, which increases latency.

提出解决方案：

To reduce model parameters, regular convolution networks in diffusion decoder are substituted with depthwise separable convolutions.
To accelerate the inference procedure, we adopt a training-free fast sampling technique for DPMs (DPM-solver).
Streaming inference is also implemented in LightGrad to reduce latency further.

Compared with Grad-TTS, LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency, while preserving comparable speech quality on both Chinese Mandarin and English in 4 denoising steps.

LightGrad流式方案(基于三星论文)：

论文链接：

https://arxiv.org/abs/2111.09052

具体实现：

Decoder input is chopped into chunks at phoneme boundaries to cover several consecutive phonemes and the chunk lengths are limited to a predefined range.
To incorporate context information into decoder, last phoneme of the previous chunk and first phoneme of the following chunk are padded to the head and tail of the current chunk.
Then, the decoder generates mel-spectrogram for each padded chunk.
After this, mel-spectrogram frames corresponding to the padded phonemes are removed to reverse the changes to each chunk.