RapidTTS项目(文本转语音)：支持中文、数字和英文转语音

Liekkas Kono

已于 2022-04-16 22:11:20 修改

阅读量3.1k

点赞数

分类专栏：深度学习文章标签： TTS 文本转语音

于 2022-04-07 08:23:09 首次发布

本文链接：https://blog.csdn.net/shiwanghualuo/article/details/124004619

版权

深度学习专栏收录该内容

23 篇文章 8 订阅

订阅专栏

引言

对于一个工具，大部分情况是我们会先在自己数据集上跑一下，看看效果如何，才会来决定是否引入这个工具。
但是目前大部分项目的情况是，只是想简单试验一下也比较困难。虽说有着完善的说明文档，但是复杂的运行环境往往让人筋疲力竭。
这次想要介绍的是我们Team-RapidAI-NG整理的RapidTTS项目

RapidTTS

支持合成语言: 中文和数字

目录名称	推理引擎	支持语言
csmsc_tts2	Paddle+ONNXRuntime	中文和数字
csmsc_tts3	ONNXRuntime	中文和数字
ljspeech_tts3	ONNXRuntime	英文

csmsc_tts2

共分为三步，frontend、acoustic、vocoder
- 其中acoustic这一步模型推理目前基于PaddlePaddle,
- vocoder模型推理基于ONNXRuntime
其中PaddleSpeech中提供的预训练模型可以参见link。在RapidTTS2中使用的是:

主要部分具体模型支持语言
声学模型 speedyspeech_csmsc zh
声码器 pwgan_csmsc zh

主要部分	具体模型	支持语言
声学模型	speedyspeech_csmsc	zh
声码器	pwgan_csmsc	zh

使用步骤

下载resources, Google Drive | 百度网盘,提取码:kmcf, 解压到RapidTTS/csmsc_tts2目录下

安装requirements.txt

pip install -r requirements.txt -i https://pypi.douban.com/simple/

运行tts2.py
```
python tts2.py
```

运行日志如下:

初始化前处理部分
 frontend done!
 初始化提取特征模型
 am_predictor done!
 初始化合成wav模型
 合成指定句子
 Building prefix dict from the default dictionary ...
 Loading model from cache /tmp/jieba.cache
 Loading model cost 1.431 seconds.
 Prefix dict has been built successfully.
 infer_result/001.wav done!      cost: 7.226019859313965s
 infer_result/002.wav done!      cost: 9.149477005004883s
 infer_result/003.wav done!      cost: 3.4020116329193115s
 infer_result/004.wav done!      cost: 14.5472412109375s
 infer_result/005.wav done!      cost: 14.142913818359375s
 infer_result/006.wav done!      cost: 10.191686630249023s
 infer_result/007.wav done!      cost: 15.726643800735474s
 infer_result/008.wav done!      cost: 15.421608209609985s
 infer_result/009.wav done!      cost: 8.083441972732544s
 infer_result/010.wav done!      cost: 10.538750886917114s
 infer_result/011.wav done!      cost: 7.974739074707031s
 infer_result/012.wav done!      cost: 7.274432897567749s
 infer_result/013.wav done!      cost: 8.204563856124878s
 infer_result/014.wav done!      cost: 8.994312286376953s
 infer_result/015.wav done!      cost: 5.084768056869507s
 infer_result/016.wav done!      cost: 5.3102569580078125s

csmsc_tts3

支持合成语言: 中文和数字，不支持英文字母
基于PaddleSpeech下的TTS3整理而来
整个推理引擎只采用ONNXRuntime
其中PaddleSpeech中提供的预训练模型可以参见link。在csmsc_tts3中使用的是:

主要部分具体模型支持语言
声学模型 fastspeech2_csmsc zh
声码器 hifigan_csmsc zh

主要部分	具体模型	支持语言
声学模型	fastspeech2_csmsc	zh
声码器	hifigan_csmsc	zh

使用步骤

下载resources, Google Drive | 百度网盘,提取码:a2nw, 解压到csmsc_tts3目录下，最终目录结构如下：

 csmsc_tts3
 ├── csmsc_test.txt
 ├── requirements.txt
 ├── frontend
 ├── main.sh
 ├── tts3.py
 ├── infer_result
 ├── resources
 │   ├── fastspeech2_csmsc_onnx_0.2.0
 │   │   ├── fastspeech2_csmsc.onnx
 │   │   └── phone_id_map.txt
 │   └── hifigan_csmsc.onnx
 └──syn_utils.py

安装requirements.txt

pip install -r requirements.txt -i https://pypi.douban.com/simple/

运行tts3.py
```
python tts3.py
```

运行日志如下:

 frontend done!
 warm up done!
 Building prefix dict from the default dictionary ...
 Loading model from cache C:\Users\WANGJI~1\AppData\Local\Temp\jieba.cache
 Loading model cost 0.836 seconds.
 Prefix dict has been built successfully.
 009901, mel: (331, 80), wave: 99300, time: 1.3718173s, Hz: 72385.938204132, RTF: 0.33155610876132857.
 009902, mel: (288, 80), wave: 86400, time: 1.1350326000000024s, Hz: 76121.49025085453, RTF: 0.3152854722222228.
 009903, mel: (341, 80), wave: 102300, time: 1.4687841000000006s, Hz: 69649.7502651354, RTF: 0.3445812785923755.
 generation speed: 72441.68237053939Hz, RTF: 0.33130097499999983

生成结果会保存到infer_result目录下

ljspeech_tts3

支持合成语言: 英文字母
基于PaddleSpeech下的ljspeech-TTS3整理而来
整个推理引擎只采用ONNXRuntime
其中PaddleSpeech中提供的预训练模型可以参见link。在ljspeech_tts3中使用的是:

主要部分具体模型支持语言
声学模型 fastspeech2_ljspeech en
声码器 pwg_ljspeech en

主要部分	具体模型	支持语言
声学模型	fastspeech2_ljspeech	en
声码器	pwg_ljspeech	en

使用步骤

下载resources, Google Drive | 百度网盘,提取码:4vlu, 解压到ljspeech_tts3目录下，最终目录结构如下：

 ljspeech_tts3
 ├── sentences_en.txt
 ├── requirements.txt
 ├── frontend
 ├── main.sh
 ├── tts3.py
 ├── infer_result
 ├── resources
 │   ├── fastspeech2_ljspeech
 │   │   ├── fastspeech2_ljspeech.onnx
 │   │   └── phone_id_map.txt
 │   └── pwgan_ljspeech.onnx
 └──syn_utils.py

安装requirements.txt

pip install -r requirements.txt -i https://pypi.douban.com/simple/

运行tts3.py
```
python tts3.py
```
or
```
bash main.sh
```

运行日志如下:

 frontend done!
 001, mel: (343, 80), wave: 87808, time: 7.583922399999999s, Hz: 11578.186242472837, RTF: 1.9044433677455357.
 002, mel: (274, 80), wave: 70144, time: 5.986744399999999s, Hz: 11716.561243394675, RTF: 1.8819514994154878.
 003, mel: (175, 80), wave: 44800, time: 3.911470399999999s, Hz: 11453.51349948683, RTF: 1.9251734414062498.
 004, mel: (217, 80), wave: 55552, time: 4.678628299999996s, Hz: 11873.585640758554, RTF: 1.8570632888104823.
 005, mel: (371, 80), wave: 94976, time: 7.7152417s, Hz: 12310.185834993608, RTF: 1.7911996045843162.
 006, mel: (338, 80), wave: 86528, time: 7.670878100000003s, Hz: 11280.071739420744, RTF: 1.954774801913832.
 007, mel: (205, 80), wave: 52480, time: 4.628822800000002s, Hz: 11337.668363997142, RTF: 1.9448443270769813.
 008, mel: (390, 80), wave: 99840, time: 8.2700763s, Hz: 12072.447745611855, RTF: 1.826473012319712.
 009, mel: (169, 80), wave: 43264, time: 4.2657806000000065s, Hz: 10142.12548840801, RTF: 2.1741004905926427.
 generation speed: 11613.502408804885Hz, RTF: 1.8986520365538124

生成结果会保存到infer_result目录下