ChatGPT的普及,让AI走进了大家的生活,如果能够提高ChatGPT的交互性,使用类似”小爱同学“一样的指令,就能直接交互式使用,那那能够极大的提升便利性。
我们将自建AI小助手拆分成了4部分,语音合成是第一篇。
剩余3篇分别为:自建AI小助手之语音识别,自建AI小助手之自然语言解析,自建AI小助手之整合篇。
语音合成部分,我们使用社区活跃度更高的PP飞浆。
以下内容来自于pp飞浆官方文档,以及个人实践整理。
python版本:Python 3.9.6
0. PaddleSpeech 介绍
🚀 PaddleSpeech 是 all-in-one 的语音算法工具箱,包含多种领先国际水平的语音算法与预训练模型。你可以从中选择各种语音处理工具以及预训练模型,支持语音识别,语音合成,声音分类,声纹识别,标点恢复,语音翻译等多种功能,PaddleSpeech Server模块可帮助用户快速在服务器上部署语音服务。PaddleSpeech团队发表的论文 An Easy-to-Use All-in-One Speech Toolkit
入选 NAACL2022
,荣获 NAACL2022 Best Demo Award
。
PaddleSpeech传送门:https://github.com/PaddlePaddle/PaddleSpeech
1. 初步认识onnxruntime推理流程
使用onnxruntime推理PaddleSpeech提供的语音合成onnx模型只需要四个步骤:
- 文本前端
- 加载模型,创建Session
- 模型推理
- 音频保存
2. 配置PaddleSpeech开发环境
你可以通过PaddleSpeech的源码进行安装
# 安装PaddleSpeech
!git clone https://gitee.com/paddlepaddle/PaddleSpeech.git
%cd PaddleSpeech
!pip install pytest-runner
!pip install .
# aistudio会报错: paddlespeech 的 repo中存在失效软链接
# 执行下面这行命令!!
!find -L /home/aistudio -type l -delete
# 下载模型模型
%cd /home/aistudio/work
!wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip
!wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip
!unzip fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip
!unzip mb_melgan_csmsc_onnx_0.2.0.zip
# 下载nltk数据包,如果项目中有就不用下载了
%cd /home/aistudio
!wget -P data https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz
!tar zxvf data/nltk_data.tar.gz
3. TTS文本前端
PaddleSpeech提供的文本前端可以帮助我们把中文文本转换成模型推理需要的音素序列
phones_dict = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/phone_id_map.txt"
from paddlespeech.t2s.frontend.zh_frontend import Frontend
frontend = Frontend(
phone_vocab_path=phones_dict,
tone_vocab_path=None)
text = "今天天气真的很不错,我想出去玩!"
input_ids = frontend.get_input_ids(
text,
merge_sentences=True, # 是否按符号拆分句子
get_tone_ids=False)
input_ids = input_ids['phone_ids']
print(input_ids)
4. 加载模型,创建Onnxruntime Session
创建onnxruntime的session,用于推理
import onnxruntime as ort
# 模型路径
onnx_am_encoder = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_encoder_infer.onnx"
onnx_am_decoder = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_decoder.onnx"
onnx_am_postnet = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_postnet.onnx"
onnx_voc_melgan = "/home/aistudio/work/mb_melgan_csmsc_onnx_0.2.0/mb_melgan_csmsc.onnx"
# 用CPU推理
providers = ['CPUExecutionProvider']
# 配置ort session
sess_options = ort.SessionOptions()
# 创建session
am_encoder_infer_sess = ort.InferenceSession(onnx_am_encoder, providers=providers, sess_options=sess_options)
am_decoder_sess = ort.InferenceSession(onnx_am_decoder, providers=providers, sess_options=sess_options)
am_postnet_sess = ort.InferenceSession(onnx_am_postnet, providers=providers, sess_options=sess_options)
voc_melgan_sess = ort.InferenceSession(onnx_voc_melgan, providers=providers, sess_options=sess_options)
5. 模型推理
# 辅助函数 denorm, 训练过程中mel输出经过了norm,使用过程中需要进行denorm
import numpy as np
am_stat_path = r"/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/speech_stats.npy"
am_mu, am_std = np.load(am_stat_path)
from paddlespeech.server.utils.util import denorm
# 推理阶段封装
# 端到端合成:一次性把句子全部合成完毕
def inference(text):
phone_ids = frontend.get_input_ids(text, merge_sentences=True, get_tone_ids=False)['phone_ids']
orig_hs = am_encoder_infer_sess.run(None, input_feed={'text': phone_ids[0].numpy()})
hs = orig_hs[0]
am_decoder_output = am_decoder_sess.run( None, input_feed={'xs': hs})
am_postnet_output = am_postnet_sess.run(None,input_feed={
'xs': np.transpose(am_decoder_output[0], (0, 2, 1))
})
am_output_data = am_decoder_output + np.transpose(am_postnet_output[0], (0, 2, 1))
normalized_mel = am_output_data[0][0]
mel = denorm(normalized_mel, am_mu, am_std)
wav = voc_melgan_sess.run(output_names=None, input_feed={'logmel': mel})[0]
return wav
6. 音频保存
# 保存为wav,播放体验
import soundfile as sf
import time
text = "欢迎使用飞桨语音合成系统,测试一下合成效果。"
t1 = time.time()
wav = inference(text)
print("合成耗时:", time.time() - t1)
sf.write("demo.wav", wav, samplerate=24000)
7. 流式语音合成
流式语音合成需要流式播放才能起到展示效果,思路上是把各个流程进行分片,然后再分块合成,播放器同时流式播放。
流式播放需要声卡支持,建议放到自己的笔记本上进行播放,aistudio 上不便于展示,只展示拼接在一起的结果,不进行流式播放展示
将 streaming_tts.py
下载到本机,按上面的步骤下载好模型,安装好PaddleSpeech即可(注意nltk_data
,下载速度较慢,建议按上面方式提前下载好)
需要安装 pyaudio
# 配置流式参数
import math
from paddlespeech.server.utils.util import get_chunks
voc_block = 36
voc_pad = 14
am_block = 72
am_pad = 12
voc_upsample = 300
def depadding(data, chunk_num, chunk_id, block, pad, upsample):
"""
Streaming inference removes the result of pad inference
"""
front_pad = min(chunk_id * block, pad)
# first chunk
if chunk_id == 0:
data = data[:block * upsample]
# last chunk
elif chunk_id == chunk_num - 1:
data = data[front_pad * upsample:]
# middle chunk
else:
data = data[front_pad * upsample:(front_pad + block) * upsample]
return data
def inference_stream(text):
input_ids = frontend.get_input_ids(
text,
merge_sentences=False,
get_tone_ids=False)
phone_ids = input_ids["phone_ids"]
print(phone_ids)
for i in range(len(phone_ids)):
# 先分句
# am
voc_chunk_id = 0
orig_hs = am_encoder_infer_sess.run(
None, input_feed={'text': phone_ids[i].numpy()})
orig_hs = orig_hs[0]
# streaming voc chunk info
mel_len = orig_hs.shape[1]
voc_chunk_num = math.ceil(mel_len / voc_block)
start = 0
end = min(voc_block + voc_pad, mel_len)
# streaming am
hss = get_chunks(orig_hs, am_block, am_pad, "am")
am_chunk_num = len(hss)
for i, hs in enumerate(hss):
am_decoder_output = am_decoder_sess.run(
None, input_feed={'xs': hs})
am_postnet_output = am_postnet_sess.run(
None,
input_feed={
'xs': np.transpose(am_decoder_output[0], (0, 2, 1))
})
am_output_data = am_decoder_output + np.transpose(
am_postnet_output[0], (0, 2, 1))
normalized_mel = am_output_data[0][0]
sub_mel = denorm(normalized_mel, am_mu, am_std)
sub_mel = depadding(sub_mel, am_chunk_num, i,
am_block, am_pad, 1)
if i == 0:
mel_streaming = sub_mel
else:
mel_streaming = np.concatenate(
(mel_streaming, sub_mel), axis=0)
# streaming voc
# 当流式AM推理的mel帧数大于流式voc推理的chunk size,开始进行流式voc 推理
while (mel_streaming.shape[0] >= end and
voc_chunk_id < voc_chunk_num):
voc_chunk = mel_streaming[start:end, :]
sub_wav = voc_melgan_sess.run(
output_names=None, input_feed={'logmel': voc_chunk})
sub_wav = depadding(
sub_wav[0], voc_chunk_num, voc_chunk_id,
voc_block, voc_pad, voc_upsample)
yield sub_wav
voc_chunk_id += 1
start = max(
0, voc_chunk_id * voc_block - voc_pad)
end = min(
(voc_chunk_id + 1) * voc_block + voc_pad,
mel_len)
text = "欢迎使用飞桨语音合成系统,测试一下合成效果。"
wavs = []
t1 = time.time()
for sub_wav in inference_stream(text):
print("响应时间:", time.time() - t1)
t1 = time.time()
wavs.append(sub_wav.flatten())
wav = np.concatenate(wavs)
print(wav.shape)
sf.write("demo_stream.wav",data=wav, samplerate=24000)
[Tensor(shape=[21], dtype=int64, place=Place(cpu), stop_gradient=True,
[71 , 199, 126, 177, 115, 138, 69 , 46 , 151, 89 , 241, 120, 71 , 42 ,
39 , 57 , 260, 75 , 182, 163, 179]), Tensor(shape=[16], dtype=int64, place=Place(cpu), stop_gradient=True,
[38 , 44 , 177, 116, 73 , 260, 80 , 71 , 42 , 39 , 57 , 260, 99 , 70 ,
232, 179])]
8. 简单代码清单
-
非流式
-
代码: not_stream.py
phones_dict = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/phone_id_map.txt" from paddlespeech.t2s.frontend.zh_frontend import Frontend frontend = Frontend( phone_vocab_path=phones_dict, tone_vocab_path=None) text = "今天天气真的很不错,我想出去玩!" input_ids = frontend.get_input_ids( text, merge_sentences=True, # 是否按符号拆分句子 get_tone_ids=False) input_ids = input_ids['phone_ids'] print(input_ids) import onnxruntime as ort # 模型路径 onnx_am_encoder = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_encoder_infer.onnx" onnx_am_decoder = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_decoder.onnx" onnx_am_postnet = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_postnet.onnx" onnx_voc_melgan = "/home/aistudio/work/mb_melgan_csmsc_onnx_0.2.0/mb_melgan_csmsc.onnx" # 用CPU推理 providers = ['CPUExecutionProvider'] # 配置ort session sess_options = ort.SessionOptions() # 创建session am_encoder_infer_sess = ort.InferenceSession(onnx_am_encoder, providers=providers, sess_options=sess_options) am_decoder_sess = ort.InferenceSession(onnx_am_decoder, providers=providers, sess_options=sess_options) am_postnet_sess = ort.InferenceSession(onnx_am_postnet, providers=providers, sess_options=sess_options) voc_melgan_sess = ort.InferenceSession(onnx_voc_melgan, providers=providers, sess_options=sess_options) # 辅助函数 denorm, 训练过程中mel输出经过了norm,使用过程中需要进行denorm import numpy as np am_stat_path = r"/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/speech_stats.npy" am_mu, am_std = np.load(am_stat_path) from paddlespeech.server.utils.util import denorm # 推理阶段封装 # 端到端合成:一次性把句子全部合成完毕 def inference(text): phone_ids = frontend.get_input_ids(text, merge_sentences=True, get_tone_ids=False)['phone_ids'] orig_hs = am_encoder_infer_sess.run(None, input_feed={'text': phone_ids[0].numpy()}) hs = orig_hs[0] am_decoder_output = am_decoder_sess.run( None, input_feed={'xs': hs}) am_postnet_output = am_postnet_sess.run(None,input_feed={ 'xs': np.transpose(am_decoder_output[0], (0, 2, 1)) }) am_output_data = am_decoder_output + np.transpose(am_postnet_output[0], (0, 2, 1)) normalized_mel = am_output_data[0][0] mel = denorm(normalized_mel, am_mu, am_std) wav = voc_melgan_sess.run(output_names=None, input_feed={'logmel': mel})[0] return wav # 保存为wav,播放体验 import soundfile as sf import time text = "欢迎使用飞桨语音合成系统,测试一下合成效果。" t1 = time.time() wav = inference(text) print("合成耗时:", time.time() - t1) sf.write("demo.wav", wav, samplerate=24000)
-
执行结果
- 执行命令
python3 not_stream.py
- 在代码目录下找到demo.wav
-
-
流式
-
代码: stream.py
phones_dict = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/phone_id_map.txt" from paddlespeech.t2s.frontend.zh_frontend import Frontend frontend = Frontend( phone_vocab_path=phones_dict, tone_vocab_path=None) text = "今天天气真的很不错,我想出去玩!" input_ids = frontend.get_input_ids( text, merge_sentences=True, # 是否按符号拆分句子 get_tone_ids=False) input_ids = input_ids['phone_ids'] print(input_ids) import onnxruntime as ort # 模型路径 onnx_am_encoder = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_encoder_infer.onnx" onnx_am_decoder = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_decoder.onnx" onnx_am_postnet = "/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_postnet.onnx" onnx_voc_melgan = "/home/aistudio/work/mb_melgan_csmsc_onnx_0.2.0/mb_melgan_csmsc.onnx" # 用CPU推理 providers = ['CPUExecutionProvider'] # 配置ort session sess_options = ort.SessionOptions() # 创建session am_encoder_infer_sess = ort.InferenceSession(onnx_am_encoder, providers=providers, sess_options=sess_options) am_decoder_sess = ort.InferenceSession(onnx_am_decoder, providers=providers, sess_options=sess_options) am_postnet_sess = ort.InferenceSession(onnx_am_postnet, providers=providers, sess_options=sess_options) voc_melgan_sess = ort.InferenceSession(onnx_voc_melgan, providers=providers, sess_options=sess_options) # 辅助函数 denorm, 训练过程中mel输出经过了norm,使用过程中需要进行denorm import numpy as np am_stat_path = r"/home/aistudio/work/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/speech_stats.npy" am_mu, am_std = np.load(am_stat_path) # 配置流式参数 import math from paddlespeech.server.utils.util import get_chunks voc_block = 36 voc_pad = 14 am_block = 72 am_pad = 12 voc_upsample = 300 def depadding(data, chunk_num, chunk_id, block, pad, upsample): """ Streaming inference removes the result of pad inference """ front_pad = min(chunk_id * block, pad) # first chunk if chunk_id == 0: data = data[:block * upsample] # last chunk elif chunk_id == chunk_num - 1: data = data[front_pad * upsample:] # middle chunk else: data = data[front_pad * upsample:(front_pad + block) * upsample] return data from paddlespeech.server.utils.util import denorm def inference_stream(text): input_ids = frontend.get_input_ids( text, merge_sentences=False, get_tone_ids=False) phone_ids = input_ids["phone_ids"] print(phone_ids) for i in range(len(phone_ids)): # 先分句 # am voc_chunk_id = 0 orig_hs = am_encoder_infer_sess.run( None, input_feed={'text': phone_ids[i].numpy()}) orig_hs = orig_hs[0] # streaming voc chunk info mel_len = orig_hs.shape[1] voc_chunk_num = math.ceil(mel_len / voc_block) start = 0 end = min(voc_block + voc_pad, mel_len) # streaming am hss = get_chunks(orig_hs, am_block, am_pad, "am") am_chunk_num = len(hss) for i, hs in enumerate(hss): am_decoder_output = am_decoder_sess.run( None, input_feed={'xs': hs}) am_postnet_output = am_postnet_sess.run( None, input_feed={ 'xs': np.transpose(am_decoder_output[0], (0, 2, 1)) }) am_output_data = am_decoder_output + np.transpose( am_postnet_output[0], (0, 2, 1)) normalized_mel = am_output_data[0][0] sub_mel = denorm(normalized_mel, am_mu, am_std) sub_mel = depadding(sub_mel, am_chunk_num, i, am_block, am_pad, 1) if i == 0: mel_streaming = sub_mel else: mel_streaming = np.concatenate( (mel_streaming, sub_mel), axis=0) # streaming voc # 当流式AM推理的mel帧数大于流式voc推理的chunk size,开始进行流式voc 推理 while (mel_streaming.shape[0] >= end and voc_chunk_id < voc_chunk_num): voc_chunk = mel_streaming[start:end, :] sub_wav = voc_melgan_sess.run( output_names=None, input_feed={'logmel': voc_chunk}) sub_wav = depadding( sub_wav[0], voc_chunk_num, voc_chunk_id, voc_block, voc_pad, voc_upsample) yield sub_wav voc_chunk_id += 1 start = max( 0, voc_chunk_id * voc_block - voc_pad) end = min( (voc_chunk_id + 1) * voc_block + voc_pad, mel_len) import time import soundfile as sf text = "欢迎使用飞桨语音合成系统,测试一下合成效果。" wavs = [] t1 = time.time() for sub_wav in inference_stream(text): print("响应时间:", time.time() - t1) t1 = time.time() wavs.append(sub_wav.flatten()) wav = np.concatenate(wavs) print(wav.shape) sf.write("demo_stream.wav",data=wav, samplerate=24000)
-
执行结果
-
- 执行命令
python3 stream.py
- 在代码目录下找到demo_stream.wav即可完成语言播放
关注公众号:binary技术小站,免费无限制不限量体检ChatGPT
-