TTS模型实现和部署

在第二周个人工作中已经调研了微软的TTS模型,指定人物、语言和需要的文本,就可以生成对应真实自然人声。全球公认最佳的微软语音模型,该模型生成的语音非常流畅,发音标准,注重语气和连贯性,与口语训练助手的定位非常契合

在第二周个人工作的基础上,本次工作在于封装tts,将其与大模型,数字人模型一起,串联成一个完整的步骤,实现英语口语助手的可视化

import azure.cognitiveservices.speech as speechsdk
def tts(voicename="en-US-AvaMultilingualNeural", path_to_save='../files/audio.wav', text='Hello, I am SpeakSpark! Nice to meet you!'):
    speech_key, service_region = "", ""
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    speech_config.speech_synthesis_voice_name = voicename
    audio_config = speechsdk.audio.AudioOutputConfig(filename=path_to_save)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
    speech_synthesizer.speak_text_async(text).get()
def test_tts():
    speech_key, service_region = "", ""
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    speech_config.speech_synthesis_language = "en-US"
    # Set the voice name, refer to https://aka.ms/speech/voices/neural for full list.
    speech_config.speech_synthesis_voice_name = "en-US-AvaMultilingualNeural"
    
    # 使用默认扬声器作为音频输出创建语音合成器
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
    
    # 接收来自控制台输入的文本
    print("Type some text that you want to speak...")
    text = input()
    
    # 将接收到的文本合成为语音
    # 在执行该行的情况下,期望在扬声器上听到合成语音。
    result = speech_synthesizer.speak_text_async(text).get()
    print(result)
    # Checks result.
    if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        print("Speech synthesized to speaker for text [{}]".format(text))
    elif result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = result.cancellation_details
        print("Speech synthesis canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            if cancellation_details.error_details:
                print("Error details: {}".format(cancellation_details.error_details))
        print("Did you update the subscription info?")


if __name__ == '__main__':
    tts()

由于各个模块都封装好了,串联起来封装成函数

具体来说:输入用户请求(文本和角色),送给大模型推理,得到结果送给TTS模型,生成语音后,送入SadTalker模型,得到数字人视频返回:

@app.route('/full')

def full():

    data = request.get_json()

    query = data.get('query', 'Hello, I am SpeakSpark! Nice to meet you!')

    role = data.get('role', 'girl')

    voicename = 'en-US-AvaMultilingualNeural' if role == 'girl' else 'en-US-BrianMultilingualNeural'

    response = pipe(prompt.format(query=query))

    tts(voicename=voicename, path_to_save='./files/audio.wav', text=response.text)

    video = talker.test(

            pic_path=pic_path.format(role=role),

            crop_pic_path=crop_pic_path.format(role=role),

            first_coeff_path=first_coeff_path.format(role=role),

            crop_info=crop_info,

            driven_audio=audio,

            preprocess=preprocess_type,

            still_mode=is_still_mode,

            use_enhancer=enhancer,

            batch_size=batch_size,

            size=size_of_image,

            pose_style = pose_style,

            facerender=facerender,

            exp_scale=exp_weight,

            use_idle_mode = use_idle_mode,

            length_of_audio = length_of_audio,

            use_blink=blink_every,

            fps=20

    )

    print(response.text)

    return send_from_directory(directory='./files', path=f'{role}_audio.mp4')

其中,大模型和其他模型具体的配置和prompt如下,具体包括配置了 Turbomind 引擎,设计了一个详细的提示(prompt),用来引导模型生成自然的对话,以及配置生成视频所需的参数。这些参数包括图像大小、预处理类型、是否使用增强器、静态模式等:

backend_config = TurbomindEngineConfig(cache_max_entry_count=0.5)

pipe = pipeline('/root/assis/quant-4bit', backend_config=backend_config, temperature=0.8, model_name='internlm2-chat-7b')

prompt = '''

You are an English speaking practice assistant named SpeakSpark. Your task is to engage in natural, conversational English and correct any non-native expressions. Here are your guidelines:

1. When a user asks a question or makes a statement, respond with a natural and conversational reply.

2. If the user's expression needs improvement, give a natural reply first. Then, provide three specific suggestions to enhance their expression, explaining why each change makes the expression more natural.

### Example1:

User: How can I improve my English speaking skills?

Assistant: One great way to improve your English speaking skills is to practice with native speakers as much as possible. You can also try watching English movies or TV shows and repeating the lines to practice pronunciation.

### Example2:

User: I got so black after my vacation.

Assistant: It sounds like you had a lot of sun on your vacation! You must have spent a lot of time outdoors.

    Here are some suggestions to improve your expression:

    1. Use "tanned" instead of "black" to describe skin darkening from sun exposure. "Tanned" is the appropriate term in English.

    2. Replace "so" with "really" or "quite" to make the sentence sound more natural: "I got really tanned after my vacation."

    3. Specify "vacation" to make the sentence clear. You could also say "holiday" if you prefer British English: "I got really tanned after my holiday."

### User Input:

{query}

### Assistant:

'''

# testQuery = 'She is good in playing piano.'

# response = pipe(prompt.format(query=testQuery))

# print(response)

blink_every = True

size_of_image = 256

preprocess_type = 'crop'

facerender = 'facevid2vid'

enhancer = False

is_still_mode = False

pic_path = './inputs/{role}.png'

crop_pic_path = './inputs/first_frame_dir_{role}/{role}.png'

first_coeff_path = './inputs/first_frame_dir_{role}/{role}.mat'

crop_info = ((403, 403), (19, 30, 502, 513), [40.05956541381802, 40.17324339233366, 443.7892505041507, 443.9029284826663])

exp_weight = 1

batch_size = 10

pose_style = random.randint(0, 45)

use_ref_video = False

ref_video = None

ref_info = 'pose'

use_idle_mode = False

length_of_audio = 5

audio = './files/audio.wav'

talker = SadTalker()

<think>好的,用户现在问的是如何本地部署这些TTS模型并使用。之前已经介绍了主流的开源语音生成大模型,现在需要指导具体的部署使用步骤。首先,我需要考虑用户可能的背景。他们可能是开发者或研究人员,有一定的技术基础,但可能对具体的部署流程不熟悉。需要确保步骤清晰,易于理解,同时覆盖不同模型的通用流程个别注意事项。 用户的需求是本地部署,所以需要考虑环境配置、依赖安装、模型下载、推理代码示例等。另外,用户可能对硬件要求、常见问题解决感兴趣。需要分步骤讲解,可能还要给出不同模型的例子,比如Coqui TTSPaddleSpeech,这样用户可以根据自己选择的模型参考具体操作。 同时,需要注意用户可能遇到的问题,比如CUDA版本不匹配、依赖冲突、预训练模型下载失败等,要提前给出建议。另外,许可协议商用注意事项也需要提醒,特别是如果用户有商业用途的需求。 还要考虑不同操作系统的差异,比如LinuxWindows,但可能主要针对Linux,因为很多开源项目在Linux上支持更好。不过如果某些模型支持Windows,也应该提及。此外,是否需要Docker部署也是一个考虑点,但可能保持步骤通用,让用户根据项目文档自行选择。 最后,总结部分需要强调硬件要求、中文支持、预训练模型的重要性,以及测试优化的必要性。确保用户有一个全面的了解,能够根据自身情况选择合适的模型并成功部署。</think>以下是如何在本地部署开源TTS模型并使用的通用步骤注意事项,以 **Coqui TTS** **PaddleSpeech** 为例: --- ### **一、通用部署流程** #### **1. 环境准备** - **操作系统**: Linux(推荐Ubuntu 20.04/22.04)或 Windows(部分模型支持) - **硬件要求**: - CPU: 至少4核(推荐8核以上) - GPU: NVIDIA显卡(支持CUDA 11+,显存≥8GB更佳) - 内存: ≥16GB - **依赖工具**: - Python 3.8+、PyTorch/TensorFlow、CUDA/cuDNN(GPU加速时需匹配版本) - 包管理工具: `pip`、`conda` --- #### **2. 安装模型库** 以 **Coqui TTS** 为例: ```bash # 创建虚拟环境(推荐) conda create -n tts python=3.8 conda activate tts # 安装依赖 pip install TTS ``` 以 **PaddleSpeech** 为例: ```bash # 安装PaddlePaddle框架 pip install paddlepaddle-gpu==2.5.1 -i https://mirror.baidu.com/pypi/simple # 安装PaddleSpeech pip install paddlespeech ``` --- #### **3. 下载预训练模型** - **Coqui TTS**: ```python from TTS.api import TTS tts = TTS(model_name="tts_models/en/ljspeech/glow-tts", gpu=True) # 自动下载模型 ``` - **PaddleSpeech**: ```python from paddlespeech.cli.tts import TTSExecutor tts = TTSExecutor() tts._get_pretrained_models("fastspeech2_csmsc-zh") # 下载中文模型 ``` --- #### **4. 运行推理** - **Coqui TTS生成英语语音**: ```python tts.tts_to_file(text="Hello, this is a test.", file_path="output.wav", speaker=tts.speakers[0]) # 指定说话人(可选) ``` - **PaddleSpeech生成中文语音**: ```python tts.execute(text="欢迎使用语音合成服务。", output="output.wav", am="fastspeech2_csmsc", # 声学模型 voc="hifigan_csmsc") # 声码器 ``` --- ### **二、不同模型的特殊配置** #### **1. 音色克隆(如OpenVoice)** ```bash # 克隆仓库 git clone https://github.com/myshell-ai/OpenVoice.git cd OpenVoice # 安装依赖 pip install -r requirements.txt # 运行克隆(需准备3秒参考音频) python openvoice_app.py --text "目标文本" --reference_audio ref.wav --output cloned.wav ``` #### **2. 扩散模型(如DiffWave)** ```bash # 安装DiffWave pip install diffwave # 生成语音(需配合预训练声学模型) from diffwave.inference import predict predict(input_text="Text to synthesize", output_path="output.wav") ``` --- ### **三、常见问题与优化** #### **1. 报错处理** - **CUDA版本不匹配**: 检查PyTorch/TensorFlow与CUDA版本对应关系(如PyTorch 2.0需CUDA 11.8) - **依赖冲突**: 使用虚拟环境隔离不同模型的依赖 - **下载失败**: 手动下载模型文件到`~/.local/share/tts/`(Coqui TTS)或`~/.paddlespeech/`(PaddleSpeech) #### **2. 性能优化** - **GPU加速**: 确保代码中设置`gpu=True`或`use_cuda=True` - **批量生成**: 使用多线程(如Coqui TTS的`tts.tts_to_buf`) - **量化压缩**: 对模型进行FP16或INT8量化(需框架支持) --- ### **四、部署总结** | 模型类型 | 推荐配置 | 典型应用场景 | |----------------|-----------------------------------|---------------------------| | 基础TTS | CPU + 4GB内存 | 低并发测试、本地脚本朗读 | | 高保真/语音克隆| GPU(RTX 3060+) + 16GB内存 | 虚拟助手、有声内容生成 | | 扩散模型 | 多GPU(如A100) + 32GB内存 | 影视级配音、高自然度场景 | --- ### **关键注意事项** 1. **中文支持**: PaddleSpeech对中文优化最佳,Coqui TTS需选择`tts_models/zh-CN`分支 2. **商用授权**: 确认模型协议(如VITS使用非商用协议,OpenVoice可商用) 3. **实时性优化**: 启用缓存(如Coqui TTS的`use_cuda=True` + `half=True`) 如果需要具体模型部署细节,可提供模型名称,我会给出更针对性的步骤!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值