TTS：CosyVoice 语音克隆

燕双嘤

已于 2025-05-08 10:04:27 修改

阅读量793

点赞数 20

分类专栏： ChatGPT/AIGC/RAG 文章标签：语音识别人工智能

于 2025-04-28 10:46:21 首次发布

本文链接：https://blog.csdn.net/qq_42192693/article/details/147550415

版权

ChatGPT/AIGC/RAG 专栏收录该内容

16 篇文章

订阅专栏

【原文】CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

【原文】CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

【项目】https://github.com/FunAudioLLM/CosyVoice

1，CosyVoice 声音克隆

【服务端】CosyVoice/runtime/python/fastapi/server.py，添加环境变量：
PYTHONUNBUFFERED=1;PYTHONPATH=D:\PyCharmWorkSpace\Linly-Talker\CosyVoice\third_party\Matcha-TTS
【报错】TypeError: expected str, bytes or os.PathLike object, not MultiplexedPath

【解决】Windows 下不支持 MultiplexedPath，手动添加进去。
self.zh_tn_model = ZhNormalizer(remove_erhua=False, full_to_half=False, overwrite_cache=True)
self.en_tn_model = EnNormalizer()
👇
self.zh_tn_model = ZhNormalizer(remove_erhua=False, full_to_half=False, overwrite_cache=True,cache_dir="tn")
self.en_tn_model = EnNormalizer(cache_dir="tn")

【客户端】CosyVoice/runtime/python/fastapi/client.py

mode：输入 zero_shot，表示声音克隆
prompt_wav：输入音频
prompt_text：输入音频对应的文本

【耗时】从传入文本 👉 输出 .wav 总耗时：2.8s

拿到语音数据 response：40 ms
拼接组装 response：2400 ms

2，CosyVoice2 声音克隆（流式）

【服务端】CosyVoice/runtime/python/fastapi/server.py，添加环境变量：
PYTHONUNBUFFERED=1;PYTHONPATH=D:\PyCharmWorkSpace\Linly-Talker\CosyVoice\third_party\Matcha-TTS
【报错】ZeroDivisionError: 0.0 cannot be raised to a negative power

【解决】diffusers 版本太高了，建议降低到 0.29.0。

【报错】找不到预训练音色

【解决】需要手动下载spk2info.pt文件粘贴到pretrained_models/CosyVoice2-0.5B中，随后重新运行webui.py就能看到预训练模型。Issue

【客户端】CosyVoice/runtime/python/fastapi/client.py

mode：输入 zero_shot，表示声音克隆
prompt_wav：输入音频
prompt_text：输入音频对应的文本

【耗时】从传入文本 👉 输出 .wav 总耗时：2.8s，跟 CosyVoice 离线几乎一样。

拿到语音数据 response：40 ms
拼接组装 response：2400 ms

【流式客户端】webui.py，如果不想用 Gradio，代码实现参考：https://zhuanlan.zhihu.com/p/16096611214

【推理参数】

cosyvoice = CosyVoice2(args.model_dir, load_jit=False, load_trt=True, fp16=True, use_flow_cache=False)

【支持声调】https://funaudiollm.github.io/cosyvoice2/

在他讲述那个荒诞故事的过程中，他突然[laughter]停下来，因为他自己也被逗笑了[laughter]。

追求卓越不是终点，它需要你每天都<strong>付出</strong>和<strong>精进</strong>，最终才能达到巅峰。

当你用心去倾听一首音乐时[breath]，你会开始注意到那些细微的音符变化[breath]，并通过它们感受到音乐背后的情感。

'[breath]', 呼吸声
'<strong>', '</strong>', 强调
'[noise]',噪声
'[laughter]', 笑声
'[cough]', 咳嗽
'[clucking]', 咯咯声
'[accent]',重音
'[quick_breath]',快速呼吸声
"<laughter>", "</laughter>",
"[hissing]", 嘶嘶声
"[sigh]", 叹气
"[vocalized-noise]",发声噪音
"[lipsmack]", 咂嘴
"[mn]"

【支持语气】只能在 client.py 中使用 inference_instruct2() 模式实现。

parser.add_argument('--mode', default='instruct2', choices=['sft', 'zero_shot', 'cross_lingual', 'instruct'],

用惊讶的语气说<|endofprompt|>走进家门，看见墙上挂满了我的照片，我惊讶得愣住了。原来家人悄悄为我准备了一个惊喜的纪念墙。

用伤心的语气说<|endofprompt|>收到拒信的那一刻，我感到无比伤心。虽然知道失败是成长的一部分，但仍然难以掩饰心中的失落。

用开心的语气说<|endofprompt|>参加朋友的婚礼，看着新人幸福的笑脸，我感到无比开心。这样的爱与承诺，总是令人心生向往。

【报错】ValueError: buffer size must be a multiple of element size

【解决】在 client.py 下配置

else:
    payload = {
        'tts_text': args.tts_text,
        'instruct_text': args.instruct_text
    }
    files = [('prompt_wav', ('prompt_wav', open(args.prompt_wav, 'rb'), 'application/octet-stream'))]
    response = requests.request("GET", url, data=payload, files=files, stream=True)

【耗时】首次耗时 8s，二次耗时 6s。

【耗时高的原因】

3，音色保存 & 音色加载

使用 CosyVoice2 可以完成音色保存&音色加载，如果要想实现对自己音色的 Instruct，CosyVoice所有模型目前均不支持，Instruct 会直接删除音色 embding，使用内部设定好的音色 embding（中文女），所以下面方法无法实现对自己音色的语气调整，只可以使用语气词。

总结：无法对自己的音色进行语气控制，只能使用语气词。

【参考内容】

https://github.com/FunAudioLLM/CosyVoice/issues/671
https://github.com/FunAudioLLM/CosyVoice/issues/1151
https://github.com/FunAudioLLM/CosyVoice/issues/604
https://github.com/FunAudioLLM/CosyVoice/issues/918

【音色保存】实现将自己的音色保存并且作为预训练音色生成语音，首先需要得到自己的录音，将 .wav 转换为预训练音色 .pt 。

import torch
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.cli.frontend import load_spk_from_wav

cosyvoice = CosyVoice('D:\modelscope_cache\hub\iic\CosyVoice2-0___5B', load_jit=False, load_trt=False, fp16=False)

data = load_spk_from_wav("C:\\Users\\shaoqisun\\Desktop\\5.wav", cosyvoice)
torch.save(data, f'speakers/xijun.pt')

lower_sr = 16000
high_sr = 22050

def postprocess(speech, top_db=60, hop_length=220, win_length=440):
    max_val = 0.8

    speech, _ = librosa.effects.trim(
        speech, top_db=top_db,
        frame_length=win_length,
        hop_length=hop_length
    )

    if speech.abs().max() > max_val:
        speech = speech / speech.abs().max() * max_val

    zeros = torch.zeros(1, int(high_sr * 0.2))

    print(speech, zeros)

    speech = torch.concat([speech, zeros], dim=1)

    return speech


def load_spk_from_wav(wav_file, cosyvoice):
    target_wav, sample_rate = torchaudio.load(wav_file)
    if target_wav.shape[0] == 2:
        # 计算两个声道的平均值
        target_wav = target_wav.mean(dim=0, keepdim=True)

    target_wav_high = torchaudio.transforms.Resample(sample_rate, high_sr)(target_wav)
    target_wav_high = postprocess(target_wav_high)
    target_wav_lower = torchaudio.transforms.Resample(high_sr, lower_sr)(target_wav_high)

    speech_feat, speech_feat_len = cosyvoice.frontend._extract_speech_feat(target_wav_high)
    speech_token, speech_token_len = cosyvoice.frontend._extract_speech_token(target_wav_lower)
    embedding = cosyvoice.frontend._extract_spk_embedding(target_wav_lower)

    print(f"speech_feat {type(speech_feat)}")
    print(f"speech_token {type(speech_token)}")
    print(f"embedding {type(embedding)}")

    return {
        "speech_feat": speech_feat,
        "speech_feat_len": speech_feat_len,
        "speech_token": speech_token,
        "speech_token_len": speech_token_len,
        "embedding": embedding
    }


def load_spk_from_pt(spk_id, spk_dir="./speakers"):
    spk_pt = os.path.join(spk_dir, f"{spk_id}.pt")

    if os.path.exists(spk_pt) and os.path.isfile(spk_pt):
        return torch.load(spk_pt)

    return None


def scan_spks_from_file(spk_dir="./speakers"):
    spks = []

    for spk_pt in os.listdir(spk_dir):
        if not spk_pt.endswith('.pt'):
            continue

        full_spk_pt = os.path.join(spk_dir, spk_pt)
        spk_id = spk_pt.replace(".pt", "")

        if os.path.exists(full_spk_pt) and os.path.isfile(full_spk_pt):
            spks.append(spk_id)

    return spks

【音色加载】由于音色并非通过训练获取，因此效果有限！！！

修改 frontend.py 244 行左右：从 .pt 加载音色。

embedding = self.spk2info[spk_id]['embedding']
👇
embedding = load_spk_from_pt(spk_id)['embedding']

利用预训练音色生成语音。

import time
import torch
import torchaudio
from cosyvoice.cli.cosyvoice import CosyVoice2, CosyVoice

cosyvoice = CosyVoice2('D:\modelscope_cache\hub\iic\CosyVoice2-0___5B')
for id in range(1):
    begin = time.time()
    data = None
    for i, j in enumerate(cosyvoice.inference_sft(
            '收到拒信的那一刻，我感到无比伤心。虽然知道失败是成长的一部分，但仍然难以掩饰心中的失落。',
            'xijun',
            #instruct_text='用愤怒的语气说',
            stream=False)):
        if data == None:
            data = j['tts_speech']
        else:
            data = torch.cat((data, j['tts_speech']), dim=1)
    torchaudio.save('sft_{}.wav'.format(id), data, cosyvoice.sample_rate)