【原文】CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
【原文】CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
1,CosyVoice 声音克隆
【服务端】CosyVoice/runtime/python/fastapi/server.py,添加环境变量:
PYTHONUNBUFFERED=1;PYTHONPATH=D:\PyCharmWorkSpace\Linly-Talker\CosyVoice\third_party\Matcha-TTS
【报错】TypeError: expected str, bytes or os.PathLike object, not MultiplexedPath
【解决】Windows 下不支持 MultiplexedPath,手动添加进去。
self.zh_tn_model = ZhNormalizer(remove_erhua=False, full_to_half=False, overwrite_cache=True) self.en_tn_model = EnNormalizer() 👇 self.zh_tn_model = ZhNormalizer(remove_erhua=False, full_to_half=False, overwrite_cache=True,cache_dir="tn") self.en_tn_model = EnNormalizer(cache_dir="tn")
【客户端】CosyVoice/runtime/python/fastapi/client.py
- mode:输入 zero_shot,表示声音克隆
- prompt_wav:输入音频
- prompt_text:输入音频对应的文本
【耗时】从传入文本 👉 输出 .wav 总耗时:2.8s
- 拿到语音数据 response:40 ms
- 拼接组装 response:2400 ms
2,CosyVoice2 声音克隆(流式)
【服务端】CosyVoice/runtime/python/fastapi/server.py,添加环境变量:
PYTHONUNBUFFERED=1;PYTHONPATH=D:\PyCharmWorkSpace\Linly-Talker\CosyVoice\third_party\Matcha-TTS
【报错】ZeroDivisionError: 0.0 cannot be raised to a negative power
【解决】diffusers 版本太高了,建议降低到 0.29.0。
【报错】找不到预训练音色
【解决】需要手动下载spk2info.pt文件粘贴到pretrained_models/CosyVoice2-0.5B中,随后重新运行webui.py就能看到预训练模型。Issue
【客户端】CosyVoice/runtime/python/fastapi/client.py
- mode:输入 zero_shot,表示声音克隆
- prompt_wav:输入音频
- prompt_text:输入音频对应的文本
【耗时】从传入文本 👉 输出 .wav 总耗时:2.8s,跟 CosyVoice 离线几乎一样。
- 拿到语音数据 response:40 ms
- 拼接组装 response:2400 ms
【流式客户端】webui.py,如果不想用 Gradio,代码实现参考:https://zhuanlan.zhihu.com/p/16096611214
【推理参数】
cosyvoice = CosyVoice2(args.model_dir, load_jit=False, load_trt=True, fp16=True, use_flow_cache=False)
【支持声调】https://funaudiollm.github.io/cosyvoice2/
在他讲述那个荒诞故事的过程中,他突然[laughter]停下来,因为他自己也被逗笑了[laughter]。 追求卓越不是终点,它需要你每天都<strong>付出</strong>和<strong>精进</strong>,最终才能达到巅峰。 当你用心去倾听一首音乐时[breath],你会开始注意到那些细微的音符变化[breath],并通过它们感受到音乐背后的情感。
'[breath]', 呼吸声 '<strong>', '</strong>', 强调 '[noise]',噪声 '[laughter]', 笑声 '[cough]', 咳嗽 '[clucking]', 咯咯声 '[accent]',重音 '[quick_breath]',快速呼吸声 "<laughter>", "</laughter>", "[hissing]", 嘶嘶声 "[sigh]", 叹气 "[vocalized-noise]",发声噪音 "[lipsmack]", 咂嘴 "[mn]"
【支持语气】只能在 client.py 中使用 inference_instruct2() 模式实现。
parser.add_argument('--mode', default='instruct2', choices=['sft', 'zero_shot', 'cross_lingual', 'instruct'],
用惊讶的语气说<|endofprompt|>走进家门,看见墙上挂满了我的照片,我惊讶得愣住了。原来家人悄悄为我准备了一个惊喜的纪念墙。 用伤心的语气说<|endofprompt|>收到拒信的那一刻,我感到无比伤心。虽然知道失败是成长的一部分,但仍然难以掩饰心中的失落。 用开心的语气说<|endofprompt|>参加朋友的婚礼,看着新人幸福的笑脸,我感到无比开心。这样的爱与承诺,总是令人心生向往。
【报错】ValueError: buffer size must be a multiple of element size
【解决】在 client.py 下配置
else: payload = { 'tts_text': args.tts_text, 'instruct_text': args.instruct_text } files = [('prompt_wav', ('prompt_wav', open(args.prompt_wav, 'rb'), 'application/octet-stream'))] response = requests.request("GET", url, data=payload, files=files, stream=True)
【耗时】首次耗时 8s,二次耗时 6s。
【耗时高的原因】
3,音色保存 & 音色加载
使用 CosyVoice2 可以完成音色保存&音色加载,如果要想实现对自己音色的 Instruct,CosyVoice所有模型目前均不支持,Instruct 会直接删除音色 embding,使用内部设定好的音色 embding(中文女),所以下面方法无法实现对自己音色的语气调整,只可以使用语气词。
总结:无法对自己的音色进行语气控制,只能使用语气词。
【参考内容】
【音色保存】实现将自己的音色保存并且作为预训练音色生成语音,首先需要得到自己的录音,将 .wav 转换为预训练音色 .pt 。
import torch from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2 from cosyvoice.cli.frontend import load_spk_from_wav cosyvoice = CosyVoice('D:\modelscope_cache\hub\iic\CosyVoice2-0___5B', load_jit=False, load_trt=False, fp16=False) data = load_spk_from_wav("C:\\Users\\shaoqisun\\Desktop\\5.wav", cosyvoice) torch.save(data, f'speakers/xijun.pt')
lower_sr = 16000 high_sr = 22050 def postprocess(speech, top_db=60, hop_length=220, win_length=440): max_val = 0.8 speech, _ = librosa.effects.trim( speech, top_db=top_db, frame_length=win_length, hop_length=hop_length ) if speech.abs().max() > max_val: speech = speech / speech.abs().max() * max_val zeros = torch.zeros(1, int(high_sr * 0.2)) print(speech, zeros) speech = torch.concat([speech, zeros], dim=1) return speech def load_spk_from_wav(wav_file, cosyvoice): target_wav, sample_rate = torchaudio.load(wav_file) if target_wav.shape[0] == 2: # 计算两个声道的平均值 target_wav = target_wav.mean(dim=0, keepdim=True) target_wav_high = torchaudio.transforms.Resample(sample_rate, high_sr)(target_wav) target_wav_high = postprocess(target_wav_high) target_wav_lower = torchaudio.transforms.Resample(high_sr, lower_sr)(target_wav_high) speech_feat, speech_feat_len = cosyvoice.frontend._extract_speech_feat(target_wav_high) speech_token, speech_token_len = cosyvoice.frontend._extract_speech_token(target_wav_lower) embedding = cosyvoice.frontend._extract_spk_embedding(target_wav_lower) print(f"speech_feat {type(speech_feat)}") print(f"speech_token {type(speech_token)}") print(f"embedding {type(embedding)}") return { "speech_feat": speech_feat, "speech_feat_len": speech_feat_len, "speech_token": speech_token, "speech_token_len": speech_token_len, "embedding": embedding } def load_spk_from_pt(spk_id, spk_dir="./speakers"): spk_pt = os.path.join(spk_dir, f"{spk_id}.pt") if os.path.exists(spk_pt) and os.path.isfile(spk_pt): return torch.load(spk_pt) return None def scan_spks_from_file(spk_dir="./speakers"): spks = [] for spk_pt in os.listdir(spk_dir): if not spk_pt.endswith('.pt'): continue full_spk_pt = os.path.join(spk_dir, spk_pt) spk_id = spk_pt.replace(".pt", "") if os.path.exists(full_spk_pt) and os.path.isfile(full_spk_pt): spks.append(spk_id) return spks
【音色加载】由于音色并非通过训练获取,因此效果有限!!!
- 修改 frontend.py 244 行左右:从 .pt 加载音色。
embedding = self.spk2info[spk_id]['embedding'] 👇 embedding = load_spk_from_pt(spk_id)['embedding']
- 利用预训练音色生成语音。
import time import torch import torchaudio from cosyvoice.cli.cosyvoice import CosyVoice2, CosyVoice cosyvoice = CosyVoice2('D:\modelscope_cache\hub\iic\CosyVoice2-0___5B') for id in range(1): begin = time.time() data = None for i, j in enumerate(cosyvoice.inference_sft( '收到拒信的那一刻,我感到无比伤心。虽然知道失败是成长的一部分,但仍然难以掩饰心中的失落。', 'xijun', #instruct_text='用愤怒的语气说', stream=False)): if data == None: data = j['tts_speech'] else: data = torch.cat((data, j['tts_speech']), dim=1) torchaudio.save('sft_{}.wav'.format(id), data, cosyvoice.sample_rate)