【项目】多模态语音合成-CosyVoicev2实战

(一)基本介绍

在这里插入图片描述

CosyVoice 是阿里巴巴通义实验室语音团队于2024 年 7 月份开源的语音生成大模型,依托大模型技术,实现自然流畅的语音生成体验。与传统语音生成技术相比,CosyVoice 具有韵律自然、音色逼真等特点。自开源以来,CosyVoice 凭借高品质的多语言语音生成、零样本语音生成、跨语言语音生成、富文本和自然语言的细粒度控制能力获得了广大社区开发者们的喜爱和支持。

CosyVoice 迎来全面升级,发布 CosyVoice2.0 版本,提供更准、更稳、更快、 更好的语音生成能力。

  • 超低延迟:CosyVoice 2.0 提出了离线和流式一体化建模的语音生成大模型技术,支持双向流式语音合成,在基本不损失效果的情况下首包合成延迟可以达到 150ms。

  • 高准确度:CosyVoice 2.0 合成音频的发音错误相比于 CosyVoice 1.0 相对下降 30%~50%,在 Seed-TTS 测试集的 hard 测试集上取得当前最低的字错误率。合成绕口令、多音字、生僻字上具有明显的提升。

  • 强稳定性:CosyVoice 2.0 在零样本语音生成和跨语言语音合成上能够出色地保证音色一致性,特别是跨语言语音合成相比于 1.0 版本具有明显提升。

  • 自然体验:CosyVoice 2.0 合成音频的韵律、音质、情感匹配相比于 1.0 具有明显提升。MOS 评测分从 5.4 提升到 5.53(相同评测某商业化语音合成大模型为 5.52)。同时, CosyVoice 2.0 对于指令可控的音频生成也进行了升级,支持更多细粒度的情感控制,以及方言口音控制。

(二)原理介绍

在这里插入图片描述

(三)开源仓库

CosyVoice 2.0: Demos; Paper; Modelscope; HuggingFace

CosyVoice 1.0: Demos; Paper; Modelscope

(四)部署使用
(1)克隆+安装
  • 克隆仓库
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git submodule update --init --recursive
  • 安装Conda: please see https://docs.conda.io/en/latest/miniconda.html
  • 创建Conda环境:
conda create -n cosyvoice python=3.10
conda activate cosyvoice

conda install -y -c conda-forge pynini==2.1.5
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

# ubuntu
sudo apt-get install sox libsox-dev
# centos
sudo yum install sox sox-devel
(2)模型下载

We strongly recommend that you download our pretrained CosyVoice2-0.5B CosyVoice-300M CosyVoice-300M-SFT CosyVoice-300M-Instruct model and CosyVoice-ttsfrd resource.

If you are expert in this field, and you are only interested in training your own CosyVoice model from scratch, you can skip this step.

# SDK模型下载
from modelscope import snapshot_download
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('iic/CosyVoice-300M-25Hz', local_dir='pretrained_models/CosyVoice-300M-25Hz')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
# git模型下载,请确保已安装git lfs
mkdir -p pretrained_models
git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git pretrained_models/CosyVoice2-0.5B
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M
git clone https://www.modelscope.cn/iic/CosyVoice-300M-25Hz.git pretrained_models/CosyVoice-300M-25Hz
git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT
git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd

Optionally, you can unzip ttsfrd resouce and install ttsfrd package for better text normalization performance.

Notice that this step is not necessary. If you do not install ttsfrd package, we will use WeTextProcessing by default.

cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl

Basic Usage

We strongly recommend using CosyVoice2-0.5B for better performance.
For zero_shot/cross_lingual inference, please use CosyVoice-300M model.
For sft inference, please use CosyVoice-300M-SFT model.
For instruct inference, please use CosyVoice-300M-Instruct model.

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

CosyVoice2 Usage

cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=True, load_onnx=False, load_trt=False)

# NOTE if you want to reproduce the results on https://funaudiollm.github.io/cosyvoice2, please add text_frontend=False during inference
# zero_shot usage
prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L248
for i, j in enumerate(cosyvoice.inference_cross_lingual('在他讲述那个荒诞故事的过程中,他突然[laughter]停下来,因为他自己也被逗笑了[laughter]。', prompt_speech_16k, stream=False)):
    torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# instruct usage
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False)):
    torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

CosyVoice Usage

cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT', load_jit=True, load_onnx=False, fp16=True)
# sft usage
print(cosyvoice.list_available_spks())
# change stream=True for chunk stream inference
for i, j in enumerate(cosyvoice.inference_sft('你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?', '中文女', stream=False)):
    torchaudio.save('sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M') # or change to pretrained_models/CosyVoice-300M-25Hz for 25Hz inference
# zero_shot usage, <|zh|><|en|><|jp|><|yue|><|ko|> for Chinese/English/Japanese/Cantonese/Korean
prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# cross_lingual usage
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k, stream=False)):
    torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# vc usage
prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)
source_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_vc(source_speech_16k, prompt_speech_16k, stream=False)):
    torchaudio.save('vc_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-Instruct')
# instruct usage, support <laughter></laughter><strong></strong>[laughter][breath]
for i, j in enumerate(cosyvoice.inference_instruct('在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.', stream=False)):
    torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
(3)开始 WebUI

You can use our web demo page to get familiar with CosyVoice quickly.
We support sft/zero_shot/cross_lingual/instruct inference in web demo.

Please see the demo website for details.

# change iic/CosyVoice-300M-SFT for sft inference, or iic/CosyVoice-300M-Instruct for instruct inference
python webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
(4)Docker环境安装

Optionally, if you want to use grpc for service deployment,
you can run following steps. Otherwise, you can just ignore this step.

cd runtime/python
docker build -t cosyvoice:v1.0 .
# change iic/CosyVoice-300M to iic/CosyVoice-300M-Instruct if you want to use instruct inference
# for grpc usage
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic/CosyVoice-300M && sleep infinity"
cd grpc && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
# for fastapi usage
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && python3 server.py --port 50000 --model_dir iic/CosyVoice-300M && sleep infinity"
cd fastapi && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
<think>好的,我需要帮助用户查找关于CosyVoice API接口的信息和使用方法。首先,我要回顾用户提供的引用内容,看看有哪些相关信息。 引用[1]提到了CosyVoice的Python库和WebUI服务,但没有详细说明API的具体使用。引用[2]指出可以使用tkinter和requests库调用CosyVoice的API,但用户可能需要更具体的步骤。引用[3]提供了一个代码示例,展示了如何初始化模型和进行推理,这可能是API的一部分。引用[4]详细说明了通过阿里云的Dashscope平台获取API Key,并给出了调用语音合成模型的代码示例,包括鉴权和发送请求的步骤。 接下来,我需要整理这些信息,确保符合系统级指令。用户需要API接口文档或使用教程,所以应分步骤说明如何调用API,包括环境准备、鉴权、代码示例等。同时,要检查是否有遗漏的信息,比如是否需要安装特定库,或者参数的详细说明。 需要注意的是,用户提供的引用中没有直接给出API文档的链接,因此可能需要指导用户前往阿里云或Dashscope的官方文档查找。此外,代码示例中的参数如'service_url'和'model'需要确认是否正确,可能根据API版本不同有所变化。 最后,生成相关问题时要围绕API使用、鉴权、功能扩展等主题,确保问题相关且有帮助。同时,确保所有数学表达式和独立公式的格式正确,但当前内容中可能不涉及数学公式,需注意是否需要调整。</think>关于CosyVoice API接口的使用方法,根据提供的参考资料整理如下: ### 一、API调用基础步骤 1. **环境准备** - 安装Python依赖库:`pip install dashscope` - 获取API密钥:登录阿里云控制台 -> 进入Dashscope控制台 -> 创建API Key[^4] 2. **鉴权设置** ```python import dashscope dashscope.api_key = '您的API_KEY' # 替换为实际密钥 ``` ### 二、语音合成接口调用 #### 1. 基础文本转语音 ```python from dashscope.audio.tts import SpeechSynthesizer response = SpeechSynthesizer.call( model='sambert-zhichu-v1', # 指定语音模型 text='小姐姐,我想要吃糖葫芦嘛~', # 输入文本内容 sample_rate=48000, # 音频采样率 format='wav' # 输出格式 ) # 保存音频文件 with open('output.wav', 'wb') as f: f.write(response.get_audio_data()) ``` #### 2. 高级参数配置 支持调节语速/音调/音量: ```python response = SpeechSynthesizer.call( text='老板这个多少钱?', voice='zhitian_emo', # 预置音色 speech_rate=0.8, # 语速调节(0.5-2.0) pitch_rate=1.2, # 音调调节(0.5-2.0) volume=50 # 音量调节(0-100) ) ``` ### 三、开发注意事项 1. 免费额度:新用户每月有500次免费调用额度 2. 语音模型: - 基础音色:`sambert-zhichu-v1` - 情感音色:`sambert-zhiyan-v1`(支持开心/生气等情绪) 3. 支持格式:WAV/MP3/PCM,推荐使用WAV保证音质[^2] ### 四、官方文档获取 完整API文档可通过以下途径获取: 1. 阿里云官方文档中心搜索"通义语音合成" 2. Dashscope开发者门户的语音服务板块 3. GitHub搜索"dashscope-python-sdk"查看SDK源码[^3]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值