VITS-fast-fine-tuning Windows环境本地部署

最新推荐文章于 2024-08-08 08:29:05 发布

novemberl

最新推荐文章于 2024-08-08 08:29:05 发布

阅读量1.8k

点赞数 2

文章标签： windows python 人工智能

本文链接：https://blog.csdn.net/novemberl/article/details/131956275

版权

本文详细介绍了在Windows系统中，使用英伟达独立显卡进行Python环境配置，包括安装Python3.8、克隆VITS-fast-fine-tuning项目、安装Python包、设置GPU版本的Torch以及预训练模型的下载和使用。同时，还涵盖了音视频文件的整理、处理和训练过程。

摘要由CSDN通过智能技术生成

首先要保证你有一张英伟达的独立显卡，显存最好不低于12G。

进入正题：

一、安装python3.8。

我选的是3.8.10，因为目前windows支持的最高版本好像就是这个版本了。

直接去这个页面https://www.python.org/downloads/release/python-3810/然后拉到最下面。

请添加图片描述

根据你的系统选择一个下载就好了。

二、克隆项目

1.进入项目目录 https://github.com/Plachtaa/VITS-fast-fine-tuning，复制项目地址

请添加图片描述

2.在本地随便找个文件夹，右击鼠标，选择git bash here，然后在里面输入:

git clone https://github.com/Plachtaa/VITS-fast-fine-tuning.git

三、安装python包

1.打开项目，找到目录中的requirements.txt文件，将里面的pyopenjtalk==0.1.3删除，然后执行：

pip install -r .\requirements.txt

2.参照https://blog.csdn.net/xijinno1/article/details/131199311文档安装cmake。

3.执行：

pip install pyopenjtalk
pip install imageio==2.4.1
pip install moviepy

没错之前安装cmake就是为了能成功安装pyopenjtalk包，这一步如果安装失败，可以尝试更改pyopenjtalk的版本来解决。

四、安装GPU版本Torch。

# CUDA 11.6
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
# CUDA 11.7
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

五、创建一些必须要有但不知道为什么项目里就没有的文件夹。

1.在项目目录中，找到monotonic_align文件夹，进入文件夹然后创建同名文件夹monotonic_align。然后执行：

python setup.py build_ext --inplace

最终得到的目录结构。

请添加图片描述

2.在项目根目录创建pretrained_models

3.下载这个文件并解压至项目根目录。

https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/sampled_audio4ft_v2.zip

4.在项目根目录创建video_data、raw_audio、denoised_audio、custom_character_voice、segmented_character_voice五个文件夹。

六、下载预训练模型。

CJE: Trilingual (Chinese, Japanese, English)
CJ: Dualigual (Chinese, Japanese)
C: Chinese only

CJE模型下载地址：

https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/D_trilingual.pth
https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/G_trilingual.pth
https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/configs/uma_trilingual.json

CJ模型下载地址：

https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/D_0-p.pth
https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/G_0-p.pth
https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/config.json

C模型下载地址：

https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/D_0.pth
https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/G_0.pth
https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/config.json

三个文件下载完成之后，将G模型文件重命名为G_0.pth，D模型文件重命名为D_0.pth，config文件重命名为finetune_speaker.json；

将G_0.pth，D_0.pth放入pretrained_models文件夹；

将finetune_speaker.json放入configs文件夹。

七、整理音视频文件

1.将音频文件的格式按照下面的文档整理好。

https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA.MD

2.短音频：

将装有短音频的文件夹直接放入./custom_character_voice/ 文件夹中。

3.长音频：

将长音频文件放入./raw_audio/文件夹中。

4.视频：

将视频放入./video_data/文件夹中。

以上几种视频音频格式，只需选择一种即可。

八、Process音视频文件。

执行：

python scripts/video2audio.py
python scripts/denoise_audio.py
python scripts/long_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large
python scripts/short_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large
python scripts/resample.py

其中{PRETRAINED_MODEL}替换为CJE、CJ或C

其中如果报Warning: no short audios found。需要重新安装ffmpeg。

pip uninstall ffmpeg-python
pip install ffmpeg-python

这两步执行完如果还不行，按照下面文档，在系统中安装ffmpeg。

https://blog.csdn.net/qq_35164554/article/details/124866110

九、Process 所有文本文件。

训练质量相关：实验发现目前使用CJ模型+勾选ADD_AUXILIARY，对于中/日均能训练出最好的效果，第一次训练建议默认使用该组合！！！

# 用ADD_AUXILIARY
python preprocess_v2.py --add_auxiliary_data True --languages "{PRETRAINED_MODEL}"
# 不用ADD_AUXILIARY
python preprocess_v2.py --languages "{PRETRAINED_MODEL}"

其中{PRETRAINED_MODEL}替换为CJE、CJ或C

十、开始训练。

首次训练
python finetune_speaker_v2.py -m ./OUTPUT_MODEL --max_epochs "{Maximum_epochs}" --drop_speaker_embed True
继续训练
python finetune_speaker_v2.py -m "./OUTPUT_MODEL" --max_epochs "{Maximum_epochs}" --drop_speaker_embed False --cont True

{Maximum_epochs}为最大训练步数，我看默认是200，具体根据实际情况更改。

如果想查看训练效果，执行tensorboard --logdir=./OUTPUT_MODEL ，在浏览器访问localhost:6006查看。

十一、推理

将configs目录下的modified_finetune_speaker.json文件，复制到根目录并重命名为finetune_speaker.json

python VC_inference.py --model_dir ./OUTPUT_MODEL/G_latest.pth --share True

参考:

1.官方本地部署教程：https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/LOCAL.md

2.本地部署vitshttps://www.bilibili.com/read/cv24427456

3.项目Issues和colab等