vits2中文特化版使用

Sher-Locked

已于 2024-07-18 20:52:28 修改

阅读量1.9k

点赞数 44

文章标签： python 音频

于 2024-07-13 20:10:49 首次发布

本文链接：https://blog.csdn.net/m0_51003570/article/details/140388805

版权

近日因为一些机缘巧合，尝试部署了一下vits2，遇到了一些问题，相比两年前的vits，部署方法更为繁琐，我使用的是 https://github.com/v3ucn/Bert-VITS2-Extra_- ，中的中文特化版的vits，部署在Windows11，所用显卡为2080ti。这个模型的作者已经提供了非常完善的教程，在这里https://colab.research.google.com/drive/10FRAJhPjZin3TbBTy3a0GC6EmIMvTII4?usp=sharing ，但是本人作为新手还是遇到了很多问题，在此总结一下本人部署的流程

安装环境

我使用的是python3.9，刚开始在高版本部署的时候遇到了有的包的版本对不上的问题，可能可以解决，但是没必要在此花费时间。Anaconda的安装可以参考部分资料，也可以参考我的https://blog.csdn.net/m0_51003570/article/details/140388952?spm=1001.2014.3001.5502。
然后安装pytorch，https://pytorch.org，往下拉动，到这个界面，直接复制下面的commad，在你的python环境中执行。在这里插入图片描述
某些问题：
安装pytorch的TBB问题，先退出conda环境，再执行conda uninstall TBB

该项目有一些自己的依赖，例如FFmpeg，直接在你的环境下pip安装即可，如果行不通就看下其他安装教程，比如下面这个。
https://blog.csdn.net/csdn_yudong/article/details/129182648

pip install FFmpeg

还有部分依赖在requirements.txt里，现在假设你已经下好了上述GitHub中的文件，通过win+r，输入cmd打开命令行，cd到那个文件夹，或者在那个文件夹空白处右键，选择在终端打开，在你的环境下pip安装

pip install -r requirements.txt

warning：中途不要出现报错，出现报错了请搜索报错，尝试重新安装那个包。

模型下载

整理一下代码结构，在Data下，创建一个自己的文件夹，命名随意，创建以下文件夹，刚开始只需要创建configs、models、raw、wavs四个文件夹和config.json就行，config.json从根目录的config文件夹里复制，复制完了按下文的流程修改，.wav文件全部放到raw文件夹里, 其他的后续会陆续生成。
在这里插入图片描述

这个项目用到了大量的模型，在作者的文章中，作者在下文中给出了所有模型的下载地址,：https://colab.research.google.com/drive/10FRAJhPjZin3TbBTy3a0GC6EmIMvTII4?usp=sharing
wget下载方式：https://blog.csdn.net/suncrx/article/details/129377455

但是原作者使用的huggingface.co貌似无法访问，网址要改用镜像，但是日后可能镜像也会失效
要将所有huggingface.co改为hf-mirror.com即改成下面的命令，Yae是我的文件夹名，改成自己上面创建的文件夹名

//bert的下载命令
wget -P bert/Erlangshen-MegatronBert-1.3B-Chinese/ https://hf-mirror.com/IDEA-CCNL/Erlangshen-MegatronBert-1.3B/resolve/main/pytorch_model.bin
//忘了是什么了，反正下载就对了
wget -P slm/wavlm-base-plus/ https://hf-mirror.com/microsoft/wavlm-base-plus/resolve/main/pytorch_model.bin
//情感模型的下载位置
wget -P emotional/clap-htsat-fused/ https://hf-mirror.com/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://hf-mirror.com/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin
//其他的bert，可能用不上
wget -P bert/chinese-roberta-wwm-ext-large/ https://hf-mirror.com/hfl/chinese-roberta-wwm-ext-large/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://hf-mirror.com/microsoft/deberta-v3-large/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://hf-mirror.com/microsoft/deberta-v3-large/resolve/main/pytorch_model.generator.bin
//底模，或者说已经预训练了的模型
wget -P Data/Yae/models/ https://hf-mirror.com/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/D_0.pth
wget -P Data/Yae/models/ https://hf-mirror.com/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/G_0.pth
wget -P Data/Yae/models/ https://hf-mirror.com/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/WD_0.pth

Yennefer是作者的文件夹名，改成自己上面创建的文件夹名，以下是原作者的命令。

//bert的下载命令
wget -P bert/Erlangshen-MegatronBert-1.3B-Chinese/ https://huggingface.co/IDEA-CCNL/Erlangshen-MegatronBert-1.3B/resolve/main/pytorch_model.bin
//忘了是什么了，反正下载就对了
wget -P slm/wavlm-base-plus/ https://huggingface.co/microsoft/wavlm-base-plus/resolve/main/pytorch_model.bin
//情感模型的下载位置
wget -P emotional/clap-htsat-fused/ https://huggingface.co/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin
//其他的bert，可能用不上
wget -P bert/chinese-roberta-wwm-ext-large/ https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.generator.bin
//底模，或者说已经预训练了的模型
wget -P Data/Yennefer/models/ https://huggingface.co/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/D_0.pth
wget -P Data/Yennefer/models/ https://huggingface.co/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/G_0.pth
wget -P Data/Yennefer/models/ https://huggingface.co/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/WD_0.pth

参数配置

该项目的参数主要集中在根目录下的config.yam和你的Data/XXX文件夹下的config.json.。
config.yml直接改就行，config.json要复制一份到你的文件夹下，修改内容包括了：
config.yml中

//这个改为你数据集的路径
dataset_path: "./Data/Yae"

//训练使用的worker，不建议超过CPU核心数（原文的注释），这个数设置过大，可能会导致锁显存，
//明明还有大量空闲显存，但是报cuda out of memory
train_ms:
	num_workers: 4

in_dir: "raw" # 相对于根目录的路径为 /datasetPath/in_dir
  
out_dir: "wavs" # 音频文件重采样后输出路径

transcription_path: "esd.list"
  # 数据清洗后文本路径，可以不填。不填则将在原始文本目录生成
cleaned_path: ""
# 训练集路径
train_path: "train.list"
# 验证集路径
val_path: "val.list"

//模型路径
webui:
  model: "models/G_10050.pth"

config.json中

//训练集与测试路径，暂时没有那个list，一会会生成
"training_files": "Data/Yae/train.list",
"validation_files": "Data/Yae/val.list",

//训练多少轮
"epochs": 1000


//一次训练放入多少数据，我用10，大概最高占用了14G显存
"batch_size": 10

//我也不知道是什么东西，不过要改成自己的文件夹名
"spk2id": {
      "Yae": 0
    }

数据集处理：

网上的教程，和这个数据集貌似又不太一样的地方，webui_preprocess.py不太好用，有些命令只能手敲，我使用的是作者给出的命令，流程如下。https://www.bilibili.com/read/cv22206231/这里总结了一些常见的报错，部分我遇到的上面没有的报错，会在下面展示。

1.数据集切分

将长音频切成训练用的短音频

原来的audio_slicer.py只能切一个单一音频文件，所以我对其略微修改了一下，使得其可以读取某个文件夹内所有的wav文件，修改后的代码如下。

import librosa  # Optional. Use any library you like to read audio files.
import soundfile  # Optional. Use any library you like to write audio files.

import shutil
import gradio as gr
import os
import webbrowser
import subprocess
import datetime
import json
import requests
import soundfile as sf
import numpy as np
import yaml
from config import config
import os

with open('config.yml', mode="r", encoding="utf-8") as f:
    configyml=yaml.load(f,Loader=yaml.FullLoader)


model_name = configyml["dataset_path"].replace("Data/","")



from slicer2 import Slicer
index=0
path = configyml['dataset_path']+"/raw"
files= os.listdir(path)
for file in files:
    print(file)
    audio, sr = librosa.load(f'./Data/{model_name}/raw/{file}', sr=None, mono=False)  # Load an audio file with librosa.
    slicer = Slicer(
        sr=sr,
        threshold=-40,
        min_length=2000,
        min_interval=300,
        hop_size=10,
        max_sil_kept=500
    )
    chunks = slicer.slice(audio)
    for _, chunk in enumerate(chunks):
        if len(chunk.shape) > 1:
            chunk = chunk.T  # Swap axes if the audio is stereo.
        soundfile.write(f'./Data/{model_name}/raw/{model_name}_{index}.wav', chunk, sr)  # Save sliced audio files with soundfile.
        index+=1

    if os.path.exists(f'./Data/{model_name}/raw/{file}'):  # 如果文件存在
        os.remove(f'./Data/{model_name}/raw/{file}')

在命令行执行以下命令

python audio_slicer.py

2.数据集标注

为了降低数据集标注难度，这个代码使用音转文的模型自动进行标注，并生成符合其标准的训练用list。
这个地方要安装whisper，命令如下

pip install git+https://github.com/openai/whisper.git

这里默认产生繁体中文，使用以下代码改为简体，先安装这个包，命令如下

pip install zhconv

代码如下

import whisper
import os
import json
import argparse
import torch
from config import config


import yaml

with open('config.yml', mode="r", encoding="utf-8") as f:
    configyml=yaml.load(f,Loader=yaml.FullLoader)


model_name = configyml["dataset_path"].replace("./Data/","")


lang2token = {
            'zh': "ZH|",
            'ja': "JP|",
            "en": "EN|",
        }
def transcribe_one(audio_path):
    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio(audio_path)
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # detect the spoken language
    _, probs = model.detect_language(mel)
    print(f"Detected language: {max(probs, key=probs.get)}")
    lang = max(probs, key=probs.get)
    # decode the audio
    options = whisper.DecodingOptions(beam_size=5)
    result = whisper.decode(model, mel, options)

    #繁体中文转简体
    if lang =="zh":
        import zhconv
        simplified_text = zhconv.convert(result.text, 'zh-hans')
 
 
    # print the recognized text
    print(simplified_text)
    return lang, simplified_text

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--languages", default="CJ")
    parser.add_argument("--whisper_size", default="medium")
    args = parser.parse_args()
    if args.languages == "CJE":
        lang2token = {
            'zh': "ZH|",
            'ja': "JP|",
            "en": "EN|",
        }
    elif args.languages == "CJ":
        lang2token = {
            'zh': "ZH|",
            'ja': "JP|",
        }
    elif args.languages == "C":
        lang2token = {
            'zh': "ZH|",
        }
    assert (torch.cuda.is_available()), "Please enable GPU in order to run Whispser!"
    model = whisper.load_model(args.whisper_size)
    #parent_dir = "./custom_character_voice/"
    parent_dir=config.resample_config.in_dir    

    print(parent_dir)
    speaker = model_name
    speaker_annos = []
    total_files = sum([len(files) for r, d, files in os.walk(parent_dir)])
    # resample audios
    # 2023/4/21: Get the target sampling rate
    with open(config.train_ms_config.config_path,'r', encoding='utf-8') as f:
        hps = json.load(f)
    target_sr = hps['data']['sampling_rate']
    processed_files = 0


    
    print(speaker)

    
    for i, wavfile in enumerate(list(os.walk(parent_dir))[0][2]):
        # try to load file as audio
        # if wavfile.startswith("processed_"):
        #     continue
        
        try:
            # wav, sr = torchaudio.load(parent_dir + "/" + speaker + "/" + wavfile, frame_offset=0, num_frames=-1, normalize=True,
            #                           channels_first=True)
            # wav = wav.mean(dim=0).unsqueeze(0)
            # if sr != target_sr:
            #     wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sr)(wav)
            # if wav.shape[1] / sr > 20:
            #     print(f"{wavfile} too long, ignoring\n")
            #save_path = parent_dir+"/"+ speaker + "/" + f"ada_{i}.wav"
            # torchaudio.save(save_path, wav, target_sr, channels_first=True)
            # transcribe text
            
            lang, text = transcribe_one(f"./Data/{speaker}/raw/{wavfile}")
            
            if lang not in list(lang2token.keys()):
                print(f"{lang} not supported, ignoring\n")
                continue
            #text = "ZH|" + text + "\n"
            text = f"./Data/{model_name}/wavs/{wavfile}|" + f"{model_name}|" +lang2token[lang] + text + "\n"
            speaker_annos.append(text)
            
            processed_files += 1
            print(f"Processed: {processed_files}/{total_files}")
        except Exception as e:
            print(e)
            continue

    # # clean annotation
    # import argparse
    # import text
    # from utils import load_filepaths_and_text
    # for i, line in enumerate(speaker_annos):
    #     path, sid, txt = line.split("|")
    #     cleaned_text = text._clean_text(txt, ["cjke_cleaners2"])
    #     cleaned_text += "\n" if not cleaned_text.endswith("\n") else ""
    #     speaker_annos[i] = path + "|" + sid + "|" + cleaned_text
    # write into annotation
    if len(speaker_annos) == 0:
        print("Warning: no short audios found, this IS expected if you have only uploaded long audios, videos or video links.")
        print("this IS NOT expected if you have uploaded a zip file of short audios. Please check your file structure or make sure your audio language is supported.")
    with open(config.preprocess_text_config.transcription_path, 'w', encoding='utf-8') as f:
        for line in speaker_annos:
            f.write(line)

标注用命令

python short_audio_transcribe.py

接下来的内容基本上都是和原作者差不多了

3.重采样

该模型目前是生成44100hz的采样率的音频的，输入也必须是这个采样率，这个是重采样的代码，按照你们的路径修改路径

python resample.py --sr 44100 --in_dir ./Data/Yae/raw/ --out_dir ./Data/Yae/wavs/

4.预处理标签文件

按照你们的路径修改路径

python preprocess_text.py --transcription-path ./Data/Yae/esd.list --train-path ./Data/Yae/train.list --val-path ./Data/Yae/val.list --config-path ./Data/Yae/configs/config.json

在这里我遇到了一个报错：Resource punkt not found. Please use the NLTK Downloader to obtain the resource
解决方法在这个链接里：https://blog.csdn.net/qq_45956730/article/details/128944224

5.生成 BERT 特征文件

按照你们的路径修改路径

python bert_gen.py --config-path ./Data/Yae/configs/config.json

6.生成 clap 特征文件

python clap_gen.py --config-path ./Data/Yae/configs/config.json

训练

我这边batch_size是10，大概占用了14537MiB的显存，也就是大约14个G，显存占用是越来越多的，可以调低batch_size以防爆显存。

python train_ms.py

我遇上的问题是这个 ‘HParams’ object has no attribute ‘Yae’
解决办法是
config.yam中修改这个

//我也不知道是什么东西，不过要改成自己的文件夹名
"spk2id": {
      "Yae": 0
    }

推理

改模型位置，在config.yam中

//模型路径，记得修改
webui:
  model: "models/G_10050.pth"
//命令
python webui.py

参考了一些文章：
原项目链接：https://github.com/v3ucn/Bert-VITS2-Extra_-
原作者的笔记：https://colab.research.google.com/drive/10FRAJhPjZin3TbBTy3a0GC6EmIMvTII4?usp=sharing#scrollTo=Uy4He00grikV
wget （windows版）下载、安装及使用：https://blog.csdn.net/suncrx/article/details/129377455
So-VITS-SVC 4.0 训练/推理常见报错和Q&A：https://www.bilibili.com/read/cv22206231/
Python nltk解决报错：Resource punkt not found. Please use the NLTK Downloader to obtain the resource:https://blog.csdn.net/qq_45956730/article/details/128944224

Sher-Locked

关注

44
点赞
踩
33

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫