flowtron 文本到语音生成模型使用

前言

最近需要用一个text2speech的网络模型做工程, 了解到2020的一篇论文的工作就是这个, 正好作者也开源了项目, 所以来研究研究

paper
Flowtron: an Autoregressive Flow-based Network for Text-to-Mel-spectrogram Synthesis

github
https://github.com/NVIDIA/flowtron

环境配置

Setup
注意需要安装pytorch (都是搞AI的人, 不会没装pytorch的叭

git clone https://github.com/NVIDIA/flowtron.git
cd flowtron
git submodule update --init; cd tacotron2; git submodule update --init
pip install -r requirements.txt

异常:
(1) 直接https下载然后解压, 会报git没配置的错误
(2) scipy版本太高

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
paddlepaddle-gpu 1.7.2.post107 requires scipy<=1.3.1; python_version >= "3.5", but you have scipy 1.5.2 which is incompatible.

先卸载scipy

D:\flowtron\flowtron>python -m pip uninstall scipy
Found existing installation: scipy 1.5.2
Uninstalling scipy-1.5.2:
  Would remove:
    g:\python\lib\site-packages\scipy-1.5.2.dist-info\*
    g:\python\lib\site-packages\scipy\*
Proceed (Y/n)? y
  Successfully uninstalled scipy-1.5.2

然后安装1.3.1版本scipy

D:\flowtron\flowtron>python -m pip install scipy==1.3.1
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting scipy==1.3.1
  Downloading https://mirrors.aliyun.com/pypi/packages/50/eb/defa40367863304e1ef01c6572584c411446a5f29bdd9dc90f91509e9144/scipy-1.3.1-cp37-cp37m-win_amd64.whl (30.3 MB)
     |████████████████████████████████| 30.3 MB 57 kB/s
Requirement already satisfied: numpy>=1.13.3 in g:\python\lib\site-packages (from scipy==1.3.1) (1.19.2)
Installing collected packages: scipy
Successfully installed scipy-1.3.1
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'G:\Python\python.exe -m pip install --upgrade pip' command.

代码

看参数设置

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-c', '--config', type=str,
                        help='JSON file for configuration')
    parser.add_argument('-p', '--params', nargs='+', default=[])
    parser.add_argument('-f', '--flowtron_path',
                        help='Path to flowtron state dict', type=str)
    parser.add_argument('-w', '--waveglow_path',
                        help='Path to waveglow state dict', type=str)
    parser.add_argument('-t', '--text', help='Text to synthesize', type=str)
    parser.add_argument('-i', '--id', help='Speaker id', type=int)
    parser.add_argument('-n', '--n_frames', help='Number of frames',
                        default=400, type=int)
    parser.add_argument('-o', "--output_dir", default="results/")
    parser.add_argument("-s", "--sigma", default=0.5, type=float)
    parser.add_argument("-g", "--gate", default=0.5, type=float)
    parser.add_argument("--seed", default=1234, type=int)
    args = parser.parse_args()

    # Parse configs.  Globals nicer in this case
    with open(args.config) as f:
        data = f.read()

    global config
    config = json.loads(data)
    update_params(config, args.params)

    data_config = config["data_config"]
    global model_config
    model_config = config["model_config"]

    # Make directory if it doesn't exist
    if not os.path.isdir(args.output_dir):
        os.makedirs(args.output_dir)
        os.chmod(args.output_dir, 0o775)

    torch.backends.cudnn.enabled = True
    torch.backends.cudnn.benchmark = False
    infer(args.flowtron_path, args.waveglow_path, args.output_dir, args.text,
          args.id, args.n_frames, args.sigma, args.gate, args.seed)

看看infer源码

def infer(flowtron_path, waveglow_path, output_dir, text, speaker_id, n_frames,
          sigma, gate_threshold, seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

    # load waveglow
    waveglow = torch.load(waveglow_path)['model'].cuda().eval()
    waveglow.cuda().half()
    for k in waveglow.convinv:
        k.float()
    waveglow.eval()

    # load flowtron
    model = Flowtron(**model_config).cuda()
    state_dict = torch.load(flowtron_path, map_location='cpu')['state_dict']
    model.load_state_dict(state_dict)
    model.eval()
    print("Loaded checkpoint '{}')" .format(flowtron_path))

    ignore_keys = ['training_files', 'validation_files']
    trainset = Data(
        data_config['training_files'],
        **dict((k, v) for k, v in data_config.items() if k not in ignore_keys))
    speaker_vecs = trainset.get_speaker_id(speaker_id).cuda()
    text = trainset.get_text(text).cuda()
    speaker_vecs = speaker_vecs[None]
    text = text[None]

    with torch.no_grad():
        residual = torch.cuda.FloatTensor(1, 80, n_frames).normal_() * sigma
        mels, attentions = model.infer(
            residual, speaker_vecs, text, gate_threshold=gate_threshold)

    for k in range(len(attentions)):
        attention = torch.cat(attentions[k]).cpu().numpy()
        fig, axes = plt.subplots(1, 2, figsize=(16, 4))
        axes[0].imshow(mels[0].cpu().numpy(), origin='bottom', aspect='auto')
        axes[1].imshow(attention[:, 0].transpose(), origin='bottom', aspect='auto')
        fig.savefig(os.path.join(output_dir, 'sid{}_sigma{}_attnlayer{}.png'.format(speaker_id, sigma, k)))
        plt.close("all")

    with torch.no_grad():
        audio = waveglow.infer(mels.half(), sigma=0.8).float()

    audio = audio.cpu().numpy()[0]
    # normalize audio for now
    audio = audio / np.abs(audio).max()
    print(audio.shape)

    write(os.path.join(output_dir, 'sid{}_sigma{}.wav'.format(speaker_id, sigma)),
          data_config['sampling_rate'], audio)

应用

example
先下载pre-trained model
google drive上的flowtron_ljs.pt文件
https://drive.google.com/file/d/1Cjd6dK_eFz6DE0PKXKgKxrzTUqzzUDW-/view

以及waveglow_256channels_v4.pt模型文件
waveglow的github地址
https://github.com/NVIDIA/WaveGlow
最新的v5版本模型google drive
https://drive.google.com/file/d/1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF/view

python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t "It is well know that deep generative models have a rich latent space!" -i 0

改一下参数设置

python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_universal_v5.pt -t "It is well know that deep generative models have a rich latent space!" -i 0

异常

ImportError: cannot import name 'betabinom' from 'scipy.stats'

scipy版本问题, 所以上面的报错有误导性 (blog不删除上面的报错是为了做比对

python -m pip install scipy --upgrade

接着报错

ModuleNotFoundError: No module named 'numba.decorators'

numba版本问题, 卸载后安装0.48版本 (好像可以不卸载)

pip uninstall numba
pip install numba==0.48

接着报错

ValueError: 'bottom' is not a valid value for origin; supported values are 'upper', 'lower'

具体看代码, 改一下code

    for k in range(len(attentions)):
        attention = torch.cat(attentions[k]).cpu().numpy()
        fig, axes = plt.subplots(1, 2, figsize=(16, 4))
        axes[0].imshow(mels[0].cpu().numpy(), origin='lower', aspect='auto') # origin='bottom'
        axes[1].imshow(attention[:, 0].transpose(), origin='lower', aspect='auto') # origin='bottom'
        fig.savefig(os.path.join(output_dir, 'sid{}_sigma{}_attnlayer{}.png'.format(speaker_id, sigma, k)))
        plt.close("all")

效果
results文件夹生成了语音文件, 我听是效果不错, 可以生成"It is well know that deep generative models have a rich latent space!"这句话的语音, (CSDN可以放语音文件么, 不能直观演示

在这里插入图片描述

原理浅析

之后补充…

总结

直接运行代码通常会报错误, 一般来说是依赖库版本问题, google / Stack Overflow一下就行

后记
一个二进制玩家逐渐转移到数据科学和AI安全领域…我也没想到哇
不过 AI 和 IoT 一样都是大势所趋, 所以只能顺势而为, 来都来了就老老实实干, 不想那些有的没的了 (有空再搞搞二进制

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值