【20220121】Voice conversion

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion 中间结果,github

1. autovc

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

  1. 英文翻译参考此CSDN此CSDN
  2. 零次学习(Zero-Shot Learning)参考此知乎
  3. autovc官方github

三个问题:

  1. 训练非并行数据;
  2. 多对多转换;
  3. zero-shot

希望像GAN一样匹配分布,像CVAE一样容易训练

zero-shot

zero-shot learning 概念示意
在这里插入图片描述
利用训练集数据训练模型,使得模型能够对测试集的对象进行分类,但是训练集类别和测试集类别之间没有交集;期间需要借助类别的描述,来建立训练集和测试集之间的联系,从而使得模型有效。第一个问题是获取合适的类别描述 A A A;第二个问题是建立一个合适的分类模型。

存在的问题:

  1. 领域漂移问题——自编码器过程
  2. 枢纽点问题——建立从语义空间到特征空间的映射;使用生成模型
  3. 语义间隔——将两者的流形调整到一致

代码复现

程序报错

  1. RuntimeError: CUDA error: out of memory CUDA内存不足
    解决办法:等0号卡上别人程序跑完(需要1725MiB)
  2. module 'librosa' has no attribute 'output'
    0.8.0以后的版本,librosa已将output函数删除
    解决办法:
import soundfile as sf
sf.write(name + '.wav', waveform, 16000)

算法流程:

  1. Generate spectrogram data from the wav files: python make_spect.py
    以CASIA database中的liuchanhg_angry、wangzhe_happy、zhaoquanyin_sad中的201.wav、202.wav为例,生成的谱S分别为:(112, 80)、(103, 80);(92, 80)、(70, 80);(189, 80)、(87, 80)

  2. Generate training metadata, including the GE2E speaker embedding (please use one-hot embeddings if you are not doing zero-shot conversion): python make_metadata.py
    如果当前的话语太短,选择另一个话语
    最后生成的格式为:
    [[‘liuchanhg_angry’, array(speaker embeding), ‘liuchanhg_angry/201.npy’, ‘liuchanhg_angry/202.npy’]]

  3. Run the main training script: python main.py
    Converges when the reconstruction loss is around 0.0001.

...
Elapsed [1 day, 3:41:14], Iteration [304060/1000000], G/loss_id: 0.0180, G/loss_id_psnt: 0.0179, G/loss_cd: 0.0001
Elapsed [1 day, 3:41:17], Iteration [304070/1000000], G/loss_id: 0.0110, G/loss_id_psnt: 0.0109, G/loss_cd: 0.0001
...

100k

Elapsed [2:25:56], Iteration [99990/100000], G/loss_id: 0.0294, G/loss_id_psnt: 0.0294, G/loss_cd: 0.0000
Elapsed [2:25:57], Iteration [100000/100000], G/loss_id: 0.0240, G/loss_id_psnt: 0.0240, G/loss_cd: 0.0000

1000k

Elapsed [17:26:39], Iteration [698500/1000000], G/loss_id: 0.0289, G/loss_id_psnt: 0.0289, G/loss_cd: 0.0000

model_2修改dim_neck=32,freq=32,batch_size=2

Elapsed [14:21:06], Iteration [461400/1000000], G/loss_id: 0.0139, G/loss_id_psnt: 0.0139, G/loss_cd: 0.0001
...
Elapsed [23:37:37], Iteration [732500/1000000], G/loss_id: 0.0163, G/loss_id_psnt: 0.0162, G/loss_cd: 0.0001
...
Elapsed [1 day, 8:23:07], Iteration [999900/1000000], G/loss_id: 0.0190, G/loss_id_psnt: 0.0189, G/loss_cd: 0.0001
Elapsed [1 day, 8:23:18], Iteration [1000000/1000000], G/loss_id: 0.0143, G/loss_id_psnt: 0.0143, G/loss_cd: 0.0001

model_3修改len_crop=128*3

Elapsed [14:10:26], Iteration [181400/1000000], G/loss_id: 0.0146, G/loss_id_psnt: 0.0145, G/loss_cd: 0.0002
Elapsed [14:10:54], Iteration [181500/1000000], G/loss_id: 0.0160, G/loss_id_psnt: 0.0159, G/loss_cd: 0.0002
...
Elapsed [23:26:40], Iteration [290100/1000000], G/loss_id: 0.0163, G/loss_id_psnt: 0.0163, G/loss_cd: 0.0001
...
Elapsed [1 day, 9:20:03], Iteration [426600/1000000], G/loss_id: 0.0125, G/loss_id_psnt: 0.0124, G/loss_cd: 0.0001
Elapsed [1 day, 9:20:17], Iteration [426700/1000000], G/loss_id: 0.0186, G/loss_id_psnt: 0.0182, G/loss_cd: 0.0001
...
Elapsed [1 day, 12:16:26], Iteration [501300/1000000], G/loss_id: 0.0137, G/loss_id_psnt: 0.0131, G/loss_cd: 0.0001
Elapsed [1 day, 12:16:40], Iteration [501400/1000000], G/loss_id: 0.0152, G/loss_id_psnt: 0.0149, G/loss_cd: 0.0001

下面发现mel谱的长度和vecoder不一致,一个2s左右的音频mel谱的维度居然有(四五百, 80),所以直接从wavenet的代码试试生成mel谱,参考官方github

model_4参考,增加squeeze:

g_loss_id = F.mse_loss(x_real, x_identic.squeeze())
g_loss_id_psnt = F.mse_loss(x_real, x_identic_psnt.squeeze())

但是作者

I trimmed the silence off by hand.

还有

small batch size leads to better generalization

是否需要using other vocoders?

作者

300k steps, 10 hours

作者

The only use a subset.

作者

final loss is around 1e-2

关于AUTOVC的效果非常不好我猜测可能的问题如下:

  1. 作者说“I trimmed the silence off by hand.”,所以不知道沉默部分太多对训练效果是否有影响,因为按所给代码提的p225_001.wav的维度为(385, 80) ,我又用wavenet的mel谱提取方式,得到的结果为(177, 80),而vecoder里面此音频的维度为(90, 80),这个问题对于结果的生成(尤其是metadata)我觉得比较重要,也有很多人有同样的问题(https://github.com/auspicious3000/autovc/issues/84,https://github.com/auspicious3000/autovc/issues/17);同时原代码的len_crop=128,因为这样可能太短了,没有包含说话人的声音,我将其增加为376,还在训练,但是Speaker_embeding仍然是根据128做的,如果之后训练效果仍然不好,我会再最后尝试改这个;(PS:现在维度为(166, 80),又发现(129, 80), (385, 80) for wav48)
  2. 作者说“small batch size leads to better generalization”,我现在把batch size改会2重新训练;
  3. 关于训练步长作者在原论文中写道“100k”,但是又说“300k steps”;
  4. 作者又说“The only use a subset.”,不是按原论文中说的9:1划分VCTK数据集。

retrain

Elapsed [1 day, 3:08:04], Iteration [1000000/1000000], G/loss_id: 0.0077, G/loss_id_psnt: 0.0077, G/loss_cd: 0.0005
Elapsed [1 day, 10:11:06], Iteration [1000000/1000000], G/loss_id: 0.0062, G/loss_id_psnt: 0.0062, G/loss_cd: 0.0002
Elapsed [1 day, 10:10:44], Iteration [1000000/1000000], G/loss_id: 0.0037, G/loss_id_psnt: 0.0036, G/loss_cd: 0.0002
Elapsed [2 days, 6:16:53], Iteration [1000000/1000000], G/loss_id: 0.0034, G/loss_id_psnt: 0.0033, G/loss_cd: 0.0002
Elapsed [2 days, 5:12:11], Iteration [1000000/1000000], G/loss_id: 0.0033, G/loss_id_psnt: 0.0032, G/loss_cd: 0.0002

效果都不好

Vocoder

参考实验室wiki工具库github,可基于预训练模型,解决mel谱的提取与vocoder生成问题。

wavenet cudu问题
修改/ceph/home/yangsc21/anaconda3/envs/autovc/lib/python3.8/site-packages/wavenet_vocoder/mixture.py的第112行

划分数据集

为了进行zero shot,随机选择100个说话人进行训练,9个说话人不出现在训练集中。

注意剔除p376_295.raw文件

Note that the ‘p315’ text was lost due to a hard disk error.

Elapsed [0:00:12], Iteration [100/1000000], G/loss_id: 0.7917, G/loss_id_psnt: 1.3611, G/loss_cd: 0.0848
Elapsed [0:00:22], Iteration [200/1000000], G/loss_id: 0.7815, G/loss_id_psnt: 0.6783, G/loss_cd: 0.0517
...
Elapsed [9:17:16], Iteration [325900/1000000], G/loss_id: 0.0984, G/loss_id_psnt: 0.0974, G/loss_cd: 0.0041
Elapsed [9:17:26], Iteration [326000/1000000], G/loss_id: 0.0822, G/loss_id_psnt: 0.0815, G/loss_cd: 0.0030
...
Elapsed [17:51:58], Iteration [630900/1000000], G/loss_id: 0.0621, G/loss_id_psnt: 0.0614, G/loss_cd: 0.0020
Elapsed [17:52:08], Iteration [631000/1000000], G/loss_id: 0.0534, G/loss_id_psnt: 0.0531, G/loss_cd: 0.0018

Bottleneck维度分析

  • “too narrow” model:dimensions from 32 to 16,downsampling factor from 32 to 128
  • “too wide” model:dimensions 256,sampling factor to 8,λ is set to 0

The “too narrow” model should have low classification accuracy (good disentanglement) but high reconstruction error (poor reconstruction)

The “too wide” model should have low reconstruction error (good reconstruction) but high classification accuracy (poor disentanglement).

freq=64,dim_neck=24

Elapsed [15:18:01], Iteration [148500/1000000], G/loss_id: 0.1358, G/loss_id_psnt: 0.1339, G/loss_cd: 0.0075
Elapsed [15:18:41], Iteration [148600/1000000], G/loss_id: 0.2336, G/loss_id_psnt: 0.2340, G/loss_cd: 0.0100
...
Elapsed [1 day, 1:08:15], Iteration [236000/1000000], G/loss_id: 0.1385, G/loss_id_psnt: 0.1368, G/loss_cd: 0.0061
Elapsed [1 day, 1:08:55], Iteration [236100/1000000], G/loss_id: 0.1462, G/loss_id_psnt: 0.1443, G/loss_cd: 0.0069
...
Elapsed [1 day, 14:13:03], Iteration [352300/1000000], G/loss_id: 0.1201, G/loss_id_psnt: 0.1191, G/loss_cd: 0.0060
Elapsed [1 day, 14:13:43], Iteration [352400/1000000], G/loss_id: 0.1177, G/loss_id_psnt: 0.1168, G/loss_cd: 0.0066
...
Elapsed [4 days, 14:41:52], Iteration [1000000/1000000], G/loss_id: 0.0673, G/loss_id_psnt: 0.0659, G/loss_cd: 0.0027

freq=64,dim_neck=16

Elapsed [15:18:14], Iteration [151500/1000000], G/loss_id: 0.1823, G/loss_id_psnt: 0.1794, G/loss_cd: 0.0086
Elapsed [15:18:54], Iteration [151600/1000000], G/loss_id: 0.1745, G/loss_id_psnt: 0.1731, G/loss_cd: 0.0073
...
Elapsed [1 day, 1:09:25], Iteration [240900/1000000], G/loss_id: 0.1490, G/loss_id_psnt: 0.1475, G/loss_cd: 0.0058
Elapsed [1 day, 1:10:05], Iteration [241000/1000000], G/loss_id: 0.1617, G/loss_id_psnt: 0.1594, G/loss_cd: 0.0057
...
Elapsed [1 day, 14:13:05], Iteration [359300/1000000], G/loss_id: 0.0810, G/loss_id_psnt: 0.0803, G/loss_cd: 0.0033
Elapsed [1 day, 14:13:45], Iteration [359400/1000000], G/loss_id: 0.0782, G/loss_id_psnt: 0.0785, G/loss_cd: 0.0048
...
Elapsed [4 days, 13:00:51], Iteration [999900/1000000], G/loss_id: 0.1269, G/loss_id_psnt: 0.1248, G/loss_cd: 0.0037
Elapsed [4 days, 13:01:30], Iteration [1000000/1000000], G/loss_id: 0.1593, G/loss_id_psnt: 0.1570, G/loss_cd: 0.0044

freq=64,dim_neck=8

Elapsed [15:14:48], Iteration [146200/1000000], G/loss_id: 0.1795, G/loss_id_psnt: 0.1785, G/loss_cd: 0.0053
Elapsed [15:15:29], Iteration [146300/1000000], G/loss_id: 0.2842, G/loss_id_psnt: 0.2802, G/loss_cd: 0.0078
...
Elapsed [1 day, 1:06:04], Iteration [233200/1000000], G/loss_id: 0.1539, G/loss_id_psnt: 0.1560, G/loss_cd: 0.0054
Elapsed [1 day, 1:06:45], Iteration [233300/1000000], G/loss_id: 0.2401, G/loss_id_psnt: 0.2394, G/loss_cd: 0.0074
...
Elapsed [1 day, 14:09:50], Iteration [348400/1000000], G/loss_id: 0.3423, G/loss_id_psnt: 0.3405, G/loss_cd: 0.0071
Elapsed [1 day, 14:10:31], Iteration [348500/1000000], G/loss_id: 0.3143, G/loss_id_psnt: 0.3121, G/loss_cd: 0.0091
...
Elapsed [4 days, 15:07:50], Iteration [1000000/1000000], G/loss_id: 0.1669, G/loss_id_psnt: 0.1649, G/loss_cd: 0.0045

freq=64,dim_neck=8,lambda_cd=10

Elapsed [11:10:59], Iteration [98700/1000000], G/loss_id: 0.3038, G/loss_id_psnt: 0.3034, G/loss_cd: 0.0017
Elapsed [11:11:40], Iteration [98800/1000000], G/loss_id: 0.3581, G/loss_id_psnt: 0.3517, G/loss_cd: 0.0022
...
Elapsed [21:02:44], Iteration [185700/1000000], G/loss_id: 0.3073, G/loss_id_psnt: 0.3097, G/loss_cd: 0.0017
Elapsed [21:03:25], Iteration [185800/1000000], G/loss_id: 0.2681, G/loss_id_psnt: 0.2694, G/loss_cd: 0.0016
...
Elapsed [1 day, 10:06:21], Iteration [300900/1000000], G/loss_id: 0.2351, G/loss_id_psnt: 0.2319, G/loss_cd: 0.0016
Elapsed [1 day, 10:07:02], Iteration [301000/1000000], G/loss_id: 0.2504, G/loss_id_psnt: 0.2511, G/loss_cd: 0.0016
...
Elapsed [4 days, 12:21:31], Iteration [1000000/1000000], G/loss_id: 0.2125, G/loss_id_psnt: 0.2103, G/loss_cd: 0.0011

2. CLSVC

代码复现

Elapsed [11:38:12], Iteration [110100/200000], G/loss_id: 0.0020, G/loss_id_psnt: 0.0020, spk_loss: 0.0001, content_advloss: 4.6966, code_loss: 0.0006
Elapsed [11:38:50], Iteration [110200/200000], G/loss_id: 0.0019, G/loss_id_psnt: 0.0019, spk_loss: 0.0124, content_advloss: 4.6481, code_loss: 0.0007
...
Elapsed [21:15:26], Iteration [200000/200000], G/loss_id: 0.0016, G/loss_id_psnt: 0.0016, spk_loss: 0.0001, content_advloss: 4.6103, code_loss: 0.0003
Elapsed [11:30:53], Iteration [108800/200000], G/loss_id: 0.0018, G/loss_id_psnt: 0.0019, spk_loss: 0.0002, content_advloss: 4.6069, code_loss: 0.0005
Elapsed [11:31:31], Iteration [108900/200000], G/loss_id: 0.0018, G/loss_id_psnt: 0.0019, spk_loss: 0.0001, content_advloss: 4.6133, code_loss: 0.0004
...
Elapsed [21:16:12], Iteration [200000/200000], G/loss_id: 0.0020, G/loss_id_psnt: 0.0020, spk_loss: 0.0010, content_advloss: 4.6662, code_loss: 0.0006

我仔细看了Clasvc的代码,与Autovc相比,Clsvc就是删除了下采样和上采样的部分,以及增加了一个很简单的adversarial classifier(其中content用了梯度反转),所谓的“flexible hidden feature dimensions”就是Autovc Encoder的输出维度,也是自己定义的,好像贡献也不是很大…

效果不好,重新训练

把训练过程的z改为content_embedding_source

Elapsed [0:02:48], Iteration [1000/810000], G/loss_id: 0.2151, G/loss_id_psnt: 0.2155, spk_loss: 0.3096, content_advloss: 4.6123, code_loss: 0.0139
...
Elapsed [10 days, 2:53:56], Iteration [810000/810000], G/loss_id: 0.0008, G/loss_id_psnt: 0.0008, spk_loss: 0.0000, content_advloss: 4.6048, code_loss: 0.0000

再把g_loss_cd前系数由1改为0.5

Elapsed [0:00:35], Iteration [100/810000], G/loss_id: 0.2475, G/loss_id_psnt: 0.2654, spk_loss: 4.1111, content_advloss: 4.5997, code_loss: 0.0165
...
Elapsed [10 days, 2:23:01], Iteration [810000/810000], G/loss_id: 0.0009, G/loss_id_psnt: 0.0008, spk_loss: 0.0000, content_advloss: 4.6057, code_loss: 0.0001

3. SpeechFlow

语言内容(content),音色(timbre),音调(pitch)和韵律节奏(rhythm)

代码复现

spk2gen,len(spk2gen)=109,将对应说话人转化为对应性别

{'p250': 'F', 'p285': 'M', 'p277': 'F',...

重新用VCTK的wav16进行实验,发现一些音频并不能进行sf.read操作,直接用软件打开也显示格式损坏:

...
p329
/ceph/datasets/VCTK-Corpus/wav16/p329/p329_037.wav
p330
/ceph/datasets/VCTK-Corpus/wav16/p330/p330_101.wav
...

关于预处理,作者

All preprocessing steps are in the code, except trimming silence. But I don’t think they will make any fundamental difference. Your loss value looks fine.

关于如何训练P可见此讨论

demo.pkl格式如下:

import pickle

metadata = pickle.load(open("/ceph/home/yangsc21/Python/autovc/SpeechSplit/assets/demo.pkl", "rb"))
print(metadata[0][1].shape)
print(metadata[0][2][0].shape)
print(metadata[0][2][1].shape)

print(metadata[1][1].shape)     # (1, 82)
print(metadata[1][2][0].shape)      # (105, 80)
print(metadata[1][2][1].shape)      # (105,)
'''
[['p226', array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.    # (1, 82)
,(array([[0.43534297        # (135, 80)
,array([-1.0000000e+10,     # (135,)
135, '003002')]
'''

用他提取的mel谱不能通过mel2wav_GriffinLim生成对应音频,所以一个mel谱的生成方式有一个vocoder与之对应。

min_len_seq=128, max_len_seq=128 * 2, max_len_pad=128 * 3,

Elapsed [1 day, 10:29:04], Iteration [311800/1000000], G/loss_id: 0.00094931
...
Elapsed [1 day, 22:15:11], Iteration [393000/1000000], G/loss_id: 0.00089064
Validation loss: 25.521947860717773
...
Elapsed [6 days, 12:43:18], Iteration [1000000/1000000], G/loss_id: 0.00062485
Saved model checkpoints into run/models...
Validation loss: 24.621514320373535

min_len_seq=64, max_len_seq=128, max_len_pad=192,

Elapsed [1 day, 6:19:23], Iteration [392500/1000000], G/loss_id: 0.00089022
...
Elapsed [1 day, 17:51:54], Iteration [510900/1000000], G/loss_id: 0.00070071
...
Elapsed [2 days, 2:45:19], Iteration [601200/1000000], G/loss_id: 0.00064877
Elapsed [2 days, 2:46:17], Iteration [601300/1000000], G/loss_id: 0.00086512
...
Elapsed [4 days, 13:16:14], Iteration [1000000/1000000], G/loss_id: 0.00063725
Saved model checkpoints into run_192/models...
Validation loss: 25.529197692871094

‘R’ - Rhythm, ‘F’ - Pitch, ‘U’-Timbre

4. VQMIVC

github

preprocess.py中,随机选择88个说话人作为训练集,20个说话人作为测试集,从训练集中随机取10%的数据作为验证集。训练集:31877,验证集:3496,测试集:8474

现在p225_001的mel谱为(206, 80)维度的了…这是第三种mel谱提取方式了(第一是Kaizhi Qian的,第二是实验室的),lf0的维度是(206,)

对比预测编码(Contrastive Predictive Coding, CPC)来自以下:

Representation Learning with Contrastive Predictive Coding

  1. VQMIVC的Encoder来自Vector-Quantized Contrastive Predictive Coding for VQ-VAE model
  2. speaker embedding 来自One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization for achieve one-shot VC,
  3. MI来自CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information for ot only provide reliable upper bound MI estimation, but also effectively minimize correlation in deep models as a learning critic
  4. Decoder来自AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

代码复现

Training with mutual information minimization (MIM):

epoch:1, global step:19, recon loss:4.916, cpc loss:2.398, vq loss:0.004, perpexlity:36.475, lld cs loss:-22.828, mi cs loss:1.334E-03, lld ps loss:0.072, mi ps loss:0.000, lld cp loss:-47.886, mi cp loss:0.005, used time:1.841s
[14.59624186 14.74744059 17.68857762 16.60425648  8.98168102 10.80852639]
...
Eval | epoch:150, recon loss:0.597, cpc loss:1.171, vq loss:0.452, perpexlity:331.668, lld cs loss:109.938, mi cs loss:2.622E-03, lld ps loss:0.053, mi ps loss:0.000, lld cp loss:1085.477, mi cp loss:0.027, used time:11.050s
...
epoch:500, global step:62500, recon loss:0.446, cpc loss:1.058, vq loss:0.481, perpexlity:382.976, lld cs loss:133.278, mi cs loss:-3.201E-11, lld ps loss:0.043, mi ps loss:0.001, lld cp loss:1430.699, mi cp loss:0.019, used time:59.947s
[81.88058467 74.5465714  66.79555879 59.56658187 53.12879977 48.21143758]
Saved checkpoint: model.ckpt-500
python convert_example.py -s test_wavs_/p225_001.wav -r test_wavs_/p232_001.wav -c converted_ -m /ceph/home/yangsc21/Python/autovc/VQMIVC/checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/model.ckpt-500.pt 

Training without MIM:

epoch:1, global step:11, recon loss:5.052, cpc loss:2.398, vq loss:0.007, perpexlity:28.734, lld cs loss:0.000, mi cs loss:0.000E+00, lld ps loss:0.000, mi ps loss:0.000, lld cp loss:0.000, mi cp loss:0.000, used time:0.932s
[15.68673657 15.79730163 17.24362392 18.36610983  9.6393453  11.61267515]
...
eval epoch:500, global step:62500, recon loss:0.428, cpc loss:1.084, vq loss:0.434, perpexlity:236.974, lld cs loss:0.000, mi cs loss:0.000E+00, lld ps loss:0.000, mi ps loss:0.000, lld cp loss:0.000, mi cp loss:0.000, used time:2.310s
[80.40287876 73.21134896 65.82538815 58.78130851 52.62680888 48.37559807]
Saved checkpoint: model.ckpt-500

第一次训练的时候出现了以下问题:

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0
epoch:22, global step:2738, recon loss:nan, cpc loss:2.190, vq loss:0.020, perpexlity:2.943,
lld cs loss:0.000, mi cs loss:0.000E+00, lld ps loss:0.000, mi ps loss:0.000, lld cp loss:0.0
00, mi cp loss:0.000, used time:0.928s
[69.31384211 62.62814037 61.00833479 59.76804704 60.72292783 60.05764839]
File "train.py", line 132, in mi_second_forward
scaled_loss.backward()
ZeroDivisionError: float division by zero

5. AutoPST

github

spk2emb_82.pkl格式:one - hot的speaker embedding,这里不知道怎么训练SEA model,所以直接用spk2emb_82.pkl的82个说话人进行训练,将所有的对应的说话人音频复制过来,但是注意到p248p251这两个说话人是‘indian’,在‘wav16’的文件夹中没有,所以用于训练的说话人一共有80个

How to train SEA model issues
How to make ‘mfcc_stats.pkl’ and ‘spk2emb_82.pkl’ issues

A:

Elapsed [0:00:31], Iteration [100/1000000], P/loss_tx2sp: 1.56564283, P/loss_stop_sp: 0.41424835
...
Elapsed [1 day, 22:56:36], Iteration [739100/1000000], P/loss_tx2sp: 0.04529002, P/loss_stop_sp: 0.00000134
...
Elapsed [3 days, 2:02:54], Iteration [1000000/1000000], P/loss_tx2sp: 0.04210990, P/loss_stop_sp: 0.00000138
Saved model checkpoints into assets ...

B:

Elapsed [0:01:07], Iteration [100/1000000], P/loss_tx2sp: 0.16594851, P/loss_stop_sp: 0.01642439
...
Elapsed [1 day, 3:50:30], Iteration [308000/1000000], P/loss_tx2sp: 0.05139246, P/loss_stop_sp: 0.00002161
129 torch.Size([4, 1635])
(RuntimeError: CUDA out of memory. )
...	# retrain on A100
Elapsed [1 day, 4:01:57], Iteration [612100/1000000], P/loss_tx2sp: 0.06876539, P/loss_stop_sp: 0.00025042

SpeechSplit performs better only when it has the ground truth rhythm.

6. MAP

MCD

896 p299_010_p269_010

9.590664489988024 3.480195005888315

MCD = 6.729270791070735 dB, calculated over a total of 658567 frames, total 896 pairs

933

8.460124809689486 2.212626641284654

MCD = 5.365016399179563 dB, calculated over a total of 845206 frames, total 982 pairs

7. VQVC+ and AdaIN-VC

AdaIN-VC:
https://github.com/jjery2243542/adaptive_voice_conversion
非官方:https://github.com/cyhuang-tw/AdaIN-VC
VQVC+:
https://github.com/ericwudayi/SkipVQVC

8. My_Model

run,每一次执行五次G1…,会很慢,大约100iters/1min,应该写错了,传出的是detach之后的r,p和c,这里直接停掉就不执行了

Elapsed [0:00:59], Iteration [100/1000000], G/loss_id: 0.07169086, G/loss_id_psnt: 0.69216526, spk_loss: 4.61872101, content_adv_loss: 4.61275053, mi_cp_loss: 0.01635285, mi_rc_loss: 0.00026382, mi_rp_loss: 0.00063036, lld_cp_loss: -61.53382492, lld_rc_loss: -15.68565655, lld_rp_loss: -58.85915375
...
Elapsed [10:16:29], Iteration [65200/1000000], G/loss_id: 0.00442289, G/loss_id_psnt: 0.00442220, spk_loss: 0.25282666, content_adv_loss: 4.30261326, mi_cp_loss: 0.01877159, mi_rc_loss: 0.00023279, mi_rp_loss: 0.00061867, lld_cp_loss: -62.52816391, lld_rc_loss: -15.75810432, lld_rp_loss: -59.59102631
...
Elapsed [22:28:41], Iteration [139300/1000000], G/loss_id: 0.00412945, G/loss_id_psnt: 0.00413380, spk_loss: 0.00963319, content_adv_loss: 3.85407591, mi_cp_loss: 0.02865396, mi_rc_loss: 0.00045456, mi_rp_loss: 0.00082575, lld_cp_loss: -62.08999634, lld_rc_loss: -15.69443417, lld_rp_loss: -58.60738373

run_,对以上进行完善,每一次仅执行一次G1,然后执行五次loglikeli的MI网络更新,增加eval,sample plot,100iters/40s

Elapsed [0:03:57], Iteration [600/1000000], G/loss_id: 0.30983770, G/loss_id_psnt: 0.19387859, spk_loss: 4.63300323, content_adv_loss: 4.61494732, mi_cp_loss: 0.00945068, mi_rc_loss: 0.00023235, mi_rp_loss: 0.00065887, lld_cp_loss: -58.04447556, lld_rc_loss: -15.88657951, lld_rp_loss: -56.80010986
...
Validation loss: 47.09280776977539
Elapsed [3 days, 2:54:44], Iteration [633100/1000000], G/loss_id: 0.00095636, G/loss_id_psnt: 0.00094097, spk_loss: 0.00062935, content_adv_loss: 4.60769081, mi_cp_loss: -0.00006776, mi_rc_loss: 0.00005649, mi_rp_loss: -0.00000760, lld_cp_loss: -63.45062256, lld_rc_loss: -15.89001274, lld_rp_loss: -63.43792725
...
Elapsed [4 days, 22:57:11], Iteration [1000000/1000000], G/loss_id: 0.00098727, G/loss_id_psnt: 0.00097293, spk_loss: 0.00006716, content_adv_loss: 4.60707092, mi_cp_loss: -0.00000000, mi_rc_loss: 0.00003540, mi_rp_loss: -0.00001546, lld_cp_loss: -63.21662521, lld_rc_loss: -15.81787777, lld_rp_loss: -63.23201752
Saved model checkpoints into run_/models...
Validation loss: 32.39069652557373

run_VQ,继续对以上进行完善,增加VQ和CPC

Elapsed [0:04:19], Iteration [600/1000000], G/loss_id: 0.21996836, G/loss_id_psnt: 0.25334343, spk_loss: 4.64389563, content_adv_loss: 4.59541082, mi_cp_loss: 0.00000005, mi_rc_loss: 0.00000001, mi_rp_loss: 0.00000003, lld_cp_loss: -63.99991989, lld_rc_loss: -15.99912262, lld_rp_loss: -63.99991989, vq_loss: 0.21500304, cpc_loss: 2.32923222
[25.67204237 24.83198941 23.82392436 24.96639788 25.70564449 24.59677458]
...
Validation loss: 44.1766471862793
Elapsed [2 days, 21:25:56], Iteration [518300/1000000], G/loss_id: 0.00244417, G/loss_id_psnt: 0.00243542, spk_loss: 0.00113923, content_adv_loss: 4.59757185, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000069, mi_rp_loss: 0.00000080, lld_cp_loss: -63.96213531, lld_rc_loss: -15.99189568, lld_rp_loss: -63.96203232, vq_loss: 816.06317139, cpc_loss: 1.40695262
[58.23252797 57.29166865 56.85483813 55.94757795 53.49462628 53.93145084]
...
Elapsed [5 days, 12:36:53], Iteration [1000000/1000000], G/loss_id: 0.00149921, G/loss_id_psnt: 0.00148138, spk_loss: 0.00003648, content_adv_loss: 4.60524797, mi_cp_loss: -0.00000074, mi_rc_loss: 0.00000015, mi_rp_loss: 0.00000297, lld_cp_loss: -63.96846771, lld_rc_loss: -15.98954391, lld_rp_loss: -63.96820068, vq_loss: 3281.53906250, cpc_loss: 1.39505339
[65.49059153 62.63440847 62.02957034 60.18145084 58.77016187 58.36693645]
Saved model checkpoints into run_VQ/models...
Validation loss: 44.55451202392578

run_VQ_1,lambda_cd = 0.1 -> 1

Elapsed [0:11:13], Iteration [1500/1000000], G/loss_id: 0.20114005, G/loss_id_psnt: 0.20369399, spk_loss: 4.10042000, content_adv_loss: 4.61024761, mi_cp_loss: 0.00000031, mi_rc_loss: -0.00000001, mi_rp_loss: 0.00000008, lld_cp_loss: -63.99993896, lld_rc_loss: -15.99916458, lld_rp_loss: -63.99987411, vq_loss: 0.26906750, cpc_loss: 2.26129460
[33.16532373 33.36693645 33.8373661  31.72042966 31.01478517 31.85483813]
...
Validation loss: 49.884931564331055                                   [403/1899]
Elapsed [2 days, 19:55:09], Iteration [500100/1000000], G/loss_id: 0.00222024, G
/loss_id_psnt: 0.00221478, spk_loss: 0.00017085, content_adv_loss: 4.60785437, m
i_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: -0.00000000, lld_cp_l
oss: -64.00000000, lld_rc_loss: -15.99412155, lld_rp_loss: -63.99996185, vq_loss
: 885.47814941, cpc_loss: 1.49846244
[48.7231195  47.81585932 46.77419364 45.93414068 47.27822542 45.69892585]

run_pitch,增加一个pitch的decoder,lambda_cd = 1 -> 0.1,pitch的decoder的权重系数设为1

Elapsed [0:01:50], Iteration [200/1000000], G/loss_id: 0.03694459, G/loss_id_psnt: 0.88864940, spk_loss: 4.62199402, content_adv_loss: 4.66859436, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000005, mi_rp_loss: 0.00000007, lld_cp_loss: -63.99984741, lld_rc_loss: -15.99911499, lld_rp_loss: -63.99986267, vq_loss: 0.18332298, cpc_loss: 2.39174533, pitch_loss: 84147137801697099776.00000000
[16.83467776 16.70026928 17.37231165 17.13709682 17.40591377 19.52284873]
...
Elapsed [7:49:14], Iteration [53200/1000000], G/loss_id: 0.00311949, G/loss_id_psnt: 0.00310999, spk_loss: 0.22137184, content_adv_loss: 4.62207985, mi_cp_loss: -0.00000042, mi_rc_loss: 0.00000002, mi_rp_loss: -0.00000121, lld_cp_loss: -63.99859238, lld_rc_loss: -15.99206543, lld_rp_loss: -63.66162872, vq_loss: 6.28716087, cpc_loss: 1.54115570, pitch_loss: 82519535137146798080.00000000
[49.05914068 49.96639788 48.01747203 47.47983813 45.69892585 45.0268805 ]
...

改正loss过大的问题后:

Elapsed [0:00:56], Iteration [100/1000000], G/loss_id: 0.05138026, G/loss_id_psnt: 0.77242196, spk_loss: 0.68211454, content_adv_loss: 3.45159459, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00000000, mi_rp_loss: -0.00000000, lld_cp_loss: -63.99995041, lld_rc_loss: -15.99927044, lld_rp_loss: -63.99993896, vq_loss: 0.17577030, cpc_loss: 2.39789486, pitch_loss: 0.03641313
[14.41532224 13.70967776 11.79435477 14.11290318 15.15457034 11.35752723]
...
Elapsed [1 day, 23:38:22], Iteration [353300/1000000], G/loss_id: 0.00200523, G/loss_id_psnt: 0.00200193, spk_loss: 0.00335806, content_adv_loss: 4.60966444, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000065, mi_rp_loss: -0.00271915, lld_cp_loss: -61.08781052, lld_rc_loss: -15.98555946, lld_rp_loss: -61.11469650, vq_loss: 361.41998291, cpc_loss: 1.45688534, pitch_loss: 0.00808701
[55.07392287 53.76344323 52.08333135 51.74731016 51.00806355 52.55376101]

run_pitch_,可以发现以上pitch_loss大的离谱,还是在我加了tanh()的情况下,原来是本身的f0的值在-10000000000.0 ~ 1.0 之间,就很奇怪,怎么这么小?但我用以下函数检查的时候:

import numpy as np

def speaker_normalization(f0, index_nonzero, mean_f0, std_f0):
    # f0 is logf0
    f0 = f0.astype(float).copy()
    #index_nonzero = f0 != 0
    f0[index_nonzero] = (f0[index_nonzero] - mean_f0) / std_f0 / 4.0
    f0[index_nonzero] = np.clip(f0[index_nonzero], -1, 1)
    f0[index_nonzero] = (f0[index_nonzero] + 1) / 2.0
    return f0

path = "/ceph/home/yangsc21/Python/autovc/SpeechSplit/assets/raptf0/p226/p226_025_cat.npy"
# path = "/ceph/home/yangsc21/Python/VCTK/wav16/raptf0_100_crop_cat/p225/p225_cat.npy"
f0_rapt = np.load(path)
index_nonzero = (f0_rapt != -1e10)
mean_f0, std_f0 = np.mean(f0_rapt[index_nonzero]), np.std(f0_rapt[index_nonzero])
f0_norm = speaker_normalization(f0_rapt, index_nonzero, mean_f0, std_f0)
f0_norm[f0_norm==-1e10] = 0
print(f0_rapt, np.max(f0_rapt), np.min(f0_rapt), index_nonzero, mean_f0, std_f0, f0_norm, np.max(f0_norm), np.min(f0_norm), np.mean(f0_norm))

结果如下:

[-1.e+10 -1.e+10 -1.e+10 ... -1.e+10 -1.e+10 -1.e+10] 1.0 -10000000000.0 [False False False ... False False False] 0.500022 0.12482828 [0. 0. 0. ... 0. 0. 0.] 1.0 0.0 0.278805128421691

可以发现,除了-1e10,剩下就是从0-1之间的数了,那这里过一个Sigmoid(),权重系数设为0.1

Elapsed [0:00:23], Iteration [40/1000000], G/loss_id: 0.05261087, G/loss_id_psnt: 0.72422123, spk_loss: 0.69748354, content_adv_loss: 4.07323647, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00000001, mi_rp_loss: 0.00000005, lld_cp_loss: -63.99515915, lld_rc_loss: -15.99876404, lld_rp_loss: -63.99857712, vq_loss: 0.12072643, cpc_loss: 2.39789486, pitch_loss: 0.05633526
[15.35618305 15.42338729 14.34811801 15.42338729 14.71774131 10.95430106]
...
Elapsed [1 day, 23:40:28], Iteration [317600/1000000], G/loss_id: 0.00184345, G/loss_id_psnt: 0.00184246, spk_loss: 0.12309902, content_adv_loss: 4.59243107, mi_cp_loss: 0.00000003, mi_rc_loss: 0.00000001, mi_rp_loss: 0.00000004, lld_cp_loss: -63.99990082, lld_rc_loss: -15.98676586, lld_rp_loss: -63.99086761, vq_loss: 322.91177368, cpc_loss: 1.37124848, pitch_loss: 0.00887575
[63.23924661 62.29838729 60.61828136 59.40859914 58.90457034 58.63575339]

run_pitch_2,不用VQ+CPC,只用MI + 0.1 * pitch decoder

Elapsed [0:04:26], Iteration [700/1000000], G/loss_id: 0.16231853, G/loss_id_psnt: 0.32284465, spk_loss: 4.62042952, content_adv_loss: 4.61721945, mi_cp_loss: 0.00780822, mi_rc_loss: 0.00016762, mi_rp_loss: 0.00188451, lld_cp_loss: -61.02302551, lld_rc_loss: -15.70524502, lld_rp_loss: -59.94433594, pitch_loss: 0.01591282
...
Elapsed [4 days, 23:25:32], Iteration [1000000/1000000], G/loss_id: 0.00075769, G/loss_id_psnt: 0.00075213, spk_loss: 0.00020621, content_adv_loss: 4.60287952, mi_cp_loss: 0.00002108, mi_rc_loss: 0.00001613, mi_rp_loss: 0.00000331, lld_cp_loss: -63.69352341, lld_rc_loss: -15.81671143, lld_rp_loss: -63.69023132, pitch_loss: 0.00293102
Saved model checkpoints into run_pitch_2/models...
Validation loss: 25.982803344726562

run_pitch_3,VQCPC after Mel Encoder (rhythm)

Elapsed [0:00:47], Iteration [100/1000000], G/loss_id: 0.03398866, G/loss_id_psnt: 0.81710476, spk_loss: 4.63411522, content_adv_loss: 4.56968164, mi_cp_loss: 0.00585751, mi_rc_loss: 0.00000011, mi_rp_loss: -0.00000144, lld_cp_loss: -61.00984192, lld_rc_loss: -15.42654037, lld_rp_loss: -59.25158691, vq_loss: 0.05246538, cpc_loss: 2.39724064, pitch_loss: 0.01885412
[40.2777791  45.13888955 50.69444776 56.25       63.19444776 67.36111045]
...
Elapsed [5 days, 0:38:57], Iteration [1000000/1000000], G/loss_id: 0.00155650, G/loss_id_psnt: 0.00154984, spk_loss: 0.00189495, content_adv_loss: 4.60772943, mi_cp_loss: 0.00018108, mi_rc_loss: 0.00001328, mi_rp_loss: 0.00000790, lld_cp_loss: -63.04440308, lld_rc_loss: -12.41486168, lld_rp_loss: -63.04750824, vq_loss: 0.12090041, cpc_loss: 1.45722890, pitch_loss: 0.01511321
[56.25       61.8055582  63.54166865 67.01388955 69.79166865 75.3472209 ]
Saved model checkpoints into run_pitch_3/models...
Validation loss: 1484.6753540039062

消融实验

w/o adv

Saved model checkpoints into run_pitch_wo_adv/models...
Validation loss: 57.68343734741211
Elapsed [4 days, 14:00:24], Iteration [800100/1000000], G/loss_id: 0.00046955, G/loss_id_psnt: 0.00046547, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000706, mi_rc_loss: 0.00001590, mi_rp_loss: -0.00017925, lld_cp_loss: -59.52406311, lld_rc_loss: -15.50933647, lld_rp_loss: -59.52644348, pitch_loss: 0.00289221

w/o MI

Elapsed [3 days, 20:09:18], Iteration [861400/1000000], G/loss_id: 0.00078604, G/loss_id_psnt: 0.00077555, spk_loss: 0.01067678, content_adv_loss: 4.61410141, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00468707
Elapsed [3 days, 20:10:00], Iteration [861500/1000000], G/loss_id: 0.00068698, G/loss_id_psnt: 0.00067951, spk_loss: 0.21367706, content_adv_loss: 4.60780621, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00338293

w/o pitch

Elapsed [3 days, 21:44:47], Iteration [807800/1000000], G/loss_id: 0.00086264, G/loss_id_psnt: 0.00085352, spk_loss: 0.00084184, content_adv_loss: 4.60911512, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00008330, mi_rp_loss: 0.00001434, lld_cp_loss: -62.65066147, lld_rc_loss: -14.44594097, lld_rp_loss: -62.67908859
Elapsed [3 days, 21:45:36], Iteration [807900/1000000], G/loss_id: 0.00084089, G/loss_id_psnt: 0.00082641, spk_loss: 0.00110251, content_adv_loss: 4.60943460, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00016344, mi_rp_loss: 0.00001484, lld_cp_loss: -62.94990158, lld_rc_loss: -14.53773308, lld_rp_loss: -62.95674896

w/o adv + MI

Elapsed [3 days, 17:36:39], Iteration [886500/1000000], G/loss_id: 0.00060688, G/loss_id_psnt: 0.00059975, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00623717

w/o adv + pitch

Elapsed [5 days, 4:59:55], Iteration [831100/1000000], G/loss_id: 0.00050043, G/loss_id_psnt: 0.00049772, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000898, mi_rp_loss: -0.00004322, lld_cp_loss: -63.60225677, lld_rc_loss: -15.57159233, lld_rp_loss: -63.61416626

w/o pitch + MI

Elapsed [4 days, 2:38:30], Iteration [818800/1000000], G/loss_id: 0.00079214, G/loss_id_psnt: 0.00078017, spk_loss: 0.00368636, content_adv_loss: 4.61357307, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000

w/o pitch + adv + MI

Elapsed [2 days, 19:02:29], Iteration [827100/1000000], G/loss_id: 0.00056297, G/loss_id_psnt: 0.00055678, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000

维度扩大2倍

Elapsed [6 days, 15:18:42], Iteration [1000000/1000000], G/loss_id: 0.00051162, G/loss_id_psnt: 0.00050539, spk_loss: 0.00400598, content_adv_loss: 4.59954834, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00137981, mi_rp_loss: 0.00045807, lld_cp_loss: -117.74221039, lld_rc_loss: -8.09668541, lld_rp_loss: -117.96425629, pitch_loss: 0.00297939
Saved model checkpoints into run_dim2_pitch/models...
Validation loss: 24.739643096923828

w/o adv

Elapsed [6 days, 17:55:16], Iteration [1000000/1000000], G/loss_id: 0.00193528, G/loss_id_psnt: 0.00193324, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: -0.00000000, mi_rc_loss: 0.00027657, mi_rp_loss: 0.00118168, lld_cp_loss: -36.59268570, lld_rc_loss: -31.09467697, lld_rp_loss: -41.84099579, pitch_loss: 0.00602757
Saved model checkpoints into run_dim2_pitch_wo_adv/models...
Validation loss: 61.93159866333008

w/o MI

Elapsed [4 days, 14:43:55], Iteration [874800/1000000], G/loss_id: 0.00076253, G/loss_id_psnt: 0.00075406, spk_loss: 0.00023900, content_adv_loss: 4.60489750, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00438170

w/o pitch

Elapsed [7 days, 12:10:08], Iteration [1000000/1000000], G/loss_id: 0.00056876, G/loss_id_psnt: 0.00055979, spk_loss: 0.00775459, content_adv_loss: 4.60920477, mi_cp_loss: -0.00001970, mi_rc_loss: -0.00001161, mi_rp_loss: 0.00000567, lld_cp_loss: -126.23571777, lld_rc_loss: -28.51069260, lld_rp_loss: -126.23281097
Saved model checkpoints into run_dim2_pitch_wo_pitch/models...
Validation loss: 23.904769897460938

w/o adv + MI

Elapsed [7 days, 2:04:21], Iteration [1000000/1000000], G/loss_id: 0.00036989, G/loss_id_psnt: 0.00036615, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00280974
Saved model checkpoints into run_dim2_pitch_wo_adv_mi/models...
Validation loss: 17.292850494384766

w/o adv + pitch

Elapsed [4 days, 14:56:20], Iteration [819400/1000000], G/loss_id: 0.00043808, G/loss_id_psnt: 0.00043352, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00003838, mi_rp_loss: -0.00066438, lld_cp_loss: -116.34136963, lld_rc_loss: -31.60812950, lld_rp_loss: -116.60922241

w/o pitch + MI

Elapsed [4 days, 1:50:09], Iteration [837000/1000000], G/loss_id: 0.00066237, G/loss_id_psnt: 0.00065387, spk_loss: 0.16599277, content_adv_loss: 4.60649395, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000

w/o pitch + adv + MI

Elapsed [3 days, 16:58:49], Iteration [817700/1000000], G/loss_id: 0.00039425, G/loss_id_psnt: 0.00039077, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000

9. Model

run,以上效果的转化音色时效果不好,这里把speaker embedding 改成one hot向量,

Elapsed [0:00:42], Iteration [100/1000000], G/loss_id: 0.03971826, G/loss_id_psnt: 0.77077729, content_adv_loss: 4.60181856, mi_cp_loss: 0.01571883, mi_rc_loss: 0.00013684, mi_rp_loss: 0.00233091, lld_cp_loss: -61.88400269, lld_rc_loss: -15.71234703, lld_rp_loss: -58.84709930
...
Elapsed [2 days, 23:36:30], Iteration [664400/1000000], G/loss_id: 0.00082198, G/loss_id_psnt: 0.00081689, content_adv_loss: 4.60013008, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00009694, mi_rp_loss: -0.00047144, lld_cp_loss: -55.83151245, lld_rc_loss: -15.27785683, lld_rp_loss: -55.92704010
...
Elapsed [4 days, 12:08:59], Iteration [1000000/1000000], G/loss_id: 0.00[1/1862]
G/loss_id_psnt: 0.00059563, content_adv_loss: 4.61593437, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00007099, mi_rp_loss: -0.00033119, lld_cp_loss: -50.61760712, lld_rc_loss: -15.26475143, lld_rp_loss: -51.30073166
Saved model checkpoints into run/models...
Validation loss: 37.75163459777832

run_pitch,

Elapsed [0:00:51], Iteration [100/1000000], G/loss_id: 0.04715083, G/loss_id_psnt: 0.75423688, content_adv_loss: 4.61723232, mi_cp_loss: 0.01895704, mi_rc_loss: 0.00019252, mi_rp_loss: -0.00003131, lld_cp_loss: -60.86524582, lld_rc_loss: -15.71782684, lld_rp_loss: -57.30221176, pitch_loss: 0.03021768
...
Elapsed [2 days, 23:46:05], Iteration [598000/1000000], G/loss_id: 0.00082708, G/loss_id_psnt: 0.00082202, content_adv_loss: 4.60971642, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00007787, mi_rp_loss: 0.00014183, lld_cp_loss: -62.80820847, lld_rc_loss: -15.05598736, lld_rp_loss: -62.85067749, pitch_loss: 0.00637796

run_use_VQCPC,迭代到580000代,被人催下13号机了

Elapsed [3 days, 5:22:11], Iteration [580000/1000000], G/loss_id: 0.0027 loss_id_psnt: 0.00274249, content_adv_loss: 4.59387255, mi_cp_loss: 0.00000014, mi_rc_loss: 0.00000099, mi_rp_loss: 0.00000026, lld_cp_loss: -63.99961853, lld_rc_loss: -15.95115471, lld_rp_loss: -63.99967575, vq_loss: 1067.20275879, cpc_loss: 1.43544137, pitch_loss: 0.01039015
[58.77016187 58.33333135 56.95564747 55.94757795 54.30107713 54.33467627]
Saved model checkpoints into run_use_VQCPC/models...
Validation loss: 81.5089225769043

run_use_VQCPC_2

Elapsed [4 days, 11:03:03], Iteration [812400/1000000], G/loss_id: 0.00247453, G/loss_id_psnt: 0.00247133, content_adv_loss: 4.60783482, mi_cp_loss: 0.00007319, mi_rc_loss: -0.00000709, mi_rp_loss: 0.00000088, lld_cp_loss: -63.97900009, lld_rc_loss: -15.18019390, lld_rp_loss: -63.95431519, vq_loss: 0.13141513, cpc_loss: 1.21822941, pitch_loss: 0.01545527
[64.23611045 63.88888955 64.23611045 63.54166865 63.54166865 65.2777791 ]

new 我改了一下,先取mel谱过G1五次,再把mel谱过G1和G2,这样就会慢一些

MI:

Elapsed [5 days, 20:36:16], Iteration [676100/1000000], G/loss_id: 0.00094016, G/loss_id_psnt: 0.00093268, content_adv_loss: 4.60860682, mi_cp_loss: -0.00006625, mi_rc_loss: 0.00011470, mi_rp_loss: 0.00001495, lld_cp_loss: -62.70265961, lld_rc_loss: -14.90209007, lld_rp_loss: -62.72354126

MI + pitch:

Elapsed [5 days, 20:37:14], Iteration [647300/1000000], G/loss_id: 0.00082176, G/loss_id_psnt: 0.00081603, content_adv_loss: 4.60966206, mi_cp_loss: 0.00002162, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00004071, lld_cp_loss: -63.53780365, lld_rc_loss: -15.10877800, lld_rp_loss: -63.54629135, pitch_loss: 0.00451364

结果可视化参考

1. Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations

Objective evaluation:MCD、WER
Subjective evaluation:MOS

英文应该是WER,先用ESPnet: end-to-end speech processing toolkit工具进行语音识别,官方github,WER计算方式可参考此github

以下识别结果为’SS’,待修改:

import json
import torch
import argparse
from espnet.bin.asr_recog import get_parser
from espnet.nets.pytorch_backend.e2e_asr_transformer import E2E
import os
import scipy.io.wavfile as wav
from python_speech_features import fbank

filename = "/ceph/home/yangsc21/Python/autovc/SpeechSplit/assets/test_wav/p225/p225_001.wav"
sample_rate, waveform = wav.read(filename)
fbank1, _ = fbank(waveform,samplerate=16000,winlen=0.025,winstep=0.01,
      nfilt=86,nfft=512,lowfreq=0,highfreq=None,preemph=0.97)

# print(fbank1[0].shape, fbank1[1].shape)     # (204, 86), (204, )


root = "/ceph/home/yangsc21/Python/autovc/espnet/egs/tedlium3/asr1/"
model_dir = "/ceph/home/yangsc21/Python/autovc/espnet/egs/tedlium3/exp/train_trim_sp_pytorch_nbpe500_ngpu8_train_pytorch_transformer.v2_specaug/results/"

# load model
with open(model_dir + "/model.json", "r") as f:
  idim, odim, conf = json.load(f)
model = E2E.build(idim, odim, **conf)
model.load_state_dict(torch.load(model_dir + "/model.last10.avg.best"))
model.cpu().eval()

# load tocken_list
token_list = conf['char_list']
print(token_list)
# recognize speech
parser = get_parser()
args = parser.parse_args(["--beam-size", "1", "--ctc-weight", "1.0", "--result-label", "out.json", "--model", ""])

x = torch.as_tensor(fbank1).to(torch.float32)
result = model.recognize(x, args, token_list)

print(result)

s = "".join(conf["char_list"][y] for y in result[0]["yseq"]).replace("<eos>", "").replace("<space>", " ").replace("<blank>", "")

print("prediction: ", s)

放弃用espnet,用espnet_model_zoo,官方github,语料库和网络名字可在看到,这里我们基于librispeech,用espnet2实现ASR,

2. Many-to-Many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder

tSNE Visualization of speaker embedding space

Fig. 3 illustrates speaker embedding visualized by tSNE method, there are 30 utterances sampled for every speaker to calculate the speaker representation. According to the empirical results, we found that a chunk of 2 seconds is adequate to extract the speaker representation. As shown in Fig. 3, speaker embeddings are separable for different speakers. In contrast, the speaker embeddings of utterances of the same speaker are close to each other. As a result, our method is able to extract speaker-dependence information by using the encoder network.

p335 p264 p247 p278 p272 p262 (F, F, M, M, M, F)这些说话人没有出现在训练中

3. Non-Parallel Many-To-Many Voice Conversion by Knowledge Transfer from a Text-To-Speech Model
加个text?

4. Non-Parallel Many-To-Many Voice Conversion Using Local Linguistic Tokens
VQ-VAE

5. Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention
Ablation studies

6. Zero-Shot Voice Conversion with Adjusted Speaker Embeddings and Simple Acoustic Features
F0 distributions, subjective evaluations

7. Non-Autoregressive Sequence-To-Sequence Voice Conversion

root mean square error (RMSE) of log F0, and character error rate (CER)

8. fake speech detection

Towards fine-grained prosody control for voice conversion

  • 5
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值