先跑通LJSpeech
1.v100-monsterLJdata在外面, 更改datasets/LJ/prepare中的basedir.
2.路径和模块问题
import sys
import os
#print(sys.path)
#print(os.getcwd())
sys.path.append(os.getcwd())
3.关注LJSPeech的|标准化, 在已有实验是不是错误的.
4.产生的json在LJSPeech的数据那里.
5.generate.py也要sys.path.append()
6.hprams.py:
这个有些不对:
169行在generate.py
hparams.skip_inadequate:
不知道干啥:
import tensorflow as tf
SCALE_FACTOR = 1
def f(num):
return num // SCALE_FACTOR
clearner:
basic_params = {
# Comma-separated list of cleaners to run on text prior to training and eval. For non-English
# text, you may want to use "basic_cleaners" or "transliteration_cleaners" See TRAINING_DATA.md.
'cleaners': 'english_cleaners', #originally korean_cleaners
}
basic_params.update({
# Audio
'num_mels': 80,
'num_freq': 1025,
'sample_rate': 24000, # trained as 20000 but need to be 24000
'frame_length_ms': 50,
'frame_shift_ms': 12.5,
'preemphasis': 0.97,
'min_level_db': -100,
'ref_level_db': 20,
})
if True:
basic_params.update({
'sample_rate': 22050, #originally 24000 (krbook), 22050(lj-data), 20000(others)
})
basic_params.update({
# Model
'model_type': 'single', # [single, simple, deepvoice]
'speaker_embedding_size': f(16),
'embedding_size': f(256),
'dropout_prob': 0.5,
# Encoder
'enc_prenet_sizes': [f(256), f(128)],
'enc_bank_size': 16,
'enc_bank_channel_size': f(128),
'enc_maxpool_width': 2,
'enc_highway_depth': 4,
'enc_rnn_size': f(128),
'enc_proj_sizes': [f(128), f(128)],
'enc_proj_width': 3,
# Attention
'attention_type': 'bah_mon', # ntm2-5
'attention_size': f(256),
'attention_state_size': f(256),
# Decoder recurrent network
'dec_layer_num': 2,
'dec_rnn_size': f(256),
# Decoder
'dec_prenet_sizes': [f(256), f(128)],
'post_bank_size': 8,
'post_bank_channel_size': f(256),
'post_maxpool_width': 2,
'post_highway_depth': 4,
'post_rnn_size': f(128),
'post_proj_sizes': [f(256), 80], # num_mels=80
'post_proj_width': 3,
'reduction_factor': 4,
})
if False: # Deep Voice 2 AudioBook Dataset
basic_params.update({
'dropout_prob': 0.8,
'attention_size': f(512),
'dec_prenet_sizes': [f(256), f(128), f(64)],
'post_bank_channel_size': f(512),
'post_rnn_size': f(256),
'reduction_factor': 5, # changed from 4
})
elif False: # Deep Voice 2 VCTK dataset
basic_params.update({
'dropout_prob': 0.8,
#'attention_size': f(512),
#'dec_prenet_sizes': [f(256), f(128)],
#'post_bank_channel_size': f(512),
'post_rnn_size': f(256),
'reduction_factor': 5,
})
elif True: # Single Speaker
basic_params.update({
'dropout_prob': 0.5,
'attention_size': f(128),
'post_bank_channel_size': f(128),
#'post_rnn_size': f(128),
'reduction_factor': 5, #chhanged from 4
})
elif False: # Single Speaker with generalization
basic_params.update({
'dropout_prob': 0.8,
'attention_size': f(256),
'dec_prenet_sizes': [f(256), f(128), f(64)],
'post_bank_channel_size': f(128),
'post_rnn_size': f(128),
'reduction_factor': 4,
})
basic_params.update({
# Training
'batch_size': 32,
'adam_beta1': 0.9,
'adam_beta2': 0.999,
'use_fixed_test_inputs': False,
'initial_learning_rate': 0.001,
'decay_learning_rate_mode': 0, # True in deepvoice2 paper
'initial_data_greedy': True,
'initial_phase_step': 8000,
'main_data_greedy_factor': 0,
'main_data': [''],
'prioritize_loss': False,
'recognition_loss_coeff': 0.2,
'ignore_recognition_level': 0, # 0: use all, 1: ignore only unmatched_alignment, 2: fully ignore recognition
# Eval
'min_tokens': 50,#originally 50, 30 is good for korean,
'min_iters': 30,
'max_iters': 200,
'skip_inadequate': False,
'griffin_lim_iters': 60,
'power': 1.5, # Power to raise magnitudes to prior to Griffin-Lim
})
# Default hyperparameters:
hparams = tf.contrib.training.HParams(**basic_params)
def hparams_debug_string():
values = hparams.values()
hp = [' %s: %s' % (name, values[name]) for name in sorted(values)]
return 'Hyperparameters:\n' + '\n'.join(hp)
7.from tensorflow.contrib.data.python.util import nest => from tensorflow.python.data. util import nest
1.3 => 1.4
8.有一个attention的问题, 不好解决:
一直在纠结, 最后想到1. 官方文档如何改, 最好不要自己根据逻辑改. 2. github上有改过的版本(整个代码适配1.14), 直接对比修改就好.
9.不好改, 不如用conda建立个cuda8.0+tf1.3的环境, 以后也会用到.
http://ask.ainoob.cn/article/5201
https://blog.csdn.net/H_O_W_E/article/details/
77370456
https://medium.com/@yckim/tensorflow-1-3-install-on-ubuntu-16-04-2d191a6e5546
首先conda create -n tf1.3-cuda8 pip python=3.6
然后conda install cudnn=6
会默认安装cuda8.0
然后pip install tensorflow-gpu==1.3.0
然后pip install -r requirement.txt
最后python train.py --data_path=/home/ec2-user/data_LJSpeech-1.1/
10.试着合成一下.
11.准备标贝数据, 一同训练.
改single为DeepVoice.
cleaner为mix:
python train.py --data_path=../data_LJSpeech-1.1,./Biao-Bei
如果接着已经有的训练,
先改cleaner和模式:
python train.py --data_path=../data_LJSpeech-1.1,./Biao-Bei --load_path logs/data_LJSpeech-1.1+Biao-Bei_2019-11-13_13-00-51
删除tmux会话:
tmux kill-session -t <name-of-my-session>
退出conda:
source deactivate
查看conda当前的包, 两个都要看完, 可能会不一样:
pip list
conda list
12.进行大致测试.
更改synthesis.py:
parser = argparse.ArgumentParser()
parser.add_argument('--load_path', required=True)
parser.add_argument('--sample_path', default="samples")
parser.add_argument('--text', required=True)
parser.add_argument('--num_speakers', default=2, type=int)
parser.add_argument('--speaker_id', type=int, required=True)
parser.add_argument('--checkpoint_step', default=None, type=int)
parser.add_argument('--is_korean', default=False, type=str2bool)
config = parser.parse_args()
命令变为:
python synthesizer.py --load_path logs/data_LJSpeech-1.1+Biao-Bei_2019-11-13_13-00-51 --text="I can speak slower without degrading the quality of my voice." --speaker_id=0
python synthesizer.py --load_path logs/data_LJSpeech-1.1+Biao-Bei_2019-11-13_13-00-51 --text="kao3 shi4 kao3 de2 hao3 quan2 kao4 tong2 zhuo1 hao3" --speaker_id=1
在5wsteps的时候, 能合成英文,还是合成不了中文. 明天再看看.train的log中, train时可以合成中文, 但是test时就不能合成中文. 在自由合成非teacherForce下合成不了中文.
等明天看看
但是仍然合成不了中文, 很奇怪.??
先往下走吧. 同时一直训练着, 因为英文的可以, 所以可能这个结构对中文不好训练.
同时今天跑一个只有标贝的:
改为hpar:单个
仍使用mix_cleaner
python train.py --data_path=./Biao-Bei
合成的时候:
python synthesizer.py --load_path logs/Biao-Bei_2019-11-14_13-19-37 --text="kao3 shi4 kao3 de2 hao3 quan2 kao4 tong2 zhuo1 hao3" --num_speakers=1 --speaker_id=0
同时今天跑一个simple的:
改为hpar:simple
使用mix_clear,
python train.py --data_path=../data_LJSpeech-1.1,./Biao-Bei
改simple和mix_cleaner, 然后接着之前训练的:
因为网络问题, 还没跑.
python train.py --data_path=../data_LJSpeech-1.1,./Biao-Bei --load_path logs/data_LJSpeech-1.1+Biao-Bei_2019-11-14_13-27-02
测试的时候, 先把hpar改了, simple
再:
正常中国人说中文:
python synthesizer.py --load_path logs/data_LJSpeech-1.1+Biao-Bei_2019-11-14_13-27-02 --text="kao3 shi4 kao3 de2 hao3 quan2 kao4 tong2 zhuo1 hao3" --speaker_id=1
中国人说英文:
python synthesizer.py --load_path logs/data_LJSpeech-1.1+Biao-Bei_2019-11-14_13-27-02 --text="It would appear that a speech made at the weekend by mr fischler indicates a change of his position." --speaker_id=1
正常英国人说英文:
python synthesizer.py --load_path logs/data_LJSpeech-1.1+Biao-Bei_2019-11-14_13-27-02 --text="It would appear that a speech made at the weekend by mr fischler indicates a change of his position." --speaker_id=0
英国人说中文:
python synthesizer.py --load_path logs/data_LJSpeech-1.1+Biao-Bei_2019-11-14_13-27-02 --text="kao3 shi4 kao3 de2 hao3 quan2 kao4 tong2 zhuo1 hao3" --speaker_id=0
总是说磁盘空间用完了:
du -h --max-depth=1du
df -h
小结论: simple的似乎可以迁移!!!不太像, 但是可懂度还行, 需要继续测试.
强行复现code-switched结构
先保存一个git版本, 可以正确single跑, 但是中文single只有拼音输入, 停顿很差; 可以simple跑, 中文英文都可以单独合成, 但是没有测是否可以互相合成. deepvoice版本的中文不能合成, 很奇怪.
LDE结构
hpara.py
'cleaners': 'mix_cleaners',
'model_type': 'code-switch-lde', # [single, simple, deepvoice, code-switch-lde],
'language_dim': 32,
'language_FC_units': 64,
'language_num': 2,
# 'speaker_embedding_size': f(16),
'speaker_embedding_size': f(32),
'speaker_fc_size': f(64),
# reduct_factor改为了tacotron论文的2.
# 还有没有改的参数
Tacotron.py
关于speaker id, 思路:
申请look up, 之后和simple一样, 但是多一个fc.
到了attention rnn那里, 代码中叫:
DecoderPrenetWrapper
这个时候不加入speaker信息, 用if来区分.
最后的:
ConcatOutputAndAttentionWrapper
和simple一样, 但是用fc. concat上去.
关于language seq:
这个版本的cbhg只有最后的rnn中使用了mask.
集中于cbhg的更改. 并且和文章中还有自己上一版都有不同. 涉及到modules.py
具体到train的数据流:
datafeeder.py, train.py
不懂得地方:
猜测是attention rnn.
'attention_size': f(256),
'attention_state_size': f(256),
训练
source activate tf1.3-cuda8
python train.py --data_path=../data_LJSpeech-1.1,./Biao-Bei
注意到处理数据有问题:
############# data_dirs: ../data_LJSpeech-1.1/data
filter_by_min_max_frame_batch: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13100/13100 [00:01<00:00, 8832.50it/s]
[../data_LJSpeech-1.1/data] Loaded metadata for 2163 examples (2.47 hours)
[../data_LJSpeech-1.1/data] Max length: 398
[../data_LJSpeech-1.1/data] Min length: 195
filter_by_min_max_frame_batch: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 11307.50it/s]
[./Biao-Bei/data] Loaded metadata for 5490 examples (5.90 hours)
[./Biao-Bei/data] Max length: 398
[./Biao-Bei/data] Min length: 193
========================================
{'../data_LJSpeech-1.1/data': 0.5, './Biao-Bei/data': 0.5}
========================================
############# data_dirs: ../data_LJSpeech-1.1/data
filter_by_min_max_frame_batch: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13100/13100 [00:01<00:00, 7721.30it/s]
[../data_LJSpeech-1.1/data] Loaded metadata for 2163 examples (2.47 hours)
[../data_LJSpeech-1.1/data] Max length: 398
[../data_LJSpeech-1.1/data] Min length: 195
filter_by_min_max_frame_batch: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 11575.74it/s]
[./Biao-Bei/data] Loaded metadata for 5490 examples (5.90 hours)
[./Biao-Bei/data] Max length: 398
[./Biao-Bei/data] Min length: 193
========================================
是因为reduct_factor的缘故, 要一致改.