python语音转文字库_DC-TTS的TensorFlow实现:另一个文本转语音模型

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model

I implement yet another text-to-speech model, dc-tts, introduced in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. My goal, however, is not just replicating the paper. Rather, I'd like to gain insights about various sound projects.

Requirements

NumPy >= 1.11.1

TensorFlow >= 1.3 (Note that the API of tf.contrib.layers.layer_norm has changed since 1.3)

librosa

tqdm

matplotlib

scipy

Data

68747470733a2f2f696d6167652e7368757474657273746f636b2e636f6d2f7a2f73746f636b2d766563746f722d6b6f7265616e2d616c7068616265742d6b6f7265616e2d68616e67756c2d7061747465726e2d3639333638303631312e6a7067

68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f7468756d622f392f39632f4b6174655f57696e736c65745f4d617263685f31382532435f323031345f2532386865616473686f742532392e6a70672f38393070782d4b6174655f57696e736c65745f4d617263685f31382532435f323031345f2532386865616473686f742532392e6a7067

68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f7468756d622f662f66362f4e69636b5f4f666665726d616e5f61745f554d42435f25323863726f707065642532392e6a70672f34343070782d4e69636b5f4f666665726d616e5f61745f554d42435f25323863726f707065642532392e6a7067

68747470733a2f2f696d6167652e7368757474657273746f636b2e636f6d2f7a2f73746f636b2d766563746f722d6c6a2d6c6574746572732d666f75722d636f6c6f72732d696e2d61627374726163742d6261636b67726f756e642d6c6f676f2d64657369676e2d6964656e746974792d696e2d636972636c652d616c7068616265742d6c65747465722d3431383638373834362e6a7067

I train English models and an Korean model on four different speech datasets.

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available, and it has 24 hours of reasonable quality samples. Nick's and Kate's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours and 5 hours long, respectively. Finally, KSS Dataset is a Korean single speaker speech dataset that lasts more than 12 hours.

Training

STEP 0. Download LJ Speech Dataset or prepare your own data.

STEP 1. Adjust hyper parameters in hyperparams.py. (If you want to do preprocessing, set prepro True`.

STEP 2. Run python train.py 1 for training Text2Mel. (If you set prepro True, run python prepro.py first)

STEP 3. Run python train.py 2 for training SSRN.

You can do STEP 2 and 3 at the same time, if you have more than one gpu card.

Training Curves

training_curves.png

Attention Plot

attention.gif

Sample Synthesis

I generate speech samples based on Harvard Sentences as the original paper does. It is already included in the repo.

Run synthesize.py and check the files in samples.

Generated Samples

Pretrained Model for LJ

Download this.

Notes

The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization.

The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it.

I tried to train Text2Mel and SSRN simultaneously, but it didn't work. I guess separating those two networks mitigates the burden of training.

The authors claimed that the model can be trained within a day, but unfortunately the luck was not mine. However obviously this is much fater than Tacotron as it uses only convolution layers.

Thanks to the guided attention, the attention plot looks monotonic almost from the beginning. I guess this seems to hold the aligment tight so it won't lose track.

The paper didn't mention dropouts. I applied them as I believe it helps for regularization.

Check also other TTS models such as Tacotron and Deep Voice 3.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值