Fantasy Mix-Lingual Tacotron Version 1: Alibaba

0. 说明

mix language corpus 下 tacotron TTS 的中英文混读模型简单设计与训练

数据集位于跳板机上, 按照以下步骤可以搞, 然后能听一下

  1. 先让同事帮忙从跳板机拷贝到PAME服务器上
  2. 找到想要下载的东西的路径, 比如: hujk17/cuncun/chinese/wavs/1.wav
  3.  

0.1. 数据集

  1. 中文是录音, 英文是通过某种办法构建的虚拟录音(待补充TODO...)
  2. 8k, 中文8000句, 英文10000句

0.2. 任务要求

  1. 中英文混读
  2. 同音色中英文混读, 英文部分支持按照英文习惯发音

0.3. 起名字

Fantasy Mixed-Lingual Tacotron

 

1. 数据集直观感受

1.1. 中文录音

1.2. 英文录音

1.3. 中文标注

1.4. 英文标注

1.5. 其他数据集统计信息

略, TODO...

2. 中英文输入字符选取

2.1. IPA

中英文都用IPA表示, 并且有可能在字符串, 或者是Text Encoding广播加上language id; 同时共享embedding与否, 共享encoder与否, 都有可能

2.1.1. 川哥19年论文

Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding

  1. 原文: We use the epitran IPA library [24] to convert English and Chinese transcripts to International Phonetic Alphabet (IPA), which improves the pronunciation accuracy and unifies phonetic transcriptions of different languages.
  2. 未仔细看除了IPA以外有没有语言标记TODO...
  3. 涉及到跨语言transfer reference speech, 用IPA可能更便于共享, 当只有mix要求的时候, 不一定要用IPA, 毕竟不太准(至少多个语言标记嘛), 待定

[24] D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.

2.1.2. Google19论文

略, TODO...

2.2. 声韵母加音素(Mandarin和English Phoneme)

中文用声韵母, 英文用音素(其实也是IPA啦, 在这个语境下用音素代指), 有的只用这两个, 有的借鉴google的把重读和音调拼接

如下图:

2.2.1. 谢磊老师2019论文

Building a mixed-lingual neural TTS system with only monolingual data

  1. 最简单的分开embedding, 没有别的标记加入
  2. 尝试了phoneme input与Text Encoding的残差连接, attention连接等, 为了更准确的发音, 可尝试

2.3. 拼音加字素(Pinyin和Grapheme)

2.3.1. 我复现SPE

注意, 此时使用的Encoder是完全独立的, 虽然embedding用的一套, 但是Encoder分开

区分拼音和声韵母的好坏, 可以从自回归预测下一步考虑, 至少声韵母简单些, 对上下文的依赖小一些

跨语言的思路也可以用ppg代替mel参与自回归, 或者先自回归出ppg, 然后ppg升级音色mel, 分两步走, 韵律+准确发音, 然后再音色像, 这样去详细建模

这才是弄清楚了音色指的什么, 不指的什么

2.3.2. CUHK19

Since both Pinyin and English use alphabetic characters, the model needs to distinguish the phonetic sound of the same character from different languages. Although Mandarin is tonal and English is not, the model can learn partial linguistic knowledge from the Pinyin’s encoded tone information. However, the default Tacotron is likely to synthesize speech with inconsistent voices with respect to different languages in CS text or even fails to generate intelligible speech. In order to better model the differences between languages and explicitly model language alternation at local context, we augment the encoder with explicit character-level languageID: LID和SPE

2.4. 选择声韵母加音素

  1. 各用各的, encoder也不区分, 语言也不加以区分
  2. 注意下CS的部分会不会很容易不好, 多加个停顿试试

3. tacotron版本选取

TODO...

4. vocoder版本选取

  1. P-WaveGan
  2. LPCNet
  3. GL

尽量就用大家用的, 不在这个上面费工夫

5. 预处理数据脚本

TODO...

6. 枚举mix language的可能实现思路

其实方案有特别多, 先指定一篇论文严格复现, 和公司磨合, 出声音再说

将说中文和说英文是否要看做两个说话人, 可以枚举做做试验看

6.1. 完全复现阿里VC后TTS的论文

整个模型是personal思路, 不涉及到multi-speaker embedding 拼接

6.1.1. Bilingual and code-switched TTS models

Next step is to build bilingual and code-switched TTS using the bilingual corpora obtained from the above conversion step.

  1. For each speaker, we apply three different model architectures including Tacotron2, Transformer and FastSpeech.
  2. Note that there is no code-switched utterances in the obtained bilingual corpora. The TTS models still need to learn the code-switching from monolingual English and Mandarin utterances. 

没有speaker embedding, 不同说话人不同TTS, 双语的算一个人(其实也可以双语的算两个人, 可能担心暴露出来虚拟英文的缺点)

6.1.2. Input representation

Instead of using a unified phone set across languages, we

  1. combine English and Mandarin phone sets together as a whole. For English utterances, we use 44 British English phoneme symbols plus 3 possible stress symbols. For Mandarin utterances, we use 62 Pinyin initials and finals plus 5 possible tones.
  2. The tone or stress symbols are attached to the corresponding phoneme symbols. 音调或重音符号附在相应的音素符号上
  3. We also use symbols to indicate in-utterance pauses and utterance ends.

6.1.3. Training TTS models

  1. Tacotron: We modify the model to predict 20-dim LPCNet features and use a LPCNet vocoder for waveform generation. Our implementation is based on the open-source code3. https://github.com/keithito/tacotron   https://github.com/mozilla/LPCNet
  2. We use the open source code ESPnet to train the Transformer and FastSpeech models
  3. In our experiments, we observe that the Tacotron2 TTS model trained only on the bilingual corpus produces more prosodic errors for code-switched text than the Transformer and the FastSpeech TTS models. We therefore use the Transformer TTS model to create a set of code-switched speech data for each of the two speakers. We then add these code-switched utterances into the bilingual training sets to refine the Tacotron2 TTS model. We find that such data augmentation process can also benefit the Transformer and FastSpeech TTS models.

6.1.4 Experimental setup

  1. The English corpus is produced by a female native British English speaker. It has 27,000 utterances and the total length is about 41 hours.
  2. The Mandarin corpus is produced by a female native Mandarin speaker. It has 32,000 utterances and the total length is about 30 hours.
  3. We select 250 utterances for validation and 250 utterances for testing. All speech data are sampled at 16 kHz with 16-bit resolution.
  4. We also create a code-switched text corpus of 17,000 sentences by replacing the selected English or Chinese words in monolingual sentences by their translated counterparts

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值