bert
guide attention
Monotonic Attention
Location-awar attention
DCA
Fast Speech
文章目录
- Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis 2018
- Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron 2018
- HIERARCHICAL GENERATIVE MODELING FOR CONTROLLABLE SPEECH SYNTHESIS 2018
- Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis 20190404
- MULTI-REFERENCE NEURAL TTS STYLIZATION WITH ADVERSARIAL CYCLE CONSISTENCY 20191125
- MELLOTRON: MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS 20191126
- PROSODY TRANSFER IN NEURAL TEXT TO SPEECH USING GLOBAL PITCH AND LOUDNESS FEATURES 20191221
- USING VAES AND NORMALIZING FLOWS FOR ONE-SHOT TEXT-TO-SPEECH SYNTHESIS OF EXPRESSIVE SPEECH 20200217
- UNSUPERVISED STYLE AND CONTENT SEPARATION BY MINIMIZING MUTUAL INFORMATION FOR SPEECH SYNTHESIS 20200309
- Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/3009d59dcbf74c12ffae650b855668b2.png)
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis 2018
就把情绪信息和说话人的信息,添加在embedding里,词嵌入,之后的decoder该怎么训还怎么训
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron 2018
增加了训练的输入,韵律的输入+说话人的输入+文字的输入
HIERARCHICAL GENERATIVE MODELING FOR CONTROLLABLE SPEECH SYNTHESIS 2018
引入变分自动编码器 VAE ,从嘈杂的数据中提出潜在的特性。对于VAE网上讲的特别多,我最浅显的理解就是,我拿到了一些X,然后我要生成和X相似的数据,我假设有个公式 F(Z) = X,我现在的目的就是基于观测到的X,去反推隐藏的Z以及F(Z)的式子,要是可以的话,那我就可以生成无限个和X相似的X了。替换到语音的话,就是拿到语音,找到其背后的推手,然后再用这个推手去生成,这样就不愁可以控制了,我可以控制其隐藏的Z从而达到控制X的目的,但这个是不可以预估的,有惊喜。
Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis 20190404
百度的一篇,GST,之前的input只有文字,现在加入了一些声音信息,用了多头注意力,更加厉害。
风格由三个音素控制:说话人、情绪、韵律。有三百个不同的说话人;有喜怒哀乐等情绪;有新闻故事广播等不同韵律。
MULTI-REFERENCE NEURAL TTS STYLIZATION WITH ADVERSARIAL CYCLE CONSISTENCY 20191125
同时嵌入音频1和音频2,交叉起来更厉害
MELLOTRON: MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS 20191126
在标准的数据里弄,合成唱歌,一种显式变量:文本、说话者id、音高轮廓;一种是隐藏变量:节奏、GTS。
音高轮廓 用 Alain De Cheveigné and Hideki Kawahara, “Yin, a fun-damental frequency estimator for speech and music,” The Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917–1930, 2002. 或者 Justin Salamon and Emilia Gómez, “Melody extraction from polyphonic music signals using pitch contour char- acteristics,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 6, pp. 1759–1770, 2012. 获取。
PROSODY TRANSFER IN NEURAL TEXT TO SPEECH USING GLOBAL PITCH AND LOUDNESS FEATURES 20191221
参考音频得韵律转移到合成音频,音高轮廓和RMS能量曲线 基本频率(F0)和能量(RMS)
USING VAES AND NORMALIZING FLOWS FOR ONE-SHOT TEXT-TO-SPEECH SYNTHESIS OF EXPRESSIVE SPEECH 20200217
可变自动编码器和Householder Flow
UNSUPERVISED STYLE AND CONTENT SEPARATION BY MINIMIZING MUTUAL INFORMATION FOR SPEECH SYNTHESIS 20200309
文本和风格的分离更加厉害
Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization
本文提出了三个组成部分来解决此问题,方法是:(1)制定一个具有因子分解潜变量的条件生成模型;(2)使用数据增强来添加与说话者身份不相关并且在训练过程中已知其标签的噪声;以及( 3)使用对抗分解来改善解缠结。