paperswithcode
https://paperswithcode.com/task/speech-synthesis
20个令人惊叹的深度学习应用(Demo+Paper+Code)
https://www.cnblogs.com/czaoth/p/6755609.html
语音合成的有趣历史:How Speech Synthesizers Work
https://www.youtube.com/watch?v=XsMRxNSDccc
SANE2018 | Yu Zhang - Towards End-to-end Speech Synthesis
https://www.youtube.com/watch?v=tHAdlv7ThjA
业内关于语音合成有什么评估标准吗?
https://zhidao.baidu.com/question/1900363820532029580.html
语音合成的历史方法:
https://baike.baidu.com/item/语音合成/9790227?fr=aladdin
awesome-speech-recognition-speech-synthesis-papers
https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers
语音合成的Deep learning方法
Deep learning-based synthesizers use Deep Neural Networks (DNN), which are trained on recorded speech data. Some DNN-based speech synthesizers are approaching the quality of the human voice. Examples are WaveNet by DeepMind, Tacotron by Google and Deep Voice (which uses the WaveNet technology) from Baidu.
WaveNet:
Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). “WaveNet: A Generative Model for Raw Audio”. 1609. arXiv:1609.03499. Bibcode:2016arXiv160903499V.
WaveNet is a deep neural network for generating raw audio.
WaveNet是一种用于生成原始音频的深度神经网络。
It was created by researchers at London-based artificial intelligence firm DeepMind.
它是由伦敦人工智能公司DeepMind的研究人员创建的。
The technique, outlined in a paper in September 2016,[1] is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech.
在2016年9月的一篇论文中概述了这项技术,[1]能够通过使用一种训练有真实语音记录的神经网络方法,直接对波形建模,生成听起来相对真实的类人声音。
Tests with US English and Mandarin reportedly showed that the system outperforms Google’s best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech.
据报道,美国英语和普通话测试显示,该系统的性能优于谷歌现有的最好的文本到语音(TTS)系统,尽管截至2016年,其文本到语音的合成仍然不如实际的人类语音有说服力。
[2] WaveNet’s ability to generate raw waveforms means that it can model any kind of audio, including music.[3]
WaveNet能够生成原始波形,这意味着它可以对任何类型的音频进行建模,包括音乐
Its ability to clone voices has raised ethical concerns about WaveNet’s ability to mimic the voices of living and dead persons.
克隆声音的能力引起了人们对WaveNet模仿活人和死人声音的能力的伦理关注。
According to a 2016 BBC article, companies working on similar voice-cloning technologies (such as Adobe Voco) intend to use humanely inaudible watermarking to prevent counterfeiting, while maintaining that voice cloning satisfying, for instance, the needs of entertainment-industry purposes would be of a far lower complexity and use different methods than required to fool forensic evidencing methods and electronic ID devices,
根据2016年英国广播公司的一篇文章中,致力于类似voice-cloning技术(如Adobe Voco)的公司打算使用人耳听不清的水印以防止声音伪造,同时满足声音克隆的需求,例如,娱乐业的需求目的的复杂性和使用不同的方法远远低于需要傻瓜法医证明方法和电子ID设备,
so that natural voices and voices cloned for entertainment-industry purposes could still be easily told apart by technological analysis.[4]
因此,通过技术分析,自然的声音和为娱乐产业而克隆的声音仍然可以很容易地区分开来
Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set to output values in a (countable) smaller set, often with a finite number of elements.
量化,在数学和数字信号处理中,是把输入值从一个大集合映射到一个(可数的)小集合的输出值的过程,通常用有限数量的元素。
Rounding and truncation are typical examples of quantization processes.
舍入和截断是量子化过程的典型例子。
Quantization is involved to some degree in nearly all digital signal processing, as the process of representing a signal in digital form ordinarily involves rounding.
几乎所有的数字信号处理都在一定程度上涉及到量化,因为以数字形式表示信号的过程通常涉及四舍五入。
Quantization also forms the core of essentially all lossy compression algorithms.
量化也是所有有损压缩算法的核心。