专业实践最终总结: 端到端跨语言 TTS

最新推荐文章于 2024-12-03 21:24:26 发布

ruclion

最新推荐文章于 2024-12-03 21:24:26 发布

阅读量838

点赞数

分类专栏：研三-语音合成论文文章标签：人工智能

本文链接：https://blog.csdn.net/u013625492/article/details/114684657

版权

本文详细介绍了跨语言TTS的研究背景、已有方案及其缺点，提出了一种改进的解决方案，重点关注语音内容、说话人音色和语言口音的分离。实践涵盖了从技术结构、工程实现到主要成果的全过程，包括专利、论文和实际的商业化应用。通过不同阶段的成果展示，如音色转换方法、双语TTS系统和模型优化，展示了在跨语言语音合成领域的深入研究和实践。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. 实践目的及意义

1.1. 背景意义

Code-switch is a common phenomenon in multilingual society around the world. The latest speech synthesis can generate monolingual speech with high identifiable and naturalness. However, they cannot fully handel code-switch text, which can lead to missing or incorrect pronunciation in the synthesized output. Using bilingual recordings from bilingual speakers to build a code-switch TTS is simple . However, in reality, it is expensive to obtain large amounts of such bilingual data. We explore cross-lingual TTS: use source speaker saying target language copurs and target speaker saying source language to generate target speaker saying language speech.

1.2. 已有方案和缺点

Papers try to solve cross-lingual
TTS, they can generate expressive speech, but may lead to wrong accent because of not completely information detangled. Different texts with different speakers will get different quality, which is also a big problem for commercial cross-lingual TTS. Apple studies the characteristics of the speaker’s feature vector in cross-lingual TTS. By adjusting the small difference of the same speaker’s feature vector in different languages, it can achieve better timbre similarity and speech naturalness. In Voice Clone or Voice Conversion, more attention is paid to the modeling of timbre. CUHK papers disentangle the voice content and timbre in speech. For unseen speakers, these methods can also get the speech of its timbre by modeling speaker’s feature from reference speech. These methods can also implement cross-lingual TTS by referring to the speech of different languages. These methods are not optimized for cross-lingual TTS tasks. Because the text language is different, the SV module is not universal, etc., the speaker sim