1. 实践目的及意义
1.1. 背景意义
Code-switch is a common phenomenon in multilingual society around the world. The latest speech synthesis can generate monolingual speech with high identifiable and naturalness. However, they cannot fully handel code-switch text, which can lead to missing or incorrect pronunciation in the synthesized output. Using bilingual recordings from bilingual speakers to build a code-switch TTS is simple . However, in reality, it is expensive to obtain large amounts of such bilingual data. We explore cross-lingual TTS: use source speaker saying target language copurs and target speaker saying source language to generate target speaker saying language speech.
1.2. 已有方案和缺点
Papers try to solve cross-lingual
TTS, they can generate expressive speech, but may lead to wrong accent because of not completely information detangled. Different texts with different speakers will get different quality, which is also a big problem for commercial cross-lingual TTS. Apple studies the characteristics of the speaker’s feature vector in cross-lingual TTS. By adjusting the small difference of the same speaker’s feature vector in different languages, it can achieve better timbre similarity and speech naturalness. In Voice Clone or Voice Conversion, more attention is paid to the modeling of timbre. CUHK papers disentangle the voice content and timbre in speech. For unseen speakers, these methods can also get the speech of its timbre by modeling speaker’s feature from reference speech. These methods can also implement cross-lingual TTS by referring to the speech of different languages. These methods are not optimized for cross-lingual TTS tasks. Because the text language is different, the SV module is not universal, etc., the speaker sim