melody:旋律
lyrics:歌词
timbre:音色
vocal:清唱
2021 icassp
【singer conversion】PPG-base singing voice conversion with adversarial representation learning
单位:头条
论文链接
demo: 添加链接描述
阅读笔记
技术点:多个子网络,对抗训练,互相 弥补促进性能,demo展示还不错
2020 icassp
- SVC
- Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders.
- SS
- Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System
- S2S
- Speech-To-Singing Conversion in an Encoder-Decoder Framework.
【singer conversion】PitchNet-Unsupervised Singing Voice Conversion with Pitch Adversarial Network [2020 icassp]
单位:腾讯ai lab,Chengqi Deng
abstract:
现有的SVC很多不在调上,说明pitch预测的不准。本文是为了更精确的预测pitch,更灵活的修正pitch。
本文提出用adversarial trained pitch regression network帮助encoder更好的学习pitch不变的音素表示singer-invariance embedding,另外一个单独的module送入source中提取的pitch到decoder module。本文是基于非平行数据做的SVC任务,参考之前的WaveNet encoder,虽然可以合成高相似度的语音,但是语音的质量不好—phone和pitch联合建模的缺点。
demo展示
Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders
会议:2020 ICASSP
作者:Yin-Jyun Luo
单位:Singapore University of Technology and Design
demo链接
- abstract
使用VAE结构,基于非平行数据完成many-to-many的singer VC 和singers vocal technique conversion。使用两个单独的encoder分别解码歌唱者身份信息和vocal technique 信息,通过空间向量的算术运算重新耦合信息,然后用decoder做语音重建。
2020 interspeech
- S2S
- Speech-to-Singing Conversion Based on Boundary Equilibrium GAN
【20s target speech empower SVC to new target speaker】DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System
会议:2020 interspeech
作者:Liqiang Zhang
单位:Beijing Institute of Technology,Tencent AI Lab
demo链接
-
abstract
初衷:想要实现SVC,但是目标说话人的歌唱数据很少;
方法:通过目标人正常的speech数据生成高质量的歌唱数据。通过统一speech合成和singing合成的特征,将speech和singing的train/conversion整合在一起。因此,正常的speech数据也可有助于SVC的训练,尤其是歌唱数据很少的时候。 因为要做one-shot training SVC,所以需要一个单独的speaker embedding module(用speech和singing的数据寻训练)。
结果:目标人20s的注册speech数据完成source到目标人的歌唱转换。 -
introduction
歌唱合成需要一个人大量的数据,但是是hard and expensive。[4]训练一个multi-speaker singing synthesis,然后用小数据的target speaker singing data进行fine-tune。对于unseen voice,可以通过SVC完成。【Unsupervised singing voice conversion】首先提出基于非平行数据以及wavenet-autoencoder结构的SVC,neither singing data nor the transcribed lyrics or notes is needed。
尽管如此,SVC仍然需要相当大的歌唱数据,【10】做了speech2singing的任务:修正f0 contour和duration 信息,但是需要人工的手动修正才能达到好的可懂度和自然度。
-
Duration Informed Attention Network (DurIAN)是做多模态合成,用自回归网络帧级别生成语音特征。本文基于
DurIAN网络,做speech&sing conversion。贡献点:(1)将speech synthesis和singing synthesis的网络合并,可以通过speech数据训练sing voice conversion。(2)speaker embedding是用一个训练好的d-vector网络提取的,而不是LUT(look up table)的结构。转换过程中:20s的目标说话人speech or singing数据用于提取d-vector,即可完成转换。 -
tts前端将speech/sing的文本转成phone-seq,TDNN做force_align得到对齐时长,声学特征包括mel,F0/RMSE(能量均方差)。不同于TTS的五因子,non-tonal phone用于同时建模speech&singing phones。
-
Speaker embedding network:用speech和singing的数据共同训练,提取句子中的d-vector。
-
loss:mel loss
2019 interspeech
- SVC
- Unsupervised Singing Voice Conversion
- S2S
- A Combination of Model-Based and Feature-Based Strategy for Speech-to-Singing Alignment
- NUS Speak-to-Sing: A Web Platform for Personalized Speech-to-Singing Conversion
- A Strategy for Improved Phone-Level Lyrics-to-Audio Alignment for Speech-to-Singing Synthesis
【singer conversion】 Unsupervised Singing Voice Conversion
单位:Facebook AI
demo: demo
2019 icassp
二、 风格转换
1. [说话风格转换]Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion[2019 icassp]
使用cycleGAN的方式,基于非平行数据,进行speaking style conversion(Lombard和正常互转)。
2.【情感转换】Converting Anyone’s Emotion:Towards Speaker-Independent Emotional Voice Conversion [LHZ][2020 interspeech]
code and demo
语音质量很差,对判断模型是否有效干扰很大。
- EVC(emotional voice conversion):保留语音中的文本信息和说话人特征,转换情感。说话人无关的emotion state,基于非平行数据和VAW-GAN。
情感转换和spectral以及prosody的转换都有关系。
传统的VC只关注spectral的转换。 - 完成说话人netual–angry的转换。
idea
- speech2singing
- 文本保留的(菠萝唱歌APP):将一句话的内容不变,时长,note对应到现存的歌曲上;
- 文本不保留,提取音色:不会唱歌的人唱歌;
2019 interspeech
- Augmented CycleGANs for Continuous Scale Normal-to-Lombard Speaking Style Conversion
2019 icassp
- Unsupervised Melody Style Conversion
- Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion
- Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis.
2020 interspeech
- Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
- Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS
- Transferring Source Style in Non-Parallel Voice Conversion
- Voice Conversion Using Speech-to-Speech Neuro-Style Transfer
2020 ICASSP - Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis
- Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System.