歌唱转换2019-2020文章

林林宋

已于 2022-02-22 18:35:10 修改

阅读量670

点赞数

分类专栏： paper笔记文章标签：深度学习

于 2021-02-03 15:51:20 首次发布

本文链接：https://blog.csdn.net/qq_40168949/article/details/113608810

版权

paper笔记专栏收录该内容

162 篇文章 23 订阅

订阅专栏

文章目录

melody：旋律
lyrics：歌词
timbre：音色
vocal：清唱

2021 icassp

【singer conversion】PPG-base singing voice conversion with adversarial representation learning

在这里插入图片描述

单位：头条
论文链接
demo：添加链接描述
 阅读笔记
技术点：多个子网络，对抗训练，互相弥补促进性能，demo展示还不错

2020 icassp

2020 icassp

SVC
Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders.
SS
Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System
S2S
Speech-To-Singing Conversion in an Encoder-Decoder Framework.

【singer conversion】PitchNet-Unsupervised Singing Voice Conversion with Pitch Adversarial Network [2020 icassp]

在这里插入图片描述

单位：腾讯ai lab，Chengqi Deng
abstract:
现有的SVC很多不在调上，说明pitch预测的不准。本文是为了更精确的预测pitch，更灵活的修正pitch。
本文提出用adversarial trained pitch regression network帮助encoder更好的学习pitch不变的音素表示singer-invariance embedding，另外一个单独的module送入source中提取的pitch到decoder module。本文是基于非平行数据做的SVC任务，参考之前的WaveNet encoder,虽然可以合成高相似度的语音，但是语音的质量不好—phone和pitch联合建模的缺点。
demo展示

Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders

文章详情

会议：2020 ICASSP
作者：Yin-Jyun Luo
单位：Singapore University of Technology and Design
demo链接

abstract
使用VAE结构，基于非平行数据完成many-to-many的singer VC 和singers vocal technique conversion。使用两个单独的encoder分别解码歌唱者身份信息和vocal technique 信息，通过空间向量的算术运算重新耦合信息，然后用decoder做语音重建。

2020 interspeech

2020 interspeech

S2S
Speech-to-Singing Conversion Based on Boundary Equilibrium GAN

【20s target speech empower SVC to new target speaker】DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System

会议：2020 interspeech
作者：Liqiang Zhang
单位：Beijing Institute of Technology，Tencent AI Lab
demo链接

abstract
初衷：想要实现SVC，但是目标说话人的歌唱数据很少；
方法：通过目标人正常的speech数据生成高质量的歌唱数据。通过统一speech合成和singing合成的特征，将speech和singing的train/conversion整合在一起。因此，正常的speech数据也可有助于SVC的训练，尤其是歌唱数据很少的时候。因为要做one-shot training SVC，所以需要一个单独的speaker embedding module（用speech和singing的数据寻训练）。
结果：目标人20s的注册speech数据完成source到目标人的歌唱转换。
introduction

歌唱合成需要一个人大量的数据，但是是hard and expensive。[4]训练一个multi-speaker singing synthesis，然后用小数据的target speaker singing data进行fine-tune。对于unseen voice，可以通过SVC完成。【Unsupervised singing voice conversion】首先提出基于非平行数据以及wavenet-autoencoder结构的SVC，neither singing data nor the transcribed lyrics or notes is needed。
尽管如此，SVC仍然需要相当大的歌唱数据，【10】做了speech2singing的任务：修正f0 contour和duration 信息，但是需要人工的手动修正才能达到好的可懂度和自然度。

Duration Informed Attention Network (DurIAN)是做多模态合成，用自回归网络帧级别生成语音特征。本文基于
DurIAN网络，做speech&sing conversion。贡献点：（1）将speech synthesis和singing synthesis的网络合并，可以通过speech数据训练sing voice conversion。（2）speaker embedding是用一个训练好的d-vector网络提取的，而不是LUT（look up table）的结构。转换过程中：20s的目标说话人speech or singing数据用于提取d-vector，即可完成转换。
tts前端将speech/sing的文本转成phone-seq，TDNN做force_align得到对齐时长，声学特征包括mel，F0/RMSE（能量均方差）。不同于TTS的五因子，non-tonal phone用于同时建模speech&singing phones。
Speaker embedding network：用speech和singing的数据共同训练，提取句子中的d-vector。
loss：mel loss

2019 interspeech

2019 interspeech

SVC
Unsupervised Singing Voice Conversion
S2S
A Combination of Model-Based and Feature-Based Strategy for Speech-to-Singing Alignment
NUS Speak-to-Sing: A Web Platform for Personalized Speech-to-Singing Conversion
A Strategy for Improved Phone-Level Lyrics-to-Audio Alignment for Speech-to-Singing Synthesis

【singer conversion】 Unsupervised Singing Voice Conversion

单位：Facebook AI
demo： demo
在这里插入图片描述

2019 icassp

2019 icassp
无

二、风格转换

1. [说话风格转换]Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion[2019 icassp]

语音demo

使用cycleGAN的方式，基于非平行数据，进行speaking style conversion（Lombard和正常互转）。

2.【情感转换】Converting Anyone’s Emotion:Towards Speaker-Independent Emotional Voice Conversion [LHZ][2020 interspeech]

code and demo
语音质量很差，对判断模型是否有效干扰很大。

EVC(emotional voice conversion):保留语音中的文本信息和说话人特征，转换情感。说话人无关的emotion state，基于非平行数据和VAW-GAN。
情感转换和spectral以及prosody的转换都有关系。
传统的VC只关注spectral的转换。
完成说话人netual–angry的转换。

idea

speech2singing
文本保留的（菠萝唱歌APP)：将一句话的内容不变，时长，note对应到现存的歌曲上；
文本不保留，提取音色：不会唱歌的人唱歌；

2019 interspeech

Augmented CycleGANs for Continuous Scale Normal-to-Lombard Speaking Style Conversion

2019 icassp

Unsupervised Melody Style Conversion
Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion
Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis.

2020 interspeech

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS
Transferring Source Style in Non-Parallel Voice Conversion
Voice Conversion Using Speech-to-Speech Neuro-Style Transfer
2020 ICASSP
Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis
Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System.