Unsupervised Cross-Domain Singing Voice Conversion

最新推荐文章于 2024-04-17 10:13:54 发布

林林宋

最新推荐文章于 2024-04-17 10:13:54 发布

阅读量288

点赞数

文章标签：深度学习机器学习语音合成

原文链接：https://arxiv.org/abs/2008.02830

版权

会议： 2020 interspeech
单位：FaceBook
作者：Adam Polyak

demo page

文章目录

abstract

使用了speech & sing的数据；cross-domain的意思是可以把source singing utt转换成原始为说话or歌唱的音色。
wav2wav的转换，GAN网络
使用了ASR提取声学特征，CNN提取基频，另外提取loudness feature，
提出perceptual loss：计算重建x和原始x的基频一致性，以及内容一致性；

在这里插入图片描述
Figure 1: Proposed GAN architecture. (a) Generator architecture. Musical and speech features are extracted from a singing waveform (floud(x), fw2l(x), Γ(fcrepe(x))) and passed through context stacks (colored green). The features are then concatenated and tempo- rally upsampled to match the audio frequency. The joint embedding is used to condition a non-causal WaveNet (colored blue), which receives random noise as input. (b) Discriminator architecture. Losses are drawn with dashed lines, input/output with solid lines. The discriminator (colored orange) differentiates between synthesized and real singing. Multi-scale spectral loss and perceptual losses are computed between matching real and generated samples.
在这里插入图片描述

在这里插入图片描述

多说话人的时候用到了back-translation：
$x^j_u =G(z,E(x_j),u)$

aechitecture

input——conv block （8层non-casual layer）——generator（wavenet) ：将U（0，1）之间分布的数据预测为采样点级别的wav———discriminator

Experiments

单人数据：LJSpeech，LCSING-单人歌唱数据
多人数据：VCTK, NUS

分别用纯speech数据、纯歌唱数据、speech+sing的数据用作模型训练，作为target speaker，测试的时候输入是nus的数据。

林林宋

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Unsupervised Cross-Domain Singing Voice Conversion

会议： 2020 interspeech单位：FaceBook作者：Adam Polyakdemo page文章目录abstractaechitectureExperimentsabstract使用了speech & sing的数据；cross-domain的意思是可以把source singing utt转换成原始为说话or歌唱的音色。wav2wav的转换，GAN网络使用了ASR提取声学特征，CNN提取基频，另外提取loudness feature，提出perceptual l
复制链接

扫一扫