《MASKCYCLEGAN-VC: LEARNING NON-PARALLEL VOICE CONVERSION WITH FILLING IN FRAMES》论文笔记

最新推荐文章于 2024-10-22 17:44:57 发布

大雪001

最新推荐文章于 2024-10-22 17:44:57 发布

阅读量772

点赞数 1

分类专栏：论文笔记文章标签：语音转换语音合成 CycleGAN

本文链接：https://blog.csdn.net/LeavingBook/article/details/116975304

版权

论文地址：https://ieeexplore.ieee.org/abstract/document/9414851
会议:ICASSP2021

Abstract

CycleGAN-VC3中使用的TFAN模块会大大增加计算量。作为替代，本文提出MaskCycleGAN-VC，它是CycleGAN-VC2的一种扩展，使用一种FIF(filling in frames)进行训练。使用FIF，可以将时域Mask应用于输入的Mel频谱图并且可以激励转换器根据周围的帧来填充丢失的帧。FIF能够以自监督的方式学习时频结构，无需其他模块。

As an alternative, we propose MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained using a novel auxiliary task
called filling in frames (FIF). With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames.This task allows the converter to learn time-frequency structures in a self-supervised manner and eliminates the need for an additional module such as TFAN.

1. Introduction

MaskCycleGAN-VC是CycleGAN-VC2的扩展，它使用filling in frames(FIF)进行训练。我们对输入的Mel频谱图应用时序Mask，并鼓励转换器根据周围的帧填充丢失的帧。
FIF允许转换网络通过补全过程以自我监督的方式学习时频特征结构
存在的问题：CycleGAN-VC2使用MCEP进行转换再重建，这会导致转换过程中时频信息丢失和无法使用神经网络声码器。提出的CycleGAN-VC3虽然能使用TFAN弥补时频损失的问题，但计算量过于庞大。

As an alternative, we propose MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained using a novel auxiliary task called filling in frames (FIF). With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in the missing frames based on the surrounding frames.
Similarly, FIF allows the converter to learn the time-frequency feature
structure in a self-supervised manner through a complementation process.