Almost Unsupervised Text to Speech and Automatic Speech Recognition

最新推荐文章于 2022-02-15 13:32:28 发布

咕噜咕噜day

最新推荐文章于 2022-02-15 13:32:28 发布

阅读量514

点赞数

分类专栏：语音学习文章标签： light-TTS TTS

本文链接：https://blog.csdn.net/qq_36533552/article/details/102529245

版权

语音学习专栏收录该内容

9 篇文章 1 订阅

订阅专栏

Abstract:
- 无监督方法，只需要利用几百对文本—语音对和额外的无标签的数据，提供给TTS和ASR
- components:
  - 1.a denosising auto-encoder
  - 2. 双机制训练；TTS是把text y转成语音x，ASR把利用x和y进行训练，反之亦然
  - 3. 双向序列建模，主要解决长语音序列和文本序列在训练过程中出现的错误传播问题
  - 4.一个unified model 包含 TTS和ASR
Instroduction：
- 介绍了low-resource 和 zero-resource场景的ASR和TTS 一些论文
- 利用大量标签的语音—文本数据合成某个人的特地语音，这种transfer learning 。依赖于两训练好的ASR和TTS模型
- methods：
  - 1. self-supervised learning for unpaired speech and text data ，去建立语言和文本领域的语言理解和建模能力。使用了denoising auto-encoder
  - 2.训练过程：
    - 1.TTS把文本y合成语音x，然后ASR利用(x,y)进行训练
    - 2.ASR把语音x识别成文本y,然后TTS利用（y,x）进行训练。
  - 3.由于语音和文本序列比其他seq-seq任务的长度更长，防止更严重的误差反向传播。这里利用了双向的序列建模。
  - 4.建立了一个基于Transformer 的unified model structure 联合TTS和ASR
Background:
- 2.1. Sequence to Sequence Learning
  - 基于en-de的框架：
    - The encoder reads the source sequence and generates a set of representations.
    - the decoder estimates the conditional probability of each target element given the source representations and its preceding elements.
    - The attention mechanism (Bahdanau et al., 2015) is further introduced between the encoder and decoder in order to determine which source representation to focus on when predicting the current element, and is an important component for sequence to sequence learning.
- 2.2. TTS and ASR based on the Encoder-Decoder Framework
  - 端到端的TTS和ASR；长句子使用Transformer；
Our Mothod：
- 3.1. Denoising Auto-Encoder
  - 给定大量的无标签的数据，为了更好的理解语音或者文本。使用denoising auto-encoder 从正确的版本自身，重构语音或者文本序列。
  - denoising auto-encoder是典型自监督学习，广泛应用到无监督学习中。
  - loss：
- 3.2. Dual Transformation
  - 1.TTS把文本y合成语音x，然后ASR利用(x,y)进行训练
  - 2.ASR把语音x识别成文本y,然后TTS利用（y,x）进行训练
- 3.3.Bidirectional Sequence Modeling
  - 通常生成序列的右边比左边的质量要低，所以是在 low- or zero-resource setting 则质量更低。
  - 解决前面的问题，作者使用了双向序列建模去生成语音或文本序列。从左到右和者从右到左。
  - 并且在无监督学习中且数据比较少时，利用双向序列建模，可以达到数据增强的作用。
  - unlike the conventional decoder using a zero vector as the start element for training and inference, ；we learn four start embeddings in total, two for speech generation and the other two for text generation.
  - Thus we reverse the source sequence to make it consistent with the target sequence
- 3.4.Model Structure
  - Unified Training Flow:
  - Transformer Module:
    - embed256 ,hidden_size:256; fft:1025
  - In/Out Module:
    - The post-net consists of a 5-layer 1-dimensional convolutional network with hidden size of 256, which aims to refine the quality of the generated mel-spectrograms.
    - a phoneme embedding
Experiments and Results :
- 4.1. Training and Evaluation Setup
  - LjJspeech 13100(12500+300+300) 24hours
  - evaluation: MOS (mean option score) for TTS and PER (phoneme error rate)
- 4.2.Rsults
  - PER and MOs:
- 4.3. Analyses
  - Different Components of Our Method:
    - we con-duct ablation studies by gradually adding each component to the baseline Pair-200 system to check the performancechanges.
  - Visualization of Mel-Spectrograms
    - pic:
  - Varying Paired Data
    - pic:
  - Different Masking Probabilities in DAE
    - vary mask probablity:
Relate work:
- TTS and ASR:
  - TTS:DeepVoice; Tacitron; ClariNet
- Zero-/Low-resource TTS and ASR :
  - pic
音素embed ;encoder :两层Transformer ; 单机四卡 batch 一共512；训练了三天
那200对数据到底是怎么用的？
1.text to phoneme:

咕噜咕噜day

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Almost Unsupervised Text to Speech and Automatic Speech Recognition

Abstract: 无监督方法，只需要利用几百对文本—语音对和额外的无标签的数据，提供给TTS和ASR components: 1.a denosising auto-encoder 2. 双机制训练；TTS是把text y转成语音x，ASR把利用x和y进行训练，反之亦然 3. 双向序列建模，主要解决长语音序列和文本序列在训练过程中出现的错误传播问题 4.一个un...
复制链接

扫一扫