interspeech2020论文阅读

最新推荐文章于 2024-06-08 09:56:05 发布

sqli96

最新推荐文章于 2024-06-08 09:56:05 发布

阅读量1.1k

点赞数 3

文章标签：深度学习语音识别

本文链接：https://blog.csdn.net/qq_39354864/article/details/109519018

版权

interspeech2020论文阅读

Ⅰ.Streaming ASR

1.Scout Network

(1)SN
文中用SN检测word boundary(严格来说是label boundary)，模型采用N个self-attention层(最前面有CNN层做下采样)，因为第i帧特征对应的输出仅依赖于前面的输出(如何实现的，通过mask？？)，所以SN没有latency。SN输出层用一个linear层接sigmoid预测概率 p_i ,通过最小化CE-Loss训练SN：
在这里插入图片描述
(2)Recognition Network training

文中采用Triggered attention 作为streaming decoder，也可以用mocha。

此处有争议的地方在于使用一个offline的transformer初始化，如果不初始化，模型是否能收敛？或者说收敛速度如何？（我之前的实验结果表明，即便用一个offline的transformer初始化，模型在dev set上的loss也很不稳定）。
(3)Decoding
最重要的部分，具体解码算法请读原文。
(4)Experiment
有意思的地方：Scout Network Evaluation，采用预测的边界和参考边界之间的edit distance作为evaluation metric。
实验结果
在这里插入图片描述

2.Knowledge Distillation from Offline to Streaming RNN Transducer

train an offline RNN-T that can serve as a good teacher to train a student streaming RNN-T.这篇文章没什么亮点。

3.Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

we proposed a novel augmented memory self-attention, which attends on a short segment of the input sequence and a bank of memories. The memory bank stores the embedding information for all the processed segments.
在这里插入图片描述

Experiment Result

4.Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates hypotheses in a streaming fashion, and a 2nd-pass model rescores the hypotheses with full audio sequence context. One main challenge of the two-pass model is the computation latency introduced by the 2nd-pass model.

这篇文章用Transformer layer替换了2ndpass rescorer中的LSTM，可以并行处理1st-pass输出的整个序列。
在这里插入图片描述

关于Rescorer Training

先训练1st-pass model，然后训练Rescorer，训练Rescorer时固定RNN-T的encoder和decoder不变。Rescorer的训练分为two stage:
在这里插入图片描述

5.High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

先下下来，以后再看

6.Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

用于RNN-T的transfer learning，以后再看

Ⅱ.Training Strategies for ASR

1.Semi-supervised ASR by End-to-end Self-training

Starting from a Connectionist Temporal Classification (CTC) system trained on the supervised data, we iteratively generate pseudo-labels on a mini-batch of unsupervised utterances with the current model, and use the pseudo-labels to augment the supervised data for immediate model update.
用有标注的数据训练一个CTC模型，然后每个mini-batch用未标注的数据产生伪标签，下一个mini-batch使用带伪标注的数据augment有标注的数据去训练模型？？是不是这样子的？？
we alternate the following two procedures: generating pseudo-labels using a token-level decoder on a mini-batch of unsupervised utterances, and augmenting the just decoded (input, pseudo-label) pairs for supervised training
在这里插入图片描述

2.Serialized Output Training for End-to-End Overlapped Speech Recognition

序列化输出训练(Serialized Output Training,SOT) 相比于置换不变训练(Permutation Invariant Training,PIT)有两个优点:(1)no limitation in the maximum number of speakers, and (2) an ability to model the dependencies among outputs for different speakers。SOT仅有一个输出层产生the transcriptions of multiple speakers one after another，且计算复杂度从PIT的O(S^3)降为O(S)，S是说话人的数量。

3.Improved Noisy Student Training for Automatic Speech Recognition

在这里插入图片描述
NST的training如下：

第一个任务，LibriSpeech 100-860，100h的数据作为有标注的数据，860h的数据作为无标注的数据。

第二个任务，LibriSpeech-LibriLight，LibriSpeech的全部数据作为有标注的数据，LibriLight的一部分unlab-60k作为无标注的数据。
在这里插入图片描述

Ⅲ.Computational Resource Constrained Speech Recognition

1.Scaling Up Online Speech Recognition Using ConvNets

Facebook的tds encoder模型用于Streaming ASR，主要改了这几个方面:
1.Asymmetrically Padded Convolutions:
在这里插入图片描述
最后是online的beam search decoding，具体看原文。

Experiment
训练集有13.7k个小时

2.Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition

实验结果，CER和RTF看着都不错
在这里插入图片描述

Ⅳ.ASR Neural Network Architectures and Training

1.Semi-supervised end-to-end ASR via teacher-student learning with conditional posterior distribution

Ⅴ.ASR Neural Network Architectures — Transformers

1.Improving Transformer-based Speech Recognition with Unsupervised Pre-training and Multi-task Semantic Knowledge Learning

创新点：
(1)We propose two unsupervised pre-training strategies, speech predictive coding (SPC) and text predictive coding (TPC), respectively. The SPC strategy employs a large amount of unpaired speech data with an MLM-like [11] objective to obtain general feature representations for speech, such as acoustic semantic features. And the TPC strategy uses a large amount of unpaired text data with an autoregression language model objective to get general feature representations of the text, such as linguistic semantic features.
(2)In order to prevent the model from forgetting the semantic knowledge during the fine-tuning stage, we propose a new semi-supervised fine-tuning method, named multi-task semantic knowledge learning (MTSL), which further strengthens the model’s learning ability of semantic knowledge.
在这里插入图片描述

Experiment还没仔细看

2.Weak-Attention Suppression For Transformer Based Speech Recognition

Key idea: All attention probabilities smaller than this threshold are set to zero, and the remaining probabilities are re-normalized to sum to one.
具体做法如下：
在这里插入图片描述
$m_i$ 和 $\delta_i$ 分别是 $\in \mathbb R^ {L \times d_q}$ 中第 $i$ 个位置的attention概率的均值和标准差$

3.Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

贡献：
(1)First, we apply interleaved convolutions to gradually downsample input audio sequence.(we find that this progressive downsampling scheme causes no loss in accuracy,渐进式下采用不会损失精度)

(2)Second, we limit the length of history context (left-context) of selfattention in Transformer layers to maintain constant computation cost for each decoding step.(Transformer Transducer里也用过)

(3Finally, we apply relative position encoding , which enables hidden state reuse for Transformer.

Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models. The look-ahead window introduced by convolution layers is 140 ms.(如何用cnn实现look-ahead?)

整体网络结构为：
在这里插入图片描述

Audio Encoder的结构如下：
在这里插入图片描述
下采样方案：

Prediction Net结构如下：

实验结果：

限制audio encoder的历史信息，会导致性能下降。

需要理清楚的几个点：
(1) low frame rate: 原始输入的frame rate是10ms,经过一次下采用后frame rate 变成20ms，我认为此处的frame rate 表示的是帧移

(2)Unidirectional Transformer: 类似于Transformer-XL，当前帧看不到未来的信息，所以称为单向的transformer。look-head是由conv实现的。

4.Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR

we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation of Transformer-based LVCSR.
两个动机：
(1) we hope to benefit from a deep decoder network structure that encodes multi-level of abstraction from both acoustic and linguistic representation
(2) we hypothesize that a shared acoustic and linguistic embedding space will help the network to learn the association between acoustic and linguistic information, and improve their alignments.

网络结构如下：
在这里插入图片描述

SMA模块的介绍：

多任务学习：

实验结果

5.Exploring Transformers for Large-Scale Speech Recognition

Streaming的实现方式
在这里插入图片描述

Ⅵ.ASR Neural Network Architectures and training

1.Emitting Word Timings with End-to-End Models

In this paper, we present an approach to word timings by constraining the attention head of the Listen, Attend, Spell (LAS) 2nd-pass rescorer. 详情见原文

Ⅶ.Cross/Multi-Lingual and Code-Switched Speech Recognition

1.Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition

在这里插入图片描述

Ⅷ.ASR Model Training and Strategies

1.Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition

在这里插入图片描述
为RNN-T提出了最小化贝叶斯风险训练，MBR training is conducted via minimizing the expected edit distance between the reference label sequence and on-the-fly generated N-best hypothesis. We also introduce a heuristic to incorporate an external NNLM in RNN-T beam search decoding and explore MBR training with the external NNLM.

2.Semantic Mask for Transformer based End-to-End Speech Recognition

Our masking approach is more structured in the sense that we mask the acoustic signals corresponding to a particular output token. Besides the benefit in terms of model regularization, our approach also encourages the model to reconstruct the missing token based on the contextual information, which improves the power of the implicit language model in the decoder.
类似于BERT，但是这里的Semantic Mask是mask掉某个token对应的声学特征，其中该token的边界是通过MFA(蒙特利尔强制对齐)得到的。
Semantic Mask和Specaugment随机mask掉频谱图上的部分有一些区别:
在这里插入图片描述

3.Unsupervised Regularization-Based Adaptive Training for Speech Recognition

We propose two novel regularization-based speaker adaptive training approaches for connectionist temporal classification (CTC) based speech recognition:
(1)The first method is center loss (CL) regularization, which is used to penalize the distances between the embeddings of different speakers and the only center.
(2)The second method is speaker variance loss (SVL) regularization in which we directly minimize the speaker interclass variance during model training.

Ⅸ.Noise robust and distant speech recognition

1.Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones

用异步麦克风(asynchronous microphones)进行会议转录(meeting transcription)，整个框架由五个部分组成:
在这里插入图片描述
Doing speaker diarization before speech enhancement enables the system to deal with overlapped speech without considering sampling frequency mismatch between microphones. ？？？为什么说在SE之前做Diarization可不用考虑不同mic之间的采样率不匹配的问题？

五个模块：
(1)Blind synchronization
在这里插入图片描述
(2)Speaker diarization
In this study, we concatenate two kinds of features: speaker characteristics based features(本文中用x-vector) and power ratio based features.具体见原文
(3)Speech enhancement
We first apply Weighted Prediction Error [22] to the input multichannel signals in a short-time Fourier transform (STFT) domain for dereverberation.
After that, speech separation by GSS [15] using a complex Angular Central Gaussian Mixture Model (cACGMM) [23] is applied.
Finally, Blind Analytic Normalization (BAN) postfilter [25] is applied for wf to obtain the final beamformer.
(4)Speech recognition
在这里插入图片描述
(5)Duplication reduction
具体看原文

Experiment
在这里插入图片描述

sqli96

关注

3
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
interspeech2020论文阅读

interspeech2020论文阅读Streaming ASR1.Scout Network(1)SN文中用SN检测word boundary(严格来说是label boundary)，模型采用N个self-attention层(最前面有CNN层做下采样)，因为第i帧特征对应的输出仅依赖于前面的输出(如何实现的，通过mask？？)，所以SN没有latency。SN输出层用一个linear层接sigmoid预测概率 p_i ,通过最小化CE-Loss训练SN：(2)Recognition Ne
复制链接

扫一扫