语音自监督预训练模型 CNN Encoder 调研总结

Haulyn5

已于 2022-06-01 15:47:33 修改

阅读量1.3k

点赞数 3

文章标签： cnn 深度学习音频

于 2022-05-31 14:45:30 首次发布

本文链接：https://blog.csdn.net/Haulyn5/article/details/125065221

版权

前言

最近在看 WavLM 的工作，注意到它使用的 CNN Encoder 是 7 层，相关描述如下。

“The convolutional encoder is composed of seven blocks of temporal convolution followed by layer normalization and a GELU activation layer. The temporal convolutions have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2), resulting in each output representing about 25ms of audio strided by 20ms.” (Chen et al., 2022, p. 3)

于是就很好奇，这些 CNN 的层数步长等超参数设定有什么深层意义，为什么要这样设置？为什么不能用别的？而最近浏览的语音自监督预训练文章大多对 CNN Encoder 都是一笔带过，简单描述一下作用和参数就结束了。如果说的详细点可能会说这里是为了将高维的语音信号做降维，而给定的超参数可以使得大概模拟出一个 25 ms的感受野，步长 20ms。这些是之前传统语音信号处理所推荐的参数。于是看了一下，因为 WavLM 这篇工作拿了一个 Section 去说明 HuBERT，所以怀疑这样的 CNN 设定是 Follow 了 HuBERT。搜了一下果然 HuBERT 也是这样的设定。但是 HuBERT 也没有说明为什么这样。毕竟 25 ms 的感受野有无穷种 CNN 的设计方法……加上 HuBERT 中重点说明和对比了 wav2vec 2.0，所以又做了进一步探究，至此觉得还是有点数据了，还是总结一篇博客记录了。

正文

简单来说对于最开始的 CNN Encoder，WavLM 遵循了 HuBERT 的设定，而 HuBERT 遵循了 wav2vec 2.0 的设定。根据过去的了解 wav2vec 2.0 是直接基于 vq-wav2vec 的工作，但是调研 vq-wav2vec 时发现后者并没有使用一样的 CNN Encoder。

下面简单列个表格记录。

几种语音自监督预训练模型的 CNN Encoder 参数
模型名	参数	Quote
vq-wav2vec	8层卷积	“The encoder has 8 layers with 512 channels each, kernel sizes (10,8,4,4,4,1,1,1) and strides (5,4,2,2,2,1,1,1), yielding a total stride of 160. Each layer contains a convolution, followed by dropout, group normalization with a single group (Wu & He, 2018) and a ReLU non-linearity.” (Baevski et al., 2019, p. 4)
wav2vec 2.0	7 层卷积	“The feature encoder contains seven blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2).” (Baevski et al., 2020, p. 4) “This results in an encoder output frequency of 49 hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25ms of audio.” (Baevski et al., 2020, p. 4)
HuBERT	同 wav2vec 2.0	“The waveform encoder is identical for all the three configurations, which is composed of seven 512-channel layers with strides [5,2,2,2,2,2,2] and kernel widths [10,3,3,3,3,2,2]. The BERT encoder consists of many identical transformer blocks, whose parameters along with the parameter of the subsequent projection layer are specified in Table” (Hsu et al., 2021, p. 3)
WavLM	同 HuBERT	“The convolutional encoder is composed of seven blocks of temporal convolution followed by layer normalization and a GELU activation layer. The temporal convolutions have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2), resulting in each output representing about 25ms of audio strided by 20ms.” (Chen et al., 2022, p. 3)

后续有机会会再补充，包括其他的模型以及参数设定的意义（这个得等我能理解）。

220601 补充

可是实际跑了 WavLM 的代码，CNN Encoder 似乎并不能降维。测试了一下，对于（8,80000）的输入（即 batch_size=8, 输入语音长度 80000 个采样点），得到的模型如下。(使用了 torchinfo 库打印模型信息）

from torchinfo import summary
from wavlm.WavLM import WavLM, WavLMConfig, ConvFeatureExtractionModel


feature_enc_layers = [(512, 10, 5),
 (512, 3, 2),
 (512, 3, 2),
 (512, 3, 2),
 (512, 3, 2),
 (512, 2, 2),
 (512, 2, 2)]
conv_feature_extractor = ConvFeatureExtractionModel(
            conv_layers=feature_enc_layers,
            dropout=0.0,
            mode='default',
            conv_bias=False,
        )

summary(conv_feature_extractor, input_size=(8, 80000))

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
ConvFeatureExtractionModel               --                        --
├─ModuleList: 1-1                        --                        --
│    └─Sequential: 2-1                   [8, 512, 15999]           --
│    │    └─Conv1d: 3-1                  [8, 512, 15999]           5,120
│    │    └─Dropout: 3-2                 [8, 512, 15999]           --
│    │    └─Fp32GroupNorm: 3-3           [8, 512, 15999]           1,024
│    │    └─GELU: 3-4                    [8, 512, 15999]           --
│    └─Sequential: 2-2                   [8, 512, 7999]            --
│    │    └─Conv1d: 3-5                  [8, 512, 7999]            786,432
│    │    └─Dropout: 3-6                 [8, 512, 7999]            --
│    │    └─GELU: 3-7                    [8, 512, 7999]            --
│    └─Sequential: 2-3                   [8, 512, 3999]            --
│    │    └─Conv1d: 3-8                  [8, 512, 3999]            786,432
│    │    └─Dropout: 3-9                 [8, 512, 3999]            --
│    │    └─GELU: 3-10                   [8, 512, 3999]            --
│    └─Sequential: 2-4                   [8, 512, 1999]            --
│    │    └─Conv1d: 3-11                 [8, 512, 1999]            786,432
│    │    └─Dropout: 3-12                [8, 512, 1999]            --
│    │    └─GELU: 3-13                   [8, 512, 1999]            --
│    └─Sequential: 2-5                   [8, 512, 999]             --
│    │    └─Conv1d: 3-14                 [8, 512, 999]             786,432
│    │    └─Dropout: 3-15                [8, 512, 999]             --
│    │    └─GELU: 3-16                   [8, 512, 999]             --
│    └─Sequential: 2-6                   [8, 512, 499]             --
│    │    └─Conv1d: 3-17                 [8, 512, 499]             524,288
│    │    └─Dropout: 3-18                [8, 512, 499]             --
│    │    └─GELU: 3-19                   [8, 512, 499]             --
│    └─Sequential: 2-7                   [8, 512, 249]             --
│    │    └─Conv1d: 3-20                 [8, 512, 249]             524,288
│    │    └─Dropout: 3-21                [8, 512, 249]             --
│    │    └─GELU: 3-22                   [8, 512, 249]             --
==========================================================================================
Total params: 4,200,448
Trainable params: 4,200,448
Non-trainable params: 0
Total mult-adds (G): 98.14
==========================================================================================
Input size (MB): 2.56
Forward/backward pass size (MB): 1564.41
Params size (MB): 16.80
Estimated Total Size (MB): 1583.77
==========================================================================================

可以看出输出的每个样本的特征长度 512*249 = 127,488，对于每条语音，反而增大了特征维度。虽然CNN的操作本身确实是降维，但是 Filter 的数目足够多的话，仍然会使得最后输出的维度增多。最后相当于对于 16000 Hz 采样率的语音， 25 ms 感受野，20 ms 的帧移，大约50Hz 左右。所以滤波器的数目超过 16000/50 = 320 左右时，最后的输出其实就会超过输入的维度。

参考文献

[1]S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” arXiv:2110.13900 [cs, eess], Jan. 2022, Accessed: May 11, 2022. [Online]. Available: http://arxiv.org/abs/2110.13900

[2]W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” arXiv:2106.07447 [cs, eess], Jun. 2021, Accessed: Apr. 28, 2022. [Online]. Available: http://arxiv.org/abs/2106.07447

[3]A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations,” presented at the International Conference on Learning Representations, Sep. 2019. Accessed: Apr. 01, 2022. [Online]. Available: vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations | OpenReview

[4]A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” arXiv:2006.11477 [cs, eess], Oct. 2020, Accessed: Mar. 11, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477