An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Sophie'sCookingLab

已于 2024-07-11 12:00:25 修改

阅读量575

点赞数 15

分类专栏：大模型文章标签：语言模型人工智能深度学习

于 2024-07-10 23:03:52 首次发布

本文链接：https://blog.csdn.net/weixin_40566713/article/details/140336564

版权

大模型专栏收录该内容

45 篇文章 0 订阅

订阅专栏

An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition
https://arxiv.org/html/2312.03668v1

一、本文主要思想

This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR.

主要由两部分组成
1、预训练语音表示模型：generating text tokens in an autoregressive manner via speech representations as speech prompts
例如：HuBERT [4], w2vBERT [26]
2、大模型：taking advantage of the vast knowledge provided by the LLM.

以前的研究

先前的研究仍然利用了基于信号处理的声学特征，例如滤波器组输出，这些特征并未在语音处理模型和 LLM 之间进行端到端优化。

本文优势

相比之下，本文将预训练的语音表示模型与 LLM 相结合，将语音模态信息作为连续特征直接桥接到 LLM，以完全 E2E 的方式执行语音识别。
本文将重点关注语音识别任务，该任务适合作为首要考虑因素，因为它可以评估语音话语是否准确地传递给 LLM 而没有遗漏任何内容。

二、模型总体架构

语音波形 x 被输入到音频编码器中以获得语音表示，然后通过桥接网络将语音表示转换为文本标记的嵌入空间，以作为语音提示输入到 LLM 中。
an audio encoder： adopt a pre-trained HuBERT model。把波形数据嵌入到语音表示空间。
a bridge network： converts the speech representations into the embedding space of the LLM. Since the output of HuBERT is a 20 ms shifted feature sequence（20毫秒移位，有20毫秒的重叠？）, the sequence length is longer than that of text tokens, making direct handling by the LLM inefficient.语音比文本要长，所以使用一个bridge network,把语音表示转换为文本表示，文本再输入大模型。
and an LLM：Any LLM can be used.

在这里插入图片描述

三、bridge network有两种，作用就是压缩语音嵌入的长度

整个模型架构只对这部分进行了训练，其他两部分参数均冻结。
在提出的模型中，预先训练的 HuBERT 和 GPT 通过基于卷积的桥接网络连接并进行完全微调，其中桥接网络将从语音波形样本中提取的有意义的连续潜在表示传递给 LLM 作为语音提示。
（1）Downsampling ：a kernel size of 4 and a stride size of 2，核为4步长为2的下采样。对 HuBERT输出结果进行压缩，是的只有原来的1/4长。
（2）CTC compression：删除被CTC预测为空白的帧。
在这里插入图片描述

四、模型训练部分

语料数据集：ReazonSpeech corpus [31], a 19,000-hour speech corpus collected from Japanese TV programs with 16kHz sampling for training ASR models.
dev:test:train = 1000:1000:17000
数据预处理：
音频表示模型：japanese-hubert-base
LLM: japanese-gpt-neox-3.6b
tokenizer: sentencepiece-based tokenizer
训练资源和时长：The proposed models were trained on four NVIDIA A100 80GB GPUs for five epochs, with a total batch size of 64 utterances per GPU, by doing four gradient accumulations. The wall time for training the proposed model was approximately 56 hours.
DeepSpeed-Inference

五、实验对比

在这里插入图片描述

Sophie'sCookingLab

关注

15
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition

在提出的模型中，预先训练的 HuBERT 和 GPT 通过基于卷积的桥接网络连接并进行完全微调，其中桥接网络将从语音波形样本中提取的有意义的连续潜在表示传递给 LLM 作为语音提示。相比之下，本文将预训练的语音表示模型与 LLM 相结合，将语音模态信息作为连续特征直接桥接到 LLM，以完全 E2E 的方式执行语音识别。语音波形 x 被输入到音频编码器中以获得语音表示，然后通过桥接网络将语音表示转换为文本标记的嵌入空间，以作为语音提示输入到 LLM 中。把波形数据嵌入到语音表示空间。
复制链接

扫一扫

专栏目录