LSLM论文

最新推荐文章于 2024-08-09 22:24:25 发布

53年7月11天

最新推荐文章于 2024-08-09 22:24:25 发布

阅读量938

点赞数 24

文章标签：语音识别人工智能

本文链接：https://blog.csdn.net/m0_56741459/article/details/140999725

版权

解决的问题

现在的语音模型（SLM）增强了语音对话的能力，但都局限于回合制对话，在实时对话的情境下与用户交互的能力有所欠缺，例如：当生成的对话不满意时被打断。所以，这篇论文在实时的的语音语言模型（interactive speech language models (iSLM)）中采用全双工建模（full duplex modeling (FDM)），旨在增强实时交互性，明确来说，探索打断能力的精髓。

We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time.Three fusion strategies—early fusion, middle fusion, and late fusion—are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM’s robustness to noise and sensitivity to diverse instructions.

The proposed LSLM uses a token-based decoder-only TTS to model the ability to speak and a streaming self-supervised learning (SSL) encoder to model the ability to listen.

LLMs have facilitated a paradigm shift from simplex models to half-duplex models, also known as turn-based models, as shown in Figure 1(C). Prominent models include SpeechGPT [48], LauraGPT [5], and VioLA [42]. While these half duplex models can both listen and speak, they are constrained to performing only one action at the same instant, thus failing to address the turn-taking problem.

单工和半双工：

where R1:t−1 = [r1, r2, ..., rt−1] and T is the sequence length. During the inference phase, the model can only predict the next token autoregressively based on the previous output within the current channel, without information from other channels.

全双工

In modeling a full duplex spoken dialogue system within an autoregressive language model, the model needs to predict the next token rt in the response R not only based on the context C and the generated response history R1:t−1 = [r1, r2, . . . , rt−1] in the current channel, but also by utilizing information S1:t−1 = [s1, s2, . . . , st−1] from another channel simultaneously.The training loss L(θ) is now formulated as:

A key point in FDM is that the sequence S is produced in real time and unpredictably.

LSLM的读能力，听能力，以及整合这两个能力的融合方法

The core difference between LSLM and previous speech language models lies in its capability to simultaneously speak and listen.We first introduce the speaking capability of LSLM, followed by its listening capability, and finally, we discuss various fusion methods that integrate these capabilities, endowing LSLM with full duplex ability.