与数字人对话--LLM与数字人结合案例

       22年年底ChatGPT发布后,大模型开启了高速发展阶段,国内外大厂各种各样的大模型如雨后春笋般的涌现出来。大模型除了人机会话,还能完成各种各样的任务,比如:视频理解与生成、图片理解与生成、语音理解与生成、语音与文字互相转换等等。这些能力如何和数字人结合起来,困扰数字人的应用场景落地问题能得到大大改善,今后肯定会在越来越多ToB和ToC的场景中看到数字人的身影。

       做了一个对话数字人项目,尝试将大模型与数字人结合起来,希望把基本的技术问题解决后,能应用到实际的业务场景中。

       这个对话数字人项目的目标:用户能用语音直接和数字人对话。

       这个形式就像两个人面对面沟通,简单直接,但实现起来并那么直接。需要分解成多个步骤才能完成:

图1 数字人对话实现流程

        整个流程实现分成两篇文章介绍,这篇先介绍前4步,下篇文章介绍第5步。

        Demo的环境:Win 11+Nvidia RTX 3050 + CUDA 12.1 + Unity

图2 数字人对话UI界面

1. 语音录入

用户输入部分提供了两种方式:

  • 文字输入
  • 语音输入

语音输入采用了UnityEngien中的Microphone实现,录制的语音保存成wav文件。具体实现直接上代码。

using UnityEngine;

private AudioClip clip; //语音录制片段
private int audioRecordMaxLength = 60; //语音录制最大长度60秒
private byte[] bytes; //语音录制数据

private void StartRecording()
{
    clip = Microphone.Start(null, false, audioRecordMaxLength, 44100);
}

private void StopRecording()
{
    var position = Microphone.GetPosition(null);
    Microphone.End(null);
    var samples = new float[position * clip.channels];
    clip.GetData(samples, 0);
    bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
    SendRecording();

    //保存录制的音频文件
    File.WriteAllBytes(Application.dataPath + "/test.wav", bytes); 
}

private byte[] EncodeAsWAV(float[] samples, int frequency, int channels)
{
    using (var memoryStream = new MemoryStream(44 + samples.Length * 2))
    {
        using (var writer = new BinaryWriter(memoryStream))
        {
            writer.Write("RIFF".ToCharArray());
            writer.Write(36 + samples.Length * 2);
            writer.Write("WAVE".ToCharArray());
            writer.Write("fmt ".ToCharArray());
            writer.Write(16);
            writer.Write((ushort)1);
            writer.Write((ushort)channels);
            writer.Write(frequency);
            writer.Write(frequency * channels * 2);
            writer.Write((ushort)(channels * 2));
            writer.Write((ushort)16);
            writer.Write("data".ToCharArray());
            writer.Write(samples.Length * 2);

            foreach (var sample in samples)
            {
                writer.Write((short)(sample * short.MaxValue));
             }
         }
         return memoryStream.ToArray();
    }
}

2. 音频转文字

       音频转文字采用的是OpenAI开发的自动语音识别(ASR)系统Whisper模型。Whisper模型基于一种端到端的架构,采用Transformer编码器-解码器形式实现,是一个强大、灵活且多语言的语音转文本系统,适用于各种场景,如听录、视频字幕生成、会议记录等。

       为了更快看到效果,直接使用了huggingface上的Whisper模型API接口。首先需要在huggingface注册账号,在Unity中下载huggingface接口包,将huggingface中的Access Tokens设置到Unity里,这样就可以在Unity中调用huggingface接口了(具体操作可以参考:如何安装和使用 Hugging Face Unity API)。

       在Unity中调用huggingface代码如下,_submittedText既是返回的语音识别文字内容。

private void SendRecording()
{
    HuggingFaceAPI.AutomaticSpeechRecognition(bytes, response => {
       _submittedText = GenSubmitText(response);
    }, error => {
       _errorMsg = error;
    });
}

3. AI对话

        AI对话采用了Llama 2开源模型。Meta公司的Llama2是一个非常强大的开源大模型,包含了70亿、130亿和700亿参数的模型,它可以帮助我们进行高效的数据分析和处理。Llama 2可以采用多种部署方式,本项目中采用本地部署方式(Llama 2部署参考:在本地电脑部署Meta公司开源大语言模型-Llama2教程)。

        在Unity中开发采用的是C#,很幸运在Github上找到了Llama 2推理的C#项目,直接使用方便了很多(Llama 2 C#推理项目地址:LLamaSharp

        在Unity中导入LLamaSharp包后就可以开始人机对话,outputMessage就是LLama模型回答内容,还是看代码:

using LLama;
using LLama.Common;
using static LLama.StatefulExecutorBase;
using Cysharp.Threading.Tasks;
using System.Threading;

    public string ModelPath = "models/llama-2-7b-chat.Q4_0.gguf";
    [TextArea(3, 10)]
    public string SystemPrompt = "Transcript of a dialog, where I interacts with an Assistant named Amy. Amy is helpful, kind, honest, good at writing, and never fails to answer my requests immediately and with precision.\r\n\r\nI: Hello, Amy.\r\nAmy: Hello. How may I help you today?\r\nI: Please tell me the best city in Europe.\r\nAmy: Sure. The best city in Europe is Kyiv, the capital of Ukraine.\r\nI:";
    
    private ExecutorBaseState _emptyState;
    private ChatSession _chatSession;

    private string _submittedText = "";
    private string _errorMsg = "";
    private CancellationTokenSource _cts;

    async UniTaskVoid Start()
    {
        _cts = new CancellationTokenSource();

        // Load a model
        var parameters = new ModelParams(Application.streamingAssetsPath + "/" + ModelPath)
        {
            ContextSize = 4096,
            Seed = 1337,
            GpuLayerCount = 35
        };
        // Switch to the thread pool for long-running operations
        await UniTask.SwitchToThreadPool();
        using var model = LLamaWeights.LoadFromFile(parameters);
        await UniTask.SwitchToMainThread();
        // Initialize a chat session
        using var context = model.CreateContext(parameters);
        var ex = new InteractiveExecutor(context);
        // Save the empty state for cases when we need to switch to empty session
        _emptyState = ex.GetStateData();
        _chatSession = new ChatSession(ex);
        _chatSession.AddSystemMessage(SystemPrompt);

        // run the inference in a loop to chat with LLM
        await ChatRoutine(_cts.Token);
    }

    public async UniTask ChatRoutine(CancellationToken cancel = default)
    {
        var userMessage = "";
        var outputMessage = "";
        while (!cancel.IsCancellationRequested)
        {
            // Allow input and wait for the user to submit a message or switch the session
            SetInteractable(true);
            await UniTask.WaitUntil(() => _submittedText != "");
            userMessage = _submittedText;
            _submittedText = "";
            outputMessage = "";

            // Disable input while processing the message
            await foreach (var token in ChatConcurrent(
                _chatSession.ChatAsync(
                    new ChatHistory.Message(AuthorRole.User, userMessage),
                    new InferenceParams()
                    {
                        Temperature = 0.6f,
                        AntiPrompts = new List<string> { " " }
                    }
                )
            ))
            {
                outputMessage += token;
                await UniTask.NextFrame();
            }
        }
    }

4. 文字转音频

       文字转音频采用的是Bark大模型。Bark模型是由Suno AI创建的一个基于转换器(Transformer)的文本到音频模型。它是一个端到端的模型,能够生成高度逼真的多语言语音以及其他音频,包括音乐、背景噪音和简单的音效。Bark还能产生非语言交流,例如大笑、叹息和哭泣等。

        Bark模型没有找到C#推理的代码,采用了本地部署(Bark模型部署参考:最强文本转语音工具:Bark,本地安装+云端部署+在线体验详细教程),通uvicorn和fastapi提供http接口,然后在Unity中访问http接口完成文本到语音的转换。

        Bark模型推理的Http接口代码如下:

import uvicorn
from fastapi import FastAPI
from fastapi.responses import FileResponse
from fastapi.middleware.cors import CORSMiddleware
from starlette.background import BackgroundTask
from ctransformers import AutoModelForCausalLM
from bark import SAMPLE_RATE, generate_audio
from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark.api import semantic_to_waveform
import numpy as np
import base64
import os
import nltk  # we'll use this to split into sentences
from scipy.io.wavfile import write as write_wav
import time
import random
import string

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

app = FastAPI()

def productFileName():
    timestamp = int(time.time())
    characters = string.ascii_letters + string.digits
    file_name = str(timestamp) + ''.join(random.choice(characters) for _ in range(6))
    file_name = file_name + ".wav"
    print(file_name)
    return file_name

@app.get("/GenAudio", summary="download audio file")
async def GenAudio(text: str, speaker: str):
    input = text
    SPEAKER = speaker
    audio_array = generate_audio(input,history_prompt=SPEAKER)

    # save audio to disk
    file_name = productFileName()
    write_wav(file_name, SAMPLE_RATE, audio_array)
    directory_path = f"{os.path.dirname(__file__)}"
    file_path = os.path.join(directory_path, file_name)
    return FileResponse(file_path, filename=file_name, media_type="audio/wav", background=BackgroundTask(lambda: os.remove(file_name)),)
 
if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

         在Unity中使用UnityWebRequest访问http接口并保存成wav文件,代码如下:

string queryStringText = Uri.EscapeDataString(text);
string queryStringSpeaker = Uri.EscapeDataString(speaker);
string queryString = "?text=" + queryStringText + "&speaker=" + queryStringSpeaker;
string urlWithParams = text2AudioUrl + queryString;
string filename = string.Empty;

using (UnityWebRequest request = UnityWebRequest.Get(urlWithParams))
{
    request.downloadHandler = new DownloadHandlerBuffer();
    await request.SendWebRequest().ToUniTask();

    if (request.isDone)
    {
         if (request.result == UnityWebRequest.Result.ProtocolError || request.result == UnityWebRequest.Result.ConnectionError)
         {
             Debug.Log(request.error);
         }                    
         else
         {
             filename = request.GetResponseHeader("Content-Disposition").Split(';')[1].Split('=')[1].Trim('"');
             fullPath = Path.Combine(Application.dataPath, "..", audioFilesDict) + filename;
             string directory = Path.GetDirectoryName(fullPath);
             if (!Directory.Exists(directory))
             {
                 Directory.CreateDirectory(directory);
             }
             File.WriteAllBytes(fullPath, request.downloadHandler.data);
          }
      }
}        

5. 结束

        到此,已经完成了语音对话的核心步骤,下一步就是根据语音生成数字人的说话动画,让用户看起来就是与数字人在对话,这部分留到下一篇文章介绍。最后看看实现的效果吧。

演示Demo

  • 35
    点赞
  • 35
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值