22年年底ChatGPT发布后,大模型开启了高速发展阶段,国内外大厂各种各样的大模型如雨后春笋般的涌现出来。大模型除了人机会话,还能完成各种各样的任务,比如:视频理解与生成、图片理解与生成、语音理解与生成、语音与文字互相转换等等。这些能力如何和数字人结合起来,困扰数字人的应用场景落地问题能得到大大改善,今后肯定会在越来越多ToB和ToC的场景中看到数字人的身影。
做了一个对话数字人项目,尝试将大模型与数字人结合起来,希望把基本的技术问题解决后,能应用到实际的业务场景中。
这个对话数字人项目的目标:用户能用语音直接和数字人对话。
这个形式就像两个人面对面沟通,简单直接,但实现起来并那么直接。需要分解成多个步骤才能完成:
图1 数字人对话实现流程
整个流程实现分成两篇文章介绍,这篇先介绍前4步,下篇文章介绍第5步。
Demo的环境:Win 11+Nvidia RTX 3050 + CUDA 12.1 + Unity
图2 数字人对话UI界面
1. 语音录入
用户输入部分提供了两种方式:
- 文字输入
- 语音输入
语音输入采用了UnityEngien中的Microphone实现,录制的语音保存成wav文件。具体实现直接上代码。
using UnityEngine;
private AudioClip clip; //语音录制片段
private int audioRecordMaxLength = 60; //语音录制最大长度60秒
private byte[] bytes; //语音录制数据
private void StartRecording()
{
clip = Microphone.Start(null, false, audioRecordMaxLength, 44100);
}
private void StopRecording()
{
var position = Microphone.GetPosition(null);
Microphone.End(null);
var samples = new float[position * clip.channels];
clip.GetData(samples, 0);
bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
SendRecording();
//保存录制的音频文件
File.WriteAllBytes(Application.dataPath + "/test.wav", bytes);
}
private byte[] EncodeAsWAV(float[] samples, int frequency, int channels)
{
using (var memoryStream = new MemoryStream(44 + samples.Length * 2))
{
using (var writer = new BinaryWriter(memoryStream))
{
writer.Write("RIFF".ToCharArray());
writer.Write(36 + samples.Length * 2);
writer.Write("WAVE".ToCharArray());
writer.Write("fmt ".ToCharArray());
writer.Write(16);
writer.Write((ushort)1);
writer.Write((ushort)channels);
writer.Write(frequency);
writer.Write(frequency * channels * 2);
writer.Write((ushort)(channels * 2));
writer.Write((ushort)16);
writer.Write("data".ToCharArray());
writer.Write(samples.Length * 2);
foreach (var sample in samples)
{
writer.Write((short)(sample * short.MaxValue));
}
}
return memoryStream.ToArray();
}
}
2. 音频转文字
音频转文字采用的是OpenAI开发的自动语音识别(ASR)系统Whisper模型。Whisper模型基于一种端到端的架构,采用Transformer编码器-解码器形式实现,是一个强大、灵活且多语言的语音转文本系统,适用于各种场景,如听录、视频字幕生成、会议记录等。
为了更快看到效果,直接使用了huggingface上的Whisper模型API接口。首先需要在huggingface注册账号,在Unity中下载huggingface接口包,将huggingface中的Access Tokens设置到Unity里,这样就可以在Unity中调用huggingface接口了(具体操作可以参考:如何安装和使用 Hugging Face Unity API)。
在Unity中调用huggingface代码如下,_submittedText既是返回的语音识别文字内容。
private void SendRecording()
{
HuggingFaceAPI.AutomaticSpeechRecognition(bytes, response => {
_submittedText = GenSubmitText(response);
}, error => {
_errorMsg = error;
});
}
3. AI对话
AI对话采用了Llama 2开源模型。Meta公司的Llama2是一个非常强大的开源大模型,包含了70亿、130亿和700亿参数的模型,它可以帮助我们进行高效的数据分析和处理。Llama 2可以采用多种部署方式,本项目中采用本地部署方式(Llama 2部署参考:在本地电脑部署Meta公司开源大语言模型-Llama2教程)。
在Unity中开发采用的是C#,很幸运在Github上找到了Llama 2推理的C#项目,直接使用方便了很多(Llama 2 C#推理项目地址:LLamaSharp)
在Unity中导入LLamaSharp包后就可以开始人机对话,outputMessage就是LLama模型回答内容,还是看代码:
using LLama;
using LLama.Common;
using static LLama.StatefulExecutorBase;
using Cysharp.Threading.Tasks;
using System.Threading;
public string ModelPath = "models/llama-2-7b-chat.Q4_0.gguf";
[TextArea(3, 10)]
public string SystemPrompt = "Transcript of a dialog, where I interacts with an Assistant named Amy. Amy is helpful, kind, honest, good at writing, and never fails to answer my requests immediately and with precision.\r\n\r\nI: Hello, Amy.\r\nAmy: Hello. How may I help you today?\r\nI: Please tell me the best city in Europe.\r\nAmy: Sure. The best city in Europe is Kyiv, the capital of Ukraine.\r\nI:";
private ExecutorBaseState _emptyState;
private ChatSession _chatSession;
private string _submittedText = "";
private string _errorMsg = "";
private CancellationTokenSource _cts;
async UniTaskVoid Start()
{
_cts = new CancellationTokenSource();
// Load a model
var parameters = new ModelParams(Application.streamingAssetsPath + "/" + ModelPath)
{
ContextSize = 4096,
Seed = 1337,
GpuLayerCount = 35
};
// Switch to the thread pool for long-running operations
await UniTask.SwitchToThreadPool();
using var model = LLamaWeights.LoadFromFile(parameters);
await UniTask.SwitchToMainThread();
// Initialize a chat session
using var context = model.CreateContext(parameters);
var ex = new InteractiveExecutor(context);
// Save the empty state for cases when we need to switch to empty session
_emptyState = ex.GetStateData();
_chatSession = new ChatSession(ex);
_chatSession.AddSystemMessage(SystemPrompt);
// run the inference in a loop to chat with LLM
await ChatRoutine(_cts.Token);
}
public async UniTask ChatRoutine(CancellationToken cancel = default)
{
var userMessage = "";
var outputMessage = "";
while (!cancel.IsCancellationRequested)
{
// Allow input and wait for the user to submit a message or switch the session
SetInteractable(true);
await UniTask.WaitUntil(() => _submittedText != "");
userMessage = _submittedText;
_submittedText = "";
outputMessage = "";
// Disable input while processing the message
await foreach (var token in ChatConcurrent(
_chatSession.ChatAsync(
new ChatHistory.Message(AuthorRole.User, userMessage),
new InferenceParams()
{
Temperature = 0.6f,
AntiPrompts = new List<string> { " " }
}
)
))
{
outputMessage += token;
await UniTask.NextFrame();
}
}
}
4. 文字转音频
文字转音频采用的是Bark大模型。Bark模型是由Suno AI创建的一个基于转换器(Transformer)的文本到音频模型。它是一个端到端的模型,能够生成高度逼真的多语言语音以及其他音频,包括音乐、背景噪音和简单的音效。Bark还能产生非语言交流,例如大笑、叹息和哭泣等。
Bark模型没有找到C#推理的代码,采用了本地部署(Bark模型部署参考:最强文本转语音工具:Bark,本地安装+云端部署+在线体验详细教程),通uvicorn和fastapi提供http接口,然后在Unity中访问http接口完成文本到语音的转换。
Bark模型推理的Http接口代码如下:
import uvicorn
from fastapi import FastAPI
from fastapi.responses import FileResponse
from fastapi.middleware.cors import CORSMiddleware
from starlette.background import BackgroundTask
from ctransformers import AutoModelForCausalLM
from bark import SAMPLE_RATE, generate_audio
from bark.generation import (
generate_text_semantic,
preload_models,
)
from bark.api import semantic_to_waveform
import numpy as np
import base64
import os
import nltk # we'll use this to split into sentences
from scipy.io.wavfile import write as write_wav
import time
import random
import string
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
app = FastAPI()
def productFileName():
timestamp = int(time.time())
characters = string.ascii_letters + string.digits
file_name = str(timestamp) + ''.join(random.choice(characters) for _ in range(6))
file_name = file_name + ".wav"
print(file_name)
return file_name
@app.get("/GenAudio", summary="download audio file")
async def GenAudio(text: str, speaker: str):
input = text
SPEAKER = speaker
audio_array = generate_audio(input,history_prompt=SPEAKER)
# save audio to disk
file_name = productFileName()
write_wav(file_name, SAMPLE_RATE, audio_array)
directory_path = f"{os.path.dirname(__file__)}"
file_path = os.path.join(directory_path, file_name)
return FileResponse(file_path, filename=file_name, media_type="audio/wav", background=BackgroundTask(lambda: os.remove(file_name)),)
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8080)
在Unity中使用UnityWebRequest访问http接口并保存成wav文件,代码如下:
string queryStringText = Uri.EscapeDataString(text);
string queryStringSpeaker = Uri.EscapeDataString(speaker);
string queryString = "?text=" + queryStringText + "&speaker=" + queryStringSpeaker;
string urlWithParams = text2AudioUrl + queryString;
string filename = string.Empty;
using (UnityWebRequest request = UnityWebRequest.Get(urlWithParams))
{
request.downloadHandler = new DownloadHandlerBuffer();
await request.SendWebRequest().ToUniTask();
if (request.isDone)
{
if (request.result == UnityWebRequest.Result.ProtocolError || request.result == UnityWebRequest.Result.ConnectionError)
{
Debug.Log(request.error);
}
else
{
filename = request.GetResponseHeader("Content-Disposition").Split(';')[1].Split('=')[1].Trim('"');
fullPath = Path.Combine(Application.dataPath, "..", audioFilesDict) + filename;
string directory = Path.GetDirectoryName(fullPath);
if (!Directory.Exists(directory))
{
Directory.CreateDirectory(directory);
}
File.WriteAllBytes(fullPath, request.downloadHandler.data);
}
}
}
5. 结束
到此,已经完成了语音对话的核心步骤,下一步就是根据语音生成数字人的说话动画,让用户看起来就是与数字人在对话,这部分留到下一篇文章介绍。最后看看实现的效果吧。
演示Demo