RAG自动写算法题出讲解视频[RAG+TTS+数字人]

mybbsss

已于 2024-08-18 22:38:00 修改

阅读量304

点赞数 16

文章标签：算法音视频人工智能

于 2024-08-18 22:31:37 首次发布

本文链接：https://blog.csdn.net/mycomin/article/details/141287148

版权

项目概述

LeetCode目前拥有超过三千道编程题，如果每天都完成一道题目，则需要大约十年的时间才能全部完成。
尽管大型语言模型能够解答部分题目，但其准确率仍有待提高，多次尝试也不一定能正确解答困难问题。
通过使用检索增强生成 (RAG, Retrieval-Augmented Generation) 结合标准答案的方法，我们旨在实现自动解题和讲解的功能。

技术方案与实施步骤

技术方案

将题目截图用phi-3-vision-128k-instruct多模态大模型解析
从chroma向量数据库查询解法
使用llama-3.1-405b-instruct文本大模型获取解法的原理讲解
使用chatTTS将讲解内容转换成语音和字幕
使用SyncTalk将语音转换成视频
使用FFmpeg将字幕添加到视频

数据的构建

使用chroma向量数据库
题库内容来自 https://github.com/walkccc/LeetCode

功能整合

使用chatTTS文字转语音
使用SyncTalk语音转视频
给视频增加了字幕

环境搭建

1.搭建TTS

参照chatTTS
chatTTS最新版本自带接口调用方式
下载模型到本地，然后运行

fastapi dev examples/api/main.py --host 0.0.0.0 --port 8000

适配本场景的调用方法

import shutil
import time

import requests
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
import os

# 转换服务的URL
base_url = "http://127.0.0.1:5000/convert?text="

# 存储音频文件的目录
output_dir = "audio_segments"
os.makedirs(output_dir, exist_ok=True)

# 保存音频片段的路径
audio_paths = []

# 保存字幕信息
subtitles = []


def format_time(seconds):
    """将秒数格式化为SRT时间格式"""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = int(seconds % 60)
    milliseconds = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"


def convert(texts):
    # 遍历文本并获取音频文件
    for i, text in enumerate(texts):
        # 构造请求URL
        url = f"{base_url}{text}"

        # 发送请求
        response = requests.get(url)
        if response.status_code == 200:
            # 获取WAV文件路径
            wav_path = response.text.strip()
            print(f"WAV文件路径: {wav_path}")

            # 复制WAV文件到输出目录
            output_path = os.path.join(output_dir, f"segment_{i}.wav")
            print(f"Copying {wav_path} to {output_path}")
            shutil.copy(wav_path, output_path)
            time.sleep(1)

            audio_paths.append(output_path)

            # 检测非静音区域以确定音频长度
            audio = AudioSegment.from_wav(output_path)
            nonsilent_ranges = detect_nonsilent(audio, min_silence_len=100, silence_thresh=-40)
            start_time = nonsilent_ranges[0][0] / 1000  # 起始时间（秒）
            end_time = nonsilent_ranges[-1][1] / 1000  # 结束时间（秒）

            # 保存字幕信息
            subtitles.append((start_time, end_time, text))


    # 合并所有音频文件
    combined_audio = AudioSegment.empty()
    for path in audio_paths:
        combined_audio += AudioSegment.from_wav(path)

    # 导出最终的音频文件
    final_audio_path = os.path.join(output_dir, "final_audio.wav")
    combined_audio.export(final_audio_path, format="wav")

    # 生成字幕文件
    with open(os.path.join(output_dir, "subtitles.srt"), "w", encoding="utf-8") as srt_file:
        for i, (start, end, text) in enumerate(subtitles, start=1):
            srt_file.write(f"{i}\n")
            srt_file.write(f"{format_time(start)} --> {format_time(end)}\n")
            srt_file.write(f"{text}\n\n")

    print("Audio and subtitle files created successfully.")
    return final_audio_path, os.path.join(output_dir, "subtitles.srt")

2.构建向量数据库

这里将LeetCode中几千道解法导入数据库，为了便于查询，这里使用了特定的ID作为标签提升检索准确率

import os

import chromadb
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document


# 初始化Chroma客户端
client = chromadb.Client()


def read_folders(directory):
    data = []
    for folder_name in os.listdir(directory):
        print(F"Folder: {folder_name}")
        folder_path = os.path.join(directory, folder_name)
        if os.path.isdir(folder_path):
            id = int(folder_name.split('.')[0])
            py_file_path = os.path.join(folder_path, f"{id}.py")
            if os.path.exists(py_file_path):
                with open(py_file_path, "r", encoding="utf-8") as file:
                    solution = file.read()
                title = folder_name.replace(str(id), "").strip()  # 提取Title
                data.append({"id": id, "title": title, "solution": solution})
    return data


def get_embeddings():
    return HuggingFaceEmbeddings(model_name="text2vec-base-chinese")



def search_solution_by_id(id, vectorstore):
    query = f"LeetCode problem {id}"
    query_vector = vectorstore.embeddings.embed_query(query)
    docs = vectorstore.similarity_search_by_vector(query_vector, k=1)

    if not docs:
        return None

    doc = docs[0]
    return doc.metadata["solution"]


# 创建或获取一个集合
directory = "LeetCode-main/solutions"  # 替换为您的文件夹路径
persist_dir = "chroma_db_demo"
if not os.path.exists(persist_dir):
    # 读取数据
    data = read_folders(directory)
    documents = [Document(page_content=f"LeetCode problem {d['id']}", metadata={"id": d["id"], "title": d['title'], "solution": d["solution"]}) for d in data]

    embeddings = get_embeddings()
    vectorstore = Chroma.from_documents(documents, embeddings, persist_directory=persist_dir)
    vectorstore.persist()
else:
    vectorstore = Chroma(persist_directory=persist_dir, embedding_function=get_embeddings())


# 查询示例
id_to_search = 551
solution = search_solution_by_id(id_to_search, vectorstore)
print(f"Solution for ID {id_to_search}:")
print(solution)

测试得到的结果如下

Solution for ID 551:
class Solution:
  def checkRecord(self, s: str) -> bool:
    return s.count('A') <= 1 and 'LLL' not in s

3.搭建SyncTalk

SyncTalk的环境依赖较多，这里使用完整包
测试能运行后，进行接口化改造
注意使用新的workspace路径修改后会缺失部分文件，出现生成视频没有脸的问题
因此这里不做修改

import argparse
import os.path
import shutil

from nerf_triplane.provider import NeRFDataset
from nerf_triplane.utils import *
from nerf_triplane.network import NeRFNetwork

# torch.autograd.set_detect_anomaly(True)
# Close tf32 features. Fix low numerical accuracy on rtx30xx gpu.
try:
    torch.backends.cuda.matmul.allow_tf32 = False
    torch.backends.cudnn.allow_tf32 = False
except AttributeError as e:
    print('Info. This pytorch version is not support with tf32.')

parser = argparse.ArgumentParser()
parser.add_argument('--path', type=str, default='data/May')
parser.add_argument('-O', action='store_true', help="equals --fp16 --cuda_ray --exp_eye", default=True)
parser.add_argument('--test', action='store_true', help="test mode (load model and test dataset)", default=True)
parser.add_argument('--test_train', action='store_true', help="test mode (load model and train dataset)",
                    default=True)
parser.add_argument('--data_range', type=int, nargs='*', default=[0, -1], help="data range to use")
parser.add_argument('--workspace', type=str, default='model/trial_may')
parser.add_argument('--seed', type=int, default=0)

### training options
parser.add_argument('--iters', type=int, default=200000, help="training iters")
parser.add_argument('--lr', type=float, default=1e-2, help="initial learning rate")
parser.add_argument('--lr_net', type=float, default=1e-3, help="initial learning rate")
parser.add_argument('--ckpt', type=str, default='latest')
parser.add_argument('--num_rays', type=int, default=4096 * 16,
                    help="num rays sampled per image for each training step")
parser.add_argument('--cuda_ray', action='store_true', help="use CUDA raymarching instead of pytorch")
parser.add_argument('--max_steps', type=int, default=16,
                    help="max num steps sampled per ray (only valid when using --cuda_ray)")
parser.add_argument('--num_steps', type=int, default=16,
                    help="num steps sampled per ray (only valid when NOT using --cuda_ray)")
parser.add_argument('--upsample_steps', type=int, default=0,
                    help="num steps up-sampled per ray (only valid when NOT using --cuda_ray)")
parser.add_argument('--update_extra_interval', type=int, default=16,
                    help="iter interval to update extra status (only valid when using --cuda_ray)")
parser.add_argument('--max_ray_batch', type=int, default=4096,
                    help="batch size of rays at inference to avoid OOM (only valid when NOT using --cuda_ray)")

### loss set
parser.add_argument('--warmup_step', type=int, default=10000, help="warm up steps")
parser.add_argument('--amb_aud_loss', type=int, default=1, help="use ambient aud loss")
parser.add_argument('--amb_eye_loss', type=int, default=1, help="use ambient eye loss")
parser.add_argument('--unc_loss', type=int, default=1, help="use uncertainty loss")
parser.add_argument('--lambda_amb', type=float, default=1e-1, help="lambda for ambient loss")
parser.add_argument('--pyramid_loss', type=int, default=0, help="use perceptual loss")

### network backbone options
parser.add_argument('--fp16', action='store_true', help="use amp mixed precision training")

parser.add_argument('--bg_img', type=str, default='', help="background image")
parser.add_argument('--fbg', action='store_true', help="frame-wise bg")
parser.add_argument('--exp_eye', action='store_true', help="explicitly control the eyes")
parser.add_argument('--fix_eye', type=float, default=-1,
                    help="fixed eye area, negative to disable, set to 0-0.3 for a reasonable eye")
parser.add_argument('--smooth_eye', action='store_true', help="smooth the eye area sequence")
parser.add_argument('--bs_area', type=str, default="upper", help="upper or eye")

parser.add_argument('--torso_shrink', type=float, default=0.8,
                    help="shrink bg coords to allow more flexibility in deform")

### dataset options
parser.add_argument('--color_space', type=str, default='srgb', help="Color space, supports (linear, srgb)")
parser.add_argument('--preload', type=int, default=0,
                    help="0 means load data from disk on-the-fly, 1 means preload to CPU, 2 means GPU.")
# (the default value is for the fox dataset)
parser.add_argument('--bound', type=float, default=1,
                    help="assume the scene is bounded in box[-bound, bound]^3, if > 1, will invoke adaptive ray marching.")
parser.add_argument('--scale', type=float, default=4, help="scale camera location into box[-bound, bound]^3")
parser.add_argument('--offset', type=float, nargs='*', default=[0, 0, 0], help="offset of camera location")
parser.add_argument('--dt_gamma', type=float, default=1 / 256,
                    help="dt_gamma (>=0) for adaptive ray marching. set to 0 to disable, >0 to accelerate rendering (but usually with worse quality)")
parser.add_argument('--min_near', type=float, default=0.05, help="minimum near distance for camera")
parser.add_argument('--density_thresh', type=float, default=10,
                    help="threshold for density grid to be occupied (sigma)")
parser.add_argument('--density_thresh_torso', type=float, default=0.01,
                    help="threshold for density grid to be occupied (alpha)")
parser.add_argument('--patch_size', type=int, default=1,
                    help="[experimental] render patches in training, so as to apply LPIPS loss. 1 means disabled, use [64, 32, 16] to enable")

parser.add_argument('--init_lips', action='store_true', help="init lips region")
parser.add_argument('--finetune_lips', action='store_true', help="use LPIPS and landmarks to fine tune lips region")
parser.add_argument('--smooth_lips', action='store_true', help="smooth the enc_a in a exponential decay way...")

parser.add_argument('--torso', action='store_true', help="fix head and train torso")
parser.add_argument('--head_ckpt', type=str, default='', help="head model")

### GUI options
parser.add_argument('--gui', action='store_true', help="start a GUI")
parser.add_argument('--W', type=int, default=450, help="GUI width")
parser.add_argument('--H', type=int, default=450, help="GUI height")
parser.add_argument('--radius', type=float, default=3.35, help="default GUI camera radius from center")
parser.add_argument('--fovy', type=float, default=21.24, help="default GUI camera fovy")
parser.add_argument('--max_spp', type=int, default=1, help="GUI rendering max sample per pixel")

### else
parser.add_argument('--att', type=int, default=2,
                    help="audio attention mode (0 = turn off, 1 = left-direction, 2 = bi-direction)")
parser.add_argument('--aud', type=str, default='./demo/test.wav',
                    help="audio source (empty will load the default, else should be a path to a npy file)")
parser.add_argument('--emb', action='store_true', help="use audio class + embedding instead of logits")
parser.add_argument('--portrait', action='store_true', help="only render face", default=True)
parser.add_argument('--ind_dim', type=int, default=4, help="individual code dim, 0 to turn off")
parser.add_argument('--ind_num', type=int, default=20000,
                    help="number of individual codes, should be larger than training dataset size")

parser.add_argument('--ind_dim_torso', type=int, default=8, help="individual code dim, 0 to turn off")

parser.add_argument('--amb_dim', type=int, default=2, help="ambient dimension")
parser.add_argument('--part', action='store_true', help="use partial training data (1/10)")
parser.add_argument('--part2', action='store_true', help="use partial training data (first 15s)")

parser.add_argument('--train_camera', action='store_true', help="optimize camera pose")
parser.add_argument('--smooth_path', action='store_true',
                    help="brute-force smooth camera pose trajectory with a window size")
parser.add_argument('--smooth_path_window', type=int, default=7, help="smoothing window size")

# asr
parser.add_argument('--asr', action='store_true', help="load asr for real-time app")
parser.add_argument('--asr_wav', type=str, default='', help="load the wav and use as input")
parser.add_argument('--asr_play', action='store_true', help="play out the audio")

parser.add_argument('--asr_model', type=str, default='ave')  # ave  deepspeech

parser.add_argument('--asr_save_feats', action='store_true')
# audio FPS
parser.add_argument('--fps', type=int, default=50)
# sliding window left-middle-right length (unit: 20ms)
parser.add_argument('-l', type=int, default=10)
parser.add_argument('-m', type=int, default=50)
parser.add_argument('-r', type=int, default=10)

opt = parser.parse_args()

opt.fp16 = True
opt.exp_eye = True
opt.cuda_ray = True
# assert opt.cuda_ray, "Only support CUDA ray mode."

if opt.patch_size > 1:
    # assert opt.patch_size > 16, "patch_size should > 16 to run LPIPS loss."
    assert opt.num_rays % (opt.patch_size ** 2) == 0, "patch_size ** 2 should be dividable by num_rays."

# if opt.finetune_lips:
#     # do not update density grid in finetune stage
#     opt.update_extra_interval = 1e9

print(opt)

seed_everything(opt.seed)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


def _run_model(opt):
    model = NeRFNetwork(opt)

    # manually load state dict for head
    if opt.torso and opt.head_ckpt != '':

        model_dict = torch.load(opt.head_ckpt, map_location='cpu')['model']

        missing_keys, unexpected_keys = model.load_state_dict(model_dict, strict=False)

        if len(missing_keys) > 0:
            print(f"[WARN] missing keys: {missing_keys}")
        if len(unexpected_keys) > 0:
            print(f"[WARN] unexpected keys: {unexpected_keys}")

            # freeze these keys
        for k, v in model.named_parameters():
            if k in model_dict:
                print(f'[INFO] freeze {k}, {v.shape}')
                v.requires_grad = False

    # print(model)

    # criterion = torch.nn.MSELoss(reduction='none')
    criterion = torch.nn.L1Loss(reduction='none')

    metrics = [PSNRMeter(), LPIPSMeter(device=device), LMDMeter(backend='fan')]

    trainer = Trainer('ngp', opt, model, device=device, workspace=opt.workspace, criterion=criterion, fp16=opt.fp16,
                      metrics=metrics, use_checkpoint=opt.ckpt)
    test_set = NeRFDataset(opt, device=device, type='train')
    # a manual fix to test on the training dataset
    test_set.training = False
    test_set.num_rays = -1
    test_loader = test_set.dataloader()

    # temp fix: for update_extra_states
    model.aud_features = test_loader._data.auds
    model.eye_areas = test_loader._data.eye_area
    trainer.test(test_loader)

    ## evaluate metrics (slow)
    if test_loader.has_gt:
        trainer.evaluate(test_loader)



import flask

app = flask.Flask(__name__)


@app.route('/wav2video')
def convert_video():
    try:
        file_name = flask.request.args.get('file_path')
        opt.aud = file_name
        basename = os.path.basename(file_name)
        filename, extension = os.path.splitext(basename)
        _run_model(opt)
        for root, dirs, files in os.walk(opt.workspace):
            for file_name in files:
                if file_name.endswith("_audio.mp4"):
                    return os.path.join(root, file_name)
    except Exception as e:
        print(e)
    return 'error'


app.run(debug=True, port=6666)

4.搭建RAG系统

包含了各种转换过程

输入为包含标题的图片
将图片的题目解析
从向量数据库查询解法
向LLM大模型获取解法的原理讲解
将讲解内容转换成语音和字幕
将语音转换成视频
将字幕添加到视频

import subprocess

import gradio as gr
import requests
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.chroma import Chroma

from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnableLambda


import os
import re
import base64
import chromadb
from dub_demo import convert


client = chromadb.Client()

os.environ["NVIDIA_API_KEY"] = (
    "****"
)

solution = ""

def search_solution_by_id(id, vectorstore):
    query = f"LeetCode problem {id}"
    query_vector = vectorstore.embeddings.embed_query(query)
    docs = vectorstore.similarity_search_by_vector(query_vector, k=1)
    if not docs:
        return None

    doc = docs[0]
    return doc.metadata["solution"]


def query_db(input_text):
    pattern = r'(\d+)\.'
    print(input_text.content)
    id = '551'
    matches = re.findall(pattern, input_text.content)
    if matches:
        id = matches[0]
    solution = search_solution_by_id(id, vectorstore)
    print(F"LeetCode problem {id} solution: {solution}")
    prompt = ("Given the following code snippet, write a concise explanation of the problem it solves in Chinese, "
              "the solution approach, the algorithm principle, and the time and space complexity analysis. ") + solution
    print(f"prompt = {prompt}")
    return prompt


def message_to_str(message):
    print(message)
    return message.content


def image2b64(image_file):
    with open(image_file, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()
        return image_b64


def add_subtitles_to_video(video_file, subtitle_file, output_file):
    # 构建FFmpeg命令
    command = [
        'ffmpeg',
        '-i', video_file,
        '-vf', f"subtitles={subtitle_file}",
        '-c:a', 'copy',
        output_file
    ]

    # 执行命令
    try:
        # 使用subprocess.run()执行命令，并等待命令完成
        result = subprocess.run(command, check=True)
        print("字幕添加成功")
    except subprocess.CalledProcessError as e:
        print(f"发生错误: {e}")
    except Exception as e:
        print(f"未知错误: {e}")

def generate_video(content):
    texts = []
    for i in content.split("\n"):
        if not i:
            continue
        # 去掉不符合正则表达式的字符
        result = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9，。]', '', i)
        # 按标点符号分割字符串
        result = re.split(r'[，。]', result)
        texts.extend([i for i in result if i])
    audio_path, srt_path = convert(texts)
    rq = requests.get(f"http://127.0.0.1:6666/wav2video?file_path={audio_path}")
    video_path = rq.text
    assert os.path.exists(video_path)
    output_video_path = "output/output_video.mp4"
    add_subtitles_to_video(video_path, srt_path, output_video_path)
    return output_video_path


def chart_agent_gr(image_b64):
    image_b64 = image2b64(image_b64)
    chart_reading = ChatNVIDIA(model="microsoft/phi-3-vision-128k-instruct")
    chart_reading_prompt = ChatPromptTemplate.from_template(
        'get ocr result of the image below, : <img src="data:image/png;base64,{image_b64}" />'
    )
    chart_chain = chart_reading_prompt | chart_reading

    instruct_chat = ChatNVIDIA(model="meta/llama-3.1-405b-instruct")

    chain = (
        chart_chain
        | RunnableLambda(query_db)
        | instruct_chat
        | RunnableLambda(message_to_str)
    )

    content = chain.invoke({"image_b64": image_b64})
    video_path = generate_video(content)
    return solution, content, video_path


def get_embeddings():
    return HuggingFaceEmbeddings(model_name="text2vec-base-chinese")


if __name__ == "__main__":
    persist_dir = "chroma_db_demo"
    vectorstore = Chroma(persist_directory=persist_dir, embedding_function=get_embeddings())
    multi_modal_chart_agent = gr.Interface(
        fn=chart_agent_gr,
        inputs=[gr.Image(label="Upload image", type="filepath")],
        outputs=["text", "text", "video"],
        title="Multi Modal chat agent",
        description="Multi Modal chat agent",
        allow_flagging="never",
    )
    multi_modal_chart_agent.launch(
        debug=True, share=False, show_api=False, server_port=5000, server_name="0.0.0.0"
    )

实现效果

在这里插入图片描述

不足与展望

不同项目的依赖不同

不同的项目依赖于不同的Python版本、torch版本，项目路径和代码也不一样
解决方式
不同项目之间使用接口调用，独立运行
目前主要区分Langchain, chatTTS, SyncTalk三个运行环境
也可以使用几个不同的docker容器运行

多模态的OCR能力有限

显著差于PaddleOCR等
在这里插入图片描述
解决方式
OCR并非多模态擅长的点，当前使用正则表达式处理，后续尝试优化

项目评估

项目主要实现了如下功能

输入为包含标题的图片
将图片的题目解析
从向量数据库查询解法
向LLM大模型获取解法的原理讲解
将讲解内容转换成语音和字幕
将语音转换成视频
将字幕添加到视频

未来方向

使用准确率更高的OCR识别方式
增加音频拼接的连贯度
优化字幕样式
在视频中展示代码，数字人只占屏幕的一部分
在讲解视频中增加作图等演示，使视频更贴近真人讲解

mybbsss

关注

16
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
RAG自动写算法题出讲解视频[RAG+TTS+数字人]

LeetCode目前拥有超过三千道编程题，如果每天都完成一道题目，则需要大约十年的时间才能全部完成。尽管大型语言模型能够解答部分题目，但其准确率仍有待提高，多次尝试也不一定能正确解答困难问题。通过使用检索增强生成 (RAG, Retrieval-Augmented Generation) 结合标准答案的方法，我们旨在实现自动解题和讲解的功能。
复制链接

扫一扫