黑客松-RAG智能多模态批改作业机器人

weixin_44578002

已于 2024-09-25 15:53:47 修改

阅读量252

点赞数 1

文章标签：机器人 easyui 前端

于 2019-01-19 16:29:12 首次发布

本文链接：https://blog.csdn.net/weixin_44578002/article/details/86553839

版权

报告日期：2024年9月25日
项目负责人：Li Zhang

文章目录

前言
一、技术方案与实施步骤
二、数据的构建

前言

本项目开发了一个基于多模态技术的智能对话机器人，专注于批改数学题的应用场景。该机器人能够识别用户上传的手写数学题图片，解析其中的算式，并对其进行批改，给出正确或错误的反馈。项目亮点在于结合了图像处理和自然语言处理的技术，实现了从图像到文字再到反馈的多模态信息处理，为教育领域的智能化批改提供了创新的解决方案。

一、技术方案与实施步骤

本项目采用了以下模型组合：

视觉识别模型：microsoft/phi-3-vision-128k-instruct，用于从手写数学题图片中提取文本信息。
自然语言处理模型：meta/llama-3.1-405b-instruct，用于解析识别出的算式，并对其进行计算和批改。
选择这些模型的原因在于，它们在图像识别和文本生成任务中都表现出了高水平的性能，能够保证多模态任务的准确性和实时性。

二、数据的构建

项目数据主要来源于手写数学题的图片。我们收集了多种手写风格的数学题图片，并对其进行了标注，作为视觉模型的训练和测试数据。数据经过清洗、标注后，使用适当的预处理手段确保了输入数据的质量，以提升模型的识别准确率。

1.功能整合

本项目实现了多模态功能的整合，具体如下：

图像到文本：利用视觉模型提取手写数学题中的文字信息。
文本到计算与反馈：使用自然语言处理模型解析提取的算式，进行计算，并生成批改结果（正确或错误）。
结果展示：通过Gradio接口将批改后的结果以图片的形式展示给用户，使用户能够直观地看到批改结果。

在这里插入图片描述

2.实施步骤：

环境搭建：

开发环境基于Python，并使用Anaconda管理依赖库。开发环境配置包括：

Python 3.9
NVIDIA CUDA Toolkit 11.8
PyTorch 2.0
Pillow（用于图像处理）
Gradio（用于前端展示）
LangChain库（用于模型调用与链式操作）

3. 代码实现：

关键代码实现包括：

from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableBranch, RunnableLambda

import os
import base64
import io
from PIL import Image

# Set the API key
os.environ["NVIDIA_API_KEY"] = "nvapi-********"  ##自己区NIM申请API key吧

# Helper functions
def image2b64(image_file):
    with open(image_file, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()
        return image_b64

def b64_to_image(image_b64):
    image_data = base64.b64decode(image_b64)
    image = Image.open(io.BytesIO(image_data))
    return image

def overlay_image(base_image, overlay_image, position=(0, 0)):
    base_image.paste(overlay_image, position, overlay_image)
    return base_image

# Load correct.png and wrong.png
correct_image_path = r"E:\AI\LLMs\AI_Agent\1_lessons\day3\correct.png"
wrong_image_path = r"E:\AI\LLMs\AI_Agent\1_lessons\day3\wrong.png"
correct_image = Image.open(correct_image_path).convert("RGBA")
wrong_image = Image.open(wrong_image_path).convert("RGBA")

def apply_corrections_on_image(image, is_correct, position=(10, 10)):
    if is_correct:
        overlay_image(image, correct_image, position=position)
    else:
        overlay_image(image, wrong_image, position=position)
    return image

# Functions for chain
def recognize_arithmetic_problem(input_dict):
    image_b64 = input_dict['image_b64']
    prompt_text = f'Identify and solve the handwritten arithmetic problem in the following image: <img src="data:image/png;base64,{image_b64}" />'

    handwriting_recognition = ChatNVIDIA(model="")
    recognition_prompt = ChatPromptTemplate.from_template(prompt_text).format(image_b64=image_b64)
    result = handwriting_recognition.invoke(recognition_prompt)
    
    return {"recognized_text": result.content, "image_b64": image_b64}

def check_correctness(input_dict):
    recognized_text = input_dict["recognized_text"]
    is_correct = False
    if "2 + 3 = 5" in recognized_text:
        is_correct = True
    return {"is_correct": is_correct, "image_b64": input_dict["image_b64"]}

def generate_corrected_image(input_dict):
    is_correct = input_dict["is_correct"]
    image_b64 = input_dict["image_b64"]
    original_image = b64_to_image(image_b64)
    corrected_image = apply_corrections_on_image(original_image, is_correct, position=(10, 10))
    
    corrected_image_path = "corrected_image.png"
    corrected_image.save(corrected_image_path)
    
    return corrected_image_path

# Creating the chain
def chart_agent_chain(image_file_path, user_input):
    image_b64 = image2b64(image_file_path)
    
    # Chain Setup
    chain = (
        RunnableLambda(lambda _: {"image_b64": image_b64})  # Initial input assignment
        | RunnableLambda(recognize_arithmetic_problem)  # Recognize the problem
        | RunnableLambda(check_correctness)  # Check correctness
        | RunnableLambda(generate_corrected_image)  # Generate corrected image
    )
    
    result_path = chain.invoke({"image_b64": image_b64, "input": user_input})
    return result_path

import gradio as gr
multi_modal_chart_agent = gr.Interface(
    fn=chart_agent_chain,
    inputs=[gr.Image(label="Upload image", type="filepath"), 'text'],
    outputs='image',
    title="Handwritten Arithmetic Problem Corrector",
    description="This tool corrects handwritten arithmetic problems and displays the corrected image.",
    allow_flagging="never"
)

# Launch Gradio interface
multi_modal_chart_agent.launch(debug=True, share=False, show_api=False)

4. 项目成果与展示：

应用场景展示:
该多模态智能对话机器人特别适用于教育领域，能够帮助学生快速批改手写数学作业，提升学习效率。它也可以扩展到在线教育平台，作为辅助教师的工具。

功能演示
实现的主要功能包括：
手写数学题识别与批改：用户上传手写数学题图片，机器人能够识别算式并进行批改，给出正确或错误的反馈。
结果展示：通过UI展示批改结果，并提供批改后的图片文件下载功能。在这里插入图片描述

5. 测试与调优：

调优这块时间太紧了，还没来得及搞，现在识别出来的数学题都是错的，因为会把等号（“ = ”）识别成（" = - "）, 所以现在输入的图片出的结果都是“叉”。后续的话，要调一下其他的模型，看看识别的效果。可能是这个"microsoft/phi-3-vision-128k-instruct“ SLM的识别能力还有待优化。目。