三天手写一个多模态Agent

最新推荐文章于 2025-03-19 22:01:05 发布

sunzhuojun

最新推荐文章于 2025-03-19 22:01:05 发布

阅读量728

点赞数 12

分类专栏：大语言模型应用文章标签： python 人工智能语言模型 ocr chatgpt

本文链接：https://blog.csdn.net/sunzhuojun/article/details/141300577

版权

大语言模型应用专栏收录该内容

1 篇文章

订阅专栏

项目概述：

通过融合多模态大模型与强大的语言处理能力，我们的系统能够高效地解析单证图片，自动提取关键信息字段。进一步地，系统将这些信息转化为Python代码，以编程方式生成符合行业标准的格式化图片。此外，系统还能根据提取的数据自动生成专业的邮件内容，不仅提升了工作效率，也确保了沟通的专业性和个性化。

在利用多模态大模型和大语言模型进行单证图片智能化解析的过程中，我们可以实现以下亮点：

自动化字段提取：通过先进的图像识别技术，自动识别并提取单证图片中的关键字段，如发货人、收货人、日期、金额、交易方等。

代码生成：根据提取的字段内容，智能生成Python代码，以编程方式生成符合特定标准格式的图片，确保输出的图片既美观又规范。

邮件内容编写：结合提取的字段信息，自动编写邮件内容，使邮件更加个性化和专业，同时节省了人工编写的时间。

技术方案与实施步骤

通过人工智能模型来解析单证图片，生成图表，并编写电子邮件。下面是详细技术方案，其中模型选用：

多模态大模型：phi-3-vision-128k-instruct

大语言模型：llama-3.1-405b-instruct

均使用Nvidia的NIM的API接口

大语言模型选择：

lama-3.1-405b-instruct是目前开源的能力最强的模型，具有优秀的理解、推理、代码、文字生成等能力

RAG的优势

检索增强：RAG模型通过检索（Retrieval）阶段获取相关信息，然后将这些信息用于生成（Generation）阶段，这种结合检索和生成的方法可以提高输出的准确性和相关性。

上下文理解：RAG模型能够更好地理解上下文信息，因为它在生成响应之前会检索和分析相关的信息。

灵活性：RAG模型可以灵活地应用于多种任务，包括问答、文本摘要、内容生成等，这使得它成为多功能的解决方案。

知识整合：通过检索阶段，RAG模型能够整合来自不同来源的知识，提供更加丰富和全面的回答。

实时更新：RAG模型可以利用检索到的最新信息，生成与当前情况相符的输出，这在需要实时数据更新的应用中非常有用。

减少偏见：由于RAG模型依赖于检索到的信息，它可以减少训练数据中可能存在的偏见，提供更公正的输出。

多模态功能整合

选用的phi-3-vision-128k-instruct模型是针对图像和自然语言处理任务进行了优化的大模型，能够提供高精度的图像分析和文本生成能力。用此模型对单证数据的识别，包含OCR识别与SER识别效果；根据Prompt对所需的字段进行提取，并形成JSON格式的数据输出；而后通过lama-3.1-405b-instruct，根据用户的输入就提取的JSON结果自动化编程生成符合要求的表单形式与邮件内容；

实现步骤

功能概述

图片解析：将单证图片的Base64编码转换为可识别格式，并提取关键信息。

信息提取：使用ChatNVIDIA调用phi-3-vision-128k-instruct模型来分析图片，提取指定字段。

图表生成：根据提取的信息，使用Python和matplotlib库生成图表。

邮件编写：根据提取的信息编写电子邮件。

实验分成两个步骤

利用phi-3-vision-128k-instruct抓取数据，生成JSON

2. 通过llama3.1-405B生成标准单证与邮件

生成邮件

生成的标准单证（可以继续优化Prompt提升效果）

环境搭建

环境Jupyter Lab基于NIM的API与Langchain进行应用，并使用gradio，包含以下的依赖

主要需要三个工具包:

langchain_nvidia_ai_endpoint: 用来调用nvidia nim的计算资源
langchain: 用来构建对话链, 将智能体的各个组件串联起来
base64: 因为本实验是构建多模态的智能体, 需要base64来对图像进行编解码
Gradio：部署前端页面

代码实现

图片解析与信息提取

将传入的Base64编码的图片转换为可识别格式。使用ChatNVIDIA模型和ChatPromptTemplate来提取图片中的Consignee、Notify Party、Port of Loading和Place of Deliver字段

import re

# 将 langchain 运行状态下的表保存到全局变量中
def save_json_to_global(x):
    global json_data
    if 'json' in x.content:
    #     json_data = x.content.split('JSON', 1)[1].split('END_JSON')[0]
        json_data = x.content
        # print(json_data)
    return x

# helper function 用于Debug
def print_and_return(x):
    print(x)
    return x

# 对打模型生成的代码进行处理, 将注释或解释性文字去除掉, 留下pyhon代码
def extract_python_code(text):
    pattern = r'```python\s*(.*?)\s*```'
    matches = re.findall(pattern, text, re.DOTALL)
    return [match.strip() for match in matches]

# 执行由大模型生成的代码
def execute_and_return(x):
    code = extract_python_code(x.content)[0]
    try:
        # test_code = code
        # print(str(test_code))
        result = exec(str(code))
        # print("exec result: "+result)
    except :
        print("The code is not executable, don't give up, try again!")
    return x

# 将图片编码成base64格式, 以方便输入给大模型
def image2b64(image_file):
    with open(image_file, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()
        return image_b64

    chart_reading = ChatNVIDIA(model="ai-phi-3-vision-128k-instruct")
    chart_reading_prompt = ChatPromptTemplate.from_template(
         """Analyze the image below and extract the following information:
    Consignee, Notify Party, Port of Loading, and Place of Delive.
    Present the information in JSON format.
    <img src="data:image/png;base64,{image_b64}" />

    Important: 
    - 'Shipper','Consignee' and 'Notify Party' fields may contain multiple lines of text,
      Capture all lines within these fields, Join all lines with '\n' to a string.
    - 'Port of Loading' and 'Place of Deliver' fields contain only a single line of text.
    
    Return ONLY the JSON object, nothing else. Use this exact format:
    {{
        "Shipper": "", 
        "Consignee": "",
        "Notify Party": "",
        "Port of Loading": "",
        "Place of Delive": ""
    }}
    
    For 'Shipper','Consignee' and 'Notify Party', each value should be an array of strings, 
    where each string represents a line of text.
    For 'Port of Loading' and 'Place of Delive', the value should be a single string.
    """
    )
    chart_chain = chart_reading_prompt | chart_reading

图表生成

根据提取的信息，使用instruct_chat调用模型llama-3.1-405b-instruct生成Python代码，该代码使用matplotlib库生成表格图像。图表生成的代码以Markdown代码块的形式返回。

    instruct_chat = ChatNVIDIA(model="meta/llama-3.1-405b-instruct")
    instruct_prompt = ChatPromptTemplate.from_template(
        "Do NOT repeat my requirements already stated." \
        "Please use Python and matplotlib to generate a table image based on this JSON string {json_text} , A line represents a key and its corresponding value in JSON , Save the image with name shipper_info.png, do not show" \
        "If has code, start with '```python' and end with '```'." 
    )
    instruct_chain = instruct_prompt | instruct_chat

邮件编写

使用instruct_email_prompt生成电子邮件内容的模板。根据提取的JSON信息和用户输入编写邮件。

    instruct_email_prompt = ChatPromptTemplate.from_template(
        "Write an email based on this JSON {json}, to {input}" 
    )

条件分支与执行

使用RunnableBranch来根据条件选择执行不同的代码分支。如果json字段不存在，则执行图表生成代码；如果存在，则直接更新JSON数据。

    # 根据“json”决定是否读取单证
    chart_reading_branch = RunnableBranch(
        (lambda x: x.get('json') is None, RunnableAssign({'json_text': chart_chain })),
        (lambda x: x.get('json') is not None, lambda x: x),
        lambda x: x
    )
    # 根据需求更新json
    update_json = RunnableBranch(
        (lambda x: 'json' in x.content, save_json_to_global),
        lambda x: x
    )

执行Python代码

执行图表生成的Python代码，并保存生成的图像。

    
    execute_code = RunnableBranch(
        (lambda x: '```python' in x.content, execute_and_return),
        lambda x: x
    )

代码执行与返回

执行上述流程，并将最终的邮件内容作为函数的返回值。

    chain_a = (
        chart_reading_branch
        # | RunnableLambda(print_and_return)
        | instruct_chain
        # | RunnableLambda(print_and_return)
        | update_json
        # | RunnableLambda(print_and_return)
        | execute_code
        # | RunnableLambda(print_and_return)
        # | instruct_email_chain
    )
    chain_a.invoke({"image_b64": image_b64, "input": user_input, "json": json_data}).content
    chain_b = (
       chart_reading_branch
        # | RunnableLambda(print_and_return)
        | instruct_email_chain
        | update_json
)

测试与调优

程序主要分成3个部分进行测试调优

单证的数据的提取，主要调整Prompt，时模型的生成的JSON符合要求；特别是单证数据中，某些单元格具有多行的文字，怎么通过Prompt，将它们提取到一个Key；包含对输出的JSON进行后处理，以满足后续需求；
根据JSON内容，生成对应的Python程序，生成表单的图片，此处内容可控性较弱，目前程序能比较成功的生成要求包含关键内容的图片，但可控性较差，须待后续优化；
根据JSON的内容，生成邮件，此处化的时间较少，目前构建了另外一盒个Chain来完成此部分的内容，后续可以优化来在一个Chain中完成；