RAG多模态AI Agent模型：AI的‘超级搜索引擎’，但能说话的那种

最新推荐文章于 2024-09-07 08:00:00 发布

立花云龙

最新推荐文章于 2024-09-07 08:00:00 发布

阅读量897

点赞数 17

文章标签：人工智能 python 语言模型

本文链接：https://blog.csdn.net/qq_52529447/article/details/141291463

版权

文章目录

1.项目概述

本文通过rag和多模态代理模型完成人机交互对话并可以接受图片并分析的任务

RAG（Retrieval-Augmented Generation）对话模型是一种先进的自然语言处理技术，旨在提升对话系统的准确性和智能水平。通过结合信息检索和生成模型，RAG对话模型能够提供更加丰富和相关的回答，特别是在处理需要广泛知识和细致背景的复杂对话时表现尤为出色。

多模态代理（Multi-Modal Agents）可以通俗理解为一种智能系统，它像一个全能助手，能够理解和处理来自不同来源的信息。例如，它不仅能听懂你说的话，还能看你指的东西，甚至理解你所做的动作。这样，它就能更全面地理解你的需求，给出更准确的回应。就像一个聪明的助手，它不仅听你说话，还能看你做的事情，帮助你完成任务。

2.技术方案与实施步骤

模型选择

目前，已经有多种深度学习模型可以实现人机对话问答交互。按照参数和计算需求可以分为大模型和小模型。大模型的优势在于能够处理复杂的任务，生成高质量的回答或进行复杂的推理。缺点就是：需要大量的训练数据来发挥最佳性能。训练和推理过程中需要强大的计算能力和存储资源。小模型反之。
本项目使用的llm模型就属于大模型的一种，准确度高但占用内存较多。因此使用了phi-3-small-128k-instruct小模型，在具体测试中发现，小模型输出结果有时会出现幻觉，也就是答非所问的情况，于是采用rag模型对小模型进行修正。

数据构建

下图是rag模型的工作流程。
请添加图片描述
个人上传的文本资料通过预处理之后，被embedding模型拆分，再储存到向量仓库中。
当用户端提出问题时，得到的答案会先根据向量仓库搜索，再得到结果。以此达到修正小模型的结果。
我认为rag比较有优势的一点就在于它可以通过添加文献，打造出属于自己的模型。

功能整合

实现功能1：通过rag模型修正小模型microsoft/phi-3-small-128k-instruct，将这个小模型完成对话任务的准确率提高。
实现功能2：通过Microsoft Phi 3 vision模型和LangChain 框架，创建一个可以分析图片信息的UI交互界面。

3.实施步骤

环境搭建

以下内容都是在Windows系统进行的

安装anaconda
参考下面链接：anaconda安装
在anaconda中下载所需的库
参考下面链接：配置所需库

代码实现

进入nvidia网站：NIM主页
点击任意模型获取api_key

在这里插入图片描述
注意：api_key每次重新生成后，之前的api_key会失效

测试api_key

# 简易文字llm生成实例
# 测试api_key
from openai import OpenAI

client = OpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = "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"
)

completion = client.chat.completions.create(
  model="databricks/dbrx-instruct",
  messages=[{"role":"user","content":"如何通俗理解玻色爱因斯坦凝聚态"}],
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

生成结果：

玻色爱因斯坦凝聚 (Bose-Einstein Condensation)
是一种物质态，通俗地理解，它是指在绝对零度(−273.15℃)附近，玻色子(一种基本粒子)会聚集在一起，形成一个宏观量子态。这种凝聚态具有很多特殊的性质，例如，它可以形成超流体，流动时不会产生任何摩擦，这在经典物理学中是不可能出现的。玻色爱因斯坦凝聚态的发现，对于我们理解量子力学和统计物理学有着重要的意义。

这一步的代码可以在NIM主页随意替换模型

获取可用模型简介

from langchain_nvidia_ai_endpoints import ChatNVIDIA
ChatNVIDIA.get_available_models()

选用小模型和embedding模型

# 小模型更容易出现幻觉
llm = ChatNVIDIA(model="microsoft/phi-3-small-128k-instruct", nvidia_api_key=nvapi_key, max_tokens=512)
result = llm.invoke("如何通俗理解玻色爱因斯坦凝聚态")

print(result.content)
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
embedder = NVIDIAEmbeddings(model="ai-embed-qa-4")

丰富数据库

import os
from tqdm import tqdm
from pathlib import Path
from operator import itemgetter
from langchain.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain.text_splitter import CharacterTextSplitter
from langchain_nvidia_ai_endpoints import ChatNVIDIA
import faiss

ps = os.listdir("./zh_data/")
data = []
sources = []
for p in ps:
    if p.endswith('.txt'):
        path2file="./zh_data/"+p
        with open(path2file,encoding="utf-8") as f:
            lines=f.readlines()
            for line in lines:
                if len(line)>=1:
                    data.append(line)
                    sources.append(path2file)
                    
# 把整个文本分批embedding存到向量库，提高rag检索准确度 chunk_size：200-1000，取决于embedding模型的touken
text_splitter = CharacterTextSplitter(chunk_size=400, separator=" ")    
docs = []
metadatas = []

for i, d in enumerate(documents):
    splits = text_splitter.split_text(d)
    #print(len(splits))
    docs.extend(splits)
    metadatas.extend([{"source": sources[i]}] * len(splits))

store = FAISS.from_texts(docs, embedder , metadatas=metadatas)
store.save_local('向量数据库保存路径')

结合数据库进行问答

# 结合文档纠正小模型，消除幻觉
retriever = store.as_retriever()

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>",
        ),
        ("user", "{question}"),
    ]
)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

chain.invoke("如何通俗理解玻色爱因斯坦凝聚态")

将图片进行编解码

from PIL import Image

def image2b64(image_file):
    with open(image_file, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()
        return image_b64

image_b64 = image2b64("economic-assistance-chart.png")
display(Image.open("economic-assistance-chart.png"))

以Microsoft Phi 3 vision模型为例，解析图片中的数据

chart_reading = ChatNVIDIA(model="microsoft/phi-3-vision-128k-instruct")
result = chart_reading.invoke(f'Generate underlying data table of the figure below, : <img src="data:image/png;base64,{image_b64}" />')
print(result.content)

9.用LangChain构建多模态agent
下面的函数用于执行显示输出，执行代码等功能

import re

# 将 langchain 运行状态下的表保存到全局变量中
def save_table_to_global(x):
    global table
    if 'TABLE' in x.content:
        table = x.content.split('TABLE', 1)[1].split('END_TABLE')[0]
    return x

# helper function 用于Debug
def print_and_return(x):
    print(x)
    return x

# 对打模型生成的代码进行处理, 将注释或解释性文字去除掉, 留下pyhon代码
def extract_python_code(text):
    pattern = r'```python\s*(.*?)\s*```'
    matches = re.findall(pattern, text, re.DOTALL)
    return [match.strip() for match in matches]

# 执行由大模型生成的代码
def execute_and_return(x):
    code = extract_python_code(x.content)[0]
    try:
        result = exec(str(code))
        #print("exec result: "+result)
    except ExceptionType:
        print("The code is not executable, don't give up, try again!")
    return x

# 将图片编码成base64格式, 以方便输入给大模型
def image2b64(image_file):
    with open(image_file, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()
        return image_b64

下面的函数用于定义多模态数据分析 Agent

def chart_agent(image_b64, user_input, table):
    # Chart reading Runnable
    chart_reading = ChatNVIDIA(model="ai-phi-3-vision-128k-instruct")
    chart_reading_prompt = ChatPromptTemplate.from_template(
        'Generate underlying data table of the figure below, : <img src="data:image/png;base64,{image_b64}" />'
    )
    chart_chain = chart_reading_prompt | chart_reading

    # Instruct LLM Runnable
    # instruct_chat = ChatNVIDIA(model="nv-mistralai/mistral-nemo-12b-instruct")
    # instruct_chat = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")
    #instruct_chat = ChatNVIDIA(model="ai-llama3-70b")
    instruct_chat = ChatNVIDIA(model="meta/llama-3.1-405b-instruct")

    instruct_prompt = ChatPromptTemplate.from_template(
        "Do NOT repeat my requirements already stated. Based on this table {table}, {input}" \
        "If has table string, start with 'TABLE', end with 'END_TABLE'." \
        "If has code, start with '```python' and end with '```'." \
        "Do NOT include table inside code, and vice versa."
    )
    instruct_chain = instruct_prompt | instruct_chat

    # 根据“表格”决定是否读取图表
    chart_reading_branch = RunnableBranch(
        (lambda x: x.get('table') is None, RunnableAssign({'table': chart_chain })),
        (lambda x: x.get('table') is not None, lambda x: x),
        lambda x: x
    )
    # 根据需求更新table
    update_table = RunnableBranch(
        (lambda x: 'TABLE' in x.content, save_table_to_global),
        lambda x: x
    )
    # 执行绘制图表的代码
    execute_code = RunnableBranch(
        (lambda x: '```python' in x.content, execute_and_return),
        lambda x: x
    )

    chain = (
        chart_reading_branch
        #| RunnableLambda(print_and_return)
        | instruct_chain
        #| RunnableLambda(print_and_return)
        | update_table
        | execute_code
    )

    return chain.invoke({"image_b64": image_b64, "input": user_input, "table": table}).content

# 使用全局变量 table 来存储数据
table = None
# 将要处理的图像转换成base64格式
image_b64 = image2b64("economic-assistance-chart.png")

#展示读取的图片
from PIL import Image
display(Image.open("economic-assistance-chart.png"))

# 将图片的数据转为字符串
user_input = "show this table in string"
chart_agent(image_b64, user_input, table)
print(table)    # let's see what 'table' looks like now

# 让 Agent 自己尝试修改其中的内容
user_input = "replace table string's 'UK' with 'United Kingdom'"
chart_agent(image_b64, user_input, table)
print(table)    # let's see what 'table' looks like now

给智能体添加一个UI界面

# img_path是图片生成的路径
global img_path
img_path ='C:/Users/13116/2024_summer_bootcamp/image.png'
print(img_path)

def execute_and_return_gr(x):
    code = extract_python_code(x.content)[0]
    try:
        result = exec(str(code))
        #print("exec result: "+result)
    except ExceptionType:
        print("The code is not executable, don't give up, try again!")
    return img_path
    
# 解码图片    
def chart_agent_gr(image_b64, user_input, table):
    image_b64 = image2b64(image_b64)
    # Chart reading Runnable
    chart_reading = ChatNVIDIA(model="microsoft/phi-3-vision-128k-instruct")
    chart_reading_prompt = ChatPromptTemplate.from_template(
        'Generate underlying data table of the figure below, : <img src="data:image/png;base64,{image_b64}" />'
    )
    chart_chain = chart_reading_prompt | chart_reading

    # Instruct LLM Runnable
    # instruct_chat = ChatNVIDIA(model="nv-mistralai/mistral-nemo-12b-instruct")
    # instruct_chat = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")
    #instruct_chat = ChatNVIDIA(model="ai-llama3-70b")
    instruct_chat = ChatNVIDIA(model="meta/llama-3.1-405b-instruct")

    instruct_prompt = ChatPromptTemplate.from_template(
        "Do NOT repeat my requirements already stated. Based on this table {table}, {input}" \
        "If has table string, start with 'TABLE', end with 'END_TABLE'." \
        "If has code, start with '```python' and end with '```'." \
        "Do NOT include table inside code, and vice versa."
    )
    instruct_chain = instruct_prompt | instruct_chat

    # 根据“表格”决定是否读取图表
    chart_reading_branch = RunnableBranch(
        (lambda x: x.get('table') is None, RunnableAssign({'table': chart_chain })),
        (lambda x: x.get('table') is not None, lambda x: x),
        lambda x: x
    )
    
    # 根据需求更新table
    update_table = RunnableBranch(
        (lambda x: 'TABLE' in x.content, save_table_to_global),
        lambda x: x
    )

    execute_code = RunnableBranch(
        (lambda x: '```python' in x.content, execute_and_return_gr),
        lambda x: x
    )
    
    # 执行绘制图表的代码
    chain = (
        chart_reading_branch
        | RunnableLambda(print_and_return)
        | instruct_chain
        | RunnableLambda(print_and_return)
        | update_table
        | execute_code
    )

    return chain.invoke({"image_b64": image_b64, "input": user_input, "table": table})

下面创建一个Gradio交互界面

import gradio as gr
multi_modal_chart_agent = gr.Interface(fn=chart_agent_gr,
                    inputs=[gr.Image(label="Upload image", type="filepath"), 'text'],
                    outputs=['image'],
                    title="Multi Modal chat agent",
                    description="Multi Modal chat agent",
                    allow_flagging="never")

multi_modal_chart_agent.launch(debug=True, share=False, show_api=False, server_port=5000, server_name="0.0.0.0")

4.项目成果

应用场景

在物理教育领域，RAG（Retrieval-Augmented Generation）技术可以显著提升学习和教学体验。举个例子，学生在学习物理时遇到难题，可以通过RAG系统提问。系统会从教材和科学文献中检索相关信息，生成详尽的解释和示例，这样学生可以得到个性化的辅导。此外，RAG也可以用于生成实验指导和解题步骤，提供复杂问题的解决方案，帮助学生更好地理解实验和理论。同时，它还能根据学生的学习进度推荐相关资源，并自动生成测试题和练习题。总之，RAG通过整合和生成信息，能够为学生提供更加精准和及时的学习支持。
多模态代理能够通过整合各种信息源来丰富学习体验。图片分析技术可以辅助理解复杂的物理现象。在教学过程中，系统可以通过分析图像展示的物理实验或演示模型，结合文本描述生成解释。例如，学生观看一个关于力学实验的视频，系统不仅能识别实验中的物体和动作，还能结合这些视觉信息生成详细的解释，帮助学生理解力的作用和效果。

功能演示

经过rag模型训练后得到的文本回答结果，得到的内容更精确：

玻色凝聚态是一种物质状态，其中粒子（如原子或离子）以非常低的温度聚集在一起，形成一个量子力学描述的单一物质。这种状态得名于物理学家尤里·玻色，他在1925年提出了玻色-爱因斯坟统计，描述了在低温下粒子的行为。在玻色凝聚态中，粒子的波函数相互重叠，形成一个称为量子涨落的量子态。这些量子涨落可以相互作用，导致粒子在整个系统中相互连接。这种连接使得粒子的行为与整个系统的行为相同，从而产生出一种集体行为。玻色凝聚态的一个例子是超导体，它们在低温下表现出零电阻和零磁阻。这是因为在超导体中，电子可以通过量子涨落相互连接，从而允许电流在没有阻力的情况下流动。另一个例子是超流体，它们在低温下表现出无摩擦流动。这是因为在超流体中，原子可以通过量子涨落相互连接，从而允许流体在没有摩擦的情况下流动。总之，玻色凝聚态是一种物质状态，其中粒子以非常低的温度聚集在一起，形成一个量子力学描述的单一物质。这种状态的一个例子是超导体和超流体，它们在低温下表现出零电阻和无摩擦流动。

下面是图片分析功能
在这里插入图片描述