入门向从零开始RAG+多模态大模型实战，丝滑简单

amishor

于 2024-08-18 13:22:56 发布

阅读量1k

点赞数 28

文章标签：人工智能 nlp llama 大数据

本文链接：https://blog.csdn.net/amishor/article/details/141298359

版权

NVIDIA AI-AGENT夏季训练营

项目名称：AI-AGENT夏季训练营 — 金融风控垂域RAG + 多模态大模型项目

报告日期：2024年8月18日

项目负责人：Amishor

代码：https://github.com/amishior/LLM/tree/main

项目概述：

金融行业的风险控制对于企业和机构的稳定运行至关重要。传统的风控系统在面对复杂的数据和监管要求时常常显得力不从心。为了让更多人能够轻松上手并理解先进的风控技术，本项目从零开始，通过整合监管文件和统计数据，用户能够快速理解和应用风险管理策略。系统支持文字、图像和语音输入，操作简单，适合入门用户在金融风控领域的学习与应用。

项目亮点:

1.RAG技术的应用: 通过检索增强生成技术，对监管文件进行高效检索，并在此基础上生成智能响应，极大提高了对法规与政策的理解和执行能力。

2.多模态问答大模型: 除了传统的文本问答，系统具备处理文字、图像和语音输入的能力，通过对图像（如统计图表）进行分析，获取详细的数据分析结果。

3.实用性与创新性: 该项目通过融合前沿的 AI 技术和金融风险管理实践，为金融机构提供了一个创新且实用的风控工具，提高决策效率和准确性。

技术方案与实施步骤

模型选择：

项目依托于Nvidia的NIM平台，综合采用了嵌入模型(NV-Embed-QA)和大语言模型(meta/llama-3.1-405b-instruct，microsoft/phi-3-vision-128k-instruct)来构建RAG + 多模态大模型。选择llama-3.1-405b-instruct模型的理由在于其能够高效地从大规模文本库中检索相关信息，以提供更精准和上下文相关的答案。phi-3-vision-128k-instruct的选择是为了增强系统对多种输入形式的处理能力，支持文本、图像，可以帮助用户通过图像轻松查询复杂的统计数据。

数据的构建：

数据构建主要包括对监管文件的收集和向量化处理。采用NV-Embed-QA将文本数据转化为高维向量，使其能够在向量空间中进行高效检索。向量化处理的优势在于能够显著提升检索速度和准确性，并且在面对大规模数据时仍能保持良好的性能。

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    os.environ["NVIDIA_API_KEY"] = nvapi_key
### initialize ai-embed-qa-4 model
embedder = NVIDIAEmbeddings(model="NV-Embed-QA")
# Here we read in the text data and prepare them into vectorstore
ps = os.listdir("./zh_data/")
data = []
sources = []
for p in ps:
    content = ""
    if p.endswith('.txt'):
        path2file="./zh_data/"+p
        with open(path2file,encoding="utf-8") as f:
            lines=f.readlines()
            for line in lines:
                content += line
            if len(content)>=1:
                data.append(content)
                sources.append(path2file)

documents=[d for d in data if d != '\n']

# Here we create a vector store from the documents and save it to disk.
text_splitter = CharacterTextSplitter(chunk_size=500, separator=" ")
docs = []
metadatas = []

for i, d in enumerate(documents):
    splits = text_splitter.split_text(d)     
    #print(len(splits))
    docs.extend(splits)     
    metadatas.extend([{"source": sources[i]}] * len(splits))

store = FAISS.from_texts(docs, embedder , metadatas=metadatas)
store.save_local('./zh_data/nv_embedding')

功能整合（进阶版RAG）：

项目整合了语音生成功能，使得用户可以将RAG检索到的信息以语音形式输出。同时，通过多模态模型的集成，系统能够处理用户上传的统计数据图像，并将其转化为可分析的数据信息。

实施步骤

1.环境搭建：

开发环境的搭建包括安装Python及相关库（包括LangChain、Torch等），配置必要的API密钥和数据存储路径。使用Anaconda进行环境隔离，确保不同项目之间的依赖库不产生冲突。

# 在CMD命令执行
conda creat --name ai_endpoint python=3.10
conda activate ai_endpoint
pip install jupyterlab 
pip install langchain-nvidia-ai-endpoints
pip install langchain_core 
pip install langchain matplotlib 
pip install numpy 
pip install faiss-cpu==1.7.2 
pip install langchain-community
pip install torch
pip install openai-whispe
pip install ffmpeg
pip install edge-tts
pip install openai
pip install pydub

设置NVIDIA的API密钥：

def get_nv_api_key():
    if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
        print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
    else:
        nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
        assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
        os.environ["NVIDIA_API_KEY"] = nvapi_key
    return nvapi_key

2.测试与调优

测试部分设计了多种测试用例，涵盖了监管文本文件、文本转语音、统计数据图像输入的场景。通过调优模型、增加数据量等方式，提高了系统的响应速度和准确性。

RAG工作流，通过langchain调度大模型的文献检索：

retriever = store.as_retriever()
nvapi_key = nvapi_key
llm = ChatNVIDIA(model="meta/llama-3.1-405b-instruct", nvidia_api_key=nvapi_key, max_tokens=1024)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>",
        ),
        ("user", "{question}"),
    ]
)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

语音生成模块，通过edge_tts生成语音回答:

if not os.path.exists("./audio"):
    os.mkdir("./audio")

Language = "Chinese" 
voice_name = "zh-CN-XiaoxiaoNeural"
speed = 1

if not os.path.exists("./audio"):
    os.mkdir("./audio")


#@title Edge TTS
def calculate_rate_string(input_value):
    rate = (input_value - 1) * 100
    sign = '+' if input_value >= 1 else '-'
    return f"{sign}{abs(int(rate))}"

def make_chunks(input_text, language):
    language="Chinese"
    if language == "Chinese":
      temp_list = input_text.strip().split(".")
      filtered_list = [element.strip() + '.' for element in temp_list[:-1] if element.strip() and element.strip() != "'" and element.strip() != '"']
      if temp_list[-1].strip():
          filtered_list.append(temp_list[-1].strip())
      return filtered_list

def tts_file_name(text):
    if text.endswith("."):
        text = text[:-1]
    text = text.lower()
    text = text.strip()
    text = text.replace(" ","_")
    truncated_text = text[:25] if len(text) > 25 else text if len(text) > 0 else "empty"
    random_string = uuid.uuid4().hex[:8].upper()
    file_name = f"./audio/{truncated_text}_{random_string}.mp3"
    return file_name

def merge_audio_files(audio_paths, output_path):
    # Initialize an empty AudioSegment
    merged_audio = AudioSegment.silent(duration=0)

    # Iterate through each audio file path
    for audio_path in audio_paths:
        # Load the audio file using Pydub
        audio = AudioSegment.from_file(audio_path)

        # Append the current audio file to the merged_audio
        merged_audio += audio

    # Export the merged audio to the specified output path
    merged_audio.export(output_path, format="mp3")

def edge_free_tts(chunks_list,speed,voice_name,save_path):
  # print(chunks_list)
  if len(chunks_list)>1:
    chunk_audio_list=[]
    if os.path.exists("./audio/edge_tts_voice"):
      shutil.rmtree("./audio/edge_tts_voice")
    os.mkdir("./audio/edge_tts_voice")
    k=1
    for i in chunks_list:
      print(i)
      edge_command=f'edge-tts  --rate={calculate_rate_string(speed)}% --voice {voice_name} --text "{i}" --write-media ./content/edge_tts_voice/{k}.mp3'
      print(edge_command)
      var1=os.system(edge_command)
      if var1==0:
        pass
      else:
        print(f"Failed: {i}")
      chunk_audio_list.append(f"./content/edge_tts_voice/{k}.mp3")
      k+=1
    # print(chunk_audio_list)
    merge_audio_files(chunk_audio_list, save_path)
  else:
    edge_command=f'edge-tts  --rate={calculate_rate_string(speed)}% --voice {voice_name} --text "{chunks_list[0]}" --write-media {save_path}'
    print(edge_command)
    var2=os.system(edge_command)
    if var2==0:
      pass
    else:
      print(f"Failed: {chunks_list[0]}")
  return save_path

def random_audio_name_generate():
  random_uuid = uuid.uuid4()
  audio_extension = ".mp3"
  random_audio_name = str(random_uuid)[:8] + audio_extension
  return random_audio_name

def talk(input_text):
  global translate_text_flag,Language,speed,voice_name
  if len(input_text)>=600:
    long_sentence = True
  else:
    long_sentence = False

  if long_sentence==True and translate_text_flag==True:
    chunks_list=make_chunks(input_text,Language)
  elif long_sentence==True and translate_text_flag==False:
    chunks_list=make_chunks(input_text,"Chinese")
  else:
    chunks_list=[input_text]
  save_path="./audio/"+random_audio_name_generate()
  edge_save_path=edge_free_tts(chunks_list,speed,voice_name,save_path)
  return edge_save_path

def convert_to_text(audio_path):
    import whisper
    select_model ="base" # ['tiny', 'base']
    whisper_model = whisper.load_model(select_model)
    result = whisper_model.transcribe(audio_path,word_timestamps=True,fp16=False,language='Chinese')
    with open('scan.txt', 'w') as file:
        file.write(str(result))
    return result["text"]

文转语音pipeline，RAG检索监管文件信息，回答生成语音:

def rag_model(question):
    # Load the vectorestore back.
    embedder = NVIDIAEmbeddings(model="NV-Embed-QA")
    store = FAISS.load_local("./zh_data/nv_embedding", embedder,allow_dangerous_deserialization=True)
    retriever = store.as_retriever()
    llm = ChatNVIDIA(model="meta/llama-3.1-405b-instruct", nvidia_api_key=nvapi_key, max_tokens=1024)
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>",
            ),
            ("user", "{question}"),
        ]
    )

    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return chain.invoke(question)

def run_text_prompt(message, chat_history):
    bot_message = rag_model(message)
    edge_save_path=talk(bot_message)
    display(Audio(edge_save_path, autoplay=True))

    chat_history.append((message, bot_message))
    return edge_save_path, chat_history


def run_audio_prompt(audio, chat_history):
    if audio is None:
        return None, chat_history
    message_transcription = convert_to_text(audio)
    edge_save_path, chat_history = run_text_prompt(message_transcription, chat_history)
    return edge_save_path, chat_history

3.集成与部署

各模块的集成通过API的方式实现，最终在本地进行部署，通过调用gradio进行前端展示。部署过程中关注了系统的稳定性。

    multi_modal_chart_agent = gr.Interface(fn=chart_agent_gr,
                    inputs=[gr.Image(label="Upload image", type="filepath"), 'text'],
                    outputs=['image'],
                    title="Multi Modal chat agent",
                    description="Multi Modal chat agent",
                    allow_flagging="never")