会议录音一键总结成纪要：NVIDIA AI-AGENT 语音文字图片多模态助手

menbinwan

已于 2024-08-18 15:37:42 修改

阅读量718

点赞数 14

文章标签： ai

于 2024-08-18 15:32:17 首次发布

本文链接：https://blog.csdn.net/menbinwan/article/details/141299964

版权

项目名称：AI-AGENT夏季训练营 — RAG智能对话机器人

报告日期：2024年8月18日

项目负责人：林三胖

项目概述（必写）：

项目使用英伟达 NIM 调用微软多模态phi 模型，openai whisper，微软edge text to sound 模型，和 langchain 完成了整个链路的定义，构建了一个 rag 和 agent pipeline

技术方案与实施步骤

1. 模型选择（必写）：详细描述项目采用的技术方案，包括大模型的选择理由、RAG模型的优势分析。

Phi-3是微软研究院推出的一系列小型语言模型（SLM），旨在提供与大型模型相媲美的语言理解和推理能力，同时保持较小的参数规模

penAI Whisper是一个由OpenAI研发的开源自动语音识别（ASR）模型，Whisper支持包括英语在内的多种语言，覆盖了世界上大部分的人口和地区。根据公开发布的信息，它支持的语言数量达到99种，包括中文、日语、法语、德语、西班牙语等。 Whisper不仅具备语音识别能力，还支持语音翻译、语言识别、语音活动检测等多种任务。这使得它可以在不同的应用场景中灵活使用。

2. 数据的构建（必写）：说明数据构建过程、向量化处理方法及其优势。

在RAG（Retrieval Augmented Generation，检索增强生成）系统中，Embedding技术扮演着至关重要的角色。Embedding是一种将离散的非结构化数据（如单词、句子、文档等）转换为连续的向量表示的技术，这些向量能够捕捉数据的语义和句法信息。

本文使用 langchain 中langchain.vectorstores 构建了文档和图片的 embeding，用来以后的检索。

3. 功能整合（进阶版RAG必填）：介绍进阶的语音功能、Agent功能、多模态等功能的整合策略与实现方法。

实施步骤：

环境搭建（必写）：描述开发环境的搭建过程，包括必要的软件、库的安装与配置。

!pip install -U langchain
!pip install faiss-cpu
!pip install langchain_nvidia_ai_endpoints


from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.question_answering import load_qa_chain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

from openai import OpenAI

from langchain_nvidia_ai_endpoints import ChatNVIDIA
ChatNVIDIA.get_available_models()

代码实现（必写）：列出关键代码的实现步骤，可附上关键代码截图或代码块。

#@title Edge TTS
def calculate_rate_string(input_value):
    rate = (input_value - 1) * 100
    sign = '+' if input_value >= 1 else '-'
    return f"{sign}{abs(int(rate))}"


def make_chunks(input_text, language):
    language="English"
    if language == "English":
      temp_list = input_text.strip().split(".")
      filtered_list = [element.strip() + '.' for element in temp_list[:-1] if element.strip() and element.strip() != "'" and element.strip() != '"']
      if temp_list[-1].strip():
          filtered_list.append(temp_list[-1].strip())
      return filtered_list


import re
import uuid
def tts_file_name(text):
    if text.endswith("."):
        text = text[:-1]
    text = text.lower()
    text = text.strip()
    text = text.replace(" ","_")
    truncated_text = text[:25] if len(text) > 25 else text if len(text) > 0 else "empty"
    random_string = uuid.uuid4().hex[:8].upper()
    file_name = f"./content/edge_tts_voice/{truncated_text}_{random_string}.mp3"
    print('tts_file_name:',file_name)
    return file_name


from pydub import AudioSegment
import shutil
import os
def merge_audio_files(audio_paths, output_path):
    print('audio_paths:', audio_paths)
    # Initialize an empty AudioSegment
    merged_audio = AudioSegment.silent(duration=0)

    # Iterate through each audio file path
    for audio_path in audio_paths:
        # Load the audio file using Pydub
        audio = AudioSegment.from_file(audio_path)

        # Append the current audio file to the merged_audio
        merged_audio += audio

    # Export the merged audio to the specified output path
    merged_audio.export(output_path, format="mp3")
import os.path

def edge_free_tts(chunks_list,speed,voice_name,save_path):
  print("chunks_list:",chunks_list)
  if len(chunks_list)>1:
    chunk_audio_list=[]
    if os.path.exists("./content/edge_tts_voice"):
      shutil.rmtree("./content/edge_tts_voice")
    os.mkdir("./content/edge_tts_voice")
    k=1
    print("chunks_list:", chunks_list)
    for i in chunks_list:
      print(i)
      edge_command=f'edge-tts  --rate={calculate_rate_string(speed)}% --voice {voice_name} --text "{i}" --write-media ./content/edge_tts_voice/{k}.mp3'
      print(edge_command)
      var1=os.system(edge_command)
      assert var1==0, f"Failed at edge_command: {edge_command}"
      if var1==0:
        print('edge_comand:',edge_command, 'success')
        pass
      else:
        print(f"edge command Failed: {i}:", f'./content/edge_tts_voice/{k}.mp3')
        
        
      filename = f"./content/edge_tts_voice/{k}.mp3"
      if os.path.exists(filename) and os.path.getsize(filename): 
        chunk_audio_list.append(f"./content/edge_tts_voice/{k}.mp3")
      else:
        print(f"filename not right {i}", filename)
      k+=1
    print('chunk_audio_list:', chunk_audio_list)
    merge_audio_files(chunk_audio_list, save_path)
  else:

    print('not this line')
    edge_command=f'edge-tts  --rate={calculate_rate_string(speed)}% --voice {voice_name} --text "{chunks_list[0]}" --write-media {save_path}'
    print('edge-command:',edge_command)
    print('save_path:',save_path)
    var2=os.system(edge_command)
    if var2==0:
      print('edge_comand:',edge_command, 'success')
      pass
    else:
      print(f"Failed at edge_command: {chunks_list[0]}")
      input('check self.fail()')
  return save_path

# text = "This is Microsoft Phi 3 mini 4k instruct Demo" Simply update the text variable with the text you want to convert to speech
text = "This is Microsoft Phi 3 mini 4k instruct Demo, The provided code is a Python script designed to convert text to speech using Microsoft's Edge TTS service. It includes functions to calculate the speech rate, split text into manageable chunks, generate unique filenames for audio files, and merge multiple audio files into one. The edge_free_tts function handles the text-to-speech conversion, while the talk function orchestrates the process based on the input text's length. The script also includes parameters for language, gender of the voice, and speech speed. Finally, it plays the generated audio using IPython's Audio class, making it suitable for use in Jupyter Notebooks."  # @param {type: "string"}
Language = "English" # @param ['English']
# Gender of voice simply change from male to female and choose the voice you want to use
Gender = "Female"# @param ['Male', 'Female']
female_voice="en-US-AriaNeural"# @param["en-US-AriaNeural",'zh-CN-XiaoxiaoNeural','zh-CN-XiaoyiNeural']
speed = 1  # @param {type: "number"}
translate_text_flag  = False
if len(text) >= 7:
  long_sentence = True
else:
  long_sentence = False

# long_sentence = False # @param {type:"boolean"}
save_path = ''  # @param {type: "string"}
if len(save_path)==0:
  save_path=tts_file_name(text)
if Language == "English" :
  if Gender=="Male":
    voice_name="en-US-ChristopherNeural"
  if Gender=="Female":
    voice_name=female_voice
    # voice_name="en-US-AriaNeural"


if translate_text_flag:
  input_text=text
  # input_text=translate_text(text, Language)
  # print("Translateting")
else:
  input_text=text
if long_sentence==True and translate_text_flag==True:
  chunks_list=make_chunks(input_text,Language)
elif long_sentence==True and translate_text_flag==False:
  chunks_list=make_chunks(input_text,"English")
else:
  chunks_list=[input_text]
# print(chunks_list)
# edge_save_path=edge_free_tts(chunks_list,speed,voice_name,save_path)
# from IPython.display import clear_output
# clear_output()
# from IPython.display import Audio
# Audio(edge_save_path, autoplay=True)

from IPython.display import clear_output
from IPython.display import Audio
if not os.path.exists("./content/audio"):
    os.mkdir("./content/audio")
import uuid
def random_audio_name_generate():
  random_uuid = uuid.uuid4()
  audio_extension = ".mp3"
  random_audio_name = str(random_uuid)[:8] + audio_extension
  return random_audio_name
def talk(input_text):
  global translate_text_flag,Language,speed,voice_name
  if len(input_text)>=7:
    long_sentence = True
  else:
    long_sentence = False

  if long_sentence==True and translate_text_flag==True:
    print('1')
    chunks_list=make_chunks(input_text,Language)
  elif long_sentence==True and translate_text_flag==False:
    print('2')
    chunks_list=make_chunks(input_text,"English")
  else:
    print('3')
    chunks_list=[input_text]
  save_path="./content/audio/"+random_audio_name_generate()
  edge_save_path=edge_free_tts(chunks_list,speed,voice_name,save_path)
  return edge_save_path


edge_save_path=talk(text)
Audio(edge_save_path, autoplay=True)