Multiple file loading and embeddings with OpenAI

题意:“使用 OpenAI 进行多个文件加载和嵌入”

问题背景:

I am trying to load a bunch of pdf files and query them using OpenAI APIs.

“我正在尝试加载一堆 PDF 文件,并使用 OpenAI API 对它们进行查询。”

from langchain.text_splitter import CharacterTextSplitter
#from langchain.document_loaders import UnstructuredFileLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores.faiss import FAISS
from langchain.embeddings import OpenAIEmbeddings
import pickle
import os


print("Loading data...")
pdf_folder_path = "content/"
print(os.listdir(pdf_folder_path))

# Load multiple files
# location of the pdf file/files. 
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]


print(loaders)

alldocument = []
vectorstore = None
for loader in loaders:

    print("Loading raw document..." + loader.file_path)
    raw_documents = loader.load()

    print("Splitting text...")
    text_splitter = CharacterTextSplitter(
        separator="\n\n",
        chunk_size=800,
        chunk_overlap=100,
        length_function=len,
    )
    documents = text_splitter.split_documents(raw_documents)
    #alldocument = alldocument + documents

    print("Creating vectorstore...")
    embeddings = OpenAIEmbeddings()
    
    vectorstore = FAISS.from_documents(documents, embeddings)

    #with open("vectorstore.pkl", "wb") as f:
    with open("vectorstore.pkl", "ab") as f:
        pickle.dump(vectorstore, f)
        f.close()

I am trying to load multiple files for QnA but the index only remembers the last file uploaded from a folder.

“我正在尝试加载多个文件进行问答,但索引只记住了从一个文件夹中上传的最后一个文件。”

Do I need to change the structure of for loop or have another parameter with the Open Method?

“我是否需要更改 `for` 循环的结构,或在 `Open` 方法中添加另一个参数?”

问题解决:

The problem is that with each iteration of the loop, you're overwriting the previous vectorstore when you create a new one. Then, when saving to "vectorstore.pkl", you're only saving the last vectorstore.

“问题在于,每次循环迭代时,当你创建一个新的 `vectorstore` 时,都会覆盖之前的 `vectorstore`。然后,当保存到 `vectorstore.pkl` 时,你只保存了最后一个 `vectorstore`。”

print("Loading data...")
pdf_folder_path = "content/"
print(os.listdir(pdf_folder_path))

# Load multiple files
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]

print(loaders)

all_documents = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    raw_documents = loader.load()

    print("Splitting text...")
    text_splitter = CharacterTextSplitter(
        separator="\n\n",
        chunk_size=800,
        chunk_overlap=100,
        length_function=len,
    )
    documents = text_splitter.split_documents(raw_documents)
    all_documents.extend(documents)

print("Creating vectorstore...")
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(all_documents, embeddings)

with open("vectorstore.pkl", "wb") as f:
    pickle.dump(vectorstore, f)

  • 6
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

营赢盈英

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值