【chatgpt】将PDF文件当做知识源_from pypdf2 import pdfreader from langchain.embedd-CSDN博客

本文链接：https://blog.csdn.net/u013066244/article/details/131332365

概要

指定知识源来回答问题。非常适用于公司里某些专业领域。
下文是将2023_GPT4All_Technical_Report.pdf文件当做知识源，来回答问题。

具体：

通过加载PDF文件，读取里面的内容。
将内容进行压缩成块，交给openai embeddings处理（创建知识的门牌号、房间(具体知识)的对应关系）
利用FAISS(short for Facebook AI Similarity Search)，进行问题搜索，得到答案。
再将问题和答案，交给openai进行润色。

准备工作

pip install langchain
pip install openai
pip install PyPDF2
pip install faiss-cpu
pip install tiktoken

代码

from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS

import os
os.environ["OPENAI_API_KEY"] = "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"


# location of the pdf file/files. 
reader = PdfReader('/Users/yutao/Downloads/2023_GPT4All_Technical_Report.pdf')

# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

# raw_text
# raw_text[:100]

text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

print(len(texts))
# print(texts[0])

# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

# faiss是Facebook ai similarity search的缩写
# 一种为了对嵌入向量进行高效搜索的索引结构
# https://huggingface.co/learn/nlp-course/chapter5/6?fw=pt#using-faiss-for-efficient-similarity-search
docsearch = FAISS.from_texts(texts, embeddings)

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(OpenAI(), chain_type="stuff")

query = "who are the authors of the article?"
docs = docsearch.similarity_search(query)
# 将搜索到的结果、问题，交给openai进行润色
aa = chain.run(input_documents=docs, question=query)
print("---------")
# print(docs)
print(aa)

# 理解：embeddings 将分词数据，映射到向量空间中，用于相关性的计算。

query = "What was the cost of training the GPT4all model?"
docs = docsearch.similarity_search(query)
aa = chain.run(input_documents=docs, question=query)
print(aa)