Streamlit 中的 st.file_uploader 功能允许用户上传 PDF 文件,使他们能够交互式地选择和上传 PDF 文档,以便在 Streamlit Web 应用程序中进行进一步处理或分析。
我们使用如下的命令来运行代码:
app.py
from dotenv import load_dotenv
import streamlit as st
from PyPDF2 import PdfReader
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from elasticsearch import Elasticsearch, helpers
from langchain_community.vectorstores import ElasticsearchStore
from langchain.chains.question_answering import load_qa_chain
from langchain_community.llms import OpenAI
from langchain_community.callbacks import get_openai_callback
import os
# Sidebar contents
with st.sidebar:
st.title('💬PDF Summarizer and Q/A App')
st.markdown('''
## About this application
You can built your own customized LLM-powered chatbot using:
- [Streamlit]( )
- [LangChain]( )
- [OpenAI]( ) LLM model
''')
add_vertical_space(2)
st.write(' Why drown in papers when your chat buddy can give you the highlights and summary? Happy Reading. ')
add_vertical_space(2)
def main():
load_dotenv()
OPENAI_API_KEY= os.getenv("OPENAI_API_KEY")
ES_USER = os.getenv("ES_USER")
ES_PASSWORD = os.getenv("ES_PASSWORD")
ES_ENDPOINT = os.getenv("ES_ENDPOINT")
elastic_index_name='pdf_docs'
#Main Content
st.header("Ask About Your PDF 🤷♀️💬")
# upload file
pdf = st.file_uploader("Upload your PDF File and Ask Questions", type="pdf")
if __name__ == '__main__':
main()
streamlit run app.py
提前文本并写入到 Elasticsearch
app.py
from dotenv import load_dotenv
import streamlit as st
from PyPDF2 import PdfReader
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from elasticsearch import Elasticsearch, helpers
from langchain_community.vectorstores import ElasticsearchStore
from langchain.chains.question_answering import load_qa_chain
from langchain_community.llms import OpenAI
from langchain_community.callbacks import get_openai_callback
import os
# Sidebar contents
with st.sidebar:
st.title('💬PDF Summarizer and Q/A App')
st.markdown('''
## About this application
You can built your own customized LLM-powered chatbot using:
- [Streamlit]( )
- [LangChain]( )
- [OpenAI]( ) LLM model
''')
add_vertical_space(2)
st.write(' Why drown in papers when your chat buddy can give you the highlights and summary? Happy Reading. ')
add_vertical_space(2)
def main():
load_dotenv()
OPENAI_API_KEY= os.getenv("OPENAI_API_KEY")
ES_USER = os.getenv("ES_USER")
ES_PASSWORD = os.getenv("ES_PASSWORD")
ES_ENDPOINT = os.getenv("ES_ENDPOINT")
elastic_index_name='pdf_docs'
#Main Content
st.header("Ask About Your PDF 🤷♀️💬")
# upload file
pdf = st.file_uploader("Upload your PDF File and Ask Questions", type="pdf")
# extract the text
if pdf is not None:
pdf_reader = PdfReader(pdf)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
# split into chunks
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
# Make a connection to Elasticsearch
url = f"https://{ES_USER}:{ES_PASSWORD}@{ES_ENDPOINT}:9200"
connection = Elasticsearch(
hosts=[url],
ca_certs = "./http_ca.crt",
verify_certs = True
)
print(connection.info())
# create embeddings
embeddings = OpenAIEmbeddings()
if not connection.indices.exists(index=elastic_index_name):
print("The index does not exist, going to generate embeddings")
docsearch = ElasticsearchStore.from_texts(
chunks,
embedding = embeddings,
es_url = url,
es_connection = connection,
index_name = elastic_index_name,
es_user = ES_USER,
es_password = ES_PASSWORD
)
else:
print("The index already existed")
docsearch = ElasticsearchStore(
es_connection=connection,
embedding=embeddings,
es_url = url,
index_name = elastic_index_name,
es_user = ES_USER,
es_password = ES_PASSWORD
)
if __name__ == '__main__':
main()
我们使用如下的部分来提取 pdf 文件:
if pdf is not None:
pdf_reader = PdfReader(pdf)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
它首先检查 PDF 文件是否已上传(即变量 pdf 是否不是 None)。 如果确实上传了 PDF 文件,则会创建一个 PdfReader 对象来读取 PDF 文件的内容。 然后,它迭代 PDF 文档的每个页面,使用 extract_text() 方法从每个页面中提取文本,并将所有页面中的文本连接到一个名为 text 的字符串变量中。
我们使用如下的代码把文档分成 chunk:
# split into chunks
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
“chunks”变量表示从 PDF 文件中提取的文本的分段部分。 将文本拆分为块至关重要,因为它有助于更有效地处理大型文档,因为一次处理整个文本可能会消耗过多的内存和处理资源。 通过将文本分成更小的片段,应用程序可以更有效地管理和分析数据。 对文本进行分段可以更好地组织并有助于有针对性地分析或处理文档的特定部分。
我们使用如下的代码来生成嵌入:
ElasticsearchStore.from_texts(
chunks,
embedding = embeddings,
es_url = url,
es_connection = connection,
index_name = elastic_index_name,
es_user = ES_USER,
es_password = ES_PASSWORD
)
嵌入是对象的数字表示,通常用于捕获它们在数学空间中的语义或上下文。 词嵌入是高维空间中单词、短语或文档的向量表示,其中相似的单词彼此更接近。 在此代码片段中,嵌入是使用 OpenAIEmbeddings() 函数创建的,该函数可能为从 PDF 文件中提取的文本数据生成嵌入(向量表示)。 这些嵌入捕获有关文本的语义信息,使应用程序能够更有效地理解和处理内容。
随后,使用 ElasticsearchStore.from_texts() 函数构建知识库,该函数根据之前分段的文本块创建可搜索索引或结构。 该知识库使用 Elasticsearch 库实现,可根据文本片段的嵌入进行高效的相似性搜索和检索。
成功运行上面的脚本后,我们可以到 Elasticsearch 中进行查看:
连接 LLM OpenAI
llm = OpenAI()
chain = load_qa_chain(llm, chain_type="stuff")
with get_openai_callback() as cb:
response = chain.run(input_documents=docs, question=user_question)
print(cb)
st.write(response)
此代码初始化并利用 OpenAI 语言模型 (LLM) 创建问答 (Q&A) 系统。 它使用 LLM 加载预训练或自定义问答模型,设置回调管理以处理问答过程中的事件,对输入文档和用户问题执行模型,并使用 Streamlit 显示生成的响应以进行用户交互。
最终的完整 app.py 如下:
app.py
from dotenv import load_dotenv
import streamlit as st
from PyPDF2 import PdfReader
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from elasticsearch import Elasticsearch, helpers
from langchain_community.vectorstores import ElasticsearchStore
from langchain.chains.question_answering import load_qa_chain
from langchain_community.llms import OpenAI
from langchain_community.callbacks import get_openai_callback
import os
# Sidebar contents
with st.sidebar:
st.title('💬PDF Summarizer and Q/A App')
st.markdown('''
## About this application
You can built your own customized LLM-powered chatbot using:
- [Streamlit]( )
- [LangChain]( )
- [OpenAI]( ) LLM model
''')
add_vertical_space(2)
st.write(' Why drown in papers when your chat buddy can give you the highlights and summary? Happy Reading. ')
add_vertical_space(2)
def main():
load_dotenv()
OPENAI_API_KEY= os.getenv("OPENAI_API_KEY")
ES_USER = os.getenv("ES_USER")
ES_PASSWORD = os.getenv("ES_PASSWORD")
ES_ENDPOINT = os.getenv("ES_ENDPOINT")
elastic_index_name='pdf_docs'
#Main Content
st.header("Ask About Your PDF 🤷♀️💬")
# upload file
pdf = st.file_uploader("Upload your PDF File and Ask Questions", type="pdf")
# extract the text
if pdf is not None:
pdf_reader = PdfReader(pdf)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
# split into chunks
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
# Make a connection to Elasticsearch
url = f"https://{ES_USER}:{ES_PASSWORD}@{ES_ENDPOINT}:9200"
connection = Elasticsearch(
hosts=[url],
ca_certs = "./http_ca.crt",
verify_certs = True
)
print(connection.info())
# create embeddings
embeddings = OpenAIEmbeddings()
if not connection.indices.exists(index=elastic_index_name):
print("The index does not exist, going to generate embeddings")
docsearch = ElasticsearchStore.from_texts(
chunks,
embedding = embeddings,
es_url = url,
es_connection = connection,
index_name = elastic_index_name,
es_user = ES_USER,
es_password = ES_PASSWORD
)
else:
print("The index already existed")
docsearch = ElasticsearchStore(
es_connection=connection,
embedding=embeddings,
es_url = url,
index_name = elastic_index_name,
es_user = ES_USER,
es_password = ES_PASSWORD
### 最后
**经过日积月累, 以下是小编归纳整理的深入了解Java虚拟机文档,希望可以帮助大家过关斩将顺利通过面试。**
由于整个文档比较全面,内容比较多,篇幅不允许,下面以截图方式展示 。
![](https://img-blog.csdnimg.cn/img_convert/29eaef130912d7630b42523ccc6971d0.webp?x-oss-process=image/format,png)
![](https://img-blog.csdnimg.cn/img_convert/1b8c83862d1a036f275dc8cb2ebe3f19.webp?x-oss-process=image/format,png)
![](https://img-blog.csdnimg.cn/img_convert/02d62c2b15a7a11da122e924c57cb125.webp?x-oss-process=image/format,png)
![](https://img-blog.csdnimg.cn/img_convert/835c75f5c0be51debd04b0eb25b0b272.webp?x-oss-process=image/format,png)
![](https://img-blog.csdnimg.cn/img_convert/6e41fadadc179a204eb206daabe96ec7.webp?x-oss-process=image/format,png)
![](https://img-blog.csdnimg.cn/img_convert/d7204948c61728637a680a530944bcf6.webp?x-oss-process=image/format,png)
![](https://img-blog.csdnimg.cn/img_convert/1429580427d976fdf1eb28ac84263b37.webp?x-oss-process=image/format,png)
**由于篇幅限制,文档的详解资料太全面,细节内容太多,所以只把部分知识点截图出来粗略的介绍,每个小节点里面都有更细化的内容!**
> **找小编(vip1024c)领取**
(img-DUWbn5mz-1721630741324)]
[外链图片转存中...(img-QoWfQ85s-1721630741325)]
[外链图片转存中...(img-UnTZ8Uut-1721630741325)]
[外链图片转存中...(img-GGIccoTJ-1721630741326)]
[外链图片转存中...(img-tzacy5az-1721630741326)]
[外链图片转存中...(img-pE0VWkjQ-1721630741326)]
[外链图片转存中...(img-zazvgb96-1721630741327)]
**由于篇幅限制,文档的详解资料太全面,细节内容太多,所以只把部分知识点截图出来粗略的介绍,每个小节点里面都有更细化的内容!**
> **找小编(vip1024c)领取**