BigDL-LLM结合LangChain可以实现针对文件的检索式问答,这为后续与知识图谱进行结合提供了可能。
1.1 加载文档/文件
input_doc = "\ BigDL: fast, distributed, secure AI for Big Data\n\n\ BigDL seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries:\ Orca: Distributed Big Data & AI (TF & PyTorch) Pipeline on Spark and Ray\ Nano: Transparent Acceleration of Tensorflow & PyTorch Programs on XPU\ DLlib: “Equivalent of Spark MLlib” for Deep Learning\ Chronos: Scalable Time Series Analysis using AutoML\ Friesian: End-to-End Recommendation Systems\ PPML: Secure Big Data and AI (with SGX Hardware Security)\ LLM: A library for running large language models with very low latency using low-precision techniques on Intel platforms\n\n\ "
也可以从文件中加载文件:
loader = TextLoader("../../state_of_the_union.txt")
input_doc = loader.load()
1.2 拆分输入文件的文本
Text splitters 将文档分割成指定大小的片段。在这里,我们将文档分割成块,用于嵌入和向量存储。
注意
CharacterTextSplitter
只对分隔符(默认为'\n\n'
)进行分割。
chunk_size
是最大的分割字符数,前提是如果可以分割的话。
chunk_overlap
是每次分割之间的重叠字符数。
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=650, chunk_overlap=0)
texts = text_splitter.split_text(input_doc)
1.3 创建嵌入并存储到向量存储中
拆分文档后,我们需要存储拆分内容,以便日后根据输入查询进行搜索。最常见的方法是嵌入每个分片的内容,然后将嵌入向量存储在向量存储中。
众所周知,在 Transformers 中,有一些嵌入层可以将非结构化数据转换为嵌入向量(一个由数字组成的数组),从而对其执行各种操作。嵌入向量代表现实世界中的对象和概念,如单词、文档等。
BigDL-LLM 提供了 TransformersEmbeddings
,它允许你使用 LLM 从文本输入中获取嵌入。
TransformersEmbeddings "的实例化方法与 "TransformersLLM "类似
from bigdl.llm.langchain.embeddings import TransformersEmbeddings
embeddings = TransformersEmbeddings.from_model_id(model_id="lmsys/vicuna-7b-v1.5")
介绍完 TransformersEmbeddings
后,让我们来创建嵌入并存储到向量存储中。向量存储负责存储嵌入数据并执行向量搜索。这里我们以 Faiss为例,Faiss 是一个用于对密集向量进行高效相似性搜索和聚类的库。
from langchain.vectorstores import FAISS
docsearch = FAISS.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()
根据LangChain的教程,也可以使用其提供的chroma方法来生成docsearch对象
from langchain.vectorstores import Chroma
docsearch = Chroma.from_documents(texts, embeddings)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())
1.4 搜索文档
如前所述,嵌入向量可以用于表示查询和文档。这种表示法可以将语义相似性转化为向量空间中的接近性。因此,我们可以通过这种相似性来搜索文档。
query = "What is BigDL?"
docs = docsearch.get_relevant_documents(query)
print("-"*20+"number of relevant documents"+"-"*20)
print(len(docs))
--------------------number of relevant documents--------------------
2
1.5 准备 chain
from langchain.chains.chat_vector_db.prompts import QA_PROMPT
from langchain.chains.question_answering import load_qa_chain
doc_chain = load_qa_chain(
llm, chain_type="stuff", prompt=QA_PROMPT
)
1.6 生成
result = doc_chain.run(input_documents=docs, question=query)
BigDL is a fast, distributed, and secure AI library for Big Data. It enables seamless scaling of data analytics and AI applications from laptops to the cloud. BigDL supports various libraries, including Orca, Nano, DLlib, Chronos, Friesian, PPML, and LLM. These libraries cater to different use cases, such as distributed Big Data processing, transparent acceleration of TensorFlow and PyTorch programs, scalable time series analysis, end-to-end recommendation systems, and secure Big Data and AI with SGX hardware security. BigDL aims to provide a unified platform for AI and data analytics, making it easier for developers to build and deploy their applications at scale.