langchain 文本向量化存储,并检索相似 topK,检索方法汇总

目录

chroma 检索

faiss 检索

检索器

相似性

最大相关性mmr

相似数阈值

多角度查询

上下文压缩

混合检索

检索后上下文重新排序

父文档检索器

自查询

时间权重检索

TF-IDF检索

KNN检索

目录

chroma 检索

faiss 检索

检索器

相似性

最大相关性mmr

相似数阈值

多角度查询

上下文压缩

混合检索

检索后上下文重新排序

父文档检索器

自查询

时间权重检索

TF-IDF检索

KNN检索

RAG全流程模块


txt 有多行,我的这份数据有 67 行,样例如下:

字段1\t值1\n

字段2\t值2\n

...

chroma 检索

pip install langchain-chroma

在本地下载了 embedding 模型,使用去向量化,并检索 top3

指定向量化后的数据库保存到哪里 persist_directory

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain.vectorstores import Chroma


filepath = 'data/专业描述.txt'
raw_documents = TextLoader(filepath, encoding='utf8').load()

# 按行分割块
text_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separator="\n",
    length_function=len,
    is_separator_regex=True,
)
documents = text_splitter.split_documents(raw_documents)
# 加载本地 embedding 模型
embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')
# 创建向量数据库
db = Chroma.from_documents(documents, embedding, persist_directory=r"./chroma/")
db.persist()  # 确保嵌入被写入磁盘
'''
如果已经创建好了,可以直接读取
db = Chroma(persist_directory=persist_directory, embedding_function=embedding)
'''

# 直接传入文本
query = "材料科学与工程是一门研究材料的组成、性质、制备、加工及应用的多学科交叉领域。它涵盖了金属、无机非金属"
docs = db.similarity_search(query, k=3)
# docs = db.similarity_search_with_score(query, k=3)  # 带分数的
print(docs[0].page_content)

# 传入向量去搜索
embedding_vector = embedding.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector, k=3)
print(docs[0].page_content)

faiss 检索

pip install faiss-cpu

感觉 faiss 向量化要快一些

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain.vectorstores import Chroma


filepath = 'data/专业描述.txt'
raw_documents = TextLoader(filepath, encoding='utf8').load()

# 按行分割块
text_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separator="\n",
    length_function=len,
    is_separator_regex=True,
)
documents = text_splitter.split_documents(raw_documents)
# 加载本地 embedding 模型
embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')
# 创建向量数据库
db = FAISS.from_documents(documents, embedding)
# 保存
db.save_local("./faiss_index")
'''
如果已经创建好了,可以直接读取
db = FAISS.load_local("./faiss_index", embeddings)
'''

# 直接传入文本
query = "材料科学与工程是一门研究材料的组成、性质、制备、加工及应用的多学科交叉领域。它涵盖了金属、无机非金属"
docs = db.similarity_search(query, k=3)
# docs = db.similarity_search_with_score(query, k=3)  # 带分数的
print(docs[0].page_content)

# 传入向量去搜索
embedding_vector = embedding.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector, k=3)
print(docs[0].page_content)

检索器

相似性

在上面默认情况下,向量存储检索器使用相似性搜索

我们在用上面的例子,使用 faiss 已经创建好了向量数据库,我们在最后面修改检索的代码

选取 top30

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain.vectorstores import Chroma


filepath = 'data/专业描述.txt'
raw_documents = TextLoader(filepath, encoding='utf8').load()

# 按行分割块
text_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separator="\n",
    length_function=len,
    is_separator_regex=True,
)
documents = text_splitter.split_documents(raw_documents)
# 加载本地 embedding 模型
embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')
# # 创建向量数据库
# db = FAISS.from_documents(documents, embedding)
# # 保存
# db.save_local("./faiss_index")

# 如果已经创建好了,可以直接读取
db = FAISS.load_local("./faiss_index", embedding, allow_dangerous_deserialization=True)

# 直接传入文本
query = "材料科学与工程是一门研究材料的组成、性质、制备、加工及应用的多学科交叉领域。它涵盖了金属、无机非金属"
retriever = db.as_retriever(search_kwargs={'k': 30})  # 构建检索器
docs = retriever.get_relevant_documents(query)
print(docs)

最大相关性mmr

直接比较使用相似性,相似度方法,可能会有重复数据,使用 mmr 不会有重复的检索结果

retriever = db.as_retriever(search_type="mmr", search_kwargs={'k': 30})  # 构建检索器

会发现我指定 top30,只返回了 20 个

fetch_k 默认是 20,数据库提取的候选文档数量,理解为 mmr 算法使用时内部操作的参数就可以了

想取出 30 那,只需要设置大于 30 即可

retriever = db.as_retriever(search_type="mmr", search_kwargs={'k': 30, 'fetch_k': 50})  # 构建检索器

相似数阈值

相似度大于 0.5 的拿出来

retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5})  # 构建检索器

多角度查询

基于向量距离的检索可能因微小的询问词变化或向量无法准确表达语义而产生不同结果;

使用大预言模型自动从不同角度生成多个查询,实现提示词优化;

对用户查询生成表达其不同方面的多个新查询(也就是query利用大模型生成多个表述),对每个表述进行检索,去结果的并集;

优点是生成的查询多角度,可以覆盖更全面的语义和信息需求;

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain.vectorstores import Chroma
import os
from dotenv import load_dotenv
from langchain_community.llms import Tongyi

load_dotenv('key.env')  # 指定加载 env 文件
key = os.getenv('DASHSCOPE_API_KEY')  # 获得指定环境变量
DASHSCOPE_API_KEY = os.environ["DASHSCOPE_API_KEY"]  # 获得指定环境变量
model = Tongyi(temperature=1)

filepath = 'data/专业描述.txt'
raw_documents = TextLoader(filepath, encoding='utf8').load()

# 按行分割块
text_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separator="\n",
    length_function=len,
    is_separator_regex=True,
)
documents = text_splitter.split_documents(raw_documents)
# 加载本地 embedding 模型
embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')

# 如果已经创建好了,可以直接读取
db = FAISS.load_local("./faiss_index", embedding, allow_dangerous_deserialization=True)

# 直接传入文本
query = "材料科学与工程是一门研究材料的组成、性质、制备、加工及应用的多学科交叉领域。它涵盖了金属、无机非金属"

# MultiQueryRetriever 检索
from langchain.retrievers.multi_query import MultiQueryRetriever
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=db.as_retriever(search_kwargs={'k': 8}), llm=model
)
unique_docs = retriever_from_llm.get_relevant_documents(query=query)

print(unique_docs)

上下文压缩

使用给定查询的上下文来压缩检索的输出,以便只返回相关信息,而不是立即按照原样返回检索到的文档

相当于提取每个检索结果的核心,简化每个文档,利用大模型的能力

这里我们就选择 top1,可以看到检索结果跟 query 一模一样了,是同一句话

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain.vectorstores import Chroma
import os
from dotenv import load_dotenv
from langchain_community.llms import Tongyi

load_dotenv('key.env')  # 指定加载 env 文件
key = os.getenv('DASHSCOPE_API_KEY')  # 获得指定环境变量
DASHSCOPE_API_KEY = os.environ["DASHSCOPE_API_KEY"]  # 获得指定环境变量
model = Tongyi(temperature=1)

filepath = 'data/专业描述.txt'
raw_documents = TextLoader(filepath, encoding='utf8').load()

# 按行分割块
text_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separator="\n",
    length_function=len,
    is_separator_regex=True,
)
documents = text_splitter.split_documents(raw_documents)
# 加载本地 embedding 模型
embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')

# 如果已经创建好了,可以直接读取
db = FAISS.load_local("./faiss_index", embedding, allow_dangerous_deserialization=True)

# 传入文本
query = "材料科学与工程是一门研究材料的组成、性质、制备、加工及应用的多学科交叉领域。它涵盖了金属、无机非金属"

# 检索
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
retriever = db.as_retriever(search_kwargs={'k': 1})
compressor = LLMChainExtractor.from_llm(model)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)
unique_docs = compression_retriever.get_relevant_documents(query)

print(unique_docs)

上面这个我是只取了 top1,但是我把全部结果打出来,发现有重复的,我用了下面检索代码,就去重了;官网的意思是:

LLMChainFilter 使用 LLM 链来决定过滤掉最初检索到的文档中的哪些以及返回哪些文档,而无需操作文档内容。

# 检索
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers.document_compressors import LLMChainFilter
_filter = LLMChainFilter.from_llm(model)
retriever = db.as_retriever(search_kwargs={'k': 10})
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)
unique_docs = compression_retriever.get_relevant_documents(query)

print(unique_docs)

对每个检索到的文档进行额外的 LLM 调用既昂贵又缓慢。EmbeddingsFilter通过嵌入文档和查询并仅返回那些与查询具有足够相似嵌入的文档

相当于少调用 llm 去判断相关的文档,改用 embedding 模型

# 检索
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import EmbeddingsFilter
retriever = db.as_retriever(search_kwargs={'k': 10})
embeddings_filter = EmbeddingsFilter(embeddings=embedding, similarity_threshold=0.76)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter, base_retriever=retriever
)
compressed_docs = compression_retriever.get_relevant_documents(query)
print(compressed_docs)

还有一种,是把文档分割为再小块一些的,再去做 embedding

    def contextual_compression_by_embedding_split(cls, db, query, embedding_model, topk=5, similarity_threshold=0.76,
                                                  chunk_size=300, chunk_overlap=0, separator=". "):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/
        上下文压缩检索器,embedding 模型,会对结果去重,将文档分割成更小的部分
        使用给定查询的上下文来压缩检索的输出,以便只返回相关信息,而不是立即按照原样返回检索到的文档
        利用 embedding 来计算
        :param db:
        :param query:
        :param embedding_model:
        :param topk: 不生效,默认是 4 个
        :return:
        """
        retriever = db.as_retriever(search_kwargs={'k': topk})
        splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=separator)
        redundant_filter = EmbeddingsRedundantFilter(embeddings=embedding_model)
        relevant_filter = EmbeddingsFilter(embeddings=embedding_model, similarity_threshold=similarity_threshold)
        pipeline_compressor = DocumentCompressorPipeline(
            transformers=[splitter, redundant_filter, relevant_filter]
        )
        compression_retriever = ContextualCompressionRetriever(
            base_compressor=pipeline_compressor, base_retriever=retriever
        )

        retriever_docs = compression_retriever.get_relevant_documents(query)
        return retriever_docs

混合检索

通过利用不同算法的优势, EnsembleRetriever可以获得比任何单一算法更好的性能

最常见的模式是将稀疏检索器(如 BM25)与密集检索器(如嵌入相似性)相结合,因为它们的优势是互补的。它也被称为“混合搜索”。稀疏检索器擅长根据关键词查找相关文档,而密集检索器擅长根据语义相似度查找相关文档。

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

doc_list_1 = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(
    doc_list_1, metadatas=[{"source": 1}] * len(doc_list_1)
)
bm25_retriever.k = 2

doc_list_2 = [
    "You like apples",
    "You like oranges",
]

embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')
faiss_vectorstore = FAISS.from_texts(
    doc_list_2, embedding, metadatas=[{"source": 2}] * len(doc_list_2)
)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)
docs = ensemble_retriever.invoke("apples")
print(docs)

检索后上下文重新排序

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain.vectorstores import Chroma
import os
from dotenv import load_dotenv
from langchain_community.llms import Tongyi

load_dotenv('key.env')  # 指定加载 env 文件
key = os.getenv('DASHSCOPE_API_KEY')  # 获得指定环境变量
DASHSCOPE_API_KEY = os.environ["DASHSCOPE_API_KEY"]  # 获得指定环境变量
model = Tongyi(temperature=1)

filepath = 'data/专业描述.txt'
raw_documents = TextLoader(filepath, encoding='utf8').load()

# 按行分割块
text_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separator="\n",
    length_function=len,
    is_separator_regex=True,
)
documents = text_splitter.split_documents(raw_documents)
# 加载本地 embedding 模型
embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')

# 如果已经创建好了,可以直接读取
db = FAISS.load_local("./faiss_index", embedding, allow_dangerous_deserialization=True)

# 传入文本
query = "材料科学与工程是一门研究材料的组成、性质、制备、加工及应用的多学科交叉领域。它涵盖了金属、无机非金属"

# 检索
from langchain_community.document_transformers import LongContextReorder
retriever = db.as_retriever(search_type="mmr", search_kwargs={'k': 10, 'fetch_k': 50})  # 构建检索器
docs = retriever.get_relevant_documents(query)
# 对检索结果重新排序
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

print(reordered_docs)

父文档检索器

大文档拆分成小文档(比如大文档指多个 txt 或文件)

小文档快通过向量空间建模,实现更准确的语义检索,大块提供跟完整的语义内容

检索小的,最后返回大的对应 id 进行返回

from langchain.storage import InMemoryStore
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever


loaders = [
    TextLoader("data/专业描述.txt", encoding="utf-8"),
    TextLoader("data/专业描述_copy.txt", encoding="utf-8"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

# 加载本地 embedding 模型
embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')

# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=embedding
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

retriever.add_documents(docs, ids=None)

# 会有两个键,添加了两个文档
# print(list(store.yield_keys()))

# 传入文本
query = "材料科学与工程是一门研究材料的组成、性质、制备、加工及应用的多学科交叉领域。它涵盖了金属、无机非金属"

# 检索小块
sub_docs = vectorstore.similarity_search(query)
print(sub_docs[0].page_content)

# 检索大块
retrieved_docs = retriever.get_relevant_documents("justice breyer")
print(retrieved_docs)

如果文档还是太大,可先把父文档文档分割,参考:

Parent Document Retriever | 🦜️🔗 LangChain

自查询

通过大预言模型生成向量存储可识别使用的查询语句;

当我们给定一个自然语言查询,自组织检索器会首先通过大预言模型编写一个结构化查询,然后将该结构化查询转化成底层向量存储可识别可使用的查询语句,最终应用于底层向量存储从而获得检索结果。

from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
import os
from dotenv import load_dotenv
from langchain_community.llms import Tongyi


# 加载模型
load_dotenv('key.env')  # 指定加载 env 文件
key = os.getenv('DASHSCOPE_API_KEY')  # 获得指定环境变量
DASHSCOPE_API_KEY = os.environ["DASHSCOPE_API_KEY"]  # 获得指定环境变量
model = Tongyi(temperature=1)

# 加载本地 embedding 模型
embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')

# 实验数据,重点关注 metadata 部分
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]
# 使用具备高级检索的向量存储,Chroma,faiss 不行
vectorstore = Chroma.from_documents(docs, embedding)

# 定义在子查询中用于提取结构化数据的数据结构,细化到属性名称,描述,类型
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
# 提供文档主题内容描述
document_content_description = "Brief summary of a movie"

# 构建自查讯,把以上准备的大语言模型,向量存储,结构化数据描述导入
retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True  # 可以让检索器可以识别自然语言定义的文档返回数量
)

# 只查元数据
res1 = retriever.invoke("I want to watch a movie rated higher than 8.5")

# 即查询元数据,又查询文档内容
res2 = retriever.invoke("Has Greta Gerwig directed any movies about women")

# 查询多类元数据
res3 = retriever.invoke("What's a highly rated (above 8.5) science fiction film?")

# 即查询多类元数据,又查询文档内容
res4 = retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)

pass

上面是官网文档代码,运行报错,Self-querying | 🦜️🔗 LangChain

时间权重检索

Time-weighted vector store retriever | 🦜️🔗 LangChain

TF-IDF检索

from langchain.retrievers import TFIDFRetriever
with open('data/专业描述.txt', encoding='utf8') as f:
    lst = f.readlines()
retriever = TFIDFRetriever.from_texts(lst)
result = retriever.get_relevant_documents("材料科学与工程是一门研究材料的组成、性质、制备、加工及应用的多学科交叉领域。它涵盖了金属、无机非金属")
print(result)

KNN检索

from langchain.retrievers import KNNRetriever
from langchain.embeddings import HuggingFaceEmbeddings
with open('data/专业描述.txt', encoding='utf8') as f:
    lst = f.readlines()
# 加载本地 embedding 模型
embedding = HuggingFaceEmbeddings(model_name='bge-small-zh-v1.5')
retriever = KNNRetriever.from_texts(lst, embedding)
result = retriever.get_relevant_documents("材料科学与工程是一门研究材料的组成、性质、制备、加工及应用的多学科交叉领域。它涵盖了金属、无机非金属")
print(result)

RAG全流程模块

加载数据-分割数据-向量化-检索

from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveJsonSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.vectorstores import Chroma
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainFilter
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain_community.document_transformers import LongContextReorder
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.retrievers import KNNRetriever
from langchain.retrievers import TFIDFRetriever
from pathlib import Path
import json
import os


class DocsLoader():

    @classmethod
    def txt_loader(cls, filepath):
        """
        加载 txt 数据
        :param filepath:
        :return:
        """
        loader = TextLoader(filepath, encoding='utf8')
        docs = loader.load()
        return docs

    @classmethod
    def csv_loader(cls, filepath):
        """
        https://python.langchain.com/docs/modules/data_connection/document_loaders/csv/
        可用参数解释:https://blog.csdn.net/zjkpy_5/article/details/137727850?spm=1001.2014.3001.5501
        加载 csv 数据
        :param filepath: 
        :return: 
        """""
        loader = CSVLoader(file_path=filepath, encoding='utf8')
        docs = loader.load()
        return docs

    @classmethod
    def json_loader(cls, filepath):
        """
        https://python.langchain.com/docs/modules/data_connection/document_loaders/json/
        官网 jq 用不了 win 系统
        加载 json 数据
        :param filepath:
        :return:
        """
        docs = json.loads(Path(filepath).read_text(encoding='utf8'))
        return docs


class TextSpliter():

    @classmethod
    def text_split_by_char(cls, docs, separator='\n', chunk_size=100, chunk_overlap=20, length_function=len,
            is_separator_regex=False):
        """
        https://python.langchain.com/docs/modules/data_connection/document_transformers/character_text_splitter/
        指定字符拆分,separator 指定,若指定有效 chunk_size 失效
        :param docs: 文档,必须为 str,如果是 langchain 加载进来的需要转换一下
        :param separator: 分割字符
        :param chunk_size: 每块大小
        :param chunk_overlap: 允许字数重叠大小
        :param length_function:
        :param is_separator_regex:
        :return:
        """
        text_splitter = CharacterTextSplitter(
            separator=separator,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=length_function,
            is_separator_regex=is_separator_regex,
        )
        docs = docs[0].page_content  # langchian 加载的 txt 转换为 str
        text_split = text_splitter.create_documents([docs])
        return text_split

    @classmethod
    def text_split_by_manychar_or_charnum(cls, docs, separator=["\n\n", "\n", " ", ""], chunk_size=100, chunk_overlap=20,
                               length_function=len, is_separator_regex=True):
        """
        https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter/
        按照 chunk_size 字数分割,separator 不需要传,保持默认值即可
        多个字符拆分,separator 指定,符合列表中的字符就会被拆分
        :param docs: 文档,必须为 str,如果是 langchain 加载进来的需要转换一下
        :param separator: 分割字符,默认以列表中的字符去分割 ["\n\n", "\n", " ", ""]
        :param chunk_size: 每块大小
        :param chunk_overlap: 允许字数重叠大小
        :param length_function:
        :param is_separator_regex:
        :return:
        """
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,  # 指定每块大小
            chunk_overlap=chunk_overlap,  # 指定每块可以重叠的字符数
            length_function=length_function,
            is_separator_regex=is_separator_regex,
            separators=separator  # 指定按照什么字符去分割,如果不指定就按照 chunk_size +- chunk_overlap(100+-20)个字去分割
        )
        docs = docs[0].page_content  # langchian 加载的 txt 转换为 str
        split_text = text_splitter.create_documents([docs])
        return split_text

    @classmethod
    def json_split(cls, json_data, min_chunk_size=50, max_chunk_size=300):
        """
        https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_json_splitter/
        json 拆分,每一个块会拆分为完整的字典
        :param json_data:
        :param min_chunk_size:
        :param max_chunk_size:
        :return:
        """
        splitter = RecursiveJsonSplitter(min_chunk_size=min_chunk_size, max_chunk_size=max_chunk_size)
        json_chunks = splitter.split_json(json_data=json_data)
        return json_chunks


class EmbeddingVectorDB():

    @classmethod
    def load_local_embedding_model(cls, embedding_model_path, device='cpu'):
        """加载本地向量模型"""
        embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_path, model_kwargs={'device': device})
        return embedding_model

    @classmethod
    def faiss_vector_db(cls, split_docs, vector_db_path, embedding_model):
        """
        https://python.langchain.com/docs/modules/data_connection/vectorstores/
        faiss 创建向量数据库
        :param split_docs: 分割的文本块
        :param vector_db_path: 向量数据库存储路径
        :param embedding_model: embedding 模型
        :return:
        """
        if os.path.exists(vector_db_path):
            print('加载向量数据库路径 =》', vector_db_path)
            db = FAISS.load_local(vector_db_path, embedding_model, allow_dangerous_deserialization=True)
        else:
            print('创建向量数据库路径 =》', vector_db_path)
            db = FAISS.from_documents(split_docs, embedding_model)
            db.save_local(vector_db_path)
        return db


    @classmethod
    def chroma_vector_db(cls, split_docs, vector_db_path, embedding_model):
        """
        https://python.langchain.com/docs/modules/data_connection/vectorstores/
        faiss 创建向量数据库
        :param split_docs: 分割的文本块
        :param vector_db_path: 向量数据库存储路径
        :param embedding_model: embedding 模型
        :return:
        """
        if os.path.exists(vector_db_path):
            print('加载向量数据库路径 =》', vector_db_path)
            db = Chroma(persist_directory=vector_db_path, embedding_function=embedding_model)
        else:
            print('创建向量数据库路径 =》', vector_db_path)
            db = Chroma.from_documents(split_docs, embedding_model, persist_directory=vector_db_path)
            db.persist()
        return db


class Retriever():

    @classmethod
    def similarity(cls, db, query, topk=5, long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore/
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        相似度,不带分数的,会把检索出所有最相似的返回,如果文档中有重复的,那会返回重复的
        :param db:
        :param query:
        :param long_context: 长上下文排序
        :return:
        """
        retriever = db.as_retriever(search_kwargs={'k': topk})
        retriever_docs = retriever.get_relevant_documents(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def similarity_with_score(cls, db, query, topk=5, long_context=False):
        """
        https://python.langchain.com/docs/integrations/vectorstores/usearch/#similarity-search-with-score
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        带分数的,距离分数是L2距离。因此,分数越低越好
        :param db:
        :param query:
        :param long_context: 长上下文排序
        :return:
        """
        retriever_docs = db.similarity_search_with_score(query, k=topk)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def mmr(cls, db, query, topk=5, fetch_k=50, long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore/
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        mmr 算法会去重,会把检索出所有最相似的返回
        :param db:
        :param query:
        :param topk: 指定最相似的返回几个, 最多返回的数量不会超过 fetch_k
        :param fetch_k: 给 mmr 的最多文档数
        :param long_context: 长上下文排序
        :return:
        """
        retriever = db.as_retriever(search_type="mmr", ssearch_kwargs={'k': topk, 'fetch_k': fetch_k})
        retriever_docs = retriever.get_relevant_documents(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def similarity_score_threshold(cls, db, query, topk=5, score_threshold=0.8, long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        相似分数过滤
        :param db:
        :param query:
        :param topk:
        :param score_threshold: 相似分数
        :param long_context: 长上下文排序
        :return:
        """
        retriever = db.as_retriever(search_type="similarity_score_threshold",
                                    search_kwargs={'k': topk, "score_threshold": score_threshold})
        retriever_docs = retriever.get_relevant_documents(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def multi_query_retriever(cls, db, query, model, topk=5, long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever/
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        多查询检索器
        基于向量距离的检索可能因微小的询问词变化或向量无法准确表达语义而产生不同结果;
        使用大预言模型自动从不同角度生成多个查询,实现提示词优化;
        对用户查询生成表达其不同方面的多个新查询(也就是query利用大模型生成多个表述),对每个表述进行检索,去结果的并集;
        优点是生成的查询多角度,可以覆盖更全面的语义和信息需求;
        指定 topk 好像没用,不知道为什么
        :param db:
        :param query:
        :param long_context: 长上下文排序
        :return:
        """
        retriever = db.as_retriever(search_kwargs={'k': topk})
        retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=model)
        retriever_docs = retriever.get_relevant_documents(query=query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def contextual_compression_by_llm(cls, db, query, model, topk=5, long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        上下文压缩检索器,大模型,会对结果去重
        使用给定查询的上下文来压缩检索的输出,以便只返回相关信息,而不是立即按照原样返回检索到的文档
        相当于提取每个检索结果的核心,简化每个文档,利用大模型的能力
        不知道为什么 topk 不管用
        :param db:
        :param query:
        :param model:
        :param topk:
        :param long_context: 长上下文排序
        :return:
        """
        _filter = LLMChainFilter.from_llm(model)
        retriever = db.as_retriever(search_kwargs={'k': topk})
        compression_retriever = ContextualCompressionRetriever(
            base_compressor=_filter, base_retriever=retriever
        )
        retriever_docs = compression_retriever.get_relevant_documents(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def contextual_compression_by_embedding(cls, db, query, embedding_model, topk=5, similarity_threshold=0.76,
                                            long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        上下文压缩检索器,embedding 模型,会对结果去重
        使用给定查询的上下文来压缩检索的输出,以便只返回相关信息,而不是立即按照原样返回检索到的文档
        利用 embedding 来计算
        :param db:
        :param query:
        :param embedding_model:
        :param topk:
        :param long_context: 长上下文排序
        :return:
        """
        retriever = db.as_retriever(search_kwargs={'k': topk})
        embeddings_filter = EmbeddingsFilter(embeddings=embedding_model, similarity_threshold=similarity_threshold)
        compression_retriever = ContextualCompressionRetriever(
            base_compressor=embeddings_filter, base_retriever=retriever
        )
        retriever_docs = compression_retriever.get_relevant_documents(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def contextual_compression_by_embedding_split(cls, db, query, embedding_model, topk=5, similarity_threshold=0.76,
                                                  chunk_size=100, chunk_overlap=0, separator=". ", long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        上下文压缩检索器,embedding 模型,会对结果去重,将文档分割成更小的部分
        使用给定查询的上下文来压缩检索的输出,以便只返回相关信息,而不是立即按照原样返回检索到的文档
        利用 embedding 来计算
        :param db:
        :param query:
        :param embedding_model:
        :param topk: 不生效,默认是 4 个
        :param long_context: 长上下文排序
        :return:
        """
        retriever = db.as_retriever(search_kwargs={'k': topk})
        splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=separator)
        redundant_filter = EmbeddingsRedundantFilter(embeddings=embedding_model)
        relevant_filter = EmbeddingsFilter(embeddings=embedding_model, similarity_threshold=similarity_threshold)
        pipeline_compressor = DocumentCompressorPipeline(
            transformers=[splitter, redundant_filter, relevant_filter]
        )
        compression_retriever = ContextualCompressionRetriever(
            base_compressor=pipeline_compressor, base_retriever=retriever
        )
        retriever_docs = compression_retriever.get_relevant_documents(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs


    @classmethod
    def ensemble(cls, query, text_split_docs, embedding_model, bm25_topk=5, topk=5, long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble/
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        混合检索
        最常见的模式是将稀疏检索器(如 BM25)与密集检索器(如嵌入相似性)相结合,因为它们的优势是互补的。它也被称为“混合搜索”。
        稀疏检索器擅长根据关键词查找相关文档,而密集检索器擅长根据语义相似度查找相关文档。
        :param query:
        :param text_split_docs: langchain 分割后的文档对象
        :param long_context: 长上下文排序
        :param bm25_topk: bm25 topk
        :param topk: 相似性 topk
        :return: 会返回两个的并集,结果可能会小于 bm25_topk + topk
        """
        text_split_docs = [text.page_content for text in text_split_docs]
        bm25_retriever = BM25Retriever.from_texts(
            text_split_docs, metadatas=[{"source": 1}] * len(text_split_docs)
        )
        bm25_retriever.k = bm25_topk

        faiss_vectorstore = FAISS.from_texts(
            text_split_docs, embedding_model, metadatas=[{"source": 2}] * len(text_split_docs)
        )
        faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": topk})

        ensemble_retriever = EnsembleRetriever(
            retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
        )
        retriever_docs = ensemble_retriever.invoke(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def bm25(cls, query, text_split_docs, topk=5, long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        稀疏检索器擅长根据关键词查找相关文档
        :param query:
        :param text_split_docs: langchain 分割后的文档对象
        :param topk:
        :param long_context: 长上下文压缩
        """
        text_split_docs = [text.page_content for text in text_split_docs]
        bm25_retriever = BM25Retriever.from_texts(
            text_split_docs, metadatas=[{"source": 1}] * len(text_split_docs)
        )
        bm25_retriever.k = topk
        retriever_docs = bm25_retriever.get_relevant_documents(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def parent_document_retriever(cls, docs, query, embedding_model):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever/
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        父文档检索,只适合,chroma 数据库, faiss 不支持
        适合多个文档加载进来后检索出符合的小文本段,及对应大的 txt
        可以根据此方法,检索出来大的 txt 后,用其他方法再精细化检索 txt 中的内容
        :param docs: example
            loaders = [
                        TextLoader("data/专业描述.txt", encoding="utf-8"),
                        TextLoader("data/专业描述_copy.txt", encoding="utf-8"),
                    ]
            docs = []
            for loader in loaders:
                docs.extend(loader.load())
        :return:
        """
        child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
        vectorstore = Chroma(
            collection_name="full_documents", embedding_function=embedding_model
        )
        store = InMemoryStore()
        retriever = ParentDocumentRetriever(
            vectorstore=vectorstore,
            docstore=store,
            child_splitter=child_splitter,
        )

        retriever.add_documents(docs, ids=None)
        sub_docs = vectorstore.similarity_search(query)
        parent_docs = retriever.get_relevant_documents(query)

        return sub_docs, parent_docs

    @classmethod
    def tfidf(cls, query, docs_lst, long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        tfidf 关键词检索
        :param query:
        :param docs_lst: ['xxx', 'dsfsdg'.....]
        :param long_context: 长上下文排序
        :return:
        """
        retriever = TFIDFRetriever.from_texts(docs_lst)
        retriever_docs = retriever.get_relevant_documents(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

    @classmethod
    def knn(cls, query, docs_lst, embedding_model,long_context=False):
        """
        https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder/
        knn 检索
        :param query:
        :param docs_lst: ['xxx', 'dsfsdg'.....]
        :param long_context:
        :return:
        """
        retriever = KNNRetriever.from_texts(docs_lst, embedding_model)
        retriever_docs = retriever.get_relevant_documents(query)
        if long_context:
            reordering = LongContextReorder()
            retriever_docs = reordering.transform_documents(retriever_docs)
        return retriever_docs

  • 6
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值