AI--向量的存储和检索

小瓶盖的猪猪侠

于 2024-05-23 14:50:29 发布

阅读量934

点赞数 29

分类专栏： AI 文章标签：人工智能 python 算法

本文链接：https://blog.csdn.net/qq_29983883/article/details/139135559

版权

AI 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

step1 Document

LangChain 实现了Document抽象，旨在表示文本单元和相关元数据。它具有两个属性：

page_content：代表内容的字符串；
metadata：包含任意元数据的字典。

该metadata属性可以捕获有关文档来源、其与其他文档的关系以及其他信息的信息.单个Document对象通常代表较大文档的一部分。

from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata = {"source": "mammal-pets-doc"},
    ),
     Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"source": "fish-pets-doc"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"source": "bird-pets-doc"},
    ),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

step2 向量检索

向量检索是一种常见的存储和检索非结构化数据的方式，主要思路是存储文本的数据向量，给出一个查询，我们编码查询成同一个维度的数据向量，然后使用相似度去查找相关数据
LangChain VectorStore对象包含用于将文本和Document对象添加到存储区以及使用各种相似度指标查询它们的方法。它们通常使用嵌入模型进行初始化，这些模型决定了如何将文本数据转换为数字向量。

下面我是使用bce-embedding模型作为编码模型，地址下载

from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

# init embedding model
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'batch_size': 64, 'normalize_embeddings': True}

embed_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_PATH,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
  )
vetorstore = Chroma.from_documents(
    documents,
    embedding=embed_model,
)

vetorstore.similarity_search("cat")

输出结果为：

[Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’}),
Document(page_content=‘Goldfish are popular pets for beginners, requiring relatively simple care.’, metadata={‘source’:
‘fish-pets-doc’}),
Document(page_content=‘Dogs are great companions, known for their loyalty and friendliness.’, metadata={‘source’:‘mammal-pets-doc’}),
Document(page_content=‘Parrots are intelligent> birds capable of mimicking human speech.’, metadata={‘source’:‘bird-pets-doc’})]

搜索返回相似度分数

vetorstore.similarity_search_with_score("cat")

[(Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’}),
0.9107884),
(Document(page_content=‘Goldfish are popular pets for beginners, requiring relatively simple care.’, metadata={‘source’: ‘fish-pets-doc’}),
1.3231826),
(Document(page_content=‘Dogs are great companions, known for their loyalty and friendliness.’, metadata={‘source’: ‘mammal-pets-doc’}),
1.4060305),
(Document(page_content=‘Parrots are intelligent birds capable of mimicking human speech.’, metadata={‘source’: ‘bird-pets-doc’}),
1.4284585),
(Document(page_content=‘Rabbits are social animals that need plenty of space to hop around.’, metadata={‘source’: ‘mammal-pets-doc’}),
1.4566814)]

上面结果返回的score，越小表示越接近

基于向量查询

embedding = embed_model.embed_query("cat")
vetorstore.similarity_search_by_vector(embedding)

输出结果

[Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’}),
Document(page_content=‘Goldfish are popular pets for beginners, requiring relatively simple care.’, metadata={‘source’: ‘fish-pets-doc’}),
Document(page_content=‘Dogs are great companions, known for their loyalty and friendliness.’, metadata={‘source’: ‘mammal-pets-doc’}),
Document(page_content=‘Parrots are intelligent birds capable of mimicking human speech.’, metadata={‘source’: ‘bird-pets-doc’})]

step3 检索

LangChainVectorStore对象没有Runnable子类，因此不能立即集成到 LangChain 表达语言链中。

LangChain Retrievers是 Runnable，因此它们实现了一组标准方法（例如同步和异步invoke操作batch）并且旨在纳入 LCEL 链。

我们可以自己创建一个简单的版本，而无需子类化Retriever。如果我们选择要使用的方法检索文档，我们可以轻松创建一个可运行的程序。下面我们将围绕该similarity_search方法构建一个：

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda


retriever = RunnableLambda(vetorstore.similarity_search).bind(k=1)

print(retriever.invoke("cat"))
print(retriever.batch(["cat","dog"]))

输出结果

[Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’})]
[[Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’})], [Document(page_content=‘Dogs are great companions, known for their loyalty and friendliness.’, metadata={‘source’: ‘mammal-pets-doc’})]]

Vectorstore 实现了as_retriever一个生成 Retriever 的方法，特别是VectorStoreRetriever。这些检索器包括特定的search_type属性search_kwargs，用于标识要调用的底层向量存储的哪些方法以及如何参数化它们。

retriever = vetorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(["cat", "shark"])

输出结果

[[Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’})],
[Document(page_content=‘Goldfish are popular pets for beginners, requiring relatively simple care.’, metadata={‘source’: ‘fish-pets-doc’})]]

检索器可以轻松地合并到更复杂的应用程序中，例如检索增强生成（RAG）应用程序，

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

chat = ChatOpenAI()

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

retriever = vetorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("human",message),
    ]
)


rag_chat = {"context":retriever,"question":RunnablePassthrough()} | prompt |chat

response = rag_chat.invoke("tell me about cats")
print(response.content)

输出结果

Cats are independent pets that often enjoy their own space.

小瓶盖的猪猪侠

关注

29
点赞
踩
23

收藏

觉得还不错? 一键收藏
0
评论
AI--向量的存储和检索

Vectorstore 实现了as_retriever一个生成 Retriever 的方法，特别是VectorStoreRetriever。这些检索器包括特定的search_type属性search_kwargs，用于标识要调用的底层向量存储的哪些方法以及如何参数化它们。如果我们选择要使用的方法检索文档，我们可以轻松创建一个可运行的程序。向量检索是一种常见的存储和检索非结构化数据的方式，主要思路是存储文本的数据向量，给出一个查询，我们编码查询成同一个维度的数据向量，然后使用相似度去查找相关数据。
复制链接

扫一扫

专栏目录