文本搜索快速实现_all-minilm-l6-v2-CSDN博客

本文链接：https://blog.csdn.net/cxs812760493/article/details/135346428

安装模型

1. 安装

您需要处理您的数据，以便搜索引擎可以使用它。Sentence Transformers框架使您可以访问常见的大型语言模型，将原始数据转换为嵌入。

pip install -U sentence-transformers

定义了两个主要框架后，您需要指定该引擎将使用的确切模型。

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

Sentence Transformers框架包含许多嵌入模型。然而，all-MiniLM-L6-v2是相对较快的文本编码器。

encoder = SentenceTransformer("all-MiniLM-L6-v2")

数据集转向量

1.添加数据集

all-MiniLM-L6-v2将对您提供的数据进行编码。在这里您将列出图书馆中的所有科幻小说。每本书都有元数据、名称、作者、出版年份和简短说明。

documents = [
    {
        "name": "The Time Machine",
        "description": "A man travels through time and witnesses the evolution of humanity.",
        "author": "H.G. Wells",
        "year": 1895,
    },
    {
        "name": "Ender's Game",
        "description": "A young boy is trained to become a military leader in a war against an alien race.",
        "author": "Orson Scott Card",
        "year": 1985,
    },
    {
        "name": "Brave New World",
        "description": "A dystopian society where people are genetically engineered and conditioned to conform to a strict social hierarchy.",
        "author": "Aldous Huxley",
        "year": 1932,
    },
    {
        "name": "The Hitchhiker's Guide to the Galaxy",
        "description": "A comedic science fiction series following the misadventures of an unwitting human and his alien friend.",
        "author": "Douglas Adams",
        "year": 1979,
    },
    {
        "name": "Dune",
        "description": "A desert planet is the site of political intrigue and power struggles.",
        "author": "Frank Herbert",
        "year": 1965,
    },
    {
        "name": "Foundation",
        "description": "A mathematician develops a science to predict the future of humanity and works to save civilization from collapse.",
        "author": "Isaac Asimov",
        "year": 1951,
    },
    {
        "name": "Snow Crash",
        "description": "A futuristic world where the internet has evolved into a virtual reality metaverse.",
        "author": "Neal Stephenson",
        "year": 1992,
    },
    {
        "name": "Neuromancer",
        "description": "A hacker is hired to pull off a near-impossible hack and gets pulled into a web of intrigue.",
        "author": "William Gibson",
        "year": 1984,
    },
    {
        "name": "The War of the Worlds",
        "description": "A Martian invasion of Earth throws humanity into chaos.",
        "author": "H.G. Wells",
        "year": 1898,
    },
    {
        "name": "The Hunger Games",
        "description": "A dystopian society where teenagers are forced to fight to the death in a televised spectacle.",
        "author": "Suzanne Collins",
        "year": 2008,
    },
    {
        "name": "The Andromeda Strain",
        "description": "A deadly virus from outer space threatens to wipe out humanity.",
        "author": "Michael Crichton",
        "year": 1969,
    },
    {
        "name": "The Left Hand of Darkness",
        "description": "A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will.",
        "author": "Ursula K. Le Guin",
        "year": 1969,
    },
    {
        "name": "The Three-Body Problem",
        "description": "Humans encounter an alien civilization that lives in a dying system.",
        "author": "Liu Cixin",
        "year": 2008,
    },
]

2. 创建集合

Qdrant 中的所有数据均按集合组织。在这种情况下，您正在存储书籍，因此我们将其称为my_books。

self.client.recreate_collection(
    collection_name="my_books",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

recreate_collection 此函数将首先尝试删除具有相同名称的现有集合，再创建集合。
该vector_size参数定义特定集合的向量的大小。如果它们的大小不同，则无法计算它们之间的距离。编码器输出维度是384。您还可以使用 model.get_sentence_embedding_dimension() 来获取您正在使用的模型的维度。
该distance参数允许您指定用于测量两点之间距离的函数。

3.上传数据到集合

告诉数据库上传documents到my_books集合。这将为每个记录提供一个 ID 和一个有效负载（payload）。有效负载是数据集中的元数据。

self.client.upload_records(
    collection_name="my_books",
    records=[
        models.Record(
            # 主要
            id=idx, vector=encoder.encode(doc["description"]).tolist(), payload=doc
        )
        for idx, doc in enumerate(documents)
    ],
)

4.向引擎询问问题

现在数据已存储在 Qdrant 中，您可以向它提问并接收语义相关的结果。

hits = self.client.search(
    collection_name="my_books",
    query_vector=encoder.encode("alien invasion").tolist(),
    limit=3,
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

回复：

搜索引擎显示了三种最有可能与外星人入侵有关的反应。每个响应都会分配一个分数，以显示响应与原始查询的接近程度。

{'name': 'The War of the Worlds', 'description': 'A Martian invasion of Earth throws humanity into chaos.', 'author': 'H.G. Wells', 'year': 1898} score: 0.570093257022374
{'name': "The Hitchhiker's Guide to the Galaxy", 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his alien friend.', 'author': 'Douglas Adams', 'year': 1979} score: 0.5040468703143637
{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008} score: 0.45902943411768216

缩小查询范围

2000 年代初的最新一本书怎么样？

hits = self.client.search(
    collection_name="my_books",
    query_vector=encoder.encode("alien invasion").tolist(),
    query_filter=models.Filter(
        must=[models.FieldCondition(key="year", range=models.Range(gte=2000))]
    ),
    limit=1,
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

回复：

查询范围已缩小到 2008 年的一项结果。

{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008} score: 0.45902943411768216

外部数据集转向量

1. 准备样本数据集

要对启动描述进行神经搜索，必须首先将描述数据编码为向量。要处理文本，您可以使用预训练模型，例如BERT或句子转换器。Sentence -Transformers库可以让您方便地下载和使用许多预训练的模型，例如 DistilBERT、MPNet 等。

首先您需要下载数据集。

wget https://storage.googleapis.com/generall-shared-data/startups_demo.json

安装 SentenceTransformer 库以及其他相关包。

pip install sentence-transformers numpy pandas tqdm

导入所有相关模型。

from sentence_transformers import SentenceTransformer
import numpy as np
import json
import pandas as pd
from tqdm.notebook import tqdm

您将使用一个名为的预训练模型all-MiniLM-L6-v2。这是一个性能优化的句子嵌入模型，您可以在此处阅读有关它和其他可用模型的更多信息。

下载并创建一个预先训练的句子编码器。

model = SentenceTransformer(
    "all-MiniLM-L6-v2", device="cuda"
)  # or device="cpu" if you don't have a GPU

读取原始数据文件。

df = pd.read_json("./startups_demo.json", lines=True)

对所有启动描述进行编码，为每个启动描述创建一个嵌入向量。在内部，该encode函数会将输入分成批次，这将显着加快该过程。

vectors = model.encode(
    [row.alt + ". " + row.description for row in df.itertuples()],
    show_progress_bar=True,
)

所有描述现在都转换为向量。有 384 维的 40474 个向量。模型的输出层有这个维度

vectors.shape
# > (40474, 384)

将保存的向量下载到名为的新文件中startup_vectors.npy

np.save("startup_vectors.npy", vectors, allow_pickle=False)

2. 将数据上传至 Qdrant

创建迭代器

Qdrant 客户端库定义了一个特殊函数，允许您将数据集加载到服务中。然而，由于可能有太多数据无法容纳单个计算机内存，因此该函数将数据上的迭代器作为输入。

fd = open("./startups_demo.json")

# payload is now an iterator over startup data
payload = map(json.loads, fd)

# Load all vectors into memory, numpy array works as iterable for itself.
# Other option would be to use Mmap, if you don't want to load all data into RAM
vectors = np.load("./startup_vectors.npy")

上传数据

qdrant_client.upload_collection(
    collection_name="startups",
    vectors=vectors,
    payload=payload,
    ids=None,  # Vector ids will be assigned automatically
    batch_size=256,  # How many vectors will be uploaded in a single request?
)

向量现已上传至 Qdrant

从向量库搜索

def test_upload_json_npy_filter(self):
    text = "Artificial intelligence machine learning"
    vector = self.model.encode(text).tolist()

    city_of_interest = "Berlin"

    # Define a filter for cities
    city_filter = Filter(**{
        "must": [{
            "key": "city",
            "match": {
                "value": city_of_interest
            }
        }]
    })

    search_result = self.client.search(
        collection_name="startups",
        query_vector=vector,
        query_filter=city_filter,
        limit=5
    )

    payloads = [hit.payload for hit in search_result]

    print(payloads)

执行结果：

[
    {'alt': 'Lateral -  machine learning developer apis productivity software content discovery', 'city': 'Berlin', 'description': 'Automated Intelligent Discovery\nWe enable developers and individuals to easily integrate predictive intelligence into their services. This lets them automate the discovery of relevant information based on the content they already generate.\nFor example our service can help social media tools ...', 'images': 'https://d1qb2nb5cznatu.cloudfront.net/startups/i/560522-45bd593e65b7ff59d0337ba693b2e98c-thumb_jpg.jpg?buster=1418931216', 'link': 'https://lateral.io/', 'name': 'Lateral'}, 
    {'alt': 'Kelsen -  Legal Tech', 'city': 'Berlin', 'description': 'IBM Watson for Legal Industry\nKelsen is a learning algorithm that computes valuable answers to legal questions in real time by combining big data and machine learning technologies. Kelsen learns from existing cases and human curation to provide automated, reliable answers over time.\nOur algorithms ...', 'images': 'https://d1qb2nb5cznatu.cloudfront.net/startups/i/585625-d9f396b68cd4a497b2f4ae79193aa72b-thumb_jpg.jpg?buster=1421855917', 'link': 'http://www.ask-kelsen.com', 'name': 'Kelsen'}, 
    {'alt': 'micropsi industries -  artificial intelligence software Researchers Artificial Neural Networks', 'city': 'Berlin', 'description': 'building cognitive machines\nmicropsi industries is a software startup with roots in the artificial general intelligence community, building autonomous software agents and researching true cognitive machines. Micropsi, the cognitive architecture used in micropsi industries’ agents, defines ...', 'images': 'https://d1qb2nb5cznatu.cloudfront.net/startups/i/456625-066868ea8b149c5fc19cc760922cc96d-thumb_jpg.jpg?buster=1407490288', 'link': 'http://www.micropsi-industries.com', 'name': 'micropsi industries'}, 
    {'alt': 'Shopboostr -  e-commerce algorithms big data online shopping', 'city': 'Berlin', 'description': 'From Big-Data to Customer Personalisation\nShopboostr helps ecommerce retailers to deliver a personalized user experience. Through the collection of big data our machine learning algorithms can predict every user behavior - automatic and in real time! As an outcome we are able to target every customer with ...', 'images': 'https://d1qb2nb5cznatu.cloudfront.net/startups/i/334573-ef08122374dc56345289e988c891be86-thumb_jpg.jpg?buster=1405246053', 'link': 'http://www.shopboostr.de', 'name': 'Shopboostr'}, 
    {'alt': 'Patience -  education machine learning big data', 'city': 'Berlin', 'description': 'The easiest way to create your own online learning website to sell courses online\nPatience is a Berlin-based learning technology provider with the mission of enabling virtually anybody to teach online.\nOur customizable platform makes it possible for educators around the world to easily build and manage their own online learning applications ...', 'images': 'https://d1qb2nb5cznatu.cloudfront.net/startups/i/327633-745ec02caf76582849e98d5c882f96e4-thumb_jpg.jpg?buster=1390352409', 'link': 'http://www.patience.io', 'name': 'Patience'}
]

接下来，我们来将搜索构建为API服务~