txtai初识
本教程系列将涵盖txtai的主要用例,这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码,可也可以在colab 中使用。
colab 地址
本文概述了 txtai 以及如何运行相似性搜索。
安装依赖
安装txtai和所有依赖项。
pip install txtai
创建一个嵌入实例
Embeddings 实例是 txtai 的主要入口点。Embeddings 实例定义了用于标记文本部分并将其转换为嵌入向量的方法。
from txtai.embeddings import Embeddings
# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
运行相似度查询
嵌入实例依赖于底层的转换器模型来构建文本嵌入。以下示例展示了如何使用 Transformers Embedding 实例对不同概念的列表运行相似性搜索。
data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]
print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
# Get index of best section that best matches query
uid = embeddings.similarity(query, data)[0][0]
print("%-20s %s" % (query, data[uid]))
# -----------------------------------结果-------------------------
Query Best Match
--------------------------------------------------
feel good story Maine man wins $1M from $25 lottery ticket
climate change Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health US tops 5 million confirmed virus cases
war Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife The National Park Service warns against sacrificing slower friends in a bear attack
asia Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky Maine man wins $1M from $25 lottery ticket
dishonest junk Make huge profits without work, earn up to $100,000 a day
上面的示例显示,对于几乎所有查询,实际文本并未存储在文本部分列表中。这是 Transformer 模型对基于令牌的搜索的真正力量。你从盒子里拿出来的是🔥🔥🔥!
构建嵌入索引
对于小文本列表,上述方法有效。但是对于更大的文档存储库,在每个查询上标记化并转换为嵌入是没有意义的。txtai 支持构建可显着提高性能的预计算索引。
在前面的示例的基础上,下面的示例运行一个 index 方法来构建和存储文本嵌入。在这种情况下,每次搜索仅将查询转换为嵌入向量。
# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)
# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
# Extract uid of first result
# search result format: (uid, score)
uid = embeddings.search(query, 1)[0][0]
# Print text
print("%-20s %s" % (query, data[uid]))
# -----------------------------------结果-------------------------
Query Best Match
--------------------------------------------------
feel good story Maine man wins $1M from $25 lottery ticket
climate change Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health US tops 5 million confirmed virus cases
war Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife The National Park Service warns against sacrificing slower friends in a bear attack
asia Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky Maine man wins $1M from $25 lottery ticket
dishonest junk Make huge profits without work, earn up to $100,000 a day
嵌入加载/保存
嵌入索引可以保存到磁盘并重新加载。此时,索引不是增量创建的,索引需要完全重建以合并新数据。
embeddings.save("index")
embeddings = Embeddings()
embeddings.load("index")
uid = embeddings.search("climate change", 1)[0][0]
print(data[uid])
嵌入更新/删除
嵌入索引支持更新和删除。upsert 操作将插入新数据并更新现有数据
以下部分运行查询,然后更新更改顶部结果的值,最后删除更新的值以恢复到原始查询结果。
# Run initial query
uid = embeddings.search("feel good story", 1)[0][0]
print("Initial: ", data[uid])
# Update data
data[0] = "See it: baby panda born"
embeddings.upsert([(0, data[0], None)])
uid = embeddings.search("feel good story", 1)[0][0]
print("After update: ", data[uid])
# Remove record just added from index
embeddings.delete([0])
# Ensure value matches previous value
uid = embeddings.search("feel good story", 1)[0][0]
print("After delete: ", data[uid])
# -----------------------------------结果-------------------------
Initial: Maine man wins $1M from $25 lottery ticket
After update: See it: baby panda born
After delete: Maine man wins $1M from $25 lottery ticket
嵌入方法
Embeddings 支持两种创建文本向量的方法:sentence-transformers 库和词嵌入向量。两种方法都有其优点,如下所示:
- sentence-transformers
- 通过对由transformers 库生成的向量进行平均池化来创建单个嵌入向量。
- 支持存储在 Hugging Face 模型中心或本地存储的模型。
- 有关如何创建自定义模型的详细信息,请参阅句子转换器,这些模型可以保存在本地或上传到 Hugging Face 的模型中心。
- 基本模型需要强大的计算能力(首选 GPU)。可以构建更小/更轻的模型,权衡速度的准确性。
- 词嵌入向量
- 通过每个词组件的 BM25 评分创建单个嵌入向量。有关此方法背后的逻辑,请参阅此Medium 文章。
- 由pymagnitude库支持。可以从参考链接安装预训练的词向量。
- 有关可以为自定义数据集构建词向量的代码,请参见words.py。
- 使用默认模型显着提高性能。对于较大的数据集,它提供了速度和准确性的良好权衡
参考
https://dev.to/neuml/tutorial-series-on-txtai-ibg