45-----自定义您自己的嵌入数据库

最新推荐文章于 2024-05-09 09:48:14 发布

Q shen

最新推荐文章于 2024-05-09 09:48:14 发布

阅读量55

点赞数

分类专栏： txtai 教程系列（45 部分系列）文章标签：数据库 numpy python

本文链接：https://blog.csdn.net/qq_52010446/article/details/130570668

版权

txtai 教程系列（45 部分系列）专栏收录该内容

45 篇文章 1 订阅

订阅专栏

txtai 支持多种不同的数据库和矢量索引后端，包括外部数据库。使用现代硬件，单个节点索引可以带我们走多远是惊人的。轻松进入数亿甚至数十亿条记录。

txtai 在创建您自己的嵌入数据库方面提供了最大的灵活性。开箱即用的合理默认值。因此，除非您寻找这种配置，否则没有必要。本文将探讨当您确实想要自定义嵌入数据库时可用的选项。

可以在此处找到有关嵌入配置设置的更多信息。

安装依赖
安装txtai和所有依赖项。

Install txtai

pip install txtai[database,similarity] datasets
加载数据集
此示例将使用ag_news数据集，它是新闻文章标题的集合。我们将使用 25,000 个标题的子集。
import timeit

from datasets import load_dataset

def timer(embeddings, query=“red sox”):
elapsed = timeit.timeit(lambda: embeddings.search(query), number=250)
print(f"{elapsed / 250} seconds per query")

dataset = load_dataset(“ag_news”, split=“train”)[“text”][:25000]
NumPy
让我们从最简单的嵌入数据库开始。这只是一个简单的包装器，围绕着带有句子转换器的矢量化文本，将结果存储为 NumPy 数组并运行相似性查询。
from txtai.embeddings import Embeddings

Create embeddings instance

embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “backend”: “numpy”})

Index data

embeddings.index((x, text, None) for x, text in enumerate(dataset))

embeddings.search(“red sox”)
[(19831, 0.6780003309249878),
(18302, 0.6639199256896973),
(16370, 0.6617192029953003)]
embeddings.info()
{
“backend”: “numpy”,
“build”: {
“create”: “2023-05-04T12:12:02Z”,
“python”: “3.10.11”,
“settings”: {
“numpy”: “1.22.4”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“dimensions”: 384,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:12:02Z”
}
上面的嵌入实例将文本矢量化并将内容存储为 NumPy 数组。数组索引位置与相似性分数一起返回。虽然同样可以使用句子转换器轻松完成，但使用 txtai 框架可以轻松换出不同的选项，如下所示。

SQLite 和 NumPy
我们将测试的下一个组合是带有 NumPy 数组的 SQLite 数据库。

Create embeddings instance

embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: “sqlite”, “backend”: “numpy”})

Index data

embeddings.index((x, text, None) for x, text in enumerate(dataset))
现在让我们搜索一下。
embeddings.search(“red sox”)
[{‘id’: ‘19831’,
‘text’: 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
‘score’: 0.6780003309249878},
{‘id’: ‘18302’,
‘text’: ‘BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.’,
‘score’: 0.6639199256896973},
{‘id’: ‘16370’,
‘text’: ‘Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.’,
‘score’: 0.6617192029953003}]
embeddings.info()
{
“backend”: “numpy”,
“build”: {
“create”: “2023-05-04T12:12:24Z”,
“python”: “3.10.11”,
“settings”: {
“numpy”: “1.22.4”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“content”: “sqlite”,
“dimensions”: 384,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:12:24Z”
}
结果和以前一样。唯一的区别是内容现在可以通过关联的 SQLite 数据库获得。

让我们检查 ANN 对象以查看其外观。
print(embeddings.ann.backend.shape)
print(type(embeddings.ann.backend))
(25000, 384)
<class ‘numpy.memmap’>
正如预期的那样，它是一个 NumPy 数组。让我们计算执行搜索查询需要多长时间。
timer(embeddings)
0.03392000120000011 seconds per query
还不错！

SQLite 和 PyTorch
现在让我们尝试 PyTorch 后端。

Create embeddings instance

embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: “sqlite”, “backend”: “torch”})

Index data

embeddings.index((x, text, None) for x, text in enumerate(dataset))
让我们再次运行搜索。
embeddings.search(“red sox”)
[{‘id’: ‘19831’,
‘text’: 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
‘score’: 0.678000271320343},
{‘id’: ‘18302’,
‘text’: ‘BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.’,
‘score’: 0.6639199256896973},
{‘id’: ‘16370’,
‘text’: ‘Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.’,
‘score’: 0.6617191433906555}]
embeddings.info()
{
“backend”: “torch”,
“build”: {
“create”: “2023-05-04T12:12:53Z”,
“python”: “3.10.11”,
“settings”: {
“torch”: “2.0.0+cu118”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“content”: “sqlite”,
“dimensions”: 384,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:12:53Z”
}
并一次反对检查 ANN 对象。
print(embeddings.ann.backend.shape)
print(type(embeddings.ann.backend))
torch.Size([25000, 384])
<class ‘torch.Tensor’>
正如预期的那样，这次后端是一个 Torch 张量。接下来我们将计算平均搜索时间。
timer(embeddings)
0.021084972200000267 seconds per query
更快一点，因为 Torch 使用 GPU 计算相似度矩阵。

SQLite 和 Faiss
现在让我们使用 Faiss + SQLite 的标准 txtai 设置运行相同的代码。

Create embeddings instance

embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: True})

Index data

embeddings.index((x, text, None) for x, text in enumerate(dataset))

embeddings.search(“red sox”)
[{‘id’: ‘19831’,
‘text’: 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
‘score’: 0.6780003309249878},
{‘id’: ‘18302’,
‘text’: ‘BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.’,
‘score’: 0.6639199256896973},
{‘id’: ‘16370’,
‘text’: ‘Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.’,
‘score’: 0.6617192029953003}]
embeddings.info()
{
“backend”: “faiss”,
“build”: {
“create”: “2023-05-04T12:13:23Z”,
“python”: “3.10.11”,
“settings”: {
“components”: “IVF632,Flat”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“content”: true,
“dimensions”: 384,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:13:23Z”
}
timer(embeddings)
0.008729957724000087 seconds per query
一切都与前面的例子一致。请注意，Faiss 更快，因为它是矢量索引。对于 25,000 条记录，差异可以忽略不计，但矢量索引性能对于数百万以上的数据集会迅速提高。

SQLite 和 HNSW
虽然 txtai 努力通过许多开箱即用的常见默认设置使事情尽可能简单，但自定义后端选项可以提高性能。下一个示例将向量存储在 HNSW 索引中并自定义索引选项。

Create embeddings instance

embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: True, “backend”: “hnsw”, “hnsw”: {“m”: 32}})

Index data

embeddings.index((x, text, None) for x, text in enumerate(dataset))

embeddings.search(“red sox”)
[{‘id’: ‘19831’,
‘text’: 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
‘score’: 0.6780003309249878},
{‘id’: ‘18302’,
‘text’: ‘BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.’,
‘score’: 0.6639198660850525},
{‘id’: ‘16370’,
‘text’: ‘Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.’,
‘score’: 0.6617192029953003}]
embeddings.info()
{
“backend”: “hnsw”,
“build”: {
“create”: “2023-05-04T12:13:59Z”,
“python”: “3.10.11”,
“settings”: {
“efconstruction”: 200,
“m”: 32,
“seed”: 100
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“content”: true,
“deletes”: 0,
“dimensions”: 384,
“hnsw”: {
“m”: 32
},
“metric”: “ip”,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:13:59Z”
}
timer(embeddings)
0.006160191656000279 seconds per query
再一次，一切都与前面的示例相匹配。与 Faiss 相比，性能差异可以忽略不计。

Hnswlib 为许多流行的矢量数据库提供支持。这绝对是一个值得评估的选择。

配置存储
配置作为字典传递给嵌入实例。保存嵌入实例时，默认行为是将配置保存为腌制对象。也可以使用 JSON。

Create embeddings instance

embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: True, “format”: “json”})

Index data

embeddings.index((x, text, None) for x, text in enumerate(dataset))

Save embeddings

embeddings.save(“index”)

!cat index/config.json
{
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“content”: true,
“format”: “json”,
“dimensions”: 384,
“backend”: “faiss”,
“offset”: 25000,
“build”: {
“create”: “2023-05-04T12:14:25Z”,
“python”: “3.10.11”,
“settings”: {
“components”: “IVF632,Flat”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“update”: “2023-05-04T12:14:25Z”
}
查看存储的配置，它几乎与embeddings.info()调用相同。这是设计使然，JSON 配置被设计为人类可读的。在Hugging Face Hub上共享嵌入数据库时，这是一个不错的选择。

SQLite 与 DuckDB
我们要探索的最后一件事是数据库后端。

SQLite是面向行的数据库，DuckDB是面向列的。这种设计差异很重要，在评估预期工作负载时也是需要考虑的因素。让我们探索一下。

Create embeddings instance

embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: “sqlite”})

Index data

embeddings.index((x, text, None) for x, text in enumerate(dataset))
timer(embeddings, “SELECT text FROM txtai where id = 3980”)
0.0001413383999997677 seconds per query
timer(embeddings, “SELECT count(), text FROM txtai group by text order by count() desc”)
0.03718761139199978 seconds per query

Create embeddings instance

embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: “duckdb”})

Index data

embeddings.index((x, text, None) for x, text in enumerate(dataset))
timer(embeddings, “SELECT text FROM txtai where id = 3980”)
0.002780103128000519 seconds per query
timer(embeddings, “SELECT count(), text FROM txtai group by text order by count() desc”)
0.01854579007600023 seconds per query
虽然 25,000 行的数据集很小，但我们可以开始看到差异。SQLite 具有更快的单行检索时间。DuckDB 在聚合查询方面做得更好。这是面向行与面向列的数据库的产物，也是开发解决方案时要考虑的一个因素。
总是完成了
包起来
本文探讨了数据库和向量索引后端的不同组合。使用现代硬件，单个节点索引可以带我们走多远是惊人的。轻松进入数亿甚至数十亿条记录。当硬件瓶颈成为问题时，外部矢量数据库是一个需要考虑的选项。另一个是构建分布式 txtai 嵌入集群。

简单中有力量。许多付费服务试图说服我们注册一个 API 帐户是最好的起点。在某些情况下，例如开发人员很少甚至没有开发人员的团队，情况确实如此。但是对于有开发人员的团队，应该评估像 txtai 这样的选项。

Q shen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
45-----自定义您自己的嵌入数据库

txtai 支持多种不同的数据库和矢量索引后端，包括外部数据库。使用现代硬件，单个节点索引可以带我们走多远是惊人的。轻松进入数亿甚至数十亿条记录。txtai 在创建您自己的嵌入数据库方面提供了最大的灵活性。开箱即用的合理默认值。因此，除非您寻找这种配置，否则没有必要。本文将探讨当您确实想要自定义嵌入数据库时可用的选项。可以在此处找到有关嵌入配置设置的更多信息。安装依赖安装txtai和所有依赖项。
复制链接

扫一扫