txtai 支持多种不同的数据库和矢量索引后端,包括外部数据库。使用现代硬件,单个节点索引可以带我们走多远是惊人的。轻松进入数亿甚至数十亿条记录。
txtai 在创建您自己的嵌入数据库方面提供了最大的灵活性。开箱即用的合理默认值。因此,除非您寻找这种配置,否则没有必要。本文将探讨当您确实想要自定义嵌入数据库时可用的选项。
可以在此处找到有关嵌入配置设置的更多信息。
安装依赖
安装txtai和所有依赖项。
Install txtai
pip install txtai[database,similarity] datasets
加载数据集
此示例将使用ag_news数据集,它是新闻文章标题的集合。我们将使用 25,000 个标题的子集。
import timeit
from datasets import load_dataset
def timer(embeddings, query=“red sox”):
elapsed = timeit.timeit(lambda: embeddings.search(query), number=250)
print(f"{elapsed / 250} seconds per query")
dataset = load_dataset(“ag_news”, split=“train”)[“text”][:25000]
NumPy
让我们从最简单的嵌入数据库开始。这只是一个简单的包装器,围绕着带有句子转换器的矢量化文本,将结果存储为 NumPy 数组并运行相似性查询。
from txtai.embeddings import Embeddings
Create embeddings instance
embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “backend”: “numpy”})
Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
embeddings.search(“red sox”)
[(19831, 0.6780003309249878),
(18302, 0.6639199256896973),
(16370, 0.6617192029953003)]
embeddings.info()
{
“backend”: “numpy”,
“build”: {
“create”: “2023-05-04T12:12:02Z”,
“python”: “3.10.11”,
“settings”: {
“numpy”: “1.22.4”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“dimensions”: 384,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:12:02Z”
}
上面的嵌入实例将文本矢量化并将内容存储为 NumPy 数组。数组索引位置与相似性分数一起返回。虽然同样可以使用句子转换器轻松完成,但使用 txtai 框架可以轻松换出不同的选项,如下所示。
SQLite 和 NumPy
我们将测试的下一个组合是带有 NumPy 数组的 SQLite 数据库。
Create embeddings instance
embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: “sqlite”, “backend”: “numpy”})
Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
现在让我们搜索一下。
embeddings.search(“red sox”)
[{‘id’: ‘19831’,
‘text’: 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
‘score’: 0.6780003309249878},
{‘id’: ‘18302’,
‘text’: ‘BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.’,
‘score’: 0.6639199256896973},
{‘id’: ‘16370’,
‘text’: ‘Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.’,
‘score’: 0.6617192029953003}]
embeddings.info()
{
“backend”: “numpy”,
“build”: {
“create”: “2023-05-04T12:12:24Z”,
“python”: “3.10.11”,
“settings”: {
“numpy”: “1.22.4”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“content”: “sqlite”,
“dimensions”: 384,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:12:24Z”
}
结果和以前一样。唯一的区别是内容现在可以通过关联的 SQLite 数据库获得。
让我们检查 ANN 对象以查看其外观。
print(embeddings.ann.backend.shape)
print(type(embeddings.ann.backend))
(25000, 384)
<class ‘numpy.memmap’>
正如预期的那样,它是一个 NumPy 数组。让我们计算执行搜索查询需要多长时间。
timer(embeddings)
0.03392000120000011 seconds per query
还不错!
SQLite 和 PyTorch
现在让我们尝试 PyTorch 后端。
Create embeddings instance
embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: “sqlite”, “backend”: “torch”})
Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
让我们再次运行搜索。
embeddings.search(“red sox”)
[{‘id’: ‘19831’,
‘text’: 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
‘score’: 0.678000271320343},
{‘id’: ‘18302’,
‘text’: ‘BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.’,
‘score’: 0.6639199256896973},
{‘id’: ‘16370’,
‘text’: ‘Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.’,
‘score’: 0.6617191433906555}]
embeddings.info()
{
“backend”: “torch”,
“build”: {
“create”: “2023-05-04T12:12:53Z”,
“python”: “3.10.11”,
“settings”: {
“torch”: “2.0.0+cu118”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“content”: “sqlite”,
“dimensions”: 384,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:12:53Z”
}
并一次反对检查 ANN 对象。
print(embeddings.ann.backend.shape)
print(type(embeddings.ann.backend))
torch.Size([25000, 384])
<class ‘torch.Tensor’>
正如预期的那样,这次后端是一个 Torch 张量。接下来我们将计算平均搜索时间。
timer(embeddings)
0.021084972200000267 seconds per query
更快一点,因为 Torch 使用 GPU 计算相似度矩阵。
SQLite 和 Faiss
现在让我们使用 Faiss + SQLite 的标准 txtai 设置运行相同的代码。
Create embeddings instance
embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: True})
Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
embeddings.search(“red sox”)
[{‘id’: ‘19831’,
‘text’: 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
‘score’: 0.6780003309249878},
{‘id’: ‘18302’,
‘text’: ‘BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.’,
‘score’: 0.6639199256896973},
{‘id’: ‘16370’,
‘text’: ‘Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.’,
‘score’: 0.6617192029953003}]
embeddings.info()
{
“backend”: “faiss”,
“build”: {
“create”: “2023-05-04T12:13:23Z”,
“python”: “3.10.11”,
“settings”: {
“components”: “IVF632,Flat”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“content”: true,
“dimensions”: 384,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:13:23Z”
}
timer(embeddings)
0.008729957724000087 seconds per query
一切都与前面的例子一致。请注意,Faiss 更快,因为它是矢量索引。对于 25,000 条记录,差异可以忽略不计,但矢量索引性能对于数百万以上的数据集会迅速提高。
SQLite 和 HNSW
虽然 txtai 努力通过许多开箱即用的常见默认设置使事情尽可能简单,但自定义后端选项可以提高性能。下一个示例将向量存储在 HNSW 索引中并自定义索引选项。
Create embeddings instance
embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: True, “backend”: “hnsw”, “hnsw”: {“m”: 32}})
Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
embeddings.search(“red sox”)
[{‘id’: ‘19831’,
‘text’: 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
‘score’: 0.6780003309249878},
{‘id’: ‘18302’,
‘text’: ‘BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.’,
‘score’: 0.6639198660850525},
{‘id’: ‘16370’,
‘text’: ‘Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.’,
‘score’: 0.6617192029953003}]
embeddings.info()
{
“backend”: “hnsw”,
“build”: {
“create”: “2023-05-04T12:13:59Z”,
“python”: “3.10.11”,
“settings”: {
“efconstruction”: 200,
“m”: 32,
“seed”: 100
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“content”: true,
“deletes”: 0,
“dimensions”: 384,
“hnsw”: {
“m”: 32
},
“metric”: “ip”,
“offset”: 25000,
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“update”: “2023-05-04T12:13:59Z”
}
timer(embeddings)
0.006160191656000279 seconds per query
再一次,一切都与前面的示例相匹配。与 Faiss 相比,性能差异可以忽略不计。
Hnswlib 为许多流行的矢量数据库提供支持。这绝对是一个值得评估的选择。
配置存储
配置作为字典传递给嵌入实例。保存嵌入实例时,默认行为是将配置保存为腌制对象。也可以使用 JSON。
Create embeddings instance
embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: True, “format”: “json”})
Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
Save embeddings
embeddings.save(“index”)
!cat index/config.json
{
“path”: “sentence-transformers/all-MiniLM-L6-v2”,
“content”: true,
“format”: “json”,
“dimensions”: 384,
“backend”: “faiss”,
“offset”: 25000,
“build”: {
“create”: “2023-05-04T12:14:25Z”,
“python”: “3.10.11”,
“settings”: {
“components”: “IVF632,Flat”
},
“system”: “Linux (x86_64)”,
“txtai”: “5.6.0”
},
“update”: “2023-05-04T12:14:25Z”
}
查看存储的配置,它几乎与embeddings.info()调用相同。这是设计使然,JSON 配置被设计为人类可读的。在Hugging Face Hub上共享嵌入数据库时,这是一个不错的选择。
SQLite 与 DuckDB
我们要探索的最后一件事是数据库后端。
SQLite是面向行的数据库,DuckDB是面向列的。这种设计差异很重要,在评估预期工作负载时也是需要考虑的因素。让我们探索一下。
Create embeddings instance
embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: “sqlite”})
Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
timer(embeddings, “SELECT text FROM txtai where id = 3980”)
0.0001413383999997677 seconds per query
timer(embeddings, “SELECT count(), text FROM txtai group by text order by count() desc”)
0.03718761139199978 seconds per query
Create embeddings instance
embeddings = Embeddings({“path”: “sentence-transformers/all-MiniLM-L6-v2”, “content”: “duckdb”})
Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
timer(embeddings, “SELECT text FROM txtai where id = 3980”)
0.002780103128000519 seconds per query
timer(embeddings, “SELECT count(), text FROM txtai group by text order by count() desc”)
0.01854579007600023 seconds per query
虽然 25,000 行的数据集很小,但我们可以开始看到差异。SQLite 具有更快的单行检索时间。DuckDB 在聚合查询方面做得更好。这是面向行与面向列的数据库的产物,也是开发解决方案时要考虑的一个因素。
总是完成了
包起来
本文探讨了数据库和向量索引后端的不同组合。使用现代硬件,单个节点索引可以带我们走多远是惊人的。轻松进入数亿甚至数十亿条记录。当硬件瓶颈成为问题时,外部矢量数据库是一个需要考虑的选项。另一个是构建分布式 txtai 嵌入集群。
简单中有力量。许多付费服务试图说服我们注册一个 API 帐户是最好的起点。在某些情况下,例如开发人员很少甚至没有开发人员的团队,情况确实如此。但是对于有开发人员的团队,应该评估像 txtai 这样的选项。