分布式嵌入集群

分布式嵌入集群

本教程系列将涵盖txtai的主要用例,这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码,可也可以在colab 中使用。
colab 地址

本文安装了 txtai API 并展示了一个构建嵌入集群的示例。

安装依赖

安装txtai和所有依赖项。由于本文使用了API,我们需要安装api extras包。

pip install txtai[api]

启动分布式嵌入集群

首先,我们将启动多个用作嵌入索引分片的 API 实例。每个分片存储索引数据的一个子集,这些分片协同工作以形成单个逻辑索引。

然后我们将启动主 API 实例,它将分片聚集在一起形成一个逻辑实例。

API 实例都是在后台启动的。

import os
os.chdir("/content")
writable: true

# Embeddings settings
embeddings:
    path: sentence-transformers/nli-mpnet-base-v2
# Embeddings cluster
cluster:
    shards:
        - http://127.0.0.1:8001
        - http://127.0.0.1:8002
# Start embeddings shards
CONFIG=index.yml nohup uvicorn --port 8001 "txtai.api:app" &> shard-1.log &
CONFIG=index.yml nohup uvicorn --port 8002 "txtai.api:app" &> shard-2.log &

# Start main instance
CONFIG=cluster.yml nohup uvicorn --port 8000 "txtai.api:app" &> main.log &

# Wait for startup
sleep 90

Python

让我们首先直接在 Python 中尝试集群。下面的代码将两个分片聚合到一个集群中,并对集群执行操作。

from txtai.api import Cluster

cluster = Cluster({"shards": ["http://127.0.0.1:8001", "http://127.0.0.1:8002"]})

data = [
    "US tops 5 million confirmed virus cases",
    "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
    "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
    "The National Park Service warns against sacrificing slower friends in a bear attack",
    "Maine man wins $1M from $25 lottery ticket",
    "Make huge profits without work, earn up to $100,000 a day",
]

# Index data
cluster.add([{"id": x, "text": row} for x, row in enumerate(data)])
cluster.index()

# Test search
uid = cluster.search("feel good story", 1)[0]["id"]
print("Query: feel good story\nResult:", data[uid])

---------------------------------输出---------------------------------

Query: feel good story
Result: Maine man wins $1M from $25 lottery ticket

Javascript

接下来让我们尝试使用 JavaScript 通过 API 运行上面相同的代码。

npm install txtai

对于此示例,我们将克隆 txtai.js 项目以导入示例构建配置。

git clone https://github.com/neuml/txtai.js

Run cluster.js

以下脚本是上述逻辑的 JavaScript 版本

import {Embeddings} from "txtai";
import {sprintf} from "sprintf-js";

const run = async () => {
    try {
        let embeddings = new Embeddings(process.argv[2]);

        let data  = ["US tops 5 million confirmed virus cases",
                     "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
                     "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
                     "The National Park Service warns against sacrificing slower friends in a bear attack",
                     "Maine man wins $1M from $25 lottery ticket",
                     "Make huge profits without work, earn up to $100,000 a day"];

        console.log();
        console.log("Querying an Embeddings cluster");
        console.log(sprintf("%-20s %s", "Query", "Best Match"));
        console.log("-".repeat(50));

        for (let query of ["feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"]) {
            let results = await embeddings.search(query, 1);
            let uid = results[0].id;
            console.log(sprintf("%-20s %s", query, data[uid]))
        }
    }
    catch (e) {
        console.trace(e);
    }
};

run();

构建并运行 cluster.js

cd txtai.js/examples/node
npm install
npm run build

接下来让我们针对主集群 URL 运行代码

node dist/cluster.js http://127.0.0.1:8000
Querying an Embeddings cluster
Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky                Maine man wins $1M from $25 lottery ticket
dishonest junk       Make huge profits without work, earn up to $100,000 a day

JavaScript 程序显示的结果与上面的 Python 代码相同。这是针对集群中的两个节点运行集群查询并将结果聚合在一起。

可以针对每个单独的分片运行查询,以查看查询独立返回的内容。

node dist/cluster.js http://127.0.0.1:8001
Querying an Embeddings cluster
Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Beijing mobilises invasion craft along coast as Taiwan tensions escalate
public health story  US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             Beijing mobilises invasion craft along coast as Taiwan tensions escalate
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky                Maine man wins $1M from $25 lottery ticket
dishonest junk       US tops 5 million confirmed virus cases
node dist/cluster.js http://127.0.0.1:8002
Querying an Embeddings cluster
Query                Best Match
-------------------------------------------------------
feel good story      Make huge profits without work, earn up to $100,000 a day
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
public health story  The National Park Service warns against sacrificing slower friends in a bear attack
war                  The National Park Service warns against sacrificing slower friends in a bear attack
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 The National Park Service warns against sacrificing slower friends in a bear attack
lucky                The National Park Service warns against sacrificing slower friends in a bear attack
dishonest junk       Make huge profits without work, earn up to $100,000 a day

注意差异。下面的部分针对完整集群和每个分片运行计数,以显示每个分片中的记录计数。

curl http://127.0.0.1:8000/count
printf "\n"
curl http://127.0.0.1:8001/count
printf "\n"
curl http://127.0.0.1:8002/count
6
3
3

本文展示了如何使用 txtai 创建分布式嵌入集群。这个例子可以在带有 StatefulSets 的 Kubernetes 上进一步扩展,这将在以后的教程中介绍。

参考

https://dev.to/neuml/tutorial-series-on-txtai-ibg

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

发呆的比目鱼

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值