43----云中的嵌入

文章介绍了如何使用txtai库和HuggingFaceHub来加载及搜索嵌入索引,从而实现语义搜索。通过示例展示了从Hub加载索引并执行搜索任务,包括从维基百科数据的索引中获取信息。此外,文章提到了云存储同步功能,以及如何通过百分位数字段对结果进行过滤,强调了txtai在处理大规模数据和长尾查询时的效用。
摘要由CSDN通过智能技术生成

嵌入是提供语义搜索的引擎。数据被转换为嵌入向量,相似的概念会产生相似的向量。大大小小的索引都是用这些向量构建的。索引用于查找具有相同含义的结果,不一定是相同的关键字。

除了本地存储,嵌入可以与云存储同步。鉴于 txtai 是一种完全封装的索引格式,云同步只是将一组文件移入和移出云存储的问题。这可以是对象存储,例如 AWS S3/Azure Blob/Google Cloud 或Hugging Face Hub。有关可用选项的更多详细信息,请参阅文档。还有一篇文章介绍了如何在云对象存储中构建和存储索引。

本文将介绍从 Hugging Face Hub 加载嵌入索引的示例。

安装依赖
安装txtai和所有依赖项。

Install txtai

pip install txtai
与 Hugging Face Hub 集成
Hugging Face Hub 拥有大量模型、数据集和示例应用程序,可用于快速启动您的项目。这现在包括 txtai 索引 🔥🔥🔥

让我们加载标准Introducing txtai示例中使用的嵌入。
from txtai.embeddings import Embeddings

Load the index from the Hub

embeddings = Embeddings()
embeddings.load(provider=“huggingface-hub”, container=“neuml/txtai-intro”)
注意这两个字段,provider和container。该provider字段告诉 txtai 在 Hugging Face Hub 中查找索引。该container字段设置目标存储库。
print(“%-20s %s” % (“Query”, “Best Match”))
print(“-” * 50)

Run an embeddings search for each query

for query in (“feel good story”, “climate change”, “public health story”, “war”, “wildlife”, “asia”, “lucky”, “dishonest junk”):
# Get to the top result
result = embeddings.search(query, 1)[0]

# Print text
print("%-20s %s" % (query, result["text"]))

Query Best Match

feel good story Maine man wins $1M from $25 lottery ticket
climate change Canada’s last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
public health story US tops 5 million confirmed virus cases
war Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife The National Park Service warns against sacrificing slower friends in a bear attack
asia Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky Maine man wins $1M from $25 lottery ticket
dishonest junk Make huge profits without work, earn up to $100,000 a day
如果您以前见过 txtai,这就是经典示例。最大的区别是索引是从 Hugging Face Hub 加载的,而不是动态构建的。

使用 txtai 进行维基百科搜索
让我们使用Hugging Face Hub 上可用的维基百科索引尝试一些更有趣的事情
from txtai.embeddings import Embeddings

Load the index from the Hub

embeddings = Embeddings()
embeddings.load(provider=“huggingface-hub”, container=“neuml/txtai-wikipedia”)
现在运行一系列搜索以显示该索引中可用的数据类型。
import json

for x in embeddings.search(“Roman Empire”, 1):
print(json.dumps(x, indent=2))
{
“id”: “Roman Empire”,
“text”: “The Roman Empire ( ; ) was the post-Republican period of ancient Rome. As a polity, it included large territorial holdings around the Mediterranean Sea in Europe, North Africa, and Western Asia, and was ruled by emperors. From the accession of Caesar Augustus as the first Roman emperor to the military anarchy of the 3rd century, it was a Principate with Italia as the metropole of its provinces and the city of Rome as its sole capital. The Empire was later ruled by multiple emperors who shared control over the Western Roman Empire and the Eastern Roman Empire. The city of Rome remained the nominal capital of both parts until AD 476 when the imperial insignia were sent to Constantinople following the capture of the Western capital of Ravenna by the Germanic barbarians. The adoption of Christianity as the state church of the Roman Empire in AD 380 and the fall of the Western Roman Empire to Germanic kings conventionally marks the end of classical antiquity and the beginning of the Middle Ages. Because of these events, along with the gradual Hellenization of the Eastern Roman Empire, historians distinguish the medieval Roman Empire that remained in the Eastern provinces as the Byzantine Empire.”,
“score”: 0.8913329243659973
}
for x in embeddings.search(“How does a car engine work”, 1):
print(json.dumps(x, indent=2))
{
“id”: “Internal combustion engine”,
“text”: “An internal combustion engine (ICE or IC engine) is a heat engine in which the combustion of a fuel occurs with an oxidizer (usually air) in a combustion chamber that is an integral part of the working fluid flow circuit. In an internal combustion engine, the expansion of the high-temperature and high-pressure gases produced by combustion applies direct force to some component of the engine. The force is typically applied to pistons (piston engine), turbine blades (gas turbine), a rotor (Wankel engine), or a nozzle (jet engine). This force moves the component over a distance, transforming chemical energy into kinetic energy which is used to propel, move or power whatever the engine is attached to. This replaced the external combustion engine for applications where the weight or size of an engine was more important.”,
“score”: 0.8664469122886658
}
for x in embeddings.search(“Who won the World Series in 2022?”, 1):
print(json.dumps(x, indent=2))
{
“id”: “2022 World Series”,
“text”: "The 2022 World Series was the championship series of Major League Baseball’s (MLB) 2022 season. The 118th edition of the World Series, it was a best-of-seven playoff between the American League (AL) champion Houston Astros and the National League (NL) champion Philadelphia Phillies. The Astros defeated the Phillies in six games to earn their second championship. The series was broadcast in the United States on Fox television and ESPN Radio. ",
“score”: 0.8889098167419434
}
for x in embeddings.search(“What was New York called under the Dutch?”, 1):
print(json.dumps(x, indent=2))
{
“id”: “Dutch Americans in New York City”,
“text”: “Dutch people have had a continuous presence in New York City for nearly 400 years, being the earliest European settlers. New York City traces its origins to a trading post founded on the southern tip of Manhattan Island by Dutch colonists in 1624. The settlement was named New Amsterdam in 1626 and was chartered as a city in 1653. Because of the history of Dutch colonization, Dutch culture, politics, law, architecture, and language played a formative role in shaping the culture of the city. The Dutch were the majority in New York City until the early 1700s and the Dutch language was commonly spoken until the mid to late-1700s. Many places and institutions in New York City still bear a colonial Dutch toponymy, including Brooklyn (Breukelen), Harlem (Haarlem), Wall Street (Waal Straat), The Bowery (bouwerij (\u201cfarm\u201d), and Coney Island (conyne).”,
“score”: 0.8840358853340149
}
现在可能很清楚如何将这些结果与另一个组件(例如 LLM 提示)结合起来构建一个基于对话式 QA 的系统!

按人气筛选
让我们尝试最后一个查询。这是一个通用查询,仅通过相似性搜索就有很多匹配结果。
for x in embeddings.search(“Boston”, 1):
print(json.dumps(x, indent=2))
{
“id”: “Boston (song)”,
“text”: ““Boston” is a song by American rock band Augustana, from their debut album All the Stars and Boulevards (2005). It was originally produced in 2003 by Jon King for their demo, Midwest Skies and Sleepless Mondays, and was later re-recorded with producer Brendan O’Brien for All the Stars and Boulevards.”,
“score”: 0.8729256987571716
}
虽然结果是关于波士顿的,但它并不是最受欢迎的结果。这就是百分位数字段发挥作用的地方。可以根据页面浏览量过滤结果。
for x in embeddings.search(“SELECT id, text, score, percentile FROM txtai WHERE similar(‘Boston’) AND percentile >= 0.99”, 1):
print(json.dumps(x, indent=2))
{
“id”: “Boston”,
“text”: “Boston, officially the City of Boston, is the state capital and most populous city of the Commonwealth of Massachusetts, as well as the cultural and financial center of the New England region of the United States. It is the 24th-most populous city in the country. The city boundaries encompass an area of about and a population of 675,647 as of 2020. It is the seat of Suffolk County (although the county government was disbanded on July 1, 1999). The city is the economic and cultural anchor of a substantially larger metropolitan area known as Greater Boston, a metropolitan statistical area (MSA) home to a census-estimated 4.8\u00a0million people in 2016 and ranking as the tenth-largest MSA in the country. A broader combined statistical area (CSA), generally corresponding to the commuting area and including Providence, Rhode Island, is home to approximately 8.2\u00a0million people, making it the sixth most populous in the United States.”,
“score”: 0.8668985366821289,
“percentile”: 0.9999025135905505
}
此查询添加了一个额外的过滤器,以仅匹配前 1% 的已访问维基百科页面的结果。

包起来
本文介绍了如何从云存储加载嵌入索引。Hugging Face Hub 是共享模型、数据集、示例应用程序和现在的 txtai 嵌入索引的绝佳资源。这在索引时间较长或需要大量 GPU 资源时特别有用。

期待在接下来的几个月里看到社区共享的嵌入索引!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Q shen

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值