技术背景介绍
OpenSearch是一套可扩展、灵活且可扩展的开源软件套件,适用于搜索、分析和可观察性应用。OpenSearch基于Apache Lucene构建,是一个分布式搜索和分析引擎。本文将介绍如何使用OpenSearch实现相似性搜索,涵盖从安装到实际应用的全过程。
核心原理解析
OpenSearch支持多种相似性搜索算法,包括近似k-NN搜索、脚本得分以及Painless脚本。近似k-NN搜索适用于大数据集,而脚本得分和Painless脚本则为精确搜索提供了更多自定义的可能性。
代码实现演示
安装
首先安装所需的Python客户端:
%pip install --upgrade --quiet opensearch-py langchain-community
设置API Key
我们需要使用OpenAI嵌入,所以需要配置OpenAI API Key:
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
加载数据与嵌入
下面是一个加载文本数据并生成嵌入的示例代码:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader("path/to/your/document.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
近似k-NN搜索
使用默认参数进行近似k-NN搜索:
docsearch = OpenSearchVectorSearch.from_documents(
docs, embeddings, opensearch_url="http://localhost:9200"
)
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query, k=10)
print(docs[0].page_content)
使用FAISS引擎的近似k-NN搜索
配置FAISS引擎以提高近似k-NN搜索效率:
docsearch = OpenSearchVectorSearch.from_documents(
docs,
embeddings,
opensearch_url="http://localhost:9200",
engine="faiss",
space_type="innerproduct",
ef_construction=256,
m=48,
)
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)
脚本得分
使用自定义参数进行脚本得分搜索:
docsearch = OpenSearchVectorSearch.from_documents(
docs, embeddings, opensearch_url="http://localhost:9200", is_appx_search=False
)
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(
query,
k=1,
search_type="script_scoring",
)
print(docs[0].page_content)
Painless脚本
使用Painless脚本进行搜索:
docsearch = OpenSearchVectorSearch.from_documents(
docs, embeddings, opensearch_url="http://localhost:9200", is_appx_search=False
)
filter = {"bool": {"filter": {"term": {"text": "smuggling"}}}}
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(
query,
search_type="painless_scripting",
space_type="cosineSimilarity",
pre_filter=filter,
)
print(docs[0].page_content)
最大边际相关搜索(MMR)
如果您希望得到多样化的搜索结果,可以考虑使用最大边际相关搜索:
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10, lambda_param=0.5)
使用预先存在的OpenSearch实例
如果已经有一个包含向量的OpenSearch实例,可以这样使用:
docsearch = OpenSearchVectorSearch(
index_name="index-*",
embedding_function=embeddings,
opensearch_url="http://localhost:9200",
)
docs = docsearch.similarity_search(
"Who was asking about getting lunch today?",
search_type="script_scoring",
space_type="cosinesimil",
vector_field="message_embedding",
text_field="message",
metadata_field="message_metadata",
)
使用Amazon OpenSearch服务(AOSS)
以下是如何使用Amazon OpenSearch服务的示例:
%pip install --upgrade --quiet boto3 requests requests-aws4auth
import boto3
from opensearchpy import RequestsHttpConnection
from requests_aws4auth import AWS4Auth
service = "aoss"
region = "us-east-2"
credentials = boto3.Session(
aws_access_key_id="your-aws-access-key-id",
aws_secret_access_key="your-aws-secret-access-key"
).get_credentials()
awsauth = AWS4Auth(
credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token
)
docsearch = OpenSearchVectorSearch.from_documents(
docs,
embeddings,
opensearch_url="your-opensearch-url",
http_auth=awsauth,
timeout=300,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
index_name="test-index-using-aoss",
engine="faiss",
)
docs = docsearch.similarity_search(
"What is feature selection",
efficient_filter=filter,
k=200,
)
应用场景分析
以上代码示例涵盖了本地部署以及云服务(如Amazon OpenSearch服务)的使用场景。通过配置不同的搜索方式和引擎,能够满足不同规模数据集和查询需求。
实践建议
- 在大规模数据集上使用近似k-NN搜索以提高性能。
- 根据实际需求选择合适的搜索引擎(如FAISS)。
- 对于包含敏感数据的场景,请确保已配置SSL和身份验证。
如果遇到问题欢迎在评论区交流。