trench tvs_trench沟中的语义向量搜索故事

最新推荐文章于 2024-09-14 08:25:16 发布

weixin_26648481

最新推荐文章于 2024-09-14 08:25:16 发布

阅读量268

点赞数

文章标签： python 算法机器学习

原文链接：https://medium.com/grensesnittet/semantic-vector-search-tales-from-the-trenches-fa8b61ea3680

版权

trench tvs

This blog post describes our recent experience implementing semantic vector search in a customer case. Semantic vector search is a way to spice up your search with some machine learning magic. We report on our findings about the advantages and disadvantages of this technique, and the performance gains in accuracy you can expect.

这篇博客文章描述了我们最近在客户案例中实现语义向量搜索的经验。语义向量搜索是一种借助机器学习魔术来增强搜索效果的方法。我们报告了有关该技术的优缺点的发现，以及您可以期望的准确性方面的性能提升。

TL; DR (TL;DR)

Semantic vector search (also known as neural search or neural information retrieval) is a good technique to have in your toolbox. It will definitely help you squeeze out more performance points from your working solution, but it comes at a cost. Make sure you analyze carefully the trade-offs for your use case to see if this solution is worth the additional complexity.

语义向量搜索(也称为神经搜索或神经信息检索)是工具箱中的一项很好的技术。它肯定会帮助您从工作解决方案中获得更多性能点，但这是有代价的。确保您仔细分析用例的权衡取舍，以查看该解决方案是否值得额外的复杂性。

Combining machine learning with traditional information retrieval algorithms seems to work better than each approach alone.

将机器学习与传统信息检索算法相结合似乎比单独使用每种方法都更好。

顾客 (The customer)

Our customer is one of the largest ship chandlers/ship suppliers in the world and coordinates global activities through regional centers in Europe, Far East, Middle East and North America. Their core business is to provide a 24/7/365 service for every marine, offshore and navy operation, including land operations. They are a full service provider, including handling of owners’ goods, shipping, airfreight and related marine services. Their mission is to make it easy for their customers to receive their supplies, wherever they are needed, efficiently and at the best possible price.

我们的客户是全球最大的船舶处理商/船舶供应商之一，并通过欧洲，远东，中东和北美的区域中心协调全球活动。他们的核心业务是为每项海洋，近海和海军行动(包括陆上行动)提供24/7/365服务。他们是一个全方位的服务提供商，包括船东货物，运输，空运和相关海事服务的处理。他们的使命是使客户能够轻松，高效地以最优惠的价格接收所需的物品。

问题 (The problem)

The company has a product catalogue with several thousand products. The users submit a “request for quote” (RFQ), which is like a “shopping list”, that is, a list of items having the quantity, and a textual description of the item. Given this RFQ, they have to match the textual description of each item in the list to the products available in their catalogue. Is there a way we could automate the process?

该公司的产品目录包含数千种产品。用户提交“报价请求”(RFQ)，类似于“购物清单”，即具有数量的项目清单以及该项目的文字说明。鉴于此询价，他们必须将列表中每个项目的文字说明与目录中的可用产品相匹配。有没有办法使流程自动化？

语义向量搜索的前景 (The promise of semantic vector search)

So why would we want to implement semantic vector search at all? Well, classic search implementations start with keyword search. To improve search relevance, one starts to manually fine-tune the search ingestion pipeline (analyzers, token filtering, synonyms) or the search queries (boosts, query expansion, domain specific business rules, etc). But these approaches have their shortcomings, and scaling the manual process can become difficult.

那么，为什么我们要完全实现语义矢量搜索呢？好吧，经典的搜索实现是从关键字搜索开始的。为了提高搜索的相关性，人们开始手动微调搜索摄取管道(分析器，令牌过滤，同义词)或搜索查询(提升，查询扩展，特定领域的业务规则等)。但是这些方法都有其缺点，并且扩展手动过程可能会变得困难。

Furthermore, one of the hardest challenges of implementing a good search solution is to understand what your users mean (query intent) and expect when searching. Advanced techniques like query classification, semantic query parsing, knowledge graphs, and personalization can help you. Powering your search with AI techniques can help you to both automatize the relevance tuning process (and be able to scale) and help you understand better your users.

此外，实施良好的搜索解决方案最困难的挑战之一就是要了解用户的意思 (查询意图)以及搜索时的期望。查询分类，语义查询解析，知识图和个性化等高级技术可以为您提供帮助。利用AI技术为搜索提供动力，不仅可以帮助您自动进行相关性调整过程(并且可以扩展)，还可以帮助您更好地了解用户。

Semantic vector search is an example of AI powered search. It can enable you to encode documents (even in different languages), pictures, and videos in the same space, and let you search across these types. By encoding queries in the same space, it allows you to search by what you mean, and not only by what you type.

语义向量搜索是AI支持的搜索的一个示例。它可以使您在同一空间中对文档(甚至使用不同的语言)，图片和视频进行编码，并可以在这些类型中进行搜索。通过在同一空间中对查询进行编码，它使您可以根据自己的意思进行搜索，而不仅限于键入的内容。

该方法 (The approach)

There are several ways one could use to approach the problem. But we thought that the simplest baseline we could produce is to index the product catalogue in a search engine, like Elasticsearch, define a good query function and look at the top result. From the historical data, we could build a benchmark dataset to measure the accuracy of the top result by using the users’ RFQs as queries. The building of this dataset is a complex process, and deserves its own blog post.

解决问题的方法有几种。但是我们认为，我们可以产生的最简单的基准是在像Elasticsearch这样的搜索引擎中为产品目录建立索引，定义一个好的查询功能并查看最高结果。根据历史数据，我们可以建立基准数据集，以通过使用用户的询价单作为查询来衡量最高结果的准确性。该数据集的构建是一个复杂的过程，值得拥有自己的博客文章。

Once we have that baseline, there are several ways to build on and improve the solution, like building a better signal from your data, using synonyms specialized for the domain in question, trying more complex queries in Elasticsearch and so on. But what we really wanted to find out is the best way to use Semantic Vector Search for this use case. More concretely:

一旦有了该基准，便可以通过多种方法来构建和改进解决方案，例如从数据中构建更好的信号，使用专门用于所讨论领域的同义词，在Elasticsearch中尝试更复杂的查询等等。但是，我们真正想找出的是针对此用例使用语义向量搜索的最佳方法。更具体地说：

Is it better to just use Semantic Vector Search without the traditional BM25/TF-IDF algorithms?
仅使用语义向量搜索而不使用传统的BM25 / TF-IDF算法会更好吗？
Or should we use the traditional BM25 and use Semantic Vector Search for rescoring the top X results for a query?
还是应该使用传统的BM25并使用语义向量搜索来对查询的前X个结果进行评分？
What are the performance gains for each approach? What are the costs of implementing each solution?
每种方法的性能提升是什么？实施每个解决方案的成本是多少？

什么是语义向量搜索？ (What is Semantic Vector Search?)

Search or information retrieval is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Typically, a user enters a query representing the user information needs, defined by keywords. The query does not (usually) uniquely identify a single document, but several, having different degrees of relevance.

搜索或信息检索是从大型馆藏(通常存储在计算机中)中找到满足信息需求的非结构化性质的材料(通常是文档)(通常是文本)。通常，用户输入代表用户信息需求的查询，该查询由关键字定义。该查询不是(通常)唯一地标识单个文档，而是唯一地标识多个具有不同相关度的文档。

Documents are indexed into a searchable database, in order to retrieve them at query time. One way of indexing documents is by using a data structure called the inverted index.

文档被索引到可搜索的数据库中，以便在查询时进行检索。索引文档的一种方法是使用称为倒排索引的数据结构。

Another way to do this is by embedding the documents into a vector space. You convert a document into a sequence of numbers (a vector) in a space with a certain structure, in a way that it preserves the semantics of the data. The important bit is that documents having similar meaning will be mapped to vectors close to each other. The illustrated Word2vec is a nice reference if you want to grasp this transformation in more detail. We call these vectors, the semantic vector representation of the given documents. There are of course several ways to produce embeddings from documents. You can encode subwords, words, sentences, paragraphs and whole documents. Choosing the correct embedding technique is highly important, and depends on your use case.

另一种方法是将文档嵌入向量空间。您可以通过某种方式将文档转换为具有特定结构的空间中的数字序列(向量)，从而保留数据的语义。重要的一点是，具有相似含义的文档将映射到彼此接近的向量。如果您想更详细地了解这种转换，那么图示的Word2vec是很好的参考。我们称这些向量为给定文档的语义向量表示。当然，有几种方法可以从文档产生嵌入。您可以编码子词，单词，句子，段落和整个文档。选择正确的嵌入技术非常重要，具体取决于您的用例。

Semantic Vector Search is just looking for the closest semantic vectors to a given query in the vector space. This requires that you also convert the query into a vector, before you can do this operation.

语义向量搜索只是在向量空间中寻找与给定查询最接近的语义向量。这要求您还必须先将查询转换为向量，然后才能执行此操作。

测试语义搜索 (Testing semantic search)

To test our hypothesis, we need to:

为了检验我们的假设，我们需要：

Create a benchmarking dataset Benchmark from the historical records.
根据历史记录创建基准数据集基准。
Clean the data (lowercasing, removing stop words, removing clutter words).
清理数据(降低外壳，删除停用词，删除混乱的词)。
Remove duplicates. We ended up with 588,500 records. The fields are:
删除重复项。我们最终获得了588,500条记录。字段是：

-
--

query: the textual description of the item from the user
查询：用户对商品的文字描述

-
--

category: the category of the item, Food or Technical.
category ：项目类别，食品或技术。
Add an embeddings field from the query field. We used sentence-transformers roberta-large-nli-stsb-mean-tokens’s language model to transform the field query into a vector. This process can be time consuming (around 10 hours in a Linux 64bit OS with 8 CPU cores, 32 GB RAM and kernel version 5.3.0–62-generic).
从查询字段中添加嵌入字段。我们使用了句子转换程序 roberta-large-nli-stsb-mean-tokens的语言模型将字段查询转换为向量。此过程可能很耗时(在具有8个CPU内核，32 GB RAM和5.3.0–62通用内核版本的Linux 64位OS中，大约需要10个小时)。
Create the product catalogue ProductCatalogue dataset.
创建产品目录ProductCatalogue数据集。
Clean the product catalogue (lowercasing, removing stop words, removing duplicates). The fields are: product_id, description, category.
清洁产品目录(降低包装箱，删除停用词，删除重复项)。字段是：product_id，描述，类别。
Add an embeddings field from the description field. We used sentence-transformers roberta-large-nli-stsb-mean-tokens’s language model to transform the field description into a vector.
在说明字段中添加嵌入字段。我们使用了句子转换程序 roberta-large-nli-stsb-mean-tokens的语言模型将字段描述转换为向量。
Index the product catalogue into Elasticsearch. To have some control over the reproducibility of the scores, you need to set up num_primary_shards to 1, num_replicas to 0, and sort the search results by the Description field. You will still observe variations in the score if you reindex your dataset, but these will be smaller.
将产品目录索引到Elasticsearch中。要控制分数的可重复性，您需要将num_primary_shards设置为1，将num_replicas设置为0，然后按“描述”字段对搜索结果进行排序。如果重新索引数据集，您仍会观察到分数的变化，但是这些变化会较小。
Define the queries for each approach: Semantic Vector Search, the baseline and baseline with semantic vector search rescoring.
定义每种方法的查询：语义向量搜索，基线和带有语义向量搜索的基线。

语义向量搜索(SVS) (Semantic Vector Search (SVS))

Here we use the script_score query from Elasticsearch to implement the Semantic Vector Search. Using the Benchmark dataset, we use the following parameters: _query_embeddings: Benchmark.embeddings _query_sort_column: Column to sort results by.

在这里，我们使用来自Elasticsearch的script_score查询来实现语义矢量搜索。使用Benchmark数据集，我们使用以下参数： _query_embeddings ：Benchmark.embeddings _query_sort_column ：用于对结果进行排序的列。

{
   "query": {
     "script_score": {
       "query": {
         "match_all": {}
       },
       "script": {
         "source": "cosineSimilarity(params.queryVector, 'embeddings')",
         "params": {
           "queryVector": _query_embeddings
         }
       }
     }
   },
   "sort": [
     {
       "_score": {
         "order": "desc"
       },
       _query_sort_column: {
         "order": "asc"
       }
     }
   ]
}

基准线 (Baseline)

To implement the baseline search, we use the multi_match query from Elasticsearch. Using the Benchmark dataset, we use the following parameters: _query: Benchmark.query _query_sort_column: Column to sort results by.

为了实现基线搜索，我们使用来自Elasticsearch的multi_match查询。使用Benchmark数据集，我们使用以下参数： _query ：Benchmark.query _query_sort_column ：用于对结果进行排序的列。

{
   "query": {
     "bool": {
       "should": [
         {
           "multi_match": {
             "query": _query,
             "fields": [
               "description"
             ]
           }
         }
       ]
     }
   },
   "sort": [
     {
       "_score": {
         "order": "desc"
       },
       _query_sort_column: {
         "order": "asc"
       }
     }
   ]
 }

This is in fact a boolean query wrapping a multi_match query. In our final implementation, we use several fields to build the relevant signal for search. This query can be rewritten to a simple multi_match query.

实际上，这是一个布尔查询，它包装了一个multi_match查询。在我们的最终实现中，我们使用几个字段来构建用于搜索的相关信号。该查询可以重写为简单的multi_match查询。

具有SVS记录的基准：baseline_svs (Baseline with SVS rescoring: baseline_svs)

Here we use a multi_match query with rescore, to recalculate the ranking from the baseline query with the semantic vector search. Parameters: _query: Benchmark.query _query_embeddings: Benchmark.embeddings to do the rescoring.

在这里，我们使用带有rescore的multi_match查询，以语义向量搜索从基线查询中重新计算排名。参数： _query ：Benchmark.query _query_embeddings ：Benchmark.embeddings进行记录。

{
   "query": {
     "bool": {
       "should": [
         {
           "multi_match": {
             "query": _query,
             "fields": [
               "description"
             ]
           }
         }
       ]
     }
   },
   "rescore": {
     "window_size": 50,
     "query": {
       "rescore_query": {
         "script_score": {
           "query": {
             "match_all": {}
           },
           "script": {
             "source": "cosineSimilarity(params.queryVector, 'embeddings') + 1.0",
             "params": {
               "queryVector": _query_embeddings
             }
           }
         }
       },
       "query_weight": 1,
       "rescore_query_weight": 1.5,
       "score_mode": "multiply"
     }
   }
 }

We apply the baseline query, take the top 50 results and apply a rescore based on vector similarity search (this is not part of the open source functionality of Elasticsearch, you need at least a basic license to use this feature). You will notice the “rescore_query_weight” parameter, where we increase the score of the queries by a factor of 1.5 and multiply it with the previous score (via the “score_mode” parameter). The reason is that we want a big gap in the scores at the top results. One can definitely do a hyper-parameter search to find better values for these parameters, but this requires a more complex setup where you need to split your benchmark dataset in train/test parts in order to evaluate results correctly.

我们应用基线查询，获取前50个结果，然后基于矢量相似度搜索应用重新评分(这不是Elasticsearch开源功能的一部分，您至少需要基本许可才能使用此功能)。您会注意到“ rescore_query_weight”参数，在该参数中，我们将查询的分数提高了1.5倍，并将其与先前的分数相乘(通过“ score_mode”参数)。原因是我们希望最高成绩的分数差距很大。绝对可以进行超参数搜索，以找到这些参数的更好值，但这需要更复杂的设置，在该设置中，您需要将基准数据集拆分为训练/测试部分，以便正确评估结果。

查询示例 (Example queries)

Let’s look at an example query to get a feeling of the results we can expect with each technique. Let’s perform the query:

让我们看一个示例查询，以了解每种技术可以期望的结果。让我们执行查询：

jordan radiator paint brush 50mm -2"

乔丹散热器油漆刷50mm -2“

Where the correct hit is:

正确命中的位置是：

brush radiator angle (dog leg) 50 mm jordan qa

刷子散热器角度(狗腿)50毫米

Here we present the top 10 results by the three techniques:

在这里，我们介绍三种技术的前10个结果：

What we see from the results is that SVS failed to retrieve the correct document within the top 10. Baseline and baseline_svs behaved similarly for this query, with a few items changing the order in the top 5.

从结果中我们看到，SVS无法在前10名中检索正确的文档。此查询的基准和基线_svs表现类似，其中有一些项改变了前5名的顺序。

结果 (The results)

The following table summarizes the results from the experiments. This table was obtained by running each experiment five times and taking the average of the results. Each experiment consists of indexing the data in Elasticsearch, running the queries for each algorithm and calculating the statistics. We also show the standard deviation to have an idea of the spreading of the values. We tried to account for reproducible results (with reproducible sorting, merging all Lucene segments into one, testing across different machines), but at the end, we still observe different results when reindexing the data into Elasticsearch. It is a known issue that scores are not reproducible.

下表总结了实验结果。通过运行每个实验五次并取结果的平均值来获得该表。每个实验都包括在Elasticsearch中为数据建立索引，为每种算法运行查询并计算统计信息。我们还显示了标准偏差，以了解值的分布。我们试图考虑可重复的结果(可重复的排序，将所有Lucene段合并为一个，在不同的机器上进行测试)，但是最后，将数据重新索引到Elasticsearch中时，我们仍然观察到不同的结果。分数不可复制是一个已知问题。

From the tables we can make the following remarks:

在表中，我们可以做以下说明：

The variations between each run are small. The variations diminish with the size of the batch for the top results. We believe that there are not so many documents that will have equal scores for a given query in our benchmark dataset.
每次运行之间的差异很小。对于最佳结果，变化随着批次的大小而减小。我们相信，在基准数据集中，对于给定查询，没有多少文件具有相同的分数。
Semantic Vector Search (SVS) alone performs worse than the simple baseline (Baseline). Empirical experience shows that semantic search has a higher false positive probability than BM25. This effect is more extreme when the dataset consists of lots of unrelated documents, like the product catalogue we are working with.
单独的语义向量搜索(SVS)的性能要比简单基准(基线)差。实证经验表明，语义搜索比BM25具有更高的误报概率。当数据集包含许多不相关的文档(例如我们正在使用的产品目录)时，这种影响会更加严重。
Adding a rescoring step with Semantic Vector Search (Baseline_svs) gives a solid increase for the precision at k, for k = 1, 3, 5 and 10.
对于k = 1、3、5和10，添加带有语义向量搜索(Baseline_svs)的计分步骤可大幅提高k处的精度。
SVS is quite costly, as currently implemented in Elasticsearch. Using annoy, faiss or milvus may improve performance (at the cost of functionality, since these tools just offer similarity search, and their APIs are not as rich as of Elasticsearch). There is also a considerable cost to do the rescoring over the baseline depending on the size of the rescoring window, as Table 2 shows.
正如目前在Elasticsearch中实施的那样，SVS成本很高。使用烦扰， faiss或鸢属可以提高性能(在功能的成本，因为这些工具只是提供相似性搜索，和它们的API不一样丰富Elasticsearch的)。如表2所示，根据基准窗口的大小，在基准上进行基准也存在相当大的成本。

We can conclude that, for this dataset at least, rescoring with Semantic Vector Search gives the best results.

我们可以得出的结论是，至少对于该数据集，与语义向量搜索进行匹配可获得最佳结果。

One can wonder if you can expect such performance gains when building a more complex solution. To test this idea, we also benchmarked the final solution we delivered to the customer versus adding a rescoring step with SVS. Here are the results:

有人会怀疑，在构建更复杂的解决方案时是否可以期望获得这种性能提升。为了测试这个想法，我们还对交付给客户的最终解决方案进行了基准测试，而不是添加带有SVS的评分步骤。结果如下：

As we can see, there is a significant increase in performance (seen from the confidence intervals not overlapping) but the performance gains were not so dramatic as for the baseline case. We hypothesize that this is due to the previous optimizations done in the final version. It is worth mentioning that we haven’t done any fine-tuning of the language model to this domain. This is something one could have tested with more time available.

如我们所见，性能有了显着提高(从置信区间不重叠的角度来看)，但是性能提升没有基线情况那么显着。我们假设这是由于在最终版本中进行了先前的优化。值得一提的是，我们尚未对该语言模型进行任何微调。这是可以用更多时间测试的东西。

A final remark about the benchmarking dataset is that it is not possible to build a solution with 100% precision. This is because the historical data is noisy. An example of this is a query where “hygiene product nr 2” has a correct match as “head & shoulders shampoo 200 ml”. Further work is needed to curate and clean the dataset with queries that are meaningful, in the sense that they carry an information signal sufficient to find the correct match. Unfortunately, this requires much manual work and doesn’t scale.

关于基准数据集的最后一句话是，不可能构建具有100％精度的解决方案。这是因为历史数据比较嘈杂。例如，“卫生产品nr 2”与“头和肩膀洗发露200 ml”具有正确的匹配项。在有意义的查询意义上，它们需要携带足以找到正确匹配项的信息信号，因此需要做进一步的工作来整理和清理数据集。不幸的是，这需要大量的手工工作并且无法扩展。

结论 (Conclusions)

Semantic Vector Search is a new and exciting technique which is starting to mature for production scenarios. Here are some of our lessons learned through this customer case:

语义向量搜索是一种令人兴奋的新技术，已针对生产场景开始成熟。以下是通过此客户案例获得的一些经验教训：

A simple search query goes a long way. Don’t overcomplicate your search queries from the start. Start with a simple baseline and increment complexity gradually.
一个简单的搜索查询会走很长一段路 。从一开始就不要让搜索查询过于复杂。从简单的基线开始，逐渐增加复杂性。
Create a benchmark dataset. It will help you be data driven in your performance optimizations. But still, always keep in mind what part of the experience your benchmark is not covering.
创建一个基准数据集 。这将帮助您在性能优化中以数据为驱动力。但是，仍然要记住，基准测试没有涵盖体验的哪一部分。
Reproducibility is hard. Always reindex your data when benchmarking with Elasticsearch.
可重复性很难 。在使用Elasticsearch进行基准测试时，请始终为您的数据重新编制索引。
Semantic Vector Search is costly. It takes time to build the vector embeddings, and it takes time serving the results (since you will need to convert the incoming queries to vectors too). You need to have a good understanding of your latency constraints to scale up accordingly when serving. Changes in your ingestion pipeline will also take more time when recalculating the embeddings.
语义向量搜索的成本很高 。构建向量嵌入需要花费时间，并且需要花费时间来提供结果(因为您也需要将传入的查询转换为向量)。您需要充分了解延迟限制，以便在服务时相应地扩大规模。重新计算嵌入时，摄取管道中的更改也将花费更多时间。
Relevance tuning is hard with Semantic Vector Search. At the moment, it is hard to explain the ranking results from Semantic Vector Search. This is important when trying to fix relevance issues. Here is an example of how to explain the search results for normal BM25.
语义向量搜索很难进行相关性调整 。目前，很难解释语义向量搜索的排名结果。在尝试解决相关性问题时，这一点很重要。这是一个如何解释普通BM25搜索结果的示例。

From our experiments with this dataset, the usual BM25 powered with Semantic Vector Search was the best performant solution. This approach is relatively straightforward to implement in your current search algorithm. But should you do it? Well, as always, it depends! I guess it all comes down to the question of:

从我们对该数据集的实验中可以看出，配备语义向量搜索功能的常规BM25是性能最佳的解决方案。在您当前的搜索算法中，该方法相对简单易行。但是你应该这样做吗？好吧，一如既往，这要看情况！我想这全都归结为以下问题：

How many expected RFQs will you win by using this technique?
使用此技术，您将赢得多少预期的RFQ？
How much performance gains are you willing to trade for the increase in complexity of your search pipeline?
您愿意用多少性能提升来换取搜索管道复杂性的增加？
How important is it for you to be able to explain the results of your search engine?
能够解释搜索引擎的结果对您有多重要？
What kind of control do you need over the search relevancy for your users?
您需要对用户的搜索相关性进行哪种控制？
Can your users tolerate the added increase in query execution time?
您的用户可以容忍查询执行时间的增加吗？

Well, that sums it up. If you want to connect with me and discuss more about search or machine learning, don’t hesitate to reach out.

好吧，总结一下。如果您想与我联系并讨论有关搜索或机器学习的更多信息，请随时与我们联系。

致谢 (Acknowledgments)

I would like to thank Josephine Honoré for helping me to reproduce and discuss the results. I also want to thank Morten Forfang, David Skålid Amundsen and an anonymous manager from our customer, for proof-reading and commenting on earlier drafts of this document.

我要感谢约瑟芬·奥诺雷 ( JosephineHonoré)帮助我重现和讨论结果。我还要感谢Morten Forfang ， DavidSkålidAmundsen和我们客户的一位匿名经理，感谢他们对本文档的早期草稿进行了校对和评论。