句子嵌入_带句子转换器库的句子嵌入

最新推荐文章于 2024-06-17 09:55:58 发布

杨_明

最新推荐文章于 2024-06-17 09:55:58 发布

阅读量1.1k

点赞数

文章标签：自然语言处理 nlp

原文链接：https://medium.com/towards-artificial-intelligence/sentence-embeddings-with-sentence-transformers-library-7420fc6e3815

版权

句子嵌入

数据科学，机器学习(Data Science, Machine Learning)

I came across this simple to use sentence-transformers library when I was recently working on implementing semantic search functionality. As part of this, I had to index the dense vector representation of each document into Elasticsearch for the semantic search to work. With this library, I was able to implement this functionality quickly and effectively. I hope that you will find this article helpful.

当我最近致力于实现语义搜索功能时，我遇到了这个易于使用的sentence-transformers库。作为其中的一部分，我必须将每个文档的密集矢量表示形式索引到Elasticsearch中，以使语义搜索起作用。使用该库，我能够快速有效地实现此功能。希望本文对您有所帮助。

This article requires knowledge of Embeddings (word embeddings or sentence embeddings). You can refer to this article to quickly refresh your memory. If you already know about Embeddings, you can continue reading.

本文需要嵌入知识(单词嵌入或句子嵌入)。你可以参考这个文章，快速刷新你的记忆。如果您已经了解嵌入，则可以继续阅读。

安装 (Installation)

pip install -U sentence-transformers

用法 (Usage)

1.句子嵌入(1. Sentence Embedding)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('model_name_or_path')

In the below example, we have passed a pre-trained model distilbert-base-nli-stsb-mean-tokens to SentenceTransformer for computing the sentence embedding. The full list of pre-trained models is found here. Note that there is no one embedding that could work for all the tasks, so we should try some of these models and select the one which works best.

在下面的示例中，我们将预训练的模型distilbert-base-nli-stsb-mean-tokens传递给SentenceTransformer以计算句子嵌入。此处提供了预训练模型的完整列表。请注意，没有一种嵌入可以适用于所有任务，因此我们应该尝试其中一些模型，然后选择最适合的模型。

Note: sentence-transformers models are also hosted on the Huggingface repository. So we can directly use Hugginface’s Transformers library for generating sentence embedding without installing sentence-transformers library. The sample code is given here.

注意： sentence-transformers模型也托管在Huggingface信息库中。因此，我们可以直接使用Hugginface的Transformers库来生成句子嵌入，而无需安装sentence-transformers库。示例代码在此处给出。

2.语义文本相似性 (2. Semantic Textual Similarity)

Now that we have understood how to generate the sentence embedding, the next step is to compare the sentences for semantic textual similarity and rank them based on the cosine similarity.

现在我们已经了解了如何生成句子嵌入，下一步是比较句子的语义文本相似性，并根据余弦相似度对它们进行排名。

The recommended models for Sentence Similarities are listed below. These models are trained on NLI & STS data and evaluated on the STSbenchmark dataset. The authors recommend the model distilbert-base-nli-stsb-mean-tokens as it gives a perfect balance between speed and performance.

下面列出了句子相似性的推荐模型。这些模型在NLI和STS数据上进行训练，并在STSbenchmark数据集上进行评估。作者推荐该模型distilbert-base-nli-stsb-mean-tokens因为它可以在速度和性能之间实现完美的平衡。

roberta-large-nli-stsb-mean-tokens — STSb performance: 86.39roberta-base-nli-stsb-mean-tokens — STSb performance: 85.44bert-large-nli-stsb-mean-tokens — STSb performance: 85.29distilbert-base-nli-stsb-mean-tokens — STSb performance: 85.16

roberta-large-nli-stsb平均令牌— STSb性能：86.39 roberta-base-nli-stsb平均令牌— STSb性能：85.44 bert-large-nli-stsb平均令牌— STSb性能：85.29 distilbert- base-nli-stsb-mean-tokens — STSb性能：85.16

Let’s look at an example of cosine similarity between the sentences we have used in the previous example:

让我们看一下在上一个示例中使用的句子之间的余弦相似度示例：

The method uses a brute-force approach to find the highest-scoring pairs, which has quadratic complexity. For longer sentences, this method is not feasible. Paraphrase Mining which is discussed next is the optimal method.

该方法使用蛮力方法找到得分最高的对，具有二次复杂度。对于较长的句子，此方法不可行。接下来讨论的复述挖掘是最佳方法。

3.复述挖掘 (3. Paraphrase Mining)

Paraphrase Mining is used when we need to deal with a large collection of sentences (10,000 and more). A more detailed explanation of Paraphrase Mining is found here.

当我们需要处理大量句子(10,000个及更多)时，将使用复述挖掘。可在此处找到对复述采矿的更详细说明。

Let’s look at an example using Paraphrase Mining:

让我们来看一个使用复述挖掘的示例：

4.语义搜索 (4. Semantic Search)

Traditional search engines were designed to work with the lexical based search but using semantic search we can find documents based on synonyms. Using the techniques we learned above we can implement semantic search functionality. Semantic search seeks to improve search accuracy by understanding the content of the search query.

传统的搜索引擎被设计用于基于词法的搜索，但是使用语义搜索，我们可以找到基于同义词的文档。使用我们上面学到的技术，我们可以实现语义搜索功能。语义搜索旨在通过了解搜索查询的内容来提高搜索准确性。

Semantic search is most commonly used in Search Engines such as Elasticsearch. If you have a basic understanding of Elasticsearch and go through to this link understand how Semantic Search can be implemented Elasticsearch.

语义搜索是最常见的搜索引擎，例如Elasticsearch。如果您对Elasticsearch有基本的了解，请访问此链接，以了解如何实现Elasticsearch的语义搜索。

结论 (Conclusion)

Hope you have understood how to use the sentence-transformers library for computing sentence embeddings, how to get the similarity between the sentences, and finally how we can make sure of sentence embedding to implement semantic search.

希望您了解如何使用sentence-transformers库来计算句子嵌入，如何获取句子之间的相似度以及最终如何确保句子嵌入以实现语义搜索。

Thank you for reading this article. You can reach me at https://www.linkedin.com/in/chetanambi/

感谢您阅读本文。您可以通过https://www.linkedin.com/in/chetanambi/与我联系

翻译自: https://medium.com/towards-artificial-intelligence/sentence-embeddings-with-sentence-transformers-library-7420fc6e3815

句子嵌入

杨_明

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
句子嵌入_带句子转换器库的句子嵌入

句子嵌入数据科学，机器学习(Data Science, Machine Learning)I came across this simple to use sentence-transformers library when I was recently working on implementing semantic search functionality. As part of this,...
复制链接

扫一扫