bert文本相似度计算
入门(Getting Started)
Introduction
介绍
Document similarities is one of the most crucial problems of NLP. Finding similarity across documents is used in several domains such as recommending similar books and articles, identifying plagiarised documents, legal documents, etc.
文档相似性是NLP的最关键问题之一。 在多个领域使用跨文档查找相似性,例如推荐相似的书籍和文章,识别抄袭的文档,法律文档等。
We can call two documents similar if they are semantically similar and define the same concept or if they are duplicates.
如果两个文档在语义上相似并且定义相同的概念,或者它们是重复的,则可以称两个文档相似。
To make machines figure out the similarity between documents we need to define a way to measure the similarity mathematically and it should be comparable so that machine can tell us which documents are most similar or which are least. We also need to represent text from documents in a quantifiable form (or a mathematical object, which is usually a vector form), so that we can perform similarity calculations on top of it.
为了使机器能够计算出文档之间的相似性,我们需要定义一种数学上测量相似性的方法,并且该方法应该具有可比性,以便机器可以告诉我们哪些文档最相似或哪些文档最少。 我们还需要以可量化的形式(或数学对象,通常是矢量形式)表示文档中的文本,以便我们可以在其上执行相似度计算。
So, converting a document into a mathematical object and defining a similarity measure are primarily the two steps required to make machines perform this exercise. We will look into different ways of doing this.
因此,将文档转换为数学对象并定义相似性度量主要是使机器执行此练习所需的两个步骤。 我们将研究执行此操作的不同方法。
Similarity Function
相似度函数
Some of the most common and effective ways of calculating similarities are,
计算相似度的一些最常见,最有效的方法是,
Cosine Distance/Similarity - It is the cosine of the angle between two vectors, which gives us the angular distance between the vectors. Formula to calculate cosine similarity between two vectors A and B is,
余弦距离/相似度-它是两个向量之间的角度的余弦值,它为我们提供了向量之间的角距离。 计算两个向量A和B之间的余弦相似度的公式为:
In a two-dimensional space it will look like this,
在二维空间中,它看起来像这样,
You can easily work out the math and prove this formula using the law of cosines.
您可以轻松地算出数学并使用余弦定律证明该公式。
Cosine is 1 at theta=0 and -1 at theta=180, that means for two overlapping vectors cosine will be the highest and lowest for two exactly opposite vectors. For this reason, it is called similarity. You can consider 1 - cosine as distance.
余弦在theta = 0处为1,在theta = 180处为-1,这意味着对于两个重叠的向量,余弦对于两个完全相反的向量而言将是最高和最低的。 因此,这称为相似性。 您可以考虑1-余弦作为距离。
Euclidean Distance - This is one of the forms of Minkowski distance when p=2. It is defined as follows,
欧几里德距离-当p = 2时,这是Minkowski距离的形式之一。 定义如下,
In two-dimensional space, Euclidean distance will look like this,
在二维空间中,欧几里得距离看起来像这样,
Jaccard Distance - Jaccard Index is used to calculate the similarity between two finite sets. Jaccard Distance can be considered as 1 - Jaccard Index.
雅卡德距离-雅卡德索引用于计算两个有限集之间的相似度。 提卡距离可以视为1-提卡索引。
We can use Cosine or Euclidean distance if we can represent documents in the vector space. Jaccard Distance can be used if we consider our documents just the sets or collections of words without any semantic meaning.
如果可以在向量空间中表示文档,则可以使用余弦或欧几里得距离。 如果我们认为我们的文档只是单词的集合或集合而没有任何语义含义,则可以使用“杰卡德距离”。
Cosine and Euclidean distance are the most widely used measures and we will use these two in our examples below.
余弦和欧几里得距离是使用最广泛的度量,我们将在下面的示例中使用这两个度量。
Embeddings
嵌入
Embeddings are the vector representations of text where word or sentences with similar meaning or context have similar representations.
嵌入是文本的矢量表示,其中具有相似含义或上下文的单词或句子具有相似的表示。