余弦相似度 python
What is cosine similarity?
什么是余弦相似度?
Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors.
余弦相似度通过计算两个向量之间的角度的余弦值来衡量两个向量之间的相似度。
Cosine similarity is one of the most widely used and powerful similarity measure in Data Science. It is used in multiple applications such as finding similar documents in NLP, information retrieval, finding similar sequence to a DNA in bioinformatics, detecting plagiarism and may more.
余弦相似度是数据科学中使用最广泛,功能最强大的相似度之一。 它可用于多种应用程序,例如在NLP中查找相似的文档,信息检索,在生物信息学中查找与DNA相似的序列,检测抄袭等等。
Cosine similarity is calculated as follows,
余弦相似度计算如下:
Why cosine of the angle between A and B gives us the similarity?
为什么A和B之间的夹角余弦会给我们相似性?
If you look at the cosine function, it is 1 at theta = 0 and -1 at theta = 180, that means for two overlapping vectors cosine will be the highest and lowest for two exactly opposite vectors. You can consider 1-cosine as distance.
如果查看余弦函数,则在theta = 0处为1,在theta = 180处为-1,这意味着对于两个重叠的向量,余弦将是两个完全相反的向量的最高和最低值。 您可以将1-cosine作为距离。
How to calculate it in Python?
如何在Python中计算?
The numerator of the formula is the dot product of the two vectors and denominator is the product of L2 norm of both the vectors. Dot product of two vectors is the sum of element wise multiplication of the vectors and L2 norm is the square root of sum of squares of elements of a vector.
公式的分子是两个向量的点积,分母是两个向量的L2范数的乘积。 两个向量的点积是向量在元素上的乘积之和,而L2范数是向量的元素平方和的平方根。
We can either use inbuilt functions in Numpy library to calculate dot product and L2 norm of the vectors and put it in the formula or directly use the cosine_similarity from sklearn.metrics.pairwise. Consider two vectors A and B in 2-D, following code calculates the cosine similarity,
我们可以使用Numpy库中的内置函数来计算向量的点积和L2范数并将其放入公式中,也可以直接使用sklearn.metrics.pairwise中的cosine_similarity。 考虑二维中的两个向量A和B,下面的代码计算余弦相似度,
import numpy as np
import matplotlib.pyplot as plt# consider two vectors A and B in 2-D
A=np.array([7,3])
B=np.array([3,7])ax = plt.axes()ax.arrow(0.0, 0.0, A[0], A[1], head_width=0.4, head_length=0.5)
plt.annotate(f"A({A[0]},{A[1]})", xy=(A[0], A[1]),xytext=(A[0]+0.5, A[1]))ax.arrow(0.0, 0.0, B[0], B[1], head_width=0.4, head_length=0.5)
plt.annotate(f"B({B[0]},{B[1]})", xy=(B[0], B[1]),xytext=(B[0]+0.5, B[1]))plt.xlim(0,10)
plt.ylim(0,10)plt.show()
plt.close()# cosine similarity between A and B
cos_sim=np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B))
print (f"Cosine Similarity between A and B:{cos_sim}")
print (f"Cosine Distance between A and B:{1-cos_sim}")
# using sklearn to calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity,cosine_distancescos_sim=cosine_similarity(A.reshape(1,-1),B.reshape(1,-1))
print (f"Cosine Similarity between A and B:{cos_sim}")
print (f"Cosine Distance between A and B:{1-cos_sim}")
# using scipy, it calculates 1-cosine
from scipy.spatial import distancedistance.cosine(A.reshape(1,-1),B.reshape(1,-1))
Proof of the formula
公式证明
Cosine similarity formula can be proved by using Law of cosines,
余弦相似度公式可以用余弦定律证明,
Consider two vectors A and B in 2-dimensions, such as,
考虑二维的两个向量A和B,例如,
Using Law of cosines,
利用余弦定律,
You can prove the same for 3-dimensions or any dimensions in general. It follows exactly same steps as above.
通常,您可以证明3维或任何尺寸的相同。 它遵循与上述完全相同的步骤。
Summary
概要
We saw how cosine similarity works, how to use it and why does it work. I hope this article helped in understanding the whole concept behind this powerful metric.
我们了解了余弦相似度如何工作,如何使用它以及为什么起作用。 我希望本文有助于理解这一强大指标背后的整个概念。
余弦相似度 python