余弦相似度 python_余弦相似度如何测量python中的相似度数学和用法

余弦相似度 python

What is cosine similarity?

什么是余弦相似度?

Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors.

余弦相似度通过计算两个向量之间的角度的余弦值来衡量两个向量之间的相似度。

Cosine similarity is one of the most widely used and powerful similarity measure in Data Science. It is used in multiple applications such as finding similar documents in NLP, information retrieval, finding similar sequence to a DNA in bioinformatics, detecting plagiarism and may more.

余弦相似度是数据科学中使用最广泛,功能最强大的相似度之一。 它可用于多种应用程序,例如在NLP中查找相似的文档,信息检索,在生物信息学中查找与DNA相似的序列,检测抄袭等等。

Cosine similarity is calculated as follows,

余弦相似度计算如下:

Image for post
Angle between two 2-D vectors A and B (Image by author)
两个二维向量A和B之间的角度(作者提供的图片)
Image for post
calculation of cosine of the angle between A and B
A和B之间的夹角余弦的计算

Why cosine of the angle between A and B gives us the similarity?

为什么A和B之间的夹角余弦会给我们相似性?

If you look at the cosine function, it is 1 at theta = 0 and -1 at theta = 180, that means for two overlapping vectors cosine will be the highest and lowest for two exactly opposite vectors. You can consider 1-cosine as distance.

如果查看余弦函数,则在theta = 0处为1,在theta = 180处为-1,这意味着对于两个重叠的向量,余弦将是两个完全相反的向量的最高和最低值。 您可以将1-cosine作为距离。

Image for post
cosine(Image by author)
余弦(作者提供)
Image for post
values of cosine at different angles (Image by author)
不同角度的余弦值(作者提供)

How to calculate it in Python?

如何在Python中计算?

The numerator of the formula is the dot product of the two vectors and denominator is the product of L2 norm of both the vectors. Dot product of two vectors is the sum of element wise multiplication of the vectors and L2 norm is the square root of sum of squares of elements of a vector.

公式的分子是两个向量的点积,分母是两个向量的L2范数的乘积。 两个向量的点积是向量在元素上的乘积之和,而L2范数是向量的元素平方和的平方根。

We can either use inbuilt functions in Numpy library to calculate dot product and L2 norm of the vectors and put it in the formula or directly use the cosine_similarity from sklearn.metrics.pairwise. Consider two vectors A and B in 2-D, following code calculates the cosine similarity,

我们可以使用Numpy库中的内置函数来计算向量的点积和L2范数并将其放入公式中,也可以直接使用sklearn.metrics.pairwise中的cosine_similarity。 考虑二维中的两个向量A和B,下面的代码计算余弦相似度,

import numpy as np
import matplotlib.pyplot as plt# consider two vectors A and B in 2-D
A=np.array([7,3])
B=np.array([3,7])ax = plt.axes()ax.arrow(0.0, 0.0, A[0], A[1], head_width=0.4, head_length=0.5)
plt.annotate(f"A({A[0]},{A[1]})", xy=(A[0], A[1]),xytext=(A[0]+0.5, A[1]))ax.arrow(0.0, 0.0, B[0], B[1], head_width=0.4, head_length=0.5)
plt.annotate(f"B({B[0]},{B[1]})", xy=(B[0], B[1]),xytext=(B[0]+0.5, B[1]))plt.xlim(0,10)
plt.ylim(0,10)plt.show()
plt.close()# cosine similarity between A and B
cos_sim=np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B))
print (f"Cosine Similarity between A and B:{cos_sim}")
print (f"Cosine Distance between A and B:{1-cos_sim}")
Image for post
Code output (Image by author)
代码输出(作者提供的图像)
# using sklearn to calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity,cosine_distancescos_sim=cosine_similarity(A.reshape(1,-1),B.reshape(1,-1))
print (f"Cosine Similarity between A and B:{cos_sim}")
print (f"Cosine Distance between A and B:{1-cos_sim}")
Image for post
Code output (Image by author)
代码输出(作者提供的图像)
# using scipy, it calculates 1-cosine
from scipy.spatial import distancedistance.cosine(A.reshape(1,-1),B.reshape(1,-1))
Image for post
Code output (Image by author)
代码输出(作者提供的图像)

Proof of the formula

公式证明

Cosine similarity formula can be proved by using Law of cosines,

余弦相似度公式可以用余弦定律证明,

Image for post
Law of cosines (Image by author)
余弦定律(作者提供图片)

Consider two vectors A and B in 2-dimensions, such as,

考虑二维的两个向量A和B,例如,

Image for post
Two 2-D vectors (Image by author)
两个二维矢量(作者提供的图片)

Using Law of cosines,

利用余弦定律,

Image for post
Cosine similarity using Law of cosines (Image by author)
使用余弦定律的余弦相似度(作者提供的图片)

You can prove the same for 3-dimensions or any dimensions in general. It follows exactly same steps as above.

通常,您可以证明3维或任何尺寸的相同。 它遵循与上述完全相同的步骤。

Summary

概要

We saw how cosine similarity works, how to use it and why does it work. I hope this article helped in understanding the whole concept behind this powerful metric.

我们了解了余弦相似度如何工作,如何使用它以及为什么起作用。 我希望本文有助于理解这一强大指标背后的整个概念。

翻译自: https://towardsdatascience.com/cosine-similarity-how-does-it-measure-the-similarity-maths-behind-and-usage-in-python-50ad30aad7db

余弦相似度 python

  • 0
    点赞
  • 0
    评论
  • 4
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

表情包
插入表情
评论将由博主筛选后显示,对所有人可见 | 还能输入1000个字符
©️2021 CSDN 皮肤主题: 游动-白 设计师:白松林 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值