文章目录
1. 闵可夫斯基距离 Minkowski Distance
适用条件:
- 每个空间内的数值是连续的
- 由于闵可夫斯基距离不会考虑不同值之间的量纲是否统一,因此在计算相似度时所有值的含义应该相同
闵可夫斯基距离也称明氏距离,根据p值的不同,也可以表示为下面的任意距离
Latex表示:y=\left( \sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{p}} \right)^{\frac{1}{p}}
公式表示: y = ( ∑ i = 1 n ∣ x i − y i ∣ p ) 1 p y=\left( \sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{p}} \right)^{\frac{1}{p}} y=(∑i=1n∣xi−yi∣p)p1
Python实现:
import numpy as np
import pandas as pd
def minkowski_distance(x, y, p):
return np.sum([np.abs(x - y) ** p]) ** (1.0 / p)
def minkowski_distance_2(x, y, p):
from scipy.spatial.distance import pdist
X = np.vstack([x, y])
return pdist(X, 'minkowski', p)[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(minkowski_distance(df.x, df.y, p=5))
print(minkowski_distance_2(df.x, df.y, p=5))
# p=1为曼哈顿距离,p=2为欧氏距离
p=1时 曼哈顿距离 Manhattan Distance

其中:黄线,橙线都表示曼哈顿距离;红蓝线表示等价的曼哈顿距离;绿线表示欧氏距离
直观感受: X 允许横向与纵向移动时,离 Y 的最小距离
Latex表示:y=\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|}
公式表示: y = ∑ i = 1 n ∣ x i − y i ∣ y=\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|} y=∑i=1n∣xi−yi∣
Python实现:
import pandas as pd
import numpy as np
def manhattan_distance(x, y):
return np.sum(np.abs(x - y))
def manhattan_distance_2(x, y):
from scipy.spatial.distance import pdist
X = np.vstack([x, y])
return pdist(X, 'cityblock')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(manhattan_distance(df.x, df.y))
print(manhattan_distance_2(df.x, df.y))
p=2时 欧氏距离 Euclidean Distance
如上图中的绿色所示,欧式距离为对应坐标差值的平方,求和后开根号,即为x到y的直线距离
直观感受:X 与 Y 的直线距离
Latex表示:y=\sqrt{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{2}}}
公式表示: y = ∑ i = 1 n ∣ x i − y i ∣ 2 y=\sqrt{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{2}}} y=∑i=1n∣xi−yi∣2
Python实现:
import pandas as pd
import numpy as np
def euclidean_distance(x, y):
return np.sqrt(np.sum(np.square(x - y)))
def euclidean_distance_2(x, y):
from scipy.spatial.distance import pdist
X = np.vstack([x, y])
return pdist(X)[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(euclidean_distance(df.x, df.y))
print(euclidean_distance_2(df.x, df.y))
标准化欧氏距离Standardized Euclidean Distance
scipy doc:https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.seuclidean.html
Latex表示:D=\; \sqrt{\sum_{i=1}^{n}{\left( \frac{x_{i}-y_{i}}{S_{k}} \right)}}
公式表示: D = ∑ i = 1 n ( x i − y i S k ) D=\; \sqrt{\sum_{i=1}^{n}{\left( \frac{x_{i}-y_{i}}{S_{k}} \right)}} D=∑i=1n(Skxi−yi)
Python实现:
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist
def standardized_euclidean_distance(x, y):
sk = np.var(np.vstack([x, y]), axis=0, ddof=1)
return np.sqrt(((x - y) ** 2 / sk).sum())
def standardized_euclidean_distance_2(x, y):
return pdist(np.vstack([x, y]), 'seuclidean')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(10, 2)), columns=['x', 'y'])
print(standardized_euclidean_distance(df.x, df.y))
print(standardized_euclidean_distance_2(df.x, df.y))
p->∞ 切比雪夫距离 Chebyshev Distance

直观感受:X 可以 上下左右斜着走时,到 Y 的最小距离
Latex表示:y=\; \max _{i}\left( \left| x_{i}-y_{i} \right| \right)
公式表示: y = max i ( ∣ x i − y i ∣ ) y=\; \max _{i}\left( \left| x_{i}-y_{i} \right| \right) y=maxi(∣xi−yi∣)
Python实现:
import numpy as np
import pandas as pd
def chebyshev_distance(x, y):
return np.max(np.abs(x - y))
def chebyshev_distance_2(x, y):
from scipy.spatial.distance import pdist
X = np.vstack([x, y])
return pdist(X, 'chebyshev')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(chebyshev_distance(df.x, df.y))
print(chebyshev_distance_2(df.x, df.y))
2.余弦相似度 Cosine Similarity
直观感受:两个指向高维度空间的向量之间的 夹角
Latex表示:y=\frac{\sum_{i=1}^{n}{\left( x_{i}\; \times \; y_{i} \right)}}{\sqrt{\sum_{i=1}^{n}{x_{i}^{2}}}\; \times \; \sqrt{\sum_{i=1}^{n}{y_{i}^{2}}}}
公式表示: y = ∑ i = 1 n ( x i × y i ) ∑ i = 1 n x i 2 × ∑ i = 1 n y i 2 y=\frac{\sum_{i=1}^{n}{\left( x_{i}\; \times \; y_{i} \right)}}{\sqrt{\sum_{i=1}^{n}{x_{i}^{2}}}\; \times \; \sqrt{\sum_{i=1}^{n}{y_{i}^{2}}}} y=∑i=1nxi2×∑i=1nyi2∑i=1n(xi×yi)
Python实现:
import numpy as np
import pandas as pd
def cosine_similarity(x, y):
return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
def cosine_similarity_2(x, y):
from scipy.spatial.distance import pdist
return 1 - pdist(np.vstack([x, y]), 'cosine')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(cosine_similarity(df.x, df.y))
print(cosine_similarity_2(df.x, df.y))
修正余弦相似度 Adjusted Cosine Similarity
举个栗子:
- 小A习惯差评,电影1 看睡着了打1分,电影2 深度好片打3分,电影3 中规中矩打2分;
- 小B习惯好评,电影1 也看睡着了打4分,电影2 深度好片打5分,电影3 中规中矩打4.5分;
- 其实小A小B口味还是差不多的,但是计算
cosine_similarity([1, 3, 2], [4, 5, 4.5])
计算结果是0.95
,并不是预期的1,三个维度是如此,如果维度多了,那么偏差值会更大,
import numpy as np
from sklearn.preprocessing import StandardScaler
def cosine_similarity(x, y):
return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
def sim_acs(x, y):
# 先做标准化,再计算余弦相似度
stdsc = StandardScaler()
x = stdsc.fit_transform(x)
y = stdsc.fit_transform(y)
return cosine_similarity(x, y)
if __name__ == '__main__':
print(cosine_similarity([2, 3, 2.5], [4, 5, 4.5])) # 0.997
3.皮尔逊线性相关系数 Pearson Correlation Coefficient
这个指用以衡量两组数据的线性相关的程度。皮尔逊相关系数值域为[-1,1],值大于0时为正相关,1为线性正相关;值小于0是为负相关,-1时为线性负相关。
Latex表示:y=\frac{\mbox{C}ov\left( X,Y \right)}{\sqrt{D\left( X \right)}\; \sqrt{D\left( Y \right)}}
公式表示: y = C o v ( X , Y ) D ( X ) D ( Y ) y=\frac{Cov\left( X,Y \right)}{\sqrt{D\left( X \right)}\sqrt{D\left( Y \right)}} y=D(X)D(Y)Cov(X,Y)
import numpy as np
import pandas as pd
def person_correlation(x, y):
x_mean = x - np.mean(x)
y_mean = y - np.mean(y)
return np.dot(x_mean, y_mean) / (np.linalg.norm(x_mean) * np.linalg.norm(y_mean))
def person_correlation_2(x, y):
X = np.vstack([x, y])
return np.corrcoef(X)[0][1]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(person_correlation(df.x, df.y))
print(person_correlation_2(df.x, df.y))
4.马氏距离 Mahalanobis Distance
Latex表示:D_{M}\left( X,Y \right)=\sqrt{\left( X-Y \right)^{T}S^{-1}\left( X-Y \right)}
公式表示: D M ( X , Y ) = ( X − Y ) T S − 1 ( X − Y ) D_{M}\left( X,Y \right)=\sqrt{\left( X-Y \right)^{T}S^{-1}\left( X-Y \right)} DM(X,Y)=(X−Y)TS−1(X−Y),其中 S − 1 {S}^{-1} S−1为协方差矩阵
Python实现:
import numpy as np
import pandas as pd
def mahalanobis_distance(x, y):
X = np.vstack([x, y])
XT = X.T
S = np.cov(X) # 两个维度之间协方差矩阵
SI = np.linalg.inv(S) # 协方差矩阵的逆矩阵
n = XT.shape[0]
d1 = []
for i in range(0, n):
for j in range(i + 1, n):
delta = XT[i] - XT[j]
d = np.sqrt(np.dot(np.dot(delta, SI), delta.T))
d1.append(d)
return d1
def mahalanobis_distance_2(x, y):
from scipy.spatial.distance import pdist
return pdist(np.vstack([x, y]).T, 'mahalanobis')
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(mahalanobis_distance(df.x, df.y))
print(mahalanobis_distance_2(df.x, df.y))
5.杰卡德距离 Jaccard Distance
杰卡德相似系数(Jaccard similarity coefficient)是两个集合A和B的交集元素在A,B的并集中所占的比例,而杰卡德距离是1-杰卡德相似系数
Latex表示:J\left( A,B \right)=\frac{\left| A\cup B \right|\; -\; \left| A\cap B \right|}{\left| A\cup B \right|}
公式表示: J ( A , B ) = ∣ A ∪ B ∣ − ∣ A ∩ B ∣ ∣ A ∪ B ∣ J\left( A,B \right)=\frac{\left| A\cup B \right|\; -\; \left| A\cap B \right|}{\left| A\cup B \right|} J(A,B)=∣A∪B∣∣A∪B∣−∣A∩B∣
Python实现:
from scipy.spatial.distance import pdist
import pandas as pd
import numpy as np
def jaccard_distance(x, y):
up = np.double(np.bitwise_and((x != y), np.bitwise_or(x != 0, y != 0)).sum())
down = np.double(np.bitwise_or(x != 0, y != 0).sum())
return 1 - (up / down)
def jaccard_distance_2(x, y):
X = np.vstack([x, y])
return 1 - pdist(X, 'jaccard')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 20, size=(3, 2)), columns=['x', 'y'])
print(jaccard_distance(df.x, df.y))
print(jaccard_distance_2(df.x, df.y)
6. 布雷克斯距离 Bray Curtis Distance
适用于X与Y的值非负的情况
含义:常用于生态学和环境科学等坐标计算,与粗略估计样本的差异性。计算方法是用X于Y的差值的求和,比X与Y的所有值的总和,值域为[0,1],越接近0,表明样本差异越小。
Latex表示:y=\frac{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|}}{\sum_{i=1}^{n}{x_{i}\; +\; \sum_{i=1}^{n}{y_{i}}}}
公示表示: y = ∑ i = 1 n ∣ x i − y i ∣ ∑ i = 1 n x i + ∑ i = 1 n y i y=\frac{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|}}{\sum_{i=1}^{n}{x_{i}\; +\; \sum_{i=1}^{n}{y_{i}}}} y=∑i=1nxi+∑i=1nyi∑i=1n∣xi−yi∣
Python实现:
import numpy as np
from scipy.spatial.distance import pdist
import pandas as pd
def bray_curtis_distance(x, y):
up = np.sum(np.abs(y - x))
down = np.sum(x) + np.sum(y)
d1 = (up / down)
return d1
def bray_curtis_distance_2(x, y):
X = np.vstack([x, y])
return pdist(X, 'braycurtis')[0]
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(1, 2)), columns=['x', 'y'])
print(bray_curtis_distance(df.x, df.y))
print(bray_curtis_distance_2(df.x, df.y))
7. 斯皮尔曼等级相关系数 Spearman’s Rank Correlation Coefficient
含义:皮尔逊相关系数会受到异常值的影响较大,而斯皮尔曼相关系数借助排序,可以消除掉部分异常值造成的影响。斯皮尔曼等级相关系数的应用范围比皮尔逊相关系数更广泛。但是弊端是相关系数的差距体现在排名的差值上,如果数据量太少的话,平方项体现不明显,使这个系数表现不太好。
与皮尔逊相关系数一样值域为[-1,1],值域大于0为正相关,反之负相关,越接近1 相关性越明显
youtube视频详解:https://www.youtube.com/watch?v=DE58QuNKA-c
Latex表示:D=1-\frac{6\sum_{i=1}^{n}{d_{i}^{2}}}{n^{3}-n}
公式表示: D = 1 − 6 ∑ i = 1 n d i 2 n 3 − n D=1-\frac{6\sum_{i=1}^{n}{d_{i}^{2}}}{n^{3}-n} D=1−n3−n6∑i=1ndi2
import pandas as pd
import numpy as np
def spearman_rank_correlation(x, y):
from scipy.stats import spearmanr
r, p = spearmanr(x, y)
return r, p
def spearman_rank_correlation_2(dataframe: pd.DataFrame) -> pd.DataFrame:
return dataframe.corr('spearman')
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(spearman_rank_correlation(df.x, df.y)[1])
print(spearman_rank_correlation_2(df))
8 肯德尔相关系数
Latex表示:
公式表示:
import pandas as pd
import numpy as np
def kendall_correlation_coefficient(x, y):
from scipy.stats import kendalltau
return kendalltau(x, y)[0]
def kendall_correlation_coefficient_2(dataframe: pd.DataFrame) -> pd.DataFrame:
return dataframe.corr('kendall')
if __name__ == '__main__':
df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
print(kendall_correlation_coefficient(df.x, df.y))
print(kendall_correlation_coefficient_2(df))
9 编辑距离与汉明距离
详情请参考:https://blog.csdn.net/weixin_35757704/article/details/115439449
参考文章
- Scipy Distance computations Doc:https://docs.scipy.org/doc/scipy/reference/spatial.distance.html#distance-computations-scipy-spatial-distance
- 各种距离的归纳和总结:https://zhuanlan.zhihu.com/p/58819850
- Measuring Distance:https://github.com/Chris3606/GoRogue/wiki/Measuring-Distance
- [369]python各类距离公式实现:https://blog.csdn.net/xc_zhou/article/details/81535033
- 度量学习中的马氏距离:https://www.jianshu.com/p/5706a108a0c6