向量距离汇总(连续值与离散值),Latex与Python实现

1. 闵可夫斯基距离 Minkowski Distance

适用条件:

  1. 每个空间内的数值是连续的
  2. 由于闵可夫斯基距离不会考虑不同值之间的量纲是否统一,因此在计算相似度时所有值的含义应该相同

闵可夫斯基距离也称明氏距离,根据p值的不同,也可以表示为下面的任意距离

Latex表示:y=\left( \sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{p}} \right)^{\frac{1}{p}}

公式表示: y = ( ∑ i = 1 n ∣ x i − y i ∣ p ) 1 p y=\left( \sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{p}} \right)^{\frac{1}{p}} y=(i=1nxiyip)p1

Python实现:

import numpy as np
import pandas as pd


def minkowski_distance(x, y, p):
    return np.sum([np.abs(x - y) ** p]) ** (1.0 / p)


def minkowski_distance_2(x, y, p):
    from scipy.spatial.distance import pdist

    X = np.vstack([x, y])
    return pdist(X, 'minkowski', p)[0]


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
    print(minkowski_distance(df.x, df.y, p=5))
    print(minkowski_distance_2(df.x, df.y, p=5))
    # p=1为曼哈顿距离,p=2为欧氏距离

p=1时 曼哈顿距离 Manhattan Distance

其中:黄线,橙线都表示曼哈顿距离;红蓝线表示等价的曼哈顿距离;绿线表示欧氏距离

直观感受: X 允许横向与纵向移动时,离 Y 的最小距离

Latex表示:y=\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|}

公式表示: y = ∑ i = 1 n ∣ x i − y i ∣ y=\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|} y=i=1nxiyi

Python实现:

import pandas as pd
import numpy as np


def manhattan_distance(x, y):
    return np.sum(np.abs(x - y))


def manhattan_distance_2(x, y):
    from scipy.spatial.distance import pdist

    X = np.vstack([x, y])
    return pdist(X, 'cityblock')[0]


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
    print(manhattan_distance(df.x, df.y))
    print(manhattan_distance_2(df.x, df.y))

p=2时 欧氏距离 Euclidean Distance

如上图中的绿色所示,欧式距离为对应坐标差值的平方,求和后开根号,即为x到y的直线距离

直观感受:X 与 Y 的直线距离

Latex表示:y=\sqrt{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{2}}}

公式表示: y = ∑ i = 1 n ∣ x i − y i ∣ 2 y=\sqrt{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{2}}} y=i=1nxiyi2

Python实现:

import pandas as pd
import numpy as np


def euclidean_distance(x, y):
    return np.sqrt(np.sum(np.square(x - y)))


def euclidean_distance_2(x, y):
    from scipy.spatial.distance import pdist

    X = np.vstack([x, y])
    return pdist(X)[0]


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
    print(euclidean_distance(df.x, df.y))
    print(euclidean_distance_2(df.x, df.y))

标准化欧氏距离Standardized Euclidean Distance

scipy doc:https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.seuclidean.html

Latex表示:D=\; \sqrt{\sum_{i=1}^{n}{\left( \frac{x_{i}-y_{i}}{S_{k}} \right)}}

公式表示: D =    ∑ i = 1 n ( x i − y i S k ) D=\; \sqrt{\sum_{i=1}^{n}{\left( \frac{x_{i}-y_{i}}{S_{k}} \right)}} D=i=1n(Skxiyi)

Python实现:

import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist


def standardized_euclidean_distance(x, y):
    sk = np.var(np.vstack([x, y]), axis=0, ddof=1)
    return np.sqrt(((x - y) ** 2 / sk).sum())


def standardized_euclidean_distance_2(x, y):
    return pdist(np.vstack([x, y]), 'seuclidean')[0]


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(10, 2)), columns=['x', 'y'])
    print(standardized_euclidean_distance(df.x, df.y))
    print(standardized_euclidean_distance_2(df.x, df.y))

p->∞ 切比雪夫距离 Chebyshev Distance

直观感受:X 可以 上下左右斜着走时,到 Y 的最小距离

Latex表示:y=\; \max _{i}\left( \left| x_{i}-y_{i} \right| \right)

公式表示: y =    max ⁡ i ( ∣ x i − y i ∣ ) y=\; \max _{i}\left( \left| x_{i}-y_{i} \right| \right) y=maxi(xiyi)

Python实现:

import numpy as np
import pandas as pd


def chebyshev_distance(x, y):
    return np.max(np.abs(x - y))


def chebyshev_distance_2(x, y):
    from scipy.spatial.distance import pdist
    X = np.vstack([x, y])
    return pdist(X, 'chebyshev')[0]


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
    print(chebyshev_distance(df.x, df.y))
    print(chebyshev_distance_2(df.x, df.y))

2.余弦相似度 Cosine Similarity

直观感受:两个指向高维度空间的向量之间的 夹角

Latex表示:y=\frac{\sum_{i=1}^{n}{\left( x_{i}\; \times \; y_{i} \right)}}{\sqrt{\sum_{i=1}^{n}{x_{i}^{2}}}\; \times \; \sqrt{\sum_{i=1}^{n}{y_{i}^{2}}}}

公式表示: y = ∑ i = 1 n ( x i    ×    y i ) ∑ i = 1 n x i 2    ×    ∑ i = 1 n y i 2 y=\frac{\sum_{i=1}^{n}{\left( x_{i}\; \times \; y_{i} \right)}}{\sqrt{\sum_{i=1}^{n}{x_{i}^{2}}}\; \times \; \sqrt{\sum_{i=1}^{n}{y_{i}^{2}}}} y=i=1nxi2 ×i=1nyi2 i=1n(xi×yi)

Python实现:

import numpy as np
import pandas as pd


def cosine_similarity(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))


def cosine_similarity_2(x, y):
    from scipy.spatial.distance import pdist
    return 1 - pdist(np.vstack([x, y]), 'cosine')[0]


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
    print(cosine_similarity(df.x, df.y))
    print(cosine_similarity_2(df.x, df.y))

修正余弦相似度 Adjusted Cosine Similarity

举个栗子:

  • 小A习惯差评,电影1 看睡着了打1分,电影2 深度好片打3分,电影3 中规中矩打2分;
  • 小B习惯好评,电影1 也看睡着了打4分,电影2 深度好片打5分,电影3 中规中矩打4.5分;
  • 其实小A小B口味还是差不多的,但是计算cosine_similarity([1, 3, 2], [4, 5, 4.5])计算结果是0.95,并不是预期的1,三个维度是如此,如果维度多了,那么偏差值会更大,
import numpy as np
from sklearn.preprocessing import StandardScaler


def cosine_similarity(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))


def sim_acs(x, y):
	# 先做标准化,再计算余弦相似度
    stdsc = StandardScaler()
    x = stdsc.fit_transform(x)
    y = stdsc.fit_transform(y)
    return cosine_similarity(x, y)


if __name__ == '__main__':
    print(cosine_similarity([2, 3, 2.5], [4, 5, 4.5])) # 0.997

3.皮尔逊线性相关系数 Pearson Correlation Coefficient

这个指用以衡量两组数据的线性相关的程度。皮尔逊相关系数值域为[-1,1],值大于0时为正相关,1为线性正相关;值小于0是为负相关,-1时为线性负相关。

Latex表示:y=\frac{\mbox{C}ov\left( X,Y \right)}{\sqrt{D\left( X \right)}\; \sqrt{D\left( Y \right)}}

公式表示: y = C o v ( X , Y ) D ( X ) D ( Y ) y=\frac{Cov\left( X,Y \right)}{\sqrt{D\left( X \right)}\sqrt{D\left( Y \right)}} y=D(X) D(Y) Cov(X,Y)

import numpy as np
import pandas as pd


def person_correlation(x, y):
    x_mean = x - np.mean(x)
    y_mean = y - np.mean(y)
    return np.dot(x_mean, y_mean) / (np.linalg.norm(x_mean) * np.linalg.norm(y_mean))


def person_correlation_2(x, y):
    X = np.vstack([x, y])
    return np.corrcoef(X)[0][1]


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
    print(person_correlation(df.x, df.y))
    print(person_correlation_2(df.x, df.y))

4.马氏距离 Mahalanobis Distance

Latex表示:D_{M}\left( X,Y \right)=\sqrt{\left( X-Y \right)^{T}S^{-1}\left( X-Y \right)}

公式表示: D M ( X , Y ) = ( X − Y ) T S − 1 ( X − Y ) D_{M}\left( X,Y \right)=\sqrt{\left( X-Y \right)^{T}S^{-1}\left( X-Y \right)} DM(X,Y)=(XY)TS1(XY) ,其中 S − 1 {S}^{-1} S1为协方差矩阵

Python实现:

import numpy as np
import pandas as pd


def mahalanobis_distance(x, y):
    X = np.vstack([x, y])
    XT = X.T
    S = np.cov(X)  # 两个维度之间协方差矩阵
    SI = np.linalg.inv(S)  # 协方差矩阵的逆矩阵
    n = XT.shape[0]
    d1 = []
    for i in range(0, n):
        for j in range(i + 1, n):
            delta = XT[i] - XT[j]
            d = np.sqrt(np.dot(np.dot(delta, SI), delta.T))
            d1.append(d)
    return d1


def mahalanobis_distance_2(x, y):
    from scipy.spatial.distance import pdist
    return pdist(np.vstack([x, y]).T, 'mahalanobis')


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
    print(mahalanobis_distance(df.x, df.y))
    print(mahalanobis_distance_2(df.x, df.y))

5.杰卡德距离 Jaccard Distance

杰卡德相似系数(Jaccard similarity coefficient)是两个集合A和B的交集元素在A,B的并集中所占的比例,而杰卡德距离是1-杰卡德相似系数

Latex表示:J\left( A,B \right)=\frac{\left| A\cup B \right|\; -\; \left| A\cap B \right|}{\left| A\cup B \right|}

公式表示: J ( A , B ) = ∣ A ∪ B ∣    −    ∣ A ∩ B ∣ ∣ A ∪ B ∣ J\left( A,B \right)=\frac{\left| A\cup B \right|\; -\; \left| A\cap B \right|}{\left| A\cup B \right|} J(A,B)=ABABAB

Python实现:

from scipy.spatial.distance import pdist
import pandas as pd
import numpy as np


def jaccard_distance(x, y):
    up = np.double(np.bitwise_and((x != y), np.bitwise_or(x != 0, y != 0)).sum())
    down = np.double(np.bitwise_or(x != 0, y != 0).sum())
    return 1 - (up / down)


def jaccard_distance_2(x, y):
    X = np.vstack([x, y])
    return 1 - pdist(X, 'jaccard')[0]


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 20, size=(3, 2)), columns=['x', 'y'])
    print(jaccard_distance(df.x, df.y))
    print(jaccard_distance_2(df.x, df.y)

6. 布雷克斯距离 Bray Curtis Distance

适用于X与Y的值非负的情况

含义:常用于生态学和环境科学等坐标计算,与粗略估计样本的差异性。计算方法是用X于Y的差值的求和,比X与Y的所有值的总和,值域为[0,1],越接近0,表明样本差异越小。

Latex表示:y=\frac{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|}}{\sum_{i=1}^{n}{x_{i}\; +\; \sum_{i=1}^{n}{y_{i}}}}

公示表示: y = ∑ i = 1 n ∣ x i − y i ∣ ∑ i = 1 n x i    +    ∑ i = 1 n y i y=\frac{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|}}{\sum_{i=1}^{n}{x_{i}\; +\; \sum_{i=1}^{n}{y_{i}}}} y=i=1nxi+i=1nyii=1nxiyi

Python实现:

import numpy as np
from scipy.spatial.distance import pdist
import pandas as pd


def bray_curtis_distance(x, y):
    up = np.sum(np.abs(y - x))
    down = np.sum(x) + np.sum(y)
    d1 = (up / down)
    return d1


def bray_curtis_distance_2(x, y):
    X = np.vstack([x, y])
    return pdist(X, 'braycurtis')[0]


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(1, 2)), columns=['x', 'y'])
    print(bray_curtis_distance(df.x, df.y))
    print(bray_curtis_distance_2(df.x, df.y))

7. 斯皮尔曼等级相关系数 Spearman’s Rank Correlation Coefficient

含义:皮尔逊相关系数会受到异常值的影响较大,而斯皮尔曼相关系数借助排序,可以消除掉部分异常值造成的影响。斯皮尔曼等级相关系数的应用范围比皮尔逊相关系数更广泛。但是弊端是相关系数的差距体现在排名的差值上,如果数据量太少的话,平方项体现不明显,使这个系数表现不太好。

与皮尔逊相关系数一样值域为[-1,1],值域大于0为正相关,反之负相关,越接近1 相关性越明显

youtube视频详解:https://www.youtube.com/watch?v=DE58QuNKA-c

Latex表示:D=1-\frac{6\sum_{i=1}^{n}{d_{i}^{2}}}{n^{3}-n}

公式表示: D = 1 − 6 ∑ i = 1 n d i 2 n 3 − n D=1-\frac{6\sum_{i=1}^{n}{d_{i}^{2}}}{n^{3}-n} D=1n3n6i=1ndi2

import pandas as pd
import numpy as np


def spearman_rank_correlation(x, y):
    from scipy.stats import spearmanr
    r, p = spearmanr(x, y)
    return r, p


def spearman_rank_correlation_2(dataframe: pd.DataFrame) -> pd.DataFrame:
    return dataframe.corr('spearman')


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
    print(spearman_rank_correlation(df.x, df.y)[1])
    print(spearman_rank_correlation_2(df))

8 肯德尔相关系数

Latex表示:

公式表示:

import pandas as pd
import numpy as np


def kendall_correlation_coefficient(x, y):
    from scipy.stats import kendalltau
    return kendalltau(x, y)[0]


def kendall_correlation_coefficient_2(dataframe: pd.DataFrame) -> pd.DataFrame:
    return dataframe.corr('kendall')


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 50, size=(100, 2)), columns=['x', 'y'])
    print(kendall_correlation_coefficient(df.x, df.y))
    print(kendall_correlation_coefficient_2(df))

9 编辑距离与汉明距离

详情请参考:https://blog.csdn.net/weixin_35757704/article/details/115439449

参考文章

  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

呆萌的代Ma

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值