pdist, squareform
scipy.spatial.distance 距离计算库中有两个函数:pdist, squareform,用于计算样本对之间的欧式距离,并且将样本间距离用方阵表示出来。
(题外话)
SciPy: 基于Numpy,提供方法(函数库)直接计算结果,封装了一些高阶抽象和物理模型
Numpy: 来存储和处理大型矩阵,比Python自身的嵌套列表(nested list structure)结构要高效的多,本身是由C语言开发。
Pandas: 基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。
参考 资料:https://www.jianshu.com/p/32cb09d84487
(回正题)
1.pdist, squareform使用例子
pdist, squareform的操作基于numpy,
>>> import numpy as np
>>> from scipy.spatial.distance import pdist, squareform
>>> x=np.array([[1,1,1],[2,2,2],[4,4,4]]) #三个一维向量:x1=[1,1,1] x2=[2,2,2],x3=[4,4,4]
>>> Dis=pdist(x)
>>> Dis # d(x1,x2)=sqrt(3)=1.7 ,d(x1,x3)=sqrt(27),d(x2,x3)=sqrt(8)
array([1.73205081, 5.19615242, 3.46410162])
>>> D=squareform(Dis)
array([[0. , 1.73205081, 5.19615242], # d(x1,x1),d(x1,x2),d(x1,x3)
[1.73205081, 0. , 3.46410162], # d(x2,x1),d(x2,x2),d(x2,x3)
[5.19615242, 3.46410162, 0. ]]) # d(x3,x1),d(x3,x2),d(x3,x1)
因为距离度量具有对称性,即 d ( x 1 , x 2 ) = d ( x 2 , x 1 ) d(x1,x2)=d(x2,x1) d(x1,x2)=d(x2,x1),所以上述矩阵为一个对称阵。
2.通过矩阵的四则运算实现上述pdist, squareform
有三个三维样本:x1=[1,1,1],x2=[2,2,2]x3=[4,4,4],样本之间距离的方阵为:
D = [ d ( x 1 , x 1 ) d ( x 1 , x 2 ) d ( x 1 , x 3 ) d ( x 2 , x 1 ) d ( x 2 , x 2 ) d ( x 2 , x 3 ) d ( x 3 , x 1 ) d ( x 3 , x 2 ) d ( x 3 , x 3 ) ] D=\begin{bmatrix} d(x1,x1)& d(x1,x2) & d(x1,x3)\\ d(x2,x1)& d(x2,x2) & d(x2,x3)\\ d(x3,x1)& d(x3,x2) & d(x3,x3)\end{bmatrix} D=⎣⎡d(x1,x1)d(x2,x1)d(x3,x1)d(x1,x2)d(x2,x2)d(x3,x2)d(x1,x3)d(x2,x3)d(x3,x3)⎦⎤
d ( x , y ) = x x T + y y T − 2 x y T d(x,y)=xx^T+yy^T-2xy^T d(x,y)=xxT+yyT−2xyT
所以:
D
=
[
x
1
x
1
T
+
x
1
x
1
T
−
2
x
1
x
1
T
,
x
1
x
1
T
+
x
2
x
2
T
−
2
x
1
x
2
T
,
x
1
x
1
T
+
x
3
x
3
T
−
2
x
1
x
3
T
x
2
x
2
T
+
x
1
x
1
T
−
2
x
2
x
1
T
,
x
2
x
2
T
+
x
2
x
2
T
−
2
x
2
x
1
T
,
x
2
x
2
T
+
x
3
x
3
T
−
2
x
2
x
3
T
x
3
x
3
T
+
x
1
x
1
T
−
2
x
3
x
1
T
,
x
3
x
3
T
+
x
2
x
2
T
−
2
x
3
x
2
T
,
x
3
x
3
T
+
x
3
x
3
T
−
2
x
3
x
3
T
]
D=\begin{bmatrix} x_1x_1^T+x_1x_1^T-2x_1x_1^T,& x_1x_1^T+x_2x_2^T-2x_1x_2^T ,& x_1x_1^T+x_3x_3^T-2x_1x_3^T\\ x_2x_2^T+x_1x_1^T-2x_2x_1^T,& x_2x_2^T+x_2x_2^T-2x_2x_1^T ,& x_2x_2^T+x_3x_3^T-2x_2x_3^T\\ x_3x_3^T+x_1x_1^T-2x_3x_1^T,& x_3x_3^T+x_2x_2^T-2x_3x_2^T ,& x_3x_3^T+x_3x_3^T-2x_3x_3^T\end{bmatrix}
D=⎣⎡x1x1T+x1x1T−2x1x1T,x2x2T+x1x1T−2x2x1T,x3x3T+x1x1T−2x3x1T,x1x1T+x2x2T−2x1x2T,x2x2T+x2x2T−2x2x1T,x3x3T+x2x2T−2x3x2T,x1x1T+x3x3T−2x1x3Tx2x2T+x3x3T−2x2x3Tx3x3T+x3x3T−2x3x3T⎦⎤
= [ x 1 x 1 T , x 1 x 1 T , x 1 x 1 T x 2 x 2 T , x 2 x 2 T , x 2 x 2 T x 3 x 3 T , x 3 x 3 T , x 3 x 3 T ] + [ x 1 x 1 T , x 1 x 1 T , x 1 x 1 T x 2 x 2 T , x 2 x 2 T , x 2 x 2 T x 3 x 3 T , x 3 x 3 T , x 3 x 3 T ] T − 2 [ x 1 x 1 T , x 1 x 2 T , x 1 x 3 T x 2 x 1 T , x 2 x 1 T , x 2 x 3 T x 3 x 1 T , x 3 x 2 T , x 3 x 3 T ] =\begin{bmatrix} x_1x_1^T,& x_1x_1^T ,& x_1x_1^T\\ x_2x_2^T,& x_2x_2^T ,& x_2x_2^T\\ x_3x_3^T,& x_3x_3^T ,& x_3x_3^T \end{bmatrix}+ \begin{bmatrix} x_1x_1^T,& x_1x_1^T ,& x_1x_1^T\\ x_2x_2^T,& x_2x_2^T ,& x_2x_2^T\\ x_3x_3^T,& x_3x_3^T ,& x_3x_3^T \end{bmatrix}^T-2 \begin{bmatrix} x_1x_1^T,& x_1x_2^T ,&x_1x_3^T\\ x_2x_1^T,& x_2x_1^T ,&x_2x_3^T\\ x_3x_1^T,& x_3x_2^T ,& x_3x_3^T\end{bmatrix} =⎣⎡x1x1T,x2x2T,x3x3T,x1x1T,x2x2T,x3x3T,x1x1Tx2x2Tx3x3T⎦⎤+⎣⎡x1x1T,x2x2T,x3x3T,x1x1T,x2x2T,x3x3T,x1x1Tx2x2Tx3x3T⎦⎤T−2⎣⎡x1x1T,x2x1T,x3x1T,x1x2T,x2x1T,x3x2T,x1x3Tx2x3Tx3x3T⎦⎤
=
>
[
x
1
x
1
T
,
x
1
x
1
T
,
x
1
x
1
T
x
2
x
2
T
,
x
2
x
2
T
,
x
2
x
2
T
x
3
x
3
T
,
x
3
x
3
T
,
x
3
x
3
T
]
=> \begin{bmatrix} x_1x_1^T,& x_1x_1^T ,& x_1x_1^T\\ x_2x_2^T,& x_2x_2^T ,& x_2x_2^T\\ x_3x_3^T,& x_3x_3^T ,& x_3x_3^T \end{bmatrix}
=>⎣⎡x1x1T,x2x2T,x3x3T,x1x1T,x2x2T,x3x3T,x1x1Tx2x2Tx3x3T⎦⎤
矩阵对应元素相乘,行复制
[ x 1 x 1 T , x 1 x 2 T , x 1 x 3 T x 2 x 1 T , x 2 x 1 T , x 2 x 3 T x 3 x 1 T , x 3 x 2 T , x 3 x 3 T ] = [ x 1 x 2 x 3 ] ∗ [ x 1 x 2 x 3 ] T \begin{bmatrix} x_1x_1^T,& x_1x_2^T ,&x_1x_3^T\\ x_2x_1^T,& x_2x_1^T ,&x_2x_3^T\\ x_3x_1^T,& x_3x_2^T ,& x_3x_3^T\end{bmatrix}= \begin{bmatrix} x1\\ x2\\ x3\end{bmatrix}* \begin{bmatrix} x1\\ x2\\ x3\end{bmatrix}^T ⎣⎡x1x1T,x2x1T,x3x1T,x1x2T,x2x1T,x3x2T,x1x3Tx2x3Tx3x3T⎦⎤=⎣⎡x1x2x3⎦⎤∗⎣⎡x1x2x3⎦⎤T
程序实现:
X=np.array([[1,1,1],[2,2,2],[3,3,3]])
X2=(X*X).sum(1)*np.ones([3,3])
XXT=np.matmul(X,X.T)
D=X2+X2.T-2*XXT
D=np.sqrt(D2)
print (D)
# 输出
[[ 0. 1.73205081 5.19615242]
[ 1.73205081 0. 3.46410162]
[ 5.19615242 3.46410162 0. ]]
**温馨提示:**上述矩阵为距离矩阵,在实际应用的过程中,注意使用的是距离的平方,还是距离。