目的:有N个行向量
[
e
1
,
e
2
,
.
.
.
.
e
n
]
\begin{bmatrix} e_{1}, \\e_{2} ,\\....\\e_{n}\end{bmatrix}
⎣⎢⎢⎡e1,e2,....en⎦⎥⎥⎤ 需要计算两两之间的皮尔逊系数,最简单的办法就是两个for循环,分别计算就搞定了。但是,如果n的值不大这样还ok,一旦n的值很大一般在10W左右,再循环效率损失就不小了,在做科学计算的时候,能用矩阵运算的就尽量别用循环太耗时!
PS:代码里连矩阵运算都没有还怎么愉快的装X啊,哈哈
====我是只想安安静静写代码的分割线=
两个向量的皮尔逊系数计算公式:
有个更简单的计算公式:
ρ
x
,
y
=
∑
i
=
1
n
(
x
−
x
ˉ
)
⋅
(
y
−
y
ˉ
)
∑
i
=
1
n
(
x
−
x
ˉ
)
2
⋅
∑
i
=
1
n
(
y
−
y
ˉ
)
2
\rho_{x,y}= \frac{\sum_{i=1}^{n}(x-\bar{x})\cdot (y-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x-\bar{x})^{2}\cdot\sum_{i=1}^{n}(y-\bar{y})^{2}}}
ρx,y=∑i=1n(x−xˉ)2⋅∑i=1n(y−yˉ)2∑i=1n(x−xˉ)⋅(y−yˉ)
根据公式写出对应的代码:
# 计算两序列的皮尔逊系数,数值越大相关性越大
def get_distance(self, vector1, vector2):
num1 = vector1 - np.average(vector1)
num2 = vector2 - np.average(vector2)
num = np.sum(num1 * num2)
den = np.sqrt(np.sum(np.power(num1,2)) * np.sum(np.power(num2,2)))
if den == 0:
return 0.0
return np.abs(num/den)
现在来利用矩阵重写上述公式:
e
−
e
ˉ
=
e
~
=
[
e
1
−
e
1
ˉ
e
2
−
e
2
ˉ
.
.
.
.
.
e
n
−
e
n
ˉ
]
e-\bar{e} = \tilde{e} = \begin{bmatrix}e_{1} - \bar{e_{1}}\\ e_{2} - \bar{e_{2}}\\ .....\\ e_{n} - \bar{e_{n}}\end{bmatrix}
e−eˉ=e~=⎣⎢⎢⎡e1−e1ˉe2−e2ˉ.....en−enˉ⎦⎥⎥⎤
e
~
=
[
e
1
~
e
2
~
.
.
.
e
n
~
]
\tilde{e}=\begin{bmatrix}\tilde{e_{1}}\\ \tilde{e_{2}}\\ ...\\ \tilde{e_{n}}\end{bmatrix}
e~=⎣⎢⎢⎡e1~e2~...en~⎦⎥⎥⎤
则有:
e
~
⋅
e
~
′
=
[
e
1
~
2
,
e
1
~
⋅
e
2
~
,
.
.
,
e
1
~
⋅
e
n
~
e
2
~
⋅
e
1
~
,
e
2
~
2
,
.
.
.
,
e
2
~
⋅
e
n
~
.
.
.
e
n
~
⋅
e
1
~
,
e
n
~
⋅
e
2
~
,
.
.
.
,
e
n
~
2
]
\tilde{e}\cdot {\tilde{e}}'=\begin{bmatrix} \tilde{e_{1}}^{2}, \tilde{e_{1}}\cdot\tilde{e_{2}} , ..,\tilde{e_{1}}\cdot\tilde{e_{n}}\\ \tilde{e_{2}}\cdot\tilde{e_{1}} ,\tilde{e_{2}}^{2}, ..., \tilde{e_{2}}\cdot\tilde{e_{n}} \\ ...\\ \tilde{e_{n}}\cdot\tilde{e_{1}} , \tilde{e_{n}}\cdot\tilde{e_{2}},...,\tilde{e_{n}}^{2} \end{bmatrix}
e~⋅e~′=⎣⎢⎢⎡e1~2,e1~⋅e2~,..,e1~⋅en~e2~⋅e1~,e2~2,...,e2~⋅en~...en~⋅e1~,en~⋅e2~,...,en~2⎦⎥⎥⎤
即得到分子矩阵,观察分母可知就是上述矩阵的对角线
设:
dot =
[
e
1
~
2
e
2
~
2
.
.
.
e
n
~
2
]
\begin{bmatrix}\tilde{e_{1}}^{2}\\ \tilde{e_{2}}^{2}\\ ...\\ \tilde{e_{n}}^{2}\end{bmatrix}
⎣⎢⎢⎡e1~2e2~2...en~2⎦⎥⎥⎤
则分母有:
d
o
t
⋅
d
o
t
′
\sqrt{dot\cdot {dot}'}
dot⋅dot′
所以
ρ
x
,
y
=
e
~
⋅
e
~
′
d
o
t
⋅
d
o
t
′
\rho_{x,y}=\frac{\tilde{e}\cdot {\tilde{e}}'}{\sqrt{dot\cdot {dot}'}}
ρx,y=dot⋅dot′e~⋅e~′
根据公式写代码:
def get_pairwise_distances(self, embeddings):
"""
计算嵌入向量之间的皮尔逊相关系数
Args:
embeddings: 形如(batch_size, embed_dim)的张量
Returns:
piarwise_distances: 形如(batch_size, batch_size)的张量
"""
avg_vec = tf.reduce_mean(embeddings, axis=1)
# 归一到期望E(x)=0
nomal_embed = embeddings - tf.expand_dims(avg_vec, 1)
# 计算 sum((x-avg(x))*(y-avg(y)))的混淆矩阵,即分子矩阵
dot_product = tf.matmul(nomal_embed, tf.transpose(nomal_embed))
# 计算分母 sqrt((x-avg(x))^2 * (y-avg(y))^2)
square_norm = tf.diag_part(dot_product)
square_norm = tf.matmul(tf.expand_dims(square_norm, 1), tf.expand_dims(square_norm, 0))
distance = dot_product / tf.sqrt(square_norm)
return tf.Session().run(distance)
验证一下:
import time
import tensorflow as tf
import numpy as np
data_helper = DataHelper()
a = np.array([[2, 7, 18, 88, 157,90, 177, 570],
[3, 5, 15, 90, 180, 88, 160, 580],
[1,2,3,4,5,6,7,8]],dtype=float)
start = time.time()
dis = []
for i in range(a.shape[0] - 1):
for j in range(i + 1, a.shape[0]):
dis.append(data_helper.get_distance(a[i,:],a[j,:]))
end = time.time()
for i in dis:
print(i)
print('for cost {0}s'.format(end - start))
start = time.time()
dis = data_helper.get_pairwise_distances(a)
end = time.time()
print(dis)
print('matrix cost {0}s'.format(end - start))
结果如下:
0.9983487486440501
0.7993246094489326
0.7851394659823645
for cost 0.0s
2018-08-02 15:03:52.121404: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
[[1. 0.99834875 0.79932461]
[0.99834875 1. 0.78513947]
[0.79932461 0.78513947 1. ]]
matrix cost 0.10700535774230957s