N个向量间的两两皮尔逊系数的矩阵计算方法

目的:有N个行向量 [ e 1 , e 2 , . . . . e n ] \begin{bmatrix} e_{1}, \\e_{2} ,\\....\\e_{n}\end{bmatrix} e1,e2,....en 需要计算两两之间的皮尔逊系数,最简单的办法就是两个for循环,分别计算就搞定了。但是,如果n的值不大这样还ok,一旦n的值很大一般在10W左右,再循环效率损失就不小了,在做科学计算的时候,能用矩阵运算的就尽量别用循环太耗时!
PS:代码里连矩阵运算都没有还怎么愉快的装X啊,哈哈
====我是只想安安静静写代码的分割线=
两个向量的皮尔逊系数计算公式:
这里写图片描述
有个更简单的计算公式:
ρ x , y = ∑ i = 1 n ( x − x ˉ ) ⋅ ( y − y ˉ ) ∑ i = 1 n ( x − x ˉ ) 2 ⋅ ∑ i = 1 n ( y − y ˉ ) 2 \rho_{x,y}= \frac{\sum_{i=1}^{n}(x-\bar{x})\cdot (y-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x-\bar{x})^{2}\cdot\sum_{i=1}^{n}(y-\bar{y})^{2}}} ρx,y=i=1n(xxˉ)2i=1n(yyˉ)2 i=1n(xxˉ)(yyˉ)
根据公式写出对应的代码:

 # 计算两序列的皮尔逊系数,数值越大相关性越大
 def get_distance(self, vector1, vector2):
     num1 = vector1 - np.average(vector1)
     num2 = vector2 - np.average(vector2)
     num = np.sum(num1 * num2)
     den = np.sqrt(np.sum(np.power(num1,2)) * np.sum(np.power(num2,2)))
     if den == 0:
         return 0.0
     return np.abs(num/den)

现在来利用矩阵重写上述公式:
e − e ˉ = e ~ = [ e 1 − e 1 ˉ e 2 − e 2 ˉ . . . . . e n − e n ˉ ] e-\bar{e} = \tilde{e} = \begin{bmatrix}e_{1} - \bar{e_{1}}\\ e_{2} - \bar{e_{2}}\\ .....\\ e_{n} - \bar{e_{n}}\end{bmatrix} eeˉ=e~=e1e1ˉe2e2ˉ.....enenˉ

e ~ = [ e 1 ~ e 2 ~ . . . e n ~ ] \tilde{e}=\begin{bmatrix}\tilde{e_{1}}\\ \tilde{e_{2}}\\ ...\\ \tilde{e_{n}}\end{bmatrix} e~=e1~e2~...en~
则有:
e ~ ⋅ e ~ ′ = [ e 1 ~ 2 , e 1 ~ ⋅ e 2 ~ , . . , e 1 ~ ⋅ e n ~ e 2 ~ ⋅ e 1 ~ , e 2 ~ 2 , . . . , e 2 ~ ⋅ e n ~ . . . e n ~ ⋅ e 1 ~ , e n ~ ⋅ e 2 ~ , . . . , e n ~ 2 ] \tilde{e}\cdot {\tilde{e}}'=\begin{bmatrix} \tilde{e_{1}}^{2}, \tilde{e_{1}}\cdot\tilde{e_{2}} , ..,\tilde{e_{1}}\cdot\tilde{e_{n}}\\ \tilde{e_{2}}\cdot\tilde{e_{1}} ,\tilde{e_{2}}^{2}, ..., \tilde{e_{2}}\cdot\tilde{e_{n}} \\ ...\\ \tilde{e_{n}}\cdot\tilde{e_{1}} , \tilde{e_{n}}\cdot\tilde{e_{2}},...,\tilde{e_{n}}^{2} \end{bmatrix} e~e~=e1~2,e1~e2~,..,e1~en~e2~e1~,e2~2,...,e2~en~...en~e1~,en~e2~,...,en~2
即得到分子矩阵,观察分母可知就是上述矩阵的对角线
设:
dot = [ e 1 ~ 2 e 2 ~ 2 . . . e n ~ 2 ] \begin{bmatrix}\tilde{e_{1}}^{2}\\ \tilde{e_{2}}^{2}\\ ...\\ \tilde{e_{n}}^{2}\end{bmatrix} e1~2e2~2...en~2
则分母有:
d o t ⋅ d o t ′ \sqrt{dot\cdot {dot}'} dotdot
所以
ρ x , y = e ~ ⋅ e ~ ′ d o t ⋅ d o t ′ \rho_{x,y}=\frac{\tilde{e}\cdot {\tilde{e}}'}{\sqrt{dot\cdot {dot}'}} ρx,y=dotdot e~e~
根据公式写代码:

def get_pairwise_distances(self, embeddings):
    """
        计算嵌入向量之间的皮尔逊相关系数
        Args:
            embeddings: 形如(batch_size, embed_dim)的张量
        Returns:
            piarwise_distances: 形如(batch_size, batch_size)的张量
    """
    avg_vec = tf.reduce_mean(embeddings, axis=1)
    # 归一到期望E(x)=0
    nomal_embed = embeddings - tf.expand_dims(avg_vec, 1)

    # 计算 sum((x-avg(x))*(y-avg(y)))的混淆矩阵,即分子矩阵
    dot_product = tf.matmul(nomal_embed, tf.transpose(nomal_embed))

    # 计算分母 sqrt((x-avg(x))^2 * (y-avg(y))^2)
    square_norm = tf.diag_part(dot_product)
    square_norm = tf.matmul(tf.expand_dims(square_norm, 1), tf.expand_dims(square_norm, 0))

    distance = dot_product / tf.sqrt(square_norm)
    return tf.Session().run(distance)

验证一下:

 import time
 import tensorflow as tf
 import numpy as np
 
 data_helper = DataHelper() 
 a = np.array([[2, 7, 18, 88, 157,90, 177, 570],
               [3, 5, 15, 90, 180, 88, 160, 580],
               [1,2,3,4,5,6,7,8]],dtype=float)
 start = time.time()
 dis = []
 for i in range(a.shape[0] - 1):
     for j in range(i + 1, a.shape[0]):
         dis.append(data_helper.get_distance(a[i,:],a[j,:]))
 end = time.time()
 for i in dis:
     print(i)
 print('for cost {0}s'.format(end - start))

 start = time.time()
 dis = data_helper.get_pairwise_distances(a)
 end = time.time()
 print(dis)
 print('matrix cost {0}s'.format(end - start))

结果如下:

0.9983487486440501
0.7993246094489326
0.7851394659823645
for cost 0.0s
2018-08-02 15:03:52.121404: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
[[1.         0.99834875 0.79932461]
 [0.99834875 1.         0.78513947]
 [0.79932461 0.78513947 1.        ]]
matrix cost 0.10700535774230957s
  • 2
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值