快速 Pearson Correlation 计算

最新推荐文章于 2024-08-31 16:09:59 发布

大宋黄药师

最新推荐文章于 2024-08-31 16:09:59 发布

阅读量762

点赞数

分类专栏： debug 机器学习文章标签： python numpy 机器学习

本文链接：https://blog.csdn.net/Yao_Shi_Fighting/article/details/126626167

版权

debug 同时被 2 个专栏收录

3 篇文章

订阅专栏

机器学习

3 篇文章

订阅专栏

调用scipy.stats 的pearsonr进行运算，运算量 (33003 * 33002 / 2) 达到五亿规模的情况下运算速度特别慢，无法接受。如何快速进行correlation计算？

源代码和结果：

import pickle
from scipy.stats import pearsonr
import numpy as np


def CorrelationPool():
    
    with open('map_pool.pkl','rb') as handle:
        pool = pickle.load(handle)
    
    print("There are "+str(len(pool))+" cells in the pool!")
    l = len(pool)
    # Notes: the shape of the pool is (l, 2304), the type is 'list'. 

    correlation_pool = np.zeros(int(l*(l-1)/2), dtype = np.float64)
    t = 0
    for i in tqdm(range(l-1)):
        for j in range(i+1,l):
            correlation_pool[t], _ = pearsonr(np.array(pool[i]),np.array(pool[j]))
            t += 1
    
    with open('CorrelaitonPool.pkl','wb') as f:
        pickle.dump(correlation_pool,f)

预计用时13小时（考虑到for循环结构，最终用时大约是26小时的一半）这可能是调用两层for循环导致的。为检验这种可能，隐去第二层for循环内的pearson计算时，此时五亿次for循环计算仅用时36秒：

可见scipy.stats.pearson运算占据了绝对的时间消耗。尝试用numpy重写pearson r，利用numpy的矩阵运算速度优势显著提高运算效率。

方法：标准化数据之后利用标准化数据中欧氏距离方等价于pearson相关系数的原理计算pearsonr相关系数。参考

https://www.zhihu.com/question/19734616

采用此方法后利用numpy进行运算：

def CorrelationPool(map_type = 'smooth'):
    
    with open('map_pool.pkl','rb') as handle:
        pool = np.array(pickle.load(handle), dtype = np.float64)
    
    # Z-score Normalization
    pool = scale(pool, axis = 1)
    
    l = pool.shape[0]
    print("There are "+str(l)+" cells in the pool!")
    correlation_pool = np.zeros(int(l*(l-1)/2), dtype = np.float64)

    t = 0
    for i in tqdm(range(l-1)):
        eu = np.sum((pool[0:l-i-1,:] - pool[i+1:l])**2, axis = 1)
        correlation_pool[t:t + l-i-1] = 1- eu / (2*pool.shape[1])
        t += l-i-1
    
    with open('CorrelaitonPool_'+map_type+'.pkl','wb') as f:
        pickle.dump(correlation_pool,f)