python相关系数矩阵程序_Python中稀疏矩阵的相关系数?

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef that will work on a scipy sparse matrix.

解决方案

You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:

import numpy as np

from scipy import sparse

def sparse_corrcoef(A, B=None):

if B is not None:

A = sparse.vstack((A, B), format='csr')

A = A.astype(np.float64)

n = A.shape[1]

# Compute the covariance matrix

rowsum = A.sum(1)

centering = rowsum.dot(rowsum.T.conjugate()) / n

C = (A.dot(A.T.conjugate()) - centering) / (n - 1)

# The correlation coefficients are given by

# C_{i,j} / sqrt(C_{i} * C_{j})

d = np.diag(C)

coeffs = C / np.sqrt(np.outer(d, d))

return coeffs

Check that it works OK:

# some smallish sparse random matrices

a = sparse.rand(100, 100000, density=0.1, format='csr')

b = sparse.rand(100, 100000, density=0.1, format='csr')

coeffs1 = sparse_corrcoef(a, b)

coeffs2 = np.corrcoef(a.todense(), b.todense())

print(np.allclose(coeffs1, coeffs2))

# True

Be warned:

The amount of memory required for computing the covariance matrix C will be heavily dependent on the sparsity structure of A (and B, if given). For example, if A is an (m, n) matrix containing just a single column of non-zero values then C will be an (n, n) matrix containing all non-zero values. If n is large then this could be very bad news in terms of memory consumption.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值