python相关性分析 相关性矩阵_矩阵的Python Scipy Spearman相关性与两数组相关性也不匹配pandas.Data.Frame.corr()...

I was computing spearman correlations for matrix. I found the matrix input and two-array input gave different results when using scipy.stats.spearmanr. The results are also different from pandas.Data.Frame.corr.

from scipy.stats import spearmanr # scipy 1.0.1

import pandas as pd # 0.22.0

import numpy as np

#Data

X = pd.DataFrame({"A":[-0.4,1,12,78,84,26,0,0], "B":[-0.4,3.3,54,87,25,np.nan,0,1.2], "C":[np.nan,56,78,0,np.nan,143,11,np.nan], "D":[0,-9.3,23,72,np.nan,-2,-0.3,-0.4], "E":[78,np.nan,np.nan,0,-1,-11,1,323]})

matrix_rho_scipy = spearmanr(X,nan_policy='omit',axis=0)[0]

matrix_rho_pandas = X.corr('spearman')

print(matrix_rho_scipy == matrix_rho_pandas.values) # All False except diagonal

print(spearmanr(X['A'],X['B'],nan_policy='omit',axis=0)[0]) # 0.8839285714285714 from scipy 1.0.1

print(spearmanr(X['A'],X['B'],nan_policy='omit',axis=0)[0]) # 0.8829187134416477 from scipy 1.1.0

print(matrix_rho_scipy[0,1]) # 0.8263621207201486

print(matrix_rho_pandas.values[0,1]) # 0.8829187134416477

Later I found Pandas's rho is the same as R's rho.

X = data.frame(A=c(-0.4,1,12,78,84,26,0,0),

B=c(-0.4,3.3,54,87,25,NaN,0,1.2), C=c(NaN,56,78,0,NaN, 143,11,NaN),

D=c(0,-9.3,23,72,NaN,-2,-0.3,-0.4), E=c(78,NaN,NaN,0,-1,-11,1,323))

cor.test(X$A,X$B,method='spearman', exact = FALSE, na.action="na.omit") # 0.8829187

However, Pandas's corr doesn't work with large tables (e.g., here and my case is 16,000).

Thanks to Warren Weckesser's testing, I found the two-array results from Scipy 1.1.0 (but not 1.0.1) are the same results as Pandas and R.

Please let me know if you have any suggestions or comments. Thank you.

I use Python: 3.6.2 (Anaconda); Mac OS: 10.10.5.

解决方案

It appears that scipy.stats.spearmanr doesn't handle nan values as expected when the input is an array and an axis is given. Here's a script that compares a few methods of computing pairwise Spearman rank-order correlations:

import numpy as np

import pandas as pd

from scipy.stats import spearmanr

x = np.array([[np.nan, 3.0, 4.0, 5.0, 5.1, 6.0, 9.2],

[5.0, np.nan, 4.1, 4.8, 4.9, 5.0, 4.1],

[0.5, 4.0, 7.1, 3.8, 8.0, 5.1, 7.6]])

r = spearmanr(x, nan_policy='omit', axis=1)[0]

print("spearmanr, array: %11.7f %11.7f %11.7f" % (r[0, 1], r[0, 2], r[1, 2]))

r01 = spearmanr(x[0], x[1], nan_policy='omit')[0]

r02 = spearmanr(x[0], x[2], nan_policy='omit')[0]

r12 = spearmanr(x[1], x[2], nan_policy='omit')[0]

print("spearmanr, individual: %11.7f %11.7f %11.7f" % (r01, r02, r12))

df = pd.DataFrame(x.T)

c = df.corr('spearman')

print("Pandas df.corr('spearman'): %11.7f %11.7f %11.7f" % (c[0][1], c[0][2], c[1][2]))

print("R cor.test: 0.2051957 0.4857143 -0.4707919")

print(' (method="spearman", continuity=FALSE)')

"""

# R code:

> x0 = c(NA, 3, 4, 5, 5.1, 6.0, 9.2)

> x1 = c(5.0, NA, 4.1, 4.8, 4.9, 5.0, 4.1)

> x2 = c(0.5, 4.0, 7.1, 3.8, 8.0, 5.1, 7.6)

> cor.test(x0, x1, method="spearman", continuity=FALSE)

> cor.test(x0, x2, method="spearman", continuity=FALSE)

> cor.test(x1, x2, method="spearman", continuity=FALSE)

"""

Output:

spearmanr, array: -0.0727393 -0.0714286 -0.4728054

spearmanr, individual: 0.2051957 0.4857143 -0.4707919

Pandas df.corr('spearman'): 0.2051957 0.4857143 -0.4707919

R cor.test: 0.2051957 0.4857143 -0.4707919

(method="spearman", continuity=FALSE)

My suggestion is to not use scipy.stats.spearmanr in the form spearmanr(x, nan_policy='omit', axis=). Use the corr() method of the Pandas DataFrame, or use a loop to compute the values pairwise using spearmanr(x0, x1, nan_policy='omit').

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值