sparkpython效率_pandas、spark计算相关性系数速度对比

pandas、spark计算相关性系数速度对比

相关性计算有三种算法:pearson、spearman,kenall。

在pandas库中,对一个Dataframe,可以直接计算这三个算法的相关系数correlation,方法为:data.corr()

底层是依赖scipy库的算法。

为了提升计算速度,使用spark平台来加速执行。

比较了pandas,spark并发scipy算法,spark mllib库的计算速度。

总体来说,spark mllib速度最快,其次是spark并发,pandas速度最慢。

corr执行速度测试结果

时间单位:秒

数据大小

corr算法

pandas

spark + scipy

spark mllib

备注

1000*3600

pearsonr

203

170

37

pyspark

1000*3600

pearsonr

203

50

没有计算

spark scipy计算一半

1000*3600

pearsonr

203

125

37

client模式

1000*3600

pearsonr

202

157

38

client模式

1000*3600

spearmanr

1386

6418

37

client模式

1000*3600

spearmanr

1327

6392

38

client模式

1000*3600

kendall

4326

398

无此算法

client模式

1000*3600

kendall

4239

346

无此算法

client模式

1000*1000

spearmanr

127

294

12

client 模式

1000*1000

spearmanr

98

513

5.55

client 模式

1000*360

spearmanr

13

150

没有计算

160秒,列表推导式 res = [st.spearmanr(data.iloc[:, i], data.iloc[:, j])[0] for i in range(N) for j in range(N)]

1000*360

kendall

40

45

无此算法

116秒,列表推导式 res = [st.kendall(data.iloc[:, i], data.iloc[:, j])[0] for i in range(N) for j in range(N)]

说明:spearmanr 算法在spark scipy组合下执行速度较慢,需要再对比分析,感觉存在问题的。

三种算法脚本如下:

pandas 脚本

import numpy as np

import pandas as pd

import time

C = 1000

N = 3600

data = pd.DataFrame(np.random.randn(C * N).reshape(C, -1))

print("============================ {}".format(data.shape))

print("start pandas corr ---{} ".format(time.time()))

start = time.time()

# {'pearson', 'kendall', 'spearman'}

res = data.corr(method='pearson')

end_1 = time.time()

res = data.corr(method='spearman')

end_2 = time.time()

res = data.corr(method='kendall')

end_3 = time.time()

print("pandas pearson count {} total cost : {}".format(len(res), end_1 - start))

print("pandas spearman count {} total cost : {}".format(len(res), end_2 - end_1))

print("pandas kendall count {} total cost : {}".format(len(res), end_3 - end_2))

spark scipy脚本

from pyspark import SparkContext

sc = SparkContext()

import numpy as np

import pandas as pd

from scipy import stats as st

import time

# t1 = st.kendalltau(x, y)

# t2 = st.spearmanr(x, y)

# t3 = st.pearsonr(x, y)

C = 1000

N = 3600

data = pd.DataFrame(np.random.randn(C * N).reshape(C, -1))

def pearsonr(n):

x = data.iloc[:, n]

res = [st.pearsonr(x, data.iloc[:, i])[0] for i in range(data.shape[1])]

return res

def spearmanr(n):

x = data.iloc[:, n]

res = [st.spearmanr(x, data.iloc[:, i])[0] for i in range(data.shape[1])]

return res

def kendalltau(n):

x = data.iloc[:, n]

res = [st.kendalltau(x, data.iloc[:, i])[0] for i in range(data.shape[1])]

return res

start = time.time()

res = sc.parallelize(np.arange(N)).map(lambda x: pearsonr(x)).collect()

# res = sc.parallelize(np.arange(N)).map(lambda x: spearmanr(x)).collect()

# res = sc.parallelize(np.arange(N)).map(lambda x: kendalltau(x)).collect()

end = time.time()

print("pearsonr count {} total cost : {}".format(len(res), end - start))

print("spearmanr count {} total cost : {}".format(len(res), end - start))

print("kendalltau count {} total cost : {}".format(len(res), end - start))

# 纯python算法

s = time.time()

res = [st.spearmanr(data.iloc[:, i], data.iloc[:, j])[0] for i in range(N) for j in range(N)]

end = time.time()

print(end-s)

start = time.time()

dd = sc.parallelize(res).map(lambda x: st.spearmanr(data.iloc[:, x[0]], data.iloc[:, x[1]])).collect()

end = time.time()

print(end-start)

start = time.time()

dd = sc.parallelize(res).map(lambda x: st.kendalltau(data.iloc[:, x[0]], data.iloc[:, x[1]])).collect()

end = time.time()

print(end-start)

spark mllib脚本

from pyspark import SparkContext

sc = SparkContext()

from pyspark.mllib.stat import Statistics

import time

import numpy as np

L = 1000

N = 3600

t = [np.random.randn(N) for i in range(L)]

data = sc.parallelize(t)

start = time.time()

res = Statistics.corr(data, method="pearson") # spearman pearson

end = time.time()

print("pearson : ", end-start)

start = time.time()

res = Statistics.corr(data, method="spearman") # spearman pearson

end = time.time()

print("spearman: ", end-start)

原文链接:https://www.cnblogs.com/StitchSun/p/13225260.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值