请试试这段代码.用read()调用替换你的数据.请注意,在映射lambda函数之前,我已将SQL df转换为RDD.
from pyspark.mllib.stat import Statistics
import pandas as pd
# df = sqlCtx.read.format('com.databricks.spark.csv').option('header', 'true').option('inferschema', 'true').load('corr_test.csv')
df = datos
col_names = df.columns
features = df.rdd.map(lambda row: row[0:])
corr_mat=Statistics.corr(features, method="pearson")
corr_df = pd.DataFrame(corr_mat)
corr_df.index, corr_df.columns = col_names, col_names
示例输出:
print(corr_df.to_string())
p1m p2m p3m p6m p9m p1m_ya p2m_ya p3m_ya p6m_ya p9m_ya p3m_q_ty 1ya_sales 2ya_sales seasonal_sales
p1m 1.000000 0.755679 0.755452 0.506780 0.557281 0.299348 0.182835 -0.001173 0.332484 0.308060 0.354096 0.029385 0.871112 0.292136
p2m 0.755679