相关性系数
计算两个数据集的相关性是统计中的常用操作。在MLlib中提供了计算多个数据集两两相关的方法。目前支持的相关性方法有皮尔森(Pearson)相关和斯皮尔曼(Spearman)相关。
Statistics提供方法计算数据集的相关性。根据输入的类型,两个RDD[Double]或者一个RDD[Vector],输出将会是一个Double值或者相关性矩阵。下面是一个应用的例子。
importorg.apache.spark.SparkContext
importorg.apache.spark.mllib.linalg._
importorg.apache.spark.mllib.stat.Statistics
val sc:SparkContext=...
val seriesX:RDD[Double]=...// a series
val seriesY:RDD[Double]=...// must have the same number of partitions and cardinality as seriesX
// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a
// method is not specified, Pearson's method will be used by default.
val correlation:Double=Statistics.corr(seriesX,seriesY,"pearson")
val data:RDD[Vector]=...// note that each Vector is a row and not a column
// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
// If a method is not specified, Pearson's method will be used by default.
val correlMatrix:Matrix=Statistics.corr(data,"pearson")
这个例子中我们看到,计算相关性的入口函数是Statistics.corr,当输入的数据集是两个RDD[Double]时,它的实际实现是Corre