pearson相关和spearman相关在spark机器学习库中已经有实现,一共有两种API,分别是Statistics.corr()和Correlation.corr(),其本质都是调用同一个接口,但是使用Statistics.corr()接口可以直接传入dataframe进行计算,要比另一个接口转换为RDD计算快很多。
Statistics.corr()实现pearson相关和spearman相关
def pearsonOrSpearman(df: DataFrame, colArray: Seq[String], method: String): Seq[Seq[String]] = {
val result = colArray.map(col1 => {
val rdd1 = df.rdd.map(row => row(row.fieldIndex(col1)).toString.toDouble)
col1 +: colArray.map(col2 => {
val rdd2 = df.rdd.map(row => row(row.fieldIndex(col2)).toString.toDouble)
Statistics.corr(rdd1, rdd2, method).formatted("%.6f").toString
})
})
result
}
输入数据如下:
总的数据计算量1000行,18个字段。
计算结果:
一共耗时34秒,并且随着字段的增加,耗时呈指数级增长。
Correlation.corr()实现pearson相关和spearman相关
def pearsonOrSpearman(df: DataFrame, colArray: Seq[String], method: String): Seq[Seq[String]] = {
val rdd: RDD[Row] = df.rdd
val colNames = "features"
val arrayRDD: RDD[Array[String]] = rdd.map(row => {
val array: Array[String] = row.toSeq.map(m => m.toString).toArray
array
})
val vecRDD: RDD[linalg.Vector] = arrayRDD.map(strings => {
val arr = new Array[Double](strings.length)
for (i <- 0 until strings.length) {
arr(i) = strings(i).toDouble
}
Vectors.dense(arr)
})
val t: RDD[Tuple1[linalg.Vector]] = vecRDD.map(Tuple1.apply)
val spark: SparkSession = SparkSession.builder().config(cfg).getOrCreate()
//RDD和DF、DS转换必须要导的包(隐式转换),spark指的是上面的sparkSession
import spark.implicits._
val dataFrame: DataFrame = t.toDF(colNames)
//计算pearson\spearman相关系数
val Row(matResult: Matrix) = Correlation.corr(dataFrame, colNames, method).head()
def matrixToRDD(matrix: Matrix): Array[Array[String]] = {
val array: Array[Double] = matrix.toArray
val Row = matrix.numRows
val Col = matrix.numCols
val seq: Array[Array[String]] = Array.ofDim[String](Row, Col)
for (i <- 0 until array.length) {
val r: Int = i / Col
val c: Int = i % Col
seq(r)(c) = array(i).formatted("%.6f").toString
}
seq
}
val array: Array[Array[String]] = matrixToRDD(matResult)
val list: ListBuffer[ListBuffer[String]] = ListBuffer[ListBuffer[String]]()
for (i <- 0 until matResult.numRows) {
val strings: ListBuffer[String] = ListBuffer[String]()
strings.append(colArray(i))
for (j <- 0 until matResult.numRows) {
strings.append(array(i)(j))
}
list.append(strings)
}
list
}
输入数据和上一个计算例子相同,计算输出结果耗时:
同样的数据计算,计算效率大幅提升。