Spark PCA
在机器学习或者数据挖掘中,得到的数据往往维度过高,含有噪音,需要把多指标转化为少数几个综合指标的数据。pca是机器学习框架中常用的一个功能,spark机器模块也实现了这一功能。
PCA主要的几个方法
设置输入项的字段
def setInputCol(value: String): this.type = set(inputCol, value)
设置输出项的字段
def setOutputCol(value: String): this.type = set(outputCol, value)
设置转化的维度个数
def setK(value: Int): this.type = set(k, value)
训练模型
def fit(dataset: Dataset[_]): PCAModel
利用模型转化数据
def transform(dataset: Dataset[_]): DataFrame
Spark PCA 示例
val spark = SparkSession.builder().appName("pca").master("local[4]").getOrCreate()
val file = spark.read.format("csv")
.option("sep",",")
.option("header","true")
.load("boston_house_prices.csv")
file.show(true)
import spark.implicits._
//打乱顺序
val rand = new Random()
val data = file.select("MEDV", "CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT").map(
row => (row.getAs[String](0).toDouble, row.getString(1).toDouble, row.getString(2).toDouble, row.getString(3).toDouble, row.getString(4).toDouble, row.getString(5).toDouble, row.getString(6).toDouble, row.getString(7).toDouble, row.getString(8).toDouble, row.getString(9).toDouble, row.getString(10).toDouble, row.getString(11).toDouble, row.getString(12).toDouble, row.getString(13).toDouble, rand.nextDouble()))
.toDF("price", "crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "b", "lstat", "rand").sort("rand") //强制类型转换过程
data.show(true)
val assembler = new VectorAssembler().setInputCols(Array("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "b", "lstat", "rand")).setOutputCol("features")
val pca = new PCA().setInputCol("features").setOutputCol("featuresPca").setK(3)
val assembler_data = assembler.transform(data)
val pca_model = pca.fit(assembler_data)
val pca_data = pca_model.transform(assembler_data)
pca_data.select("features","featuresPca").show(false)
数据从 64 维度降到3 维