spark ml特征工程之主成分分析(pca)

简介

主成分分析(PCA)是一种统计方法。通过正交变换将一组可能存在相关性的变量转换为一组线性不相关的变量,转换后的这组变量叫主成分。spark ML特体相应的AP进行处理。

实战

1.spark工程的pom文件引用

  <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <scala.version>2.11</scala.version>
        <spark.version>2.3.0</spark.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>
 

​​​​​2.测试数据准备

        spark ml的基本以dataframe的方式处理数据,可以直接从hive中获取dataframe计算。为了演示方便,手动创建dataframe

val spark: SparkSession = SparkSession.builder().appName("SparkSql").master("local[2]").getOrCreate()
    //准备示例数据,将数据转为dataframe
    import spark.implicits._
    val dataList: List[(Int, Double, Double, Double, Double, Double, Double)] = List(
      (0, 8.9255, -6.7863, 11.9081, 5.093, 11.4607, -9.2834),
      (0, 11.5006, -4.1473, 13.8588, 5.389, 12.3622, 7.0433),
      (0, 8.6093, -2.7457, 12.0805, 7.8928, 10.5825, -9.0837),
      (0, 11.0604, -2.1518, 8.9522, 7.1957, 12.5846, -1.8361),
      (1, 9.8369, -1.4834, 12.8746, 6.6375, 12.2772, 2.4486),
      (1, 11.4763, -2.3182, 12.608, 8.6264, 10.9621, 3.5609),
      (0, 11.8091, -0.0832, 9.3494, 4.2916, 11.1355, -8.0198),
      (0, 13.558, -7.9881, 13.8776, 7.5985, 8.6543, 0.831),
      (0, 16.1071, 2.4426, 13.9307, 5.6327, 8.8014, 6.163),
      (1, 12.5088, 1.9743, 8.896, 5.4508, 13.6043, -16.2859),
      (0, 5.0702, -0.5447, 9.59, 4.2987, 12.391, -18.8687),
      (0, 12.7188, -7.975, 10.3757, 9.0101, 12.857, -12.0852),
      (0, 8.7671, -4.6154, 9.7242, 7.4242, 9.0254, 1.4247),
      (1, 16.3699, 1.5934, 16.7395, 7.333, 12.145, 5.9004),
      (0, 13.808, 5.0514, 17.2611, 8.512, 12.8517, -9.1622),
      (0, 3.9416, 2.6562, 13.3633, 6.8895, 12.2806, -16.162),
      (0, 5.0615, 0.2689, 15.1325, 3.6587, 13.5276, -6.5477),
      (0, 8.4199, -1.8128, 8.1202, 5.3955, 9.7184, -17.839),
      (0, 4.875, 1.2646, 11.919, 8.465, 10.7203, -0.6707),
      (0, 4.409, -0.7863, 15.1828, 8.0631, 11.2831, -0.7356))

    val inputDF: DataFrame = dataList.toDF("target", "feature1", "feature2", "feature3", "feature4", "feature5", "feature6")
    inputDF.show()

3.将为字段转为向量

     pca算法转换的是特征向量,需要将降维的字段转为向量字段(转换向量的字段必须为数字类型)。

    val transCols: Array[String] = Array("feature1", "feature2", "feature3", "feature4", "feature5", "feature6")
    val assembler: VectorAssembler = new VectorAssembler().setInputCols(transCols).setOutputCol("fea_vector")
    val vectorDf: DataFrame = assembler.transform(inputDF)

4.pca转换

    调用ml 下的PCA进行降为转换,K值小于等于降维字段个数。

    val pca: PCA = new PCA().setInputCol("fea_vector").setOutputCol("fea_pca_vector").setK(6)
    val pcaDf: DataFrame = pca.fit(vectorDf).transform(vectorDf).drop("fea_vector")
    pcaDf.show(100)

运行结果

+------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
|target|feature1|feature2|feature3|feature4|feature5|feature6|          fea_vector|      fea_pca_vector|
+------+--------+--------+--------+--------+--------+--------+--------------------+--------------------+
|     0|  8.9255| -6.7863| 11.9081|   5.093| 11.4607| -9.2834|[8.9255,-6.7863,1...|[5.37827793617585...|
|     0| 11.5006| -4.1473| 13.8588|   5.389| 12.3622|  7.0433|[11.5006,-4.1473,...|[-11.007146502429...|
|     0|  8.6093| -2.7457| 12.0805|  7.8928| 10.5825| -9.0837|[8.6093,-2.7457,1...|[5.20471538182984...|
|     0| 11.0604| -2.1518|  8.9522|  7.1957| 12.5846| -1.8361|[11.0604,-2.1518,...|[-1.6205023579940...|
|     1|  9.8369| -1.4834| 12.8746|  6.6375| 12.2772|  2.4486|[9.8369,-1.4834,1...|[-6.0466221528193...|
|     1| 11.4763| -2.3182|  12.608|  8.6264| 10.9621|  3.5609|[11.4763,-2.3182,...|[-7.6342080203012...|
|     0| 11.8091| -0.0832|  9.3494|  4.2916| 11.1355| -8.0198|[11.8091,-0.0832,...|[4.29214674176063...|
|     0|  13.558| -7.9881| 13.8776|  7.5985|  8.6543|   0.831|[13.558,-7.9881,1...|[-5.9564493908368...|
|     0| 16.1071|  2.4426| 13.9307|  5.6327|  8.8014|   6.163|[16.1071,2.4426,1...|[-11.025959785947...|
|     1| 12.5088|  1.9743|   8.896|  5.4508| 13.6043|-16.2859|[12.5088,1.9743,8...|[12.3480102757833...|
|     0|  5.0702| -0.5447|    9.59|  4.2987|  12.391|-18.8687|[5.0702,-0.5447,9...|[16.1266643342444...|
|     0| 12.7188|  -7.975| 10.3757|  9.0101|  12.857|-12.0852|[12.7188,-7.975,1...|[7.34400328544611...|
|     0|  8.7671| -4.6154|  9.7242|  7.4242|  9.0254|  1.4247|[8.7671,-4.6154,9...|[-4.7371370904251...|
|     1| 16.3699|  1.5934| 16.7395|   7.333|  12.145|  5.9004|[16.3699,1.5934,1...|[-11.193626843428...|
|     0|  13.808|  5.0514| 17.2611|   8.512| 12.8517| -9.1622|[13.808,5.0514,17...|[3.88645081933627...|
|     0|  3.9416|  2.6562| 13.3633|  6.8895| 12.2806| -16.162|[3.9416,2.6562,13...|[13.1767302316998...|
|     0|  5.0615|  0.2689| 15.1325|  3.6587| 13.5276| -6.5477|[5.0615,0.2689,15...|[3.56307341821141...|
|     0|  8.4199| -1.8128|  8.1202|  5.3955|  9.7184| -17.839|[8.4199,-1.8128,8...|[14.4037358280618...|
|     0|   4.875|  1.2646|  11.919|   8.465| 10.7203| -0.6707|[4.875,1.2646,11....|[-1.9500646332272...|
|     0|   4.409| -0.7863| 15.1828|  8.0631| 11.2831| -0.7356|[4.409,-0.7863,15...|[-2.3260546476231...|

5.降维后的向量列展开

      降维后的字段以向量字段的形式存在,为了方便储存及查看,将向量字段转开为多个字段。

    //将pca转换的向量列,展开为多个字段
    val keepCols: Array[Column] = inputDF.schema.fieldNames.map(colName => $"$colName")
    //将vector转为arr的udf
    val vecToArray = udf((xs: DenseVector) => xs.toArray)
    //引用udf
    val arrayCols: Array[Column] = Array(vecToArray($"fea_pca_vector").alias("fea_pca_array"))
    val arrayDf: DataFrame = pcaDf.select((keepCols ++ arrayCols): _*)

    //4.将array拆分为多个字段
    val strings: Array[String] = Array.tabulate(6)(i => "c" + (i + 1))
    val extendExprs: Array[Column] = strings.zipWithIndex.map { case (newCols, index) => {
      $"fea_pca_array".getItem(index).alias(newCols)
    }
    }
    val pcaTransDf: DataFrame = arrayDf.select((keepCols ++ extendExprs): _*)
    pcaTransDf.show(100)

运行结果

+------+--------+--------+--------+--------+--------+--------+-------------------+--------------------+------------------+------------------+--------------------+------------------+
|target|feature1|feature2|feature3|feature4|feature5|feature6|                 c1|                  c2|                c3|                c4|                  c5|                c6|
+------+--------+--------+--------+--------+--------+--------+-------------------+--------------------+------------------+------------------+--------------------+------------------+
|     0|  8.9255| -6.7863| 11.9081|   5.093| 11.4607| -9.2834|  5.378277936175852|  0.8085996977489435| 9.765071394094718|17.767701575832902|  3.0316488808403053|7.8211656429027165|
|     0| 11.5006| -4.1473| 13.8588|   5.389| 12.3622|  7.0433|-11.007146502429592| -2.3872999387657505|  9.21100353534196|  16.0850321635918|  3.6911686117456326|  9.43141416460013|
|     0|  8.6093| -2.7457| 12.0805|  7.8928| 10.5825| -9.0837|  5.204715381829842| -2.7756558119271997|  9.71398321887668| 17.07361313534964|-0.39201897316702417| 7.973697841675099|
|     0| 11.0604| -2.1518|  8.9522|  7.1957| 12.5846| -1.8361|-1.6205023579940805| -2.3148666738460997|10.903262101103776|13.192584362140138|  0.7156899703956436|10.918051142890702|
|     1|  9.8369| -1.4834| 12.8746|  6.6375| 12.2772|  2.4486| -6.046622152819378|  -4.472306164358786| 8.734728668947925|15.333588676013058|   1.771778594153545| 9.715209853692036|
|     1| 11.4763| -2.3182|  12.608|  8.6264| 10.9621|  3.5609| -7.634208020301263|  -3.298607269322372| 10.05930898454645|15.646780327003022| -0.3532640057636149| 9.381120923946531|
|     0| 11.8091| -0.0832|  9.3494|  4.2916| 11.1355| -8.0198|  4.292146741760639|  -4.173413568817214|12.929572248105465|12.716368366095086|   2.268932713373908| 7.983708736657731|
|     0|  13.558| -7.9881| 13.8776|  7.5985|  8.6543|   0.831| -5.956449390836877|  1.8014434282316747|12.099846366855708| 18.63170093453373|   0.681252772624825| 6.387989056319241|
|     0| 16.1071|  2.4426| 13.9307|  5.6327|  8.8014|   6.163|-11.025959785947084|  -7.706429472765043|14.300253341844073| 13.49356798431484|   1.147005938548894| 6.147666083861472|
|     1| 12.5088|  1.9743|   8.896|  5.4508| 13.6043|-16.2859|   12.3480102757833|  -6.105331616136972|15.459834514071053|13.698355324238221|  1.5642340813740612|10.033883091967365|
|     0|  5.0702| -0.5447|    9.59|  4.2987|  12.391|-18.8687| 16.126664334244463|   -4.32289470160106| 8.451683445406559|14.892389300334058|   2.323784782195987| 8.285189165700926|
|     0| 12.7188|  -7.975| 10.3757|  9.0101|  12.857|-12.0852|  7.344003285446111|  2.6089831541951822| 14.05660334439746|18.847036921603753| 0.03607053558529383|10.583887860088282|
|     0|  8.7671| -4.6154|  9.7242|  7.4242|  9.0254|  1.4247| -4.737137090425153|-0.01518501667156...| 7.743220582764422|13.639363318603786|-0.34387047342556865| 7.923493173409812|
|     1| 16.3699|  1.5934| 16.7395|   7.333|  12.145|  5.9004|-11.193626843428998|  -8.380724459009917|14.498466590233898| 17.29305570070696|  1.5791796375873906| 8.917909079406792|
|     0|  13.808|  5.0514| 17.2611|   8.512| 12.8517| -9.1622| 3.8864508193362752| -11.872944341915899|15.201248427589983|19.335416786167627|-0.21452670708261612| 8.639195554128658|
|     0|  3.9416|  2.6562| 13.3633|  6.8895| 12.2806| -16.162| 13.176730231699892|  -8.668643514047263| 6.907720843075445|17.200330799510507| 0.20024961849351852| 8.263068422760218|
|     0|  5.0615|  0.2689| 15.1325|  3.6587| 13.5276| -6.5477| 3.5630734182114168|  -7.391453929939899| 5.861647300248105| 17.28456634835369|   4.592966545668411| 8.508730836924814|
|     0|  8.4199| -1.8128|  8.1202|  5.3955|  9.7184| -17.839| 14.403735828061878|  -2.048230528664273| 11.45105892146462|13.916717603332888| 0.35105723396700206| 6.731966513698952|
|     0|   4.875|  1.2646|  11.919|   8.465| 10.7203| -0.6707|-1.9500646332272775|  -6.616776359507901| 4.730122194108645|14.172039600382977| -1.2902512357868443| 9.054739235175216|
|     0|   4.409| -0.7863| 15.1828|  8.0631| 11.2831| -0.7356|-2.3260546476231543|  -6.092097719226293|3.9991882958691547| 17.67278820500946| 0.14672873361657204| 8.518881087294343|
+------+--------+--------+--------+--------+--------+--------+-------------------+--------------------+------------------+------------------+--------------------+------------------+

完整代码如下

import org.apache.spark.ml.feature.{PCA, VectorAssembler}
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.{Column, DataFrame, SparkSession}

/**
  * author     :
  * date       :Created in 2019/4/11 17:15
  * description:${description}
  * modified By:
  */

object PCADemo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("SparkSql").master("local[2]").getOrCreate()
    //准备示例数据,将数据转为dataframe
    import spark.implicits._
    val dataList: List[(Int, Double, Double, Double, Double, Double, Double)] = List(
      (0, 8.9255, -6.7863, 11.9081, 5.093, 11.4607, -9.2834),
      (0, 11.5006, -4.1473, 13.8588, 5.389, 12.3622, 7.0433),
      (0, 8.6093, -2.7457, 12.0805, 7.8928, 10.5825, -9.0837),
      (0, 11.0604, -2.1518, 8.9522, 7.1957, 12.5846, -1.8361),
      (1, 9.8369, -1.4834, 12.8746, 6.6375, 12.2772, 2.4486),
      (1, 11.4763, -2.3182, 12.608, 8.6264, 10.9621, 3.5609),
      (0, 11.8091, -0.0832, 9.3494, 4.2916, 11.1355, -8.0198),
      (0, 13.558, -7.9881, 13.8776, 7.5985, 8.6543, 0.831),
      (0, 16.1071, 2.4426, 13.9307, 5.6327, 8.8014, 6.163),
      (1, 12.5088, 1.9743, 8.896, 5.4508, 13.6043, -16.2859),
      (0, 5.0702, -0.5447, 9.59, 4.2987, 12.391, -18.8687),
      (0, 12.7188, -7.975, 10.3757, 9.0101, 12.857, -12.0852),
      (0, 8.7671, -4.6154, 9.7242, 7.4242, 9.0254, 1.4247),
      (1, 16.3699, 1.5934, 16.7395, 7.333, 12.145, 5.9004),
      (0, 13.808, 5.0514, 17.2611, 8.512, 12.8517, -9.1622),
      (0, 3.9416, 2.6562, 13.3633, 6.8895, 12.2806, -16.162),
      (0, 5.0615, 0.2689, 15.1325, 3.6587, 13.5276, -6.5477),
      (0, 8.4199, -1.8128, 8.1202, 5.3955, 9.7184, -17.839),
      (0, 4.875, 1.2646, 11.919, 8.465, 10.7203, -0.6707),
      (0, 4.409, -0.7863, 15.1828, 8.0631, 11.2831, -0.7356))

    val inputDF: DataFrame = dataList.toDF("target", "feature1", "feature2", "feature3", "feature4", "feature5", "feature6")
    inputDF.show()
    //将需要转换的列合并为向量列
    val transCols: Array[String] = Array("feature1", "feature2", "feature3", "feature4", "feature5", "feature6")
    val assembler: VectorAssembler = new VectorAssembler().setInputCols(transCols).setOutputCol("fea_vector")
    val vectorDf: DataFrame = assembler.transform(inputDF)
    //调用ml包中的PCA()
    val pca: PCA = new PCA().setInputCol("fea_vector").setOutputCol("fea_pca_vector").setK(6)
    val pcaDf: DataFrame = pca.fit(vectorDf).transform(vectorDf)
    pcaDf.show(100)

    //将pca转换的向量列,展开为多个字段
    val keepCols: Array[Column] = inputDF.schema.fieldNames.map(colName => $"$colName")
    //将vector转为arr的udf
    val vecToArray = udf((xs: DenseVector) => xs.toArray)
    //引用udf
    val arrayCols: Array[Column] = Array(vecToArray($"fea_pca_vector").alias("fea_pca_array"))
    val arrayDf: DataFrame = pcaDf.select((keepCols ++ arrayCols): _*)

    //4.将array拆分为多个字段
    val strings: Array[String] = Array.tabulate(6)(i => "c" + (i + 1))
    val extendExprs: Array[Column] = strings.zipWithIndex.map { case (newCols, index) => {
      $"fea_pca_array".getItem(index).alias(newCols)
    }
    }
    val pcaTransDf: DataFrame = arrayDf.select((keepCols ++ extendExprs): _*)
    pcaTransDf.show(100)


  }
}

 

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值