关闭

SparkML之特征提取(一)主成分分析(PCA)

标签: spark机器学习源码
5002人阅读 评论(0) 收藏 举报
分类:

主成分分析(Principal Component Analysis,PCA), 将多个变量通过线性变换以选出较少个数重要变量的一种多

元统计分析方法.

--------------------------------------------目录--------------------------------------------------------

理论和数据见附录

Spark 源码(mllib包)

实验

----------------------------------------------------------------------------------------------------------

Spark 源码(mllib包)

/**
 * A feature transformer that projects vectors to a low-dimensional space using PCA.
 *	
 * @param k number of principal components
 */
@Since("1.4.0")
class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
  require(k > 0,
    s"Number of principal components must be positive but got ${k}")

  /**
   * Computes a [[PCAModel]] that contains the principal components of the input vectors.
   *
   * @param sources source vectors
   */
  @Since("1.4.0")
  def fit(sources: RDD[Vector]): PCAModel = {
    require(k <= sources.first().size,
      s"source vector size is ${sources.first().size} must be greater than k=$k")

    val mat = new RowMatrix(sources)
    val (pc, explainedVariance) = mat.computePrincipalComponentsAndExplainedVariance(k)
    val densePC = pc match {
      case dm: DenseMatrix =>
        dm
      case sm: SparseMatrix =>
        /* Convert a sparse matrix to dense.
         *
         * RowMatrix.computePrincipalComponents always returns a dense matrix.
         * The following code is a safeguard.
         */
        sm.toDense
      case m =>
        throw new IllegalArgumentException("Unsupported matrix format. Expected " +
          s"SparseMatrix or DenseMatrix. Instead got: ${m.getClass}")

    }
    val denseExplainedVariance = explainedVariance match {
      case dv: DenseVector =>
        dv
      case sv: SparseVector =>
        sv.toDense
    }
    new PCAModel(k, densePC, denseExplainedVariance)
  }

  /**
   * Java-friendly version of [[fit()]]
   */
  @Since("1.4.0")
  def fit(sources: JavaRDD[Vector]): PCAModel = fit(sources.rdd)
}

/**
 * Model fitted by [[PCA]] that can project vectors to a low-dimensional space using PCA.
 *
 * @param k number of principal components.
 * @param pc a principal components Matrix. Each column is one principal component.
 */
@Since("1.4.0")
class PCAModel private[spark] (
    @Since("1.4.0") val k: Int,
    @Since("1.4.0") val pc: DenseMatrix,
    @Since("1.6.0") val explainedVariance: DenseVector) extends VectorTransformer {
  /**
   * Transform a vector by computed Principal Components.
   *
   * @param vector vector to be transformed.
   *               Vector must be the same length as the source vectors given to [[PCA.fit()]].
   * @return transformed vector. Vector will be of length k.
   */
  @Since("1.4.0")
  override def transform(vector: Vector): Vector = {
    vector match {
      case dv: DenseVector =>
        pc.transpose.multiply(dv)
      case SparseVector(size, indices, values) =>
        /* SparseVector -> single row SparseMatrix */
        val sm = Matrices.sparse(size, 1, Array(0, indices.length), indices, values).transpose
        val projection = sm.multiply(pc)
        Vectors.dense(projection.values)
      case _ =>
        throw new IllegalArgumentException("Unsupported vector format. Expected " +
          s"SparseVector or DenseVector. Instead got: ${vector.getClass}")
    }
  }
}

---------------------------------------------------------------------------------------------------------

SparkML实验

import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.{SparkConf, SparkContext}


object myPCA {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("PCA example").setMaster("local")
    val sc = new SparkContext(conf)

    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

    val data = sc.textFile("/root/application/upload/pca2.data")
    //data.foreach(println)

    val parseData = data.map{ line =>
    val part = line.split(' ')
      Vectors.dense(part.map(_.toDouble))
    }

    val model = new PCA(3).fit(parseData)

    model.transform(parseData).foreach(println)
    //--------------------------------------------------------------------------
    /**
      * [-198.49935555431662,61.7455925014451,-33.61561582724634]
        [-142.6503762139188,42.83576581230462,-27.723300375043127]
        [-94.48444346449276,37.63137787042039,-18.467916687311757]
        [-93.78770648660057,53.13442729915277,-20.324679585348406]
        [-115.21309309209421,64.72629901491086,-24.068684431501]
        [-141.13717390563068,62.443549430022024,-32.15482042868974]
        [-139.84404002633448,85.49929177772042,-26.90430756804854]
        [-106.34627395862736,57.60589638943985,-23.47345414370614]
        [-254.30867520979697,40.87956572432333,-12.424267061380176]
        [-146.56200808994245,52.842236008590454,-16.703674457958243]
        [-170.42181527333886,63.27229377718993,-21.440842300235158]
        [-139.13974251002367,74.9052975468746,-12.130842693355147]
        [-131.03062483262897,72.29955746812841,-15.20705763790804]
        [-126.21628609915788,71.19600990352119,-11.411808043562743]
        [-120.23904415710874,39.83322827884836,-26.220672650471542]
        [-97.36990893617941,43.377395313806836,-17.568739657112463]
      */
    println("---------------------------------------------------")
    
    sc.stop()

  }
}

附录

链接:http://pan.baidu.com/s/1dELByj3 密码:wsnb


6
0
查看评论
发表评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场

Spark MLlib特征处理:PCA 主成分分析 ---原理及实战

PCA(Principal Component Analysis)即主成分分析。正如其名,PCA可以找出特征中最主要的特征,把原来的n个特征用k(k < n)个特征代 替,去除噪音和冗余。P...
  • wangpei1949
  • wangpei1949
  • 2016-11-23 20:55
  • 2457

spark厦大-------主成分分析(PCA)

来源:http://mocom.xmu.edu.cn/article/show/58627a2faa2c3f280956e7ae/0/1 二、主成分分析(PCA) 1、概念介绍 ...
  • qq_34941023
  • qq_34941023
  • 2017-04-27 22:26
  • 366

PCA主成份分析(Spark 2.0)

Spark 2.0 Scikit PCA 主成分个数选择
  • qq_34531825
  • qq_34531825
  • 2016-08-28 19:15
  • 1898

文本挖掘之降维之特征抽取之主成分分析(PCA)

PCA的原理
  • u011955252
  • u011955252
  • 2016-03-02 10:49
  • 1760

特征提取——主成分分析PCA(K-L变换)及几何解释

最先接触到K-L变换的时候还懵懵懂懂,
  • light_lj
  • light_lj
  • 2014-05-19 09:36
  • 9966

人脸识别特征脸提取PCA算法

本文转自:http://www.cnblogs.com/mikewolf2002/p/3432243.html PCA算法的基本原理可以参考:http://www.cnblogs.com/mikewo...
  • Sunshine_in_Moon
  • Sunshine_in_Moon
  • 2015-06-25 17:48
  • 5022

主成分分析(PCA)是目前应用很广泛的一种代数特征提取方法

主成分分析(Principal Component Analysis,简称PCA)方法是目前应用很广泛的一种代数特征提取方法,可以说是常用的一种基于变量协方差矩阵对样本中的信息进行处理、压缩和抽提的有...
  • tshaun007
  • tshaun007
  • 2015-01-24 08:57
  • 1503

特征脸(Eigenface)理论基础-PCA(主成分分析法)

在之前的博客  人脸识别经典算法一:特征脸方法(Eigenface)  里面介绍了特征脸方法的原理,但是并没有对它用到的理论基础PCA做介绍,现在做补充。请将这两篇博文结合起来阅读。以下内容大部分参考...
  • yanzi6969
  • yanzi6969
  • 2015-01-04 09:17
  • 1048

协方差(矩阵)、特征向量和PCA(Principal Component Analysis,主成分分析)

一、约定---从机器学习(Machine Learning)角度: 样本集(数据集): 单个样本(观测值): 维度(变量): 上标表示样本序号,下标表示样本的维度序号或特征序号样本实例一般...
  • u013640035
  • u013640035
  • 2014-02-17 13:57
  • 1360

deep learning PCA(主成分分析)、主份重构、特征降维

前言       前面
  • hlx371240
  • hlx371240
  • 2014-10-25 22:58
  • 5205