MLlib学习Basic Statistics

最新推荐文章于 2022-04-21 21:17:10 发布

young_so_nice

最新推荐文章于 2022-04-21 21:17:10 发布

阅读量525

点赞数

分类专栏：机器学习分类算法机器学习

本文链接：https://blog.csdn.net/young_so_nice/article/details/52210244

版权

机器学习同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

机器学习分类算法

3 篇文章 0 订阅

订阅专栏

首先介绍：Summary statistics
1、summary statistics（汇总统计）
Summary statistics提供了基于列的统计信息，包括6个统计量：均值、方差、非零统计量个数、总数、最小值、最大值。

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
/**
 * Created by Administrator on 2016/8/15.
 */
object Summarystatistics {
  def main(args: Array[String]) {
    test()

  }

  def test(): Unit ={
    val conf =new SparkConf().setAppName("Summarystatistics").setMaster("local");
    val sc = new SparkContext(conf)
    val observations=sc.parallelize(
      Seq(
        Vectors.dense(1.0,10.0,100.0),
        Vectors.dense(2.0,20.0,200.0),
        Vectors.dense(3.0,30.0,300.0)
      )
    )
    val sunmmary: MultivariateStatisticalSummary=Statistics.colStats(observations)
    println(sunmmary.count)
    println(sunmmary.max)
    println(sunmmary.min)
    println(sunmmary.mean)
    println(sunmmary.normL1)
    println(sunmmary.normL2)
    println(sunmmary.numNonzeros)
    println(sunmmary.variance)
  }

}

2、Correlations（关联）
计算两个数据序列的相关度。相关系数是用以反映变量之间相关关系密切程度的统计指标。相关系数值越接近1或者-1，则表示数据越可进行线性拟合。目前Spark支持两种相关性系数：皮尔逊相关系数（pearson）和斯皮尔曼等级相关系数（spearman）。

斯皮尔相关等级
简单点说，就是无论两个变量的数据如何变化，符合什么样的分布，我们只关心每个数值在变量内的排列顺序。
如果两个变量的对应值，在各组内的排序顺位是相同或类似的，则具有显著的相关性。

在所有相关系数的计算方法里面，最常见的就是皮尔森相关。
皮尔森相关百度百科解释：皮尔森相关系数（Pearson correlation coefficient）
也称皮尔森积差相关系数(Pearson product-moment correlation coefficient)
，是一种线性相关系数。皮尔森相关系数是用来反映两个变量线性相关程度的统计量。
相关系数用r表示，其中n为样本量，分别为两个变量的观测值和均值。
r描述的是两个变量间线性相关强弱的程度。r的绝对值越大表明相关性越强。

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
/**
 * Created by Administrator on 2016/8/15.
 */
object Correlations {

  def main(args: Array[String]) {
    test()

  }

  def test(): Unit ={
    val conf =new SparkConf().setAppName("Summarystatistics").setMaster("local");
    val sc = new SparkContext(conf)
    val seriesX:RDD[Double]=sc.parallelize(Array(1,2,3,4,5))
    val seriesY:RDD[Double]=sc.parallelize(Array(11,22,33,33,555))

   // val corrlation:Double=Statistics.corr(seriesX,seriesY,"pearson") //皮尔逊相关系数

    //spearman 斯皮尔曼等级相关系数（spearman）。
    val corrlation:Double=Statistics.corr(seriesX,seriesY,"spearman")

   // println(corrlation)
    println(s"Correlation is: $corrlation")

    val data:RDD[Vector]=sc.parallelize(
    Seq(
      Vectors.dense(1.0,10.0,100.0),
      Vectors.dense(2.0,20.0,200.0),
      Vectors.dense(5.0,33.0,366.0)
    )
    )
  val correlMatrix:Matrix=Statistics.corr(data,"pearson")
  println(correlMatrix.toString())

  }
}

3、Stratified sampling（分层抽样）
一个根据Key来抽样的功能，可以为每个key设置其被选中的概率。具体见代码以及注释
和其他统计方法不同，sampleByKey 和 sampleByKeyExact方法可以在RDD键值对上被执行。key可以被想象成一个标签和作为实体属性的值。例如，key可以是男女、文件编号，实体属性可以使人口中的年龄、文件中的单词。sampleByKey方法通过随机方式决定某个观测值是否被采样，因此需要提供一个预期采样数量。sampleByKeyExact 方法比使用简单随机抽样的sampleByKey方法需要更多的资源，但是它可以保证采样大小的置信区间为99.99%。

实际上：fractions就是可以设置各个key可以被抽中的概率：

import org.apache.spark.{SparkContext, SparkConf}
/**
 * Created by Administrator on 2016/8/15.
 */
object Stratifiedsampling {
  def main(args: Array[String]) {
    val conf =new SparkConf().setAppName("Stratifiedsampling").setMaster("local");
    val sc = new SparkContext(conf)
    // $example on$
    // an RDD[(K, V)] of any key value pairs
    val data = sc.parallelize(
      Seq((1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')))

    // specify the exact fraction desired from each key
    val fractions = Map(1 -> 0.1, 2 -> 0.6, 3 -> 0.3)

    // Get an approximate sample from each stratum
    val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)
    // Get an exact sample from each stratum
    val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)
    // $example off$

    println("approxSample size is " + approxSample.collect().size.toString)
    approxSample.collect().foreach(println)

    println("exactSample its size is " + exactSample.collect().size.toString)
    exactSample.collect().foreach(println)

    sc.stop()
  }
}

4、Hypothesis testing（假设检验）

　　假设检验是用来判断样本与样本，样本与总体的差异是由抽样误差引起还是本质差别造成的统计推断方法。其基本原理是先对总体的特征作出某种假设，然后通过抽样研究的统计推理，对此假设应该被拒绝还是接受作出推断。

import org.apache.spark.{SparkConf, SparkContext}
// $example on$
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.test.ChiSqTestResult
import org.apache.spark.rdd.RDD
// $example off$

object HypothesisTestingExample {

  def main(args: Array[String]) {

    val conf =new SparkConf().setAppName("Stratifiedsampling").setMaster("local");
    val sc = new SparkContext(conf)

    // $example on$
    // a vector composed of the frequencies of events
    val vec: Vector = Vectors.dense(0.1, 0.15, 0.2, 0.3, 0.25)

    // compute the goodness of fit. If a second vector to test against is not supplied
    // as a parameter, the test runs against a uniform distribution.
    val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
    // summary of the test including the p-value, degrees of freedom, test statistic, the method
    // used, and the null hypothesis.
    println(s"$goodnessOfFitTestResult\n")

    // a contingency matrix. Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
    val mat: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))

    // conduct Pearson's independence test on the input contingency matrix
    val independenceTestResult = Statistics.chiSqTest(mat)
    // summary of the test including the p-value, degrees of freedom
    println(s"$independenceTestResult\n")

    val obs: RDD[LabeledPoint] =
      sc.parallelize(
        Seq(
          LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)),
          LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 0.0)),
          LabeledPoint(-1.0, Vectors.dense(-1.0, 0.0, -0.5)
          )
        )
      ) // (feature, label) pairs.

    // The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    // the independence test. Returns an array containing the ChiSquaredTestResult for every feature
    // against the label.
   /* val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
    featureTestResults.zipWithIndex.foreach { case (k, v) =>
      println("Column " + (v + 1).toString + ":")
      println(k)
    }  // summary of the test
    // $example off$*/

    sc.stop()
  }
}