Spark ML(2):常规统计(统计汇总、相关性分析、假设检验)

一、实现功能

常规统计方法,可以在作进一步处理之前,对整体数据集有一个理性的了解。对后续处理,可以提高效率,以及准确性。

二、统计汇总

1.功能
在使用spark机器学习训练前,使用统计汇总函数,可以大致了解数据集总体情况2.参考:官网

http://spark.apache.org/docs/2.1.0/mllib-statistics.html
官方实例:
***
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

val observations = sc.parallelize(
  Seq(
    Vectors.dense(1.0, 10.0, 100.0),
    Vectors.dense(2.0, 20.0, 200.0),
    Vectors.dense(3.0, 30.0, 300.0)
  )
)

// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean)  // a dense vector containing the mean value for each column
println(summary.variance)  // column-wise variance
println(summary.numNonzeros)  // number of nonzeros in each column
***

3.北京降雨量统计分析降雨量和年份关系
(1)数据集

0.4806,0.4839,0.318,0.4107,0.4835,0.4445,0.3704,0.3389,0.3711,0.2669,0.7317,0.4309,0.7009,0.5725,0.8132,0.5067,0.5415,0.7479,0.6973,0.4422,0.6733,0.6839,0.6653,0.721,0.4888,0.4899,0.5444,0.3932,0.3807,0.7184,0.6648,0.779,0.684,0.3928,0.4747,0.6982,0.3742,0.5112,0.597,0.9132,0.3867,0.5934,0.5279,0.2618,0.8177,0.7756,0.3669,0.5998,0.5271,1.406,0.6919,0.4868,1.1157,0.9332,0.9614,0.6577,0.5573,0.4816,0.9109,0.921

(2)读取数据集

val txt = sc.textFile("file:///opt/datas/beijing.txt")

(3)倒入相应库库

import org.apache.spark.mllib.{stat,linalg}
import org.apache.spark.mllib.linalg.Vectors

(4)执行处理数据集

scala> val data=txt.flatMap(_.split(",")).map(value=>Vectors.dense(value.toDouble))
data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[5] at map at <console>:29
查看结果
scala> data.take(10)
res9: Array[org.apache.spark.mllib.linalg.Vector] = Array([2009.0], [2007.0], [2006.0], [2005.0], [2004.0], [2003.0], [2002.0], [2001.0], [2000.0], [1999.0])

(5)统计汇总:列统计

scala> stat.Statistics.colStats(data)
res4: org.apache.spark.mllib.stat.MultivariateStatisticalSummary = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@14cb5250

(6)查看统计汇总结果

scala> res4.
count   max   mean   min   normL1   normL2   numNonzeros   variance

三、相关系数

1.目的:研究变量之间的线性相关程度,基于皮尔逊相关系数。

2.北京历年降水量(数据集):年份与降水量之间的相关性

2009,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998,1997,1996,1995,1994,1993,1992,1991,1990,1989,1988,1987,1986,1985,1984,1983,1982,1981,1980,1979,1978,1977,1976,1975,1974,1973,1972,1971,1970,1969,1968,1967,1966,1965,1964,1963,1962,1961,1960,1959,1958,1957,1956,1955,1954,1953,1952,1951,1950,1949
0.4806,0.4839,0.318,0.4107,0.4835,0.4445,0.3704,0.3389,0.3711,0.2669,0.7317,0.4309,0.7009,0.5725,0.8132,0.5067,0.5415,0.7479,0.6973,0.4422,0.6733,0.6839,0.6653,0.721,0.4888,0.4899,0.5444,0.3932,0.3807,0.7184,0.6648,0.779,0.684,0.3928,0.4747,0.6982,0.3742,0.5112,0.597,0.9132,0.3867,0.5934,0.5279,0.2618,0.8177,0.7756,0.3669,0.5998,0.5271,1.406,0.6919,0.4868,1.1157,0.9332,0.9614,0.6577,0.5573,0.4816,0.9109,0.921

3.scala实现统计分析

val txt = sc.textFile("file:///opt/datas/beijing.txt")
val data = txt.flatMap(_.split(",")).map(_.toDouble)
val years = data.filter(_>1000)
val values = data.filter(_<=1000)
scala> stat.Statistics.corr(years,values)
结果:
res6: Double = -0.4385405496488065

4.结果分析:
res6为-0.4385405496488065,即随着年份的降低,降雨量是上升的。换而言之,随着年份的上升,降雨量是下降的,通过excel验证

四、假设检验

1.概念:根据一定的假设条件,由样本推断总体的一种统计学方法。基本思路是先提出假设(虚无假设),使用统计学方法进行计算,根据计算结果判断是否拒绝假设。常用假设检验的方法,卡方检验,T检验。

2.spark实现的是皮尔森卡方检验,它可以实现适配度检测和独立性检测

适配度检测:验证观察值的次数分配与理论值是否相等
独立性检测:两个变量抽样到的观察值是否相互独立

3.判断性别与左撇子之间是否存在关系

        男,女
右利手 127,147
左利手  19,10

4.统计实现

import org.apache.spark.mllib.{linalg,stat}
val data=linalg.Matrices.dense(2,2,Array(127,19,147,10))
scala> stat.Statistics.chiSqTest(data)
结果:
res9: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 3.8587031204632654
pValue = 0.049488567227318536
Strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent..

5.结果分析

默认假设是二者无关的。如果pValue >0.05,则认为假设出现的概率比较大,可以接受;pValue <0.05,则反对假设检验。pValue = 0.049488567227318536<0.05,所以左右手概率和男女性别是有关系的。
 

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark最好的书之一; 第 1 章从 Hadoop 到 Spark 1. 1 Hadoop一一大数据时代的火种·· 1. 1. 1 大数据的由来 1. 1. 2 Google 解决大数据计算问题的方法 ……................…........... 5 1. 1. 3 Hadoop 的由来与发展………................. .. ......................…. 10 1. 2 Hadoop 的局限性·················…….........….........................……… 12 1. 2. 1 Hadoop 运行机制…….....….........…..................... . ....……. . 13 1. 2. 2 Hadoop 的性能问题……………………........ ......…..... . ...…… . 15 1. 2. 3 针对 Hadoop 的改进………………··························…....... 20 1. 3 大数据技术新星一-Spark …·…………………………………………….. 21 1. 3. 1 Spark 的出现与发展……........…………………... ......………. 21 1. 3. 2 Spark 协议族……………………………………………………… 24 1. 3. 3 Spark 的应用及优势……....... .…………………·· ·······………. 25 第 2 章体验 Spark ……….....……........…··························…………………28 2. 1 安装和使用 Spark ··············……..................………………………….28 2. 1. 1 安装 Spark ·································································· 28 2. 1. 2 了解 Spark 目录结构................ . .. .. ........... ................. .. . .. 31 2. 1. 3 使用 Spark Shell ·· · · · · · · · · ·· · · · · · · · · ······· ····· ··· · ··· · ···· · · · · ·· · · ··· ·· ·· 32 2.2 编写和运行 Spark 程序................................................ ...... ......... 35 2.2. 1 安装 Scala 插件 .... .. ............. ...... ....................... ............ 35 2.2.2 编写 Spark 程序......... ................. .................................. 37 2.2.3 运行 Spark 程序········ ············ · ··········· ·············· ········ ······ 42 2. 3 Spark Web UI ··························· ······························ ··············· 45 2.3. 1 访问实时 Web UI ························································· 45 2.3.2 从实时 UI 查看作业信息、....... .. .................. . ....... ............. 46IV 目录 第 3 章 Spark 原理……………………….................……………………………. 50 3. 1 Spark 工作原理…………………………….......…………………........… 50 3. 2 Spark 架构及运行机制………………….............……………………….. 54 3. 2. 1 Spark 系统架构与节点角色………………………........………. 54 3. 2. 2 Spark 作业执行过程……··…...............…......................... 57 3.2.3 应用初始化…·…………………………………………………….. 59 3.2.4 构建 RDD 有向无环图 ………........……………........……….. 62 3.2.5 RDD 有向无环图拆分……..................................………….. 64 3. 2. 6 Task 调度………………………………………………………...... 68 3. 2. 7 Task 执行…………………………………………………………… 71

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值