· Streaming Significance Testing
2.1 统计概览
在Statistics类中提供基本列统计RDD[Vector]功能
colStats()返回MultivariateStatisticalSummary 的实例,这个实例可以按列计算最大,最小,均值,方差,非0个数统计,列的1范数。
Scala MultivariateStatisticalSummary API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
val observations: RDD[Vector] = ... // an RDD of Vectors
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column
2.2 相关统计
计算两个数据序列(可以使向量或矩阵)的相关系数。在spark.mllib中,我们提供成对计算相关系数,实现了Pearson’s相关和Spearman’s相关。相关统计的结果依赖于计算对象,如果是两个RDD[Double]的计算,结果是Double类型,如果是两个RDD[Vector]计算,结果是一个Matrix矩阵。
Scala Statistics API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.stat.Statistics
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
val sc: SparkContext = ...
val seriesX: RDD[Double] = ... // a series
val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a
// method is not specified, Pearson's method will be used by default.
val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
val data: RDD[Vector] = ... // note that each Vector is a row and not a column
// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
// If a method is not specified, Pearson's method will be used by default.
val correlMatrix: Matrix = Statistics.corr(data, "pearson")
2.3 分层采样(Stratified sampling)
在spark.mllib中提供计算原始RDD 键值对的分层采样方法:sampleByKey 和 sampleByKeyExact 。在分层采样中,键可以看做标签类,相应的值可以看做属性。如,键可以使男人或女人,文档ID,相应的值可以使人的年龄或文档的单次。 sampleByKey 方法随机采样一系列观测值,过程就像逐个遍历所有样本点,通过抛银币决定取舍,因此只需要确定采样点个数。sampleByKeyExact 比分层随机采样方法sampleByKey需要更多地样本,才能保证采样点个数有99.99%的置信度,sampleByKeyExact暂不支持python.
sampleByKeyExact() 采样由[ f_k , n_k ] 完全决定, 对任意一个键k 属于 K 键集合,f_k是预期键对应采样点值得占比(分数),n_k 是这个键k在整个集合中值的个数。无放回采样(即采样的数据取走,不会出现重复) 方法需要一个参数(withReplacement默认是false) , 而又放回采样方法需要两个参数。
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions
val sc: SparkContext = ...
val data = ... // an RDD[(K, V)] of any key value pairs
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
// Get an exact sample from each stratum
val approxSample = data.sampleByKey(withReplacement = false, fractions)
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)
2.4 假设检验
假设检验在统计上用于判定统计结果又多大统计意义,及统计结果有多大置信度。Spark.mllib 暂支持Pearson’s chi-squared 检验,检验结果的适用性和独立性。输入数据需要验证适用性和独立性。适用性检验需要输入Vector , 独立性需要数据Matrix 。
Spark.mllib 支持输入RDD[LabledPoint] ,使用chi-squared独立性来决定特征的选择。
Statistics 提供方法运行Pearson’s chi-squared 检验,下例用于假设检验。
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics._
val sc: SparkContext = ...
val vec: Vector = ... // a vector composed of the frequencies of events
// compute the goodness of fit. If a second vector to test against is not supplied as a parameter,
// the test runs against a uniform distribution.
val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom,
// test statistic, the method used, and the null hypothesis.
val mat: Matrix = ... // a contingency matrix
// conduct Pearson's independence test on the input contingency matrix
val independenceTestResult = Statistics.chiSqTest(mat)
println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...
val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
// the independence test. Returns an array containing the ChiSquaredTestResult for every feature
// against the label.
val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
var i = 1
featureTestResults.foreach { result =>
println(s"Column $i:\n$result")
i += 1
} // summary of the test
Statistics 提供1-sample, 2-sided Kolmogorov-Smirnov检验概率分布是否相等。提供理论分布名称和理论分布参数,或者根据已知理论分布计算累计分布函数,用户可以检验样本点是否出自来验证概率分布。在特殊例子中,如正态分布,不用没有提供正态分布参数,则检验会使用标准正态分布参数。
Scala Statistics API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.Statistics
val data: RDD[Double] = ... // an RDD of sample data
// run a KS test for the sample versus a standard normal distribution
val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
println(testResult) // summary of the test including the p-value, test statistic,
// and null hypothesis
// if our p-value indicates significance, we can reject the null hypothesis
// perform a KS test using a cumulative distribution function of our making
val myCDF: Double => Double = ...
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
2.4.1 流式显著性测试
Spark.mllib 提供在线测试实现,如A/B在线测试。此测试需要在spark streaming DStream[(Boolean, Double)] 上使用,每个流单元的第一个元素是逻辑真假,假代表对照组(false),而真代表实验组(true) , 第二个元素是观测值。
流式显著性检验支持这两个参数:
1 peacePeriod (平稳周期), 默认最初启动后可以忽略的数据组数。
2 windowSize (窗尺寸) , 每次假设检验使用的数据批次数,若设为0 , 则累计处理之前所有批次。
StreamingTest 支持流式假设检验。
val data = ssc.textFileStream(dataDir).map(line => line.split(",") match {
case Array(label, value) => BinarySample(label.toBoolean, value.toDouble)
})
val streamingTest = new StreamingTest()
.setPeacePeriod(0)
.setWindowSize(0)
.setTestMethod("welch")
val out = streamingTest.registerStream(data)
out.print()
完整例子代码见:examples/src/main/scala/org/apache/spark/examples/mllib/StreamingTestExample.scala
2.5 随机数发生器
随机数发生器在随机算法,随机模板和性能测试中很有用。Spark.mllib 的随机发生器RDD 带i.i.d. 随机数据来自给定分布:均匀分布, 标准正态, Possion (泊松分布)。
RandomRDDs 提供工厂方法来生成随机双精度浮点RDD 和 随机向量RDD。下例生辰随机双精度浮点RDD, 这些随机值来自标准正态分布N(0,1), 做平移和伸缩后映射到N(1,4)。
Scala RandomRDD API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs
import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._
val sc: SparkContext = ...
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
2.6 核密度估计
核密度估计在经验概率分布图中用处很大,这种分布图不需要假设观测值来自特定的某个分布。通过给定点集,来计算随机变量的概率密度函数。通过计算经验分布在特定点的PDF(偏导数),作为标准正态分布在每个采样点附近的PDF。
KernelDensity 提供方法计算RDD采样点集的核密度估计,见下例:
Scala KernelDensity API: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
val data: RDD[Double] = ... // an RDD of sample data
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val kd = new KernelDensity()
.setSample(data)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))