卡方检测，写的很好，转载了

最新推荐文章于 2024-06-04 21:48:00 发布

alexrad

最新推荐文章于 2024-06-04 21:48:00 发布

阅读量1.9k

点赞数 1

Spark 1.1.0 Basic Statistics（下）

源地址：http://blog.selfup.cn/1157.html

Hypothesis testing

Hypothesis testing，假设检验。Spark目前支持皮尔森卡方检测（Pearson’s chi-squared tests），包括适配度检定和独立性检定。

皮尔森卡方检测

皮尔森卡方检测是最著名的卡方检测方法之一，一般提到卡方检测时若无特殊说明则代表使用的是皮尔森卡方检测。皮尔森卡方检测可以用来进行适配度检测和独立性检测。

适配度检测

适配度检测，Goodness of Fit test，验证一组观察值的次数分配是否异于理论上的分配。其 H0

假设（虚无假设，null hypothesis）为一个样本中已发生事件的次数分配会服从某个特定的理论分配。通常情况下这个特定的理论分配指的是均匀分配，目前Spark默认的是均匀分配。

独立性检测

独立性检测，independence test，验证从两个变量抽出的配对观察值组是否互相独立。其虚无假设是：两个变量呈统计独立性。

检测三个步骤

计算卡方检定的统计值“ χ2

”：把每一个观察值和理论值的差做平方后、除以理论值、再加总
计算 χ2
统计值的自由度“df”
依据研究者设定的置信水平，查出自由度为df的卡方分配临界值，比较它与第1步骤得出的 χ2

统计值，推论能否拒绝虚无假设

适配度检测示例

场景

将五角星的5个角分别标记为1,2,3,4,5。现在旋转若干次五角星，记录每个角指向自己的次数。

第一个的结果为(1,7,2,3,18)，第二个五角星的结果为(7,8,6,7,9)。现做出虚无假设：五角星的每个角指向自己的概率是相同的。

代码示例

public static void main(String[] args) { Vector vec1 = Vectors.dense(1,7,2,3,18); Vector vec2 = Vectors.dense(7,8,6,7,9); ChiSqTestResult goodnessOfFitTestResult1 = Statistics.chiSqTest(vec1); ChiSqTestResult goodnessOfFitTestResult2 = Statistics.chiSqTest(vec2); System.out.println(goodnessOfFitTestResult1); System.out.println(goodnessOfFitTestResult2); }

public static void main(String[] args) {

Vector vec1 = Vectors.dense(1,7,2,3,18);

Vector vec2 = Vectors.dense(7,8,6,7,9);

ChiSqTestResult goodnessOfFitTestResult1 = Statistics.chiSqTest(vec1);

ChiSqTestResult goodnessOfFitTestResult2 = Statistics.chiSqTest(vec2);

System.out.println(goodnessOfFitTestResult1);

System.out.println(goodnessOfFitTestResult2);

}

运行结果

Chi squared test summary: method: pearson degrees of freedom = 4 statistic = 31.41935483870968 pValue = 2.5138644141226737E-6 Very strong presumption against null hypothesis: observed follows the same distribution as expected.. Chi squared test summary: method: pearson degrees of freedom = 4 statistic = 0.7027027027027026 pValue = 0.9509952049458091 No presumption against null hypothesis: observed follows the same distribution as expected..

Chi squared test summary:

method: pearson

degrees of freedom = 4

statistic = 31.41935483870968

pValue = 2.5138644141226737E-6

Very strong presumption against null hypothesis: observed follows the same distribution as expected..

Chi squared test summary:

method: pearson

degrees of freedom = 4

statistic = 0.7027027027027026

pValue = 0.9509952049458091

No presumption against null hypothesis: observed follows the same distribution as expected..

计算过程

对于1.0,7.0,2.0,3.0,18.0有 Ei=6.2

。

χ2=∑ni=1(Oi−Ei)2Ei=(1−6.2)2+(7−6.2)2+(2−6.2)2+(3−6.2)2+(18−6.2)26.2=31.419354839

df=m−1=5−1=4

p-value可以通过查表得到，p-value<<0.001

解读

这里有5个维度的数据，自由度为4。根据均匀分布，每个维度出现的概率应该是1/5=0.2。而根据测试数据来看，这5个值出现的次数是1,7,2,3,18，显然是不符合概率都是0.2的。在一次实验中出现这样的分布的可能性即为p-value的值（2.5138644141226737E-6 ），这在一次实验中是不可能出现了，若出现了我们则有足够的理由认为假设是不成立的。所以拒绝了虚无假设（observed follows the same distribution as expected），即测试数据不满足均匀分布（这个五角星可能由于质量分布的原因导致这样的结果）。而另一组数据（7,8,6,7,9）则更符合等概率的出现，所以接受虚无假设。

独立性检测示例

场景

抽样100个人，分别记录这100个人的性别和是否是左撇子。如下表所示：

	男	女	总计
右	43	44	87
左	9	4	13
总计	52	48	100

现做出虚无假设：性别与惯用手是独立事件。

代码示例

public static void main(String[] args) { Matrix matrix = new DenseMatrix(2, 2, new double[]{43,9,44,4}); ChiSqTestResult independenceTestResult = Statistics.chiSqTest(matrix); System.out.println(independenceTestResult); }

public static void main(String[] args) {

Matrix matrix = new DenseMatrix(2, 2, new double[]{43,9,44,4});

ChiSqTestResult independenceTestResult = Statistics.chiSqTest(matrix);

System.out.println(independenceTestResult);

}

运行结果

Chi squared test summary: method: pearson degrees of freedom = 1 statistic = 1.7774150400145103 pValue = 0.18246706526055168 No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..

Chi squared test summary:

method: pearson

degrees of freedom = 1

statistic = 1.7774150400145103

pValue = 0.18246706526055168

No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..

计算过程

其中r为行数，c为列数。

Ei,j=(∑cn=1Oi,n)(∑rm=1Om,j)N

，如 E1,1=(43+44)∗(43+9)/100=45.24

χ2=∑ri=1∑cj=1(Oi,j−Ei,j)2Ei,j=(43−45.24)245.24+(44−41.76)241.76+(9−6.76)26.76+(4−6.24)26.24=1.7774150400145103
df=(r−1)(c−1)=(2−1)∗(2−1)=1

p-value可以通过查表得到，0.1<p-value<0.2

解读

由于p-value>0.05，所以无法拒绝虚无假设，即无法拒绝性别变量与惯用手变量互相独立的假设。

Random data generation

Random data generation，用于随机数的生成。RandomRDDs包下现支持正态分布、泊松分布和均匀分布三种分布方式。

示例程序

public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName("Statistics").setMaster("local[2]"); JavaSparkContext sc = new JavaSparkContext(sparkConf); //生成100个随机数，N（0,1）标准正态分布，均匀分布在2个partition中 JavaDoubleRDD n = RandomRDDs.normalJavaRDD(sc, 100L, 2); //生成100个随机数，期望和方差为10的泊松分布 JavaDoubleRDD p = RandomRDDs.poissonJavaRDD(sc, 10, 100L); //生成100个随机数，0～1上的连续型均匀分布 JavaDoubleRDD u = RandomRDDs.uniformJavaRDD(sc, 100); //将标准正态分布N（0,1）变成N（1,4） JavaDoubleRDD v = n.mapToDouble(x -> 1.0 + 2.0 * x); for(Double d : v.collect()) { System.out.println(d); } for(Double d : p.collect()) { System.out.println(d); } //将0～1均匀分布变为2～5均匀分布 JavaDoubleRDD g = u.mapToDouble(x -> 2.0 + 3.0 * x); for(Double d : g.collect()) { System.out.println(d); } }

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf().setAppName("Statistics").setMaster("local[2]");

JavaSparkContext sc = new JavaSparkContext(sparkConf);

//生成100个随机数，N（0,1）标准正态分布，均匀分布在2个partition中

JavaDoubleRDD n = RandomRDDs.normalJavaRDD(sc, 100L, 2);

//生成100个随机数，期望和方差为10的泊松分布

JavaDoubleRDD p = RandomRDDs.poissonJavaRDD(sc, 10, 100L);

//生成100个随机数，0～1上的连续型均匀分布

JavaDoubleRDD u = RandomRDDs.uniformJavaRDD(sc, 100);

//将标准正态分布N（0,1）变成N（1,4）

JavaDoubleRDD v = n.mapToDouble(x -> 1.0 + 2.0 * x);

for(Double d : v.collect()) {

System.out.println(d);

}

for(Double d : p.collect()) {

System.out.println(d);

}

//将0～1均匀分布变为2～5均匀分布

JavaDoubleRDD g = u.mapToDouble(x -> 2.0 + 3.0 * x);

for(Double d : g.collect()) {

System.out.println(d);

}

均匀分布

概率密度函数

f(x)={1b−a,a≤x≤b0,other

累积分布函数

F(x)=⎧⎩⎨⎪⎪0,x<ax−ab−a,a≤x<b1,x≥b

期望值

E(x)=a+b2

方差

Var(x)=(b−a)212

正态分布

概率密度函数

f(x)=1σ2π√e−(x−μ)22σ2

累积分布函数

F(x)=12(1+erfx−μσ2√)

期望值

E(x)=μ

方差

Var(x)=σ2

泊松分布

概率质量函數

P(X=k)=e−kλkk!

累积分布函数

F(X=k)=Γ(k+1,k)k!

期望值

E(x)=λ

方差

Var(x)=λ

参考文献

alexrad

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
卡方检测，写的很好，转载了

Spark 1.1.0 Basic Statistics（下）源地址：http://blog.selfup.cn/1157.htmlHypothesis testingHypothesis testing，假设检验。Spark目前支持皮尔森卡方检测（Pearson’s chi-squared tests），包括适配度检定和独立性检定。皮尔森卡方检测皮尔森卡方检测是最...
复制链接

扫一扫