两个比例的比较：参数方法（Z-test）和非参数方法（chi-square test）

最新推荐文章于 2025-04-23 20:00:22 发布

timothyzh

最新推荐文章于 2025-04-23 20:00:22 发布

阅读量8.9k

点赞数

分类专栏：概率统计文章标签： n2 function c table null

概率统计专栏收录该内容

16 篇文章

订阅专栏

Source: http://www.r-bloggers.com/comparison-of-two-proportions-parametric-z-test-and-non-parametric-chi-squared-methods/

考虑以下问题的例子。赌博公司所有人想验证用户是否在欺诈。为此他想比较某个玩家的成功次数和某个雇员的成功次数，从而确定其是否欺骗。在一个月时间内，玩家进行了74次赌博并赢了30次；在相同时期内，该雇员玩了103次赌博，而赢了65次。你的客户是个骗子吗？

这样的问题可以用两种不同的方法来解决：使用参数方法和非参数方法。

*参数方法的解决方案：Z-test

如果你能做如下两个假设，你就可以使用Z-test：成功的概率接近0.5；进行赌博的次数很高（在这两个条件下，二项分布就接近与Gaussian分布）。假设这些条件成立。在R中没有函数计算Z的值，因此我们记起了数学公式，然后创建相应的函数：

$$Z=\frac{\frac{x_1}{n_1}-\frac{x_2}{n_s}}{\sqrt{\widehat{p}(1-\widehat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}$$

z.prop = function(x1,x2,n1,n2){
  numerator = (x1/n1) - (x2/n2)
  p.common = (x1+x2) / (n1+n2)
  denominator = sqrt(p.common * (1-p.common) * (1/n1 + 1/n2))
  z.prop.ris = numerator / denominator
  return(z.prop.ris)
}

Z.prop函数计算Z的值，输入参数为成功数（x1和x2）和赌博的总次数（n1和n2）。我们将问题中的数据代入这个函数：

z.prop(30, 65, 74, 103)
[1] -2.969695

我们获得的z值大于查表得到的z值，这样我们论断董事所关注的玩家确实是个骗子，因为其成功的概率高于非欺骗的用户。

=====================================================

* 非参数方法的解决方案：Chi-squared test

如果现在不能对问题数据做任何的假设，那么就不能将二项分布近似为Gauss分布。我们就用chi-square test来解决这样的问题，这里会应用2X2的列联表（contingency table）。在R中有函数prop.test：

prop.test(x = c(30, 65), n = c(74, 103), correct = FALSE)

    2-sample test for equality of proportions without continuity correction

data: c(30, 65) out of c(74, 103)
X-squared = 8.8191, df = 1, p-value = 0.002981
alternative hypothesis: two.sided
95 percent confidence interval:
  -0.37125315 -0.08007196
sample estimates:
prop 1     prop 2
0.4054054  0.6310680

prop.test函数计算chi-square值，输入参数为成功值（向量x中）和总努力数（向量n中）。向量x和n也能预先申明，然后调用函数：prop.test(x,n,correct=FALSE)。

在小样本的情况（n值比较小），你必须指定correct=TRUE，从而改变chi-square基于continuity of Yates来计算：

prop.test(x = c(30, 65), n = c(74, 103), correct=TRUE)

    2-sample test for equality of proportions with continuity correction

data: c(30, 65) out of c(74, 103)
X-squared = 7.9349, df = 1, p-value = 0.004849
alternative hypothesis: two.sided
95 percent confidence interval:
  -0.38286428 -0.06846083
sample estimates:
prop 1     prop 2
0.4054054  0.6310680

在以上两种情况下，我们获得的p-value都小于0.05，于是我们拒绝相等的假设，所以该用户是个骗子。为了确认，我们比较计算的chi-square值和查表的chi-square值，在R中的计算为：