皮尔逊相关 p值_皮尔逊，p值和图

最新推荐文章于 2023-08-30 22:34:28 发布

cumi6497

最新推荐文章于 2023-08-30 22:34:28 发布

阅读量6.3k

点赞数

文章标签： python 人工智能大数据机器学习数据分析

原文链接：https://www.freecodecamp.org/news/pearson-p-values-and-plots-d5eed2fd6d1a/

版权

本文探讨了皮尔逊关于p值的理论，介绍了卡方检验的两种主要类型，并通过R语言对一个骰子实验进行了详细分析。通过实际操作展示了如何进行卡方检验，解释了p值的含义，以及如何在R中可视化卡方分布和p值对应的拒绝区域。

摘要由CSDN通过智能技术生成

皮尔逊相关 p值

by Michelle Jones

由米歇尔·琼斯(Michelle Jones)

皮尔逊，p值和图 (Pearson, p-values, and plots)

什么是p值？ (What is a p-value?)

If your experiment needs statistics, you ought to have done a better experiment. — Ernest Rutherford

如果您的实验需要统计数据，则应该做得更好。 —欧内斯特·卢瑟福

The use of p-values in research is very common. Peer reviewed journal articles are chock-full of them. It seems like every scientist and their salivating dog uses them.

在研究中使用p值非常普遍。同行评审的期刊文章充斥其中。似乎每个科学家及其垂涎的狗都在使用它们。

Okay, without looking up the answer:

好吧，不用查找答案：

what is the definition of a p-value?

p值的定义是什么？

Be honest, I won’t tell anyone your answer. I promise. It’s our secret. We’ll keep coming back to this question over my next few posts.

老实说，我不会告诉任何人您的答案。我承诺。这是我们的秘密在接下来的几篇文章中，我们将继续讨论这个问题。

We’re starting our journey with Pearson.

我们正在与Pearson一起开始我们的旅程。

卡尔·皮尔森 (Karl Pearson)

In 1900, Karl Pearson published his paper that discussed the concept of p-values. Most of the paper is worked examples of the form that we know as the chi-squared test. Thus, the focus of the paper is on frequencies of counts, and the degree to which observed counts differ (in Pearson’s term, deviate) from expected counts. In his statistical terms, each deviation (n) is an error.

1900年，卡尔·皮尔森(Karl Pearson)发表了他的论文，该论文讨论了p值的概念。本文的大部分内容都是我们称为卡方检验的形式的实例。因此，本文的重点是计数的频率，以及观察到的计数与预期计数的不同程度(以Pearson的说法，偏离)。用他的统计术语来说，每个偏差(n)都是一个误差。

His definition of P was as follows:

他对P的定义如下：

the probability of a complex system of n errors occurring with a frequency as great or greater than that of the observed system (p.158)

n个错误的复杂系统的发生频率大于或大于被观察系统的概率(第158页)

In other words, given the expected counts, how probable are our observed counts and counts that are even more different?

换句话说，给定预期的计数，我们观察到的计数和计数相差更大的可能性有多大？

卡方检验的两种主要类型 (Two main types of chi-squared tests)

There is one key point to note about Fisher’s chi-squared test. Yes, his method determines whether observed counts differ from the expected counts. However, he directly compared observed counts to their expected counts based on a pre-determined, underlying distribution. This is a goodness-of-fit test. The sorts of questions he was asking were:

关于费舍尔的卡方检验，有一个要注意的关键点。是的，他的方法确定观察到的计数是否与预期计数不同。但是，他将观察到的计数与基于预定基础分布的预期计数直接进行了比较。这是拟合优度测试。他要问的问题是：

does this set of results of dice rolls follow a binomial distribution?
掷骰子的结果是否遵循二项式分布？
does this set of petal counts from 222 buttercups fit a specific skew curve?
这套来自222毛butter的花瓣计数是否适合特定的偏斜曲线？

This is completely different to how we normally use a chi-squared test, where we compare two groups, rather than one group to a predefined distribution:

这与我们通常使用卡方检验的方式完全不同，在卡方检验中，我们将两组而不是一组与预定义的分布进行比较：

do cases and controls (e.g. smokers/non-smokers) significantly differ on disease incidence (e.g. lung cancer)?
病例和对照(例如吸烟者/不吸烟者)在疾病发生率(例如肺癌)方面有显着差异吗？
do men and women vote for the same political candidates?
男人和女人会为同一位政治候选人投票吗？

In our normal case, the expected counts (and therefore distribution) are directly derived from the contingency table margins. In this second method, we are performing the chi-square test of independence. (There’s another type of chi-squared test that basically uses the same analysis as this one, but that’s a technicality we will be ignoring.)

在我们的正常情况下，预期计数(以及分配)直接从权变表边距中得出。在第二种方法中，我们执行独立性的卡方检验。 (还有另一种卡方检验类型，基本上使用与此分析相同的分析，但这是我们将忽略的技术性。)

Back to the chi squared goodness-of-fit test.

回到卡方拟合优度检验。

掷骰子 (Dice rolling)

Let’s work through Pearson’s first example using R. Base R is sufficient for this. We’ll use R to work out the chi-squared value (χ²) for ourselves. Finally, we’ll sum the theoretical and observed counts as a check that the numbers are what we expect from the table in the paper.

让我们来看一下使用R的 Pearson的第一个示例。基数R就足够了。我们将使用R来计算卡方值( χ²) 为我们自己。最后，我们将理论和观察到的计数相加，以核对这些数字是否符合我们在本文表格中的预期。

实验说明 (Description of experiment)

The data arise from an experiment where twelve dice were rolled 26,306 times. In each roll, the number of dice with a 5 or 6 showing were counted. (I imagine it was some poor graduate student who drew the short straw.)

数据来自一个实验，其中12个骰子被掷出26,306次。在每一卷中，计数显示为5或6的骰子数。 (我想这是一些可怜的研究生吸了短稻草。)

With 12 dice, the range of possible numbers on each throw is from zero (no die had a 5 or 6 showing) to twelve (all dice had a 5 or 6 showing).

对于12个骰子，每次掷骰的可能范围是从零(无骰子显示5或6)到十二(所有骰子显示5或6)。

二项分布 (Binomial distribution)

The values for rolled dice follow a binomial distribution. We use this distribution for count data. For those interested in the relationship between the binomial distribution and the chi-squared test, this is an accessible explanation.

掷骰子的值遵循二项式分布。我们将此分布用于计数数据。对于那些对二项式分布与卡方检验之间的关系感兴趣的人，这是一个容易理解的解释。

The expected values for each possible value in the range 0 through 12 can be calculated using the dbinom function in R. For example, the probability of obtaining zero dice showing a 5 or 6 when 12 dice are rolled is

可以使用R中的dbinom函数来计算0到12范围内每个可能值的dbinom 。例如，当dbinom 12个骰子时，获得表示5或6的零骰子的概率为

dbinom(0,12,1/3)[1] 0.007707347

We multiply the probability by the total number of trials to get the expected count for no 5s and no 6s across the 26,306 trials

我们将概率乘以试验总数，以得出26,306次试验中没有5s 和 6s的预期计数

dbinom(0,12,1/3)*26306[1] 202.7495

Which we round up to 203. We can repeat this process for the values 1 through 12. However, when we reach 12 we get the following problem.

我们将其舍入为203。我们可以对值1到12重复此过程。但是，当我们达到12时，将遇到以下问题。

dbinom(12,12,1/3)[1] 1.881676e-06

dbinom(12,12,1/3)*26306[1] 0.04949938

Our probability and associated count are extremely close to 0. More on this below.

我们的概率和相关计数非常接近0。

创建data.frame (Create the data.frame)

We’ll construct the data.frame as we read in the values.

我们将在读取值时构造data.frame。

PearsonChiSquare <- data.frame(Face5or6=c(0,1,2,3,4,5,6,7,8,9,10,11,12),                                  Theoretical=c(203,1217,3345,5576,6273,5018,2927,1254,392,87,13,1,0),                                  Observed=c(185,1149,3265,5475,6114,5194,3067,1331,403,105,14,4,0))

Back to the problem with our data, that will only show when we perform the chi-squared test. Notice the 0 theoretical and expected counts for all 12 dice showing 5 or 6? The chi-squared test will not give us a result when there is division by 0.

回到我们的数据问题，该问题仅在我们执行卡方检验时才会显示。注意所有显示5或6的12个骰子的0理论值和期望值吗？当被0除时，卡方检验不会给出结果。

What can we do? The easiest method to deal with this problem is to remove the last row of the data.frame. In our case, this is exactly the same as combining the 11 and 12 categories (the normal method when cell counts are very small). Our trial count is still the same. Our underlying probabilities sum almost to 1. We will do a little correction for that.

我们可以做什么？解决此问题的最简单方法是删除data.frame的最后一行。在我们的案例中，这与组合11和12类完全相同(当单元格计数非常小时，这是常规方法)。我们的审判次数仍然相同。我们的潜在概率总和几乎为1。对此我们将做一些校正。

We drop the row where the theoretical value is 0. I could have addressed this problem by simply not reading in the 0 values. However, dropping a data.frame row based on a value is a common task in R. This code can be generalised to any instance where you need to drop one or more rows on the basis of a specific value of a variable.

我们将理论值设为0的行删除。我可以通过不读0值来解决此问题。但是，基于值删除data.frame行是R中的常见任务。此代码可以推广到需要根据变量的特定值删除一行或多行的任何实例。

#as group of 12 has 0 theoretical and 0 observed counts, drop this observationPearsonChiSquare <- PearsonChiSquare[!(PearsonChiSquare$Theoretical==0),]

Now we construct our column of theoretical probabilities based on the remaining data. Remember, the probability of having twelve dice with 5s or 6s was very small, but was not zero. Thus, the probabilities calculated for each row of our remaining data will slightly differ from the probabilities that Pearson was using.

现在，我们根据剩余数据构建理论概率列。请记住，拥有12个骰子的5s或6s的可能性很小，但不为零。因此，为我们剩余数据的每一行计算的概率将与Pearson所使用的概率略有不同。

PearsonChiSquare$probs <- with(PearsonChiSquare, Theoretical/sum(Theoretical))

Why do we need to create a column of probabilities? Because we are doing a goodness-of-fit test. We are comparing each observed count to the probability of that count, assuming a binomial distribution.

为什么我们需要创建一个概率列？因为我们正在进行拟合优度测试。假设二项分布，我们正在将每个观察到的计数与该计数的概率进行比较。

卡方检验 (Chi-squared test)

Now we are ready to do our chi-squared test. The output is shown below the command.

现在我们准备进行卡方检验。输出显示在命令下方。

chisq.test(x=PearsonChiSquare$Observed, p=PearsonChiSquare$probs)

Chi-squared test for given probabilities

data:  PearsonChiSquare$ObservedX-squared = 43.876, df = 11, p-value = 7.641e-06

The function automatically calculates the number of degrees of freedom (df) from our data. The calculation for the number of df in the chi-squared test is very easy. It is (rows -1) x (columns-1).

该功能会根据我们的数据自动计算自由度(df)的数量。卡方检验中df数的计算非常容易。它是(行-1)x(列1)。

We have 12 categories (range 0 to 11 = 12 categories). We have two columns (observed counts, binomial probabilities). Our df are therefore (12–1) x (2–1) = 11 x 1 = 11.

我们有12个类别(范围从0到11 = 12个类别)。我们有两列(观察数，二项式概率)。因此，我们的df为(12-1)x(2-1)= 11 x 1 = 11。

How did we do for the χ2 value? Especially as we removed one row of data?

我们如何处理χ2值？特别是当我们删除一行数据时？

Pearson calculated χ2=43.87241. We got really close! He calculated P=0.000016. He argued that his result gave 62,499 to 1 against the observed values arising from a binomial distribution. Putting our result into decimal notation, we got P=0.0000076.

皮尔森计算出χ2 = 43.87241。我们真的很接近！他计算出P = 0.000016。他认为，与二项式分布所产生的观测值相比，他的结果为62,499：1。将我们的结果放入十进制表示法，我们得到P = 0.0000076。

Our p-value is very small. We reject the null hypothesis that our observed results arise from a binomial distribution.

我们的p值很小。我们拒绝原假设，即观察到的结果来自二项式分布。

是什么引起差异？ (What is causing the difference?)

The reason for this is the positive bias towards rolling a 5 or 6 (except in the extreme case of all 5s or 6s). We can duplicate this bias by calculating the deviations of the observed values from the theoretical values, duplicating more of Pearson’s work.

这样做的原因是朝着5或6滚动的积极偏见(所有5s或6s的极端情况除外)。我们可以通过计算观测值与理论值的偏差来复制这种偏差，从而重复更多的Pearson的工作。

The code below performs this calculation and then writes the values to the console.

下面的代码执行此计算，然后将值写入控制台。

PearsonChiSquare$Deviation <- with(PearsonChiSquare,Observed-Theoretical)

PearsonChiSquare[, c("Face5or6","Deviation")]

Face5or6 Deviation1         0       -182         1       -683         2       -804         3      -1015         4      -1596         5       1767         6       1408         7        779         8        1110        9        1811       10         112       11         3

As you can see, the observed counts for trials where five or fewer dice showed a 5 or 6 are lower than expected (these are all negative values). Conversely, observed counts for trials where six or more dice showed a 5 or 6 are higher than expected (these are all positive values).

如您所见，五个或更少骰子显示5或6个骰子的试验观察到的计数低于预期(这些均为负值)。相反，在六个或六个以上骰子显示5或6个骰子的试验中，观察到的计数高于预期(这些均为阳性值)。

解释p值 (Interpreting the p-value)

Paraphrasing, Pearson said that the the p-value is the probability of results that are as improbable or more improbable than the one encountered. Our p-value is the probability of getting our χ2 value and any larger χ2 value.

皮尔逊(Pearson)解释说，p值是结果出现的可能性比遇到的结果更不可能或更不可能的可能性 。我们的p值是获得我们的χ2值和任何更大的χ2值的概率。

Why isn’t the p-value related only to the counts that we fitted? The p-value arises from a cumulative density function. The probability of obtaining our exact results, or any specified counts, is very close to 0. For the chi-squared statistic, we are evaluating the integral where our χ2 value is the lower bound. For the very interested, here is a description of the maths.

为什么p值不只与我们拟合的计数相关？ p值来自累积密度函数。获得精确结果或任何指定计数的概率非常接近0。对于卡方统计量，我们正在评估χ2值为下限的积分。对于非常感兴趣的人，这里是数学的描述。

可视化R中的卡方p值 (Visualizing chi-squared p-values in R)

我们的卡方分布 (Our chi-squared distribution)

We can visualise the chi-squared distribution in R. Let’s draw the probability distribution for the chi-squared distribution when there are 11 degrees of freedom. Remember, our test had 11 degrees of freedom. As you can see, I’m using my favourite package for this.

我们可以可视化R中的卡方分布。让我们绘制11个自由度时卡方分布的概率分布。请记住，我们的测试具有11个自由度。如您所见，我为此使用了我最喜欢的软件包。

library("ggplot2")ggplot(data.frame(x=c(0,50)), aes(x=x))+        stat_function(fun=dchisq, args=list(df=11))+        labs(title="Chi-squared probability distribution for 11 df",             x="Chi-squared value",             y="Probability")

This produces the following graph.

这将产生以下图形。

We read the probabilities associated with the χ2 values from right to left. As you can see from the graph, the probability of getting any χ2 value larger than 40 is extremely small. We know it is below 0.000 because the line is flat at that point. Probabilities can get very small, but they cannot reach zero.

我们从右到左读取与χ2值相关的概率。从图中可以看出，使任何χ2值大于40的可能性非常小。我们知道它低于0.000，因为那一点线是平坦的。概率可能很小，但不能达到零。

皮尔逊的卡方分布 (Pearson’s chi-squared distribution)

What about if we had Pearson’s original 12 degrees of freedom, because we didn’t drop that group (and assume we fudged about the 0 problem)? Let’s overlay the probability distributions for 12 degrees of freedom over the distribution we already plotted.

如果我们没有Pearson最初的12个自由度，那是怎么回事，因为我们没有丢掉那个小组(并假设我们对0问题感到有些困惑)？让我们在已经绘制的分布上覆盖12个自由度的概率分布。

ggplot(data.frame(x=c(0,50)), aes(x=x))+        stat_function(fun=dchisq, args=list(df=11), aes(colour="11 df"))+        stat_function(fun=dchisq, args=list(df=12), aes(colour="12 df"))+        scale_colour_manual(name="", values=c("black","blue"))+        labs(title="Chi-squared probability distribution for 11 and 12 df",             x="Chi-squared value",             y="Probability")

Which produces the graph:

产生图：

We can see that the difference between our χ2 value and Pearson’s χ2 value is negligible, considered against the relevant χ2 probability distribution.

我们可以看到我们的χ2之间的差异考虑到相关的χ2概率分布，Pearson值和Pearson的χ2值可以忽略不计。

In practical terms, the χ2 value has to be quite small in order to accept the null hypothesis. The null hypothesis, in this case, is that the observed data arise from a binomial distribution.

实际上，为了接受原假设， χ2值必须很小。在这种情况下，零假设是观察到的数据来自二项式分布。

显示特定p值的拒绝区域 (Showing the rejection area for specific p-values)

We can show how the p-value changes as the χ2 value is decreased. This also shows how the χ2 acts as the lower bound on the rejection area. If we want to reject at P ≤ 0.025, we can show the rejection area on the graph.

我们可以显示p值如何随着χ2值的减小而变化。这也显示了χ2如何充当拒绝区域的下限。如果我们想在P≤0.025时拒绝，我们可以在图表上显示拒绝区域。

RejectionArea   <- data.frame(x=seq(0,50,0.1))RejectionArea$y <- dchisq(RejectionArea$x,11)

library(ggplot2)ggplot(RejectionArea) +   geom_path(aes(x,y)) + geom_ribbon(data=RejectionArea[RejectionArea$x>qchisq(0.025,11,lower.tail = FALSE),],                 aes(x, ymin=0, ymax=y),fill="red")+  labs(title="Chi-squared probability distribution for 11 df showing rejection area\nfor p<=0.025",       x="Chi-squared value",       y="Probability")

We can show this for any p-value we like. Here is the rejection area when we set p ≤0.05.

我们可以针对我们喜欢的任何p值显示此信息。当我们设置p≤0.05时，这是拒绝区域。

How did we change the rejection area in the graph? We used qchisq(0.05,11,lower.tail=FALSE instead of qchisq(0.025,11,lower.tail=FALSE. All the other code remained exactly the same.

我们如何更改图中的拒绝区域？我们使用qchisq(0.05,11,lower.tail=FALSE代替qchisq(0.025,11,lower.tail=FALSE 。所有其他代码保持完全相同。

即将来临！ (Upcoming!)

I will be writing separate posts on the Fisher and the Neyman-Pearson approaches to p-values. Both of these post-date Pearson. Again, I’ll use R to demonstrate the concepts so you can follow along.

我将分别撰写有关p值的Fisher和Neyman-Pearson方法的文章。这两个约会都是皮尔森。同样，我将使用R演示概念，以便您可以继续学习。

As always, please feel free to amend the code as you wish.

与往常一样，请随时根据需要修改代码。

翻译自: https://www.freecodecamp.org/news/pearson-p-values-and-plots-d5eed2fd6d1a/

皮尔逊相关 p值

cumi6497

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
皮尔逊相关 p值_皮尔逊，p值和图

皮尔逊相关 p值by Michelle Jones 由米歇尔·琼斯(Michelle Jones) 皮尔逊，p值和图 (Pearson, p-values, and plots) 什么是p值？ (What is a p-value?)If your experiment needs statistics, you ought to have done a better experiment...
复制链接

扫一扫