Planned Contrasts and Post hoc Tests & 多重检验校正

梁小憨憨

已于 2022-10-12 21:51:35 修改

阅读量985

点赞数 3

分类专栏：信号处理文章标签：算法机器学习人工智能

于 2022-10-12 15:23:34 首次发布

本文链接：https://blog.csdn.net/qq_41990294/article/details/127177768

版权

信号处理专栏收录该内容

31 篇文章 126 订阅

订阅专栏

最近看论文，看到了Post hoc analy和Bonferroni correction，也不知道这是个啥，就学习一下，记录一下笔记，方便以后查阅。

Analysis of Variance

关于Analysis of Variance的一些信息在这里也有记录：Normal Distribution & Chi-squared Distribution & t distribution & F-distribution

接下来记录一些有意思的解释：

a one-way ANOVA includes one factor, whereas a two-way ANOVA includes two factors.

the term factor is used to designate a nominal variable, or in the case of an experimental design, the independent variable, that designates the groups being compared. If we have a drug trial in which we are comparing the mean pain scores of patients after receiving placebo, a low dose of the drug, or a high dose of the drug, the factor would be “drug dose.”

the term levels refers to the individual conditions or values that make up a factor. In our drug trial example, we have three levels of drug dose: placebo, low dose, and high dose.

So how is this ANOVA thing different from the t-tests we already learned? Well, in fact, you can think of it as an extension of the t-test to more than 2 groups. If you run an ANOVA on just 2 groups, the results are equivalent to the t-test. The only difference is that you get an F-value instead of a t-value.

再附上两个很有意思的图：
在这里插入图片描述

在这里插入图片描述

Planned Contrasts and Post hoc Tests

Planned contrasts and post-hoc tests are commonly performed following Analysis of Variance.

This is necessary in many instances, because ANOVA compares all individual mean differences simultaneously, in one test (referred to as an omnibus test).

If we run an ANOVA hypothesis test, and the F-test comes out significant, this indicates that at least one among the mean differences is statistically significant.

However, when the factor has more than two levels, it does not indicate which means differ significantly from each other.

在这里插入图片描述
In this example, a significant F-test result from a one-way ANOVA with the three drug dose conditions does not tell us where the significant difference lies.
Is it between 0 and 100 mg? Or between 100 and 200 mg? Or is it only the biggest difference that is significant – 0 vs. 200 mg?

Planned contrasts and post hoc tests are additional tests to determine exactly which mean differences are significant, and which are not.

Why is that we cannot just do 3 independent means t-tests here? Each time we conduct a t-test we have a certain risk of a Type I error. If we do 3, we have triple the risk.

So first we test for omnibus significance using the overall ANOVA as detailed in the first part of this chapter.
Then, if a statistically significant difference exists among the means, we do the pairwise comparisons with an adjustment to be more conservative.

These follow-up tests are designed specifically to avoid inflating risk of Type I error.

Now, this is very important. We are only allowed to conduct these tests if the F-test result was significant.

Planned contrasts

概念

Planned contrasts are used when researchers know in advance which groups they expect to differ.

For example, suppose from our worksheet example, we expect the pop group to differ from the classical group on our measure of working memory. We can then conduct a single comparison between these means without worrying about Type I error.

Because we hypothesized this difference before we saw the data, perhaps based on prior research studies or a strong intuitive hunch, and because there is only one comparison to be analyzed, we need not be concerned about inflated experimentwise alpha.
If multiple comparisons are planned, then we will need to adjust the significance level.

计算步骤

Let us take a look at how to conduct a single planned contrast. The process is quite simple, as it is just a modified ANOVA analysis.
在这里插入图片描述
First we calculate SSB with just those two groups involved in the planned contrast. We figure out the degrees of freedom between using just the two groups.

Then, we calculate the variance between using the new SSB and degrees of freedom, and we calculate an F-test for the comparison using the new variance between and the original overall variance within.

To find out if the F-test result is significant, we can use the new degrees of freedom but the original significance level for the cutoff. (Because there is just one pairwise comparison, we can use original significance level.)

multiple planned contrasts

If we were to perform multiple planned contrasts, things change a little.

Suppose we had hypothesized in this experiment that each group would differ from the others?

The Bonferroni correction involves adjusting the significance level to protect from the inflation of risk of Type I error.

The procedure for each comparison is the same as for a single planned contrast. The difference is that the cutoff score to determine statistical significance will use a more conservative significance level.

When we do multiple pairwise comparisons, the Bonferroni correction is to use the original significance level divided by number of planned contrasts.

Post hoc tests

概念

What about post hoc tests tests?

As the name suggests, these tests come into the picture when we are doing pairwise comparisons (usually all possible combinations) after the fact to find out where the significant differences were.

These are tests that do not require that we had an a priori hypothesis ahead of data collection.

Essentially, these are an allowable and acceptable form of data-snooping.

multiple post hoc tests

This is where we must be cautious about doing so many tests – we could end up with huge risk of Type I error.

If we use the Bonferroni correction that we saw for multiple planned comparisons on more than 3 tests, the significance level would be vanishingly small.
This would make it nearly impossible to detect significant differences.

For this reason, slightly more forgiving tests like Scheffe’s correction, Dunn’s or Tukey’s post-hoc tests are more popular.

There are many different post-hoc tests out there, and the choice of which one researchers use is often a matter of convention in their area of research.

计算步骤

Now we shall take a look at how to conduct post hoc tests using Scheffé’s correction.

In this example, we will test all pairwise comparisons.

The Scheffé technique involves adjusting the F-test result, rather than adjusting the significance level.

The way it works is the same as the planned contrast procedure, except for the very end.

在这里插入图片描述

Before we compare the F-test result to the cutoff score, we divide the F value by the overall degrees of freedom between, or the number of groups minus one.

Thus, we keep the significance level at the original level, but divide the calculated F by overall degrees of freedom between from the overall ANOVA.

多重检验校正

假设检验

假设检验的相关内容在这里也记录过：Normal Distribution & Chi-squared Distribution & t distribution & F-distribution

假设检验是用于检验统计假设的一种方法。它的基本思想是小概率思想，小概率思想是指小概率事件在一次试验中基本上不会发生。

假设检验的基本方法是提出一个空假设（null hypothesis），也叫做原假设，记作 $H_0$ ；然后得出感兴趣的备择假设（alternative hypothesis），记作 $H_1$ 或 $H_A$ 。

空假设和备择假设的指导原则是空假设是不感兴趣对研究不重要的结论，而备择假设是我们感兴趣想要证明的结论。

举个栗子，给定人的一些身体指标数据，判断其是否存在某种疾病（比如肺炎）。

$H_0$ ：某人没病（我们不感兴趣）； $H_1$ ：某人有病（我们感兴趣）。
将这些身体指标数据和已确定的或健康或有病的一些人的身体数据等样本信息比较，计算 $p$ 值，一般指定显著性水平 $α = 0.05$ ，如果 $p$ 值小于 $0.05$ ，表示这是一个小概率事件。根据小概率思想，我们与其相信这个小概率事件的发生，不如认为更为合理的选择是拒绝原假设，认为该人有病；否则无法拒绝原假设，即接受原假设，表示没有足够的证据认为该人有病。

注：

统计显著性：空假设为真的情况下拒绝零假设所要承担的风险水平，又叫概率水平。
$p$ 值：假定空假设为真的情况下，得到相同样本结果或更极端结果的概率，是一个用来衡量统计显著性的重要指标。
显著性水平 $α$ ：空假设为真时，错误地拒绝空假设的概率。另外，也可以把这种概率理解成在假设检验中决策所面临的风险。
比起计算 $p$ 值，也可以计算统计量，根据显著性水平判断统计量是否落入拒绝域，进而决定是否拒绝原假设。统计量没有 $p$ 直观，所以采用 $p$ 值进行表述。

一次检验有四种可能的结果，用下面的表格表示：

在这里插入图片描述

Type I error，I类错误，也叫做 $α$ 错误
Type II error，II类错误，也叫做 $β$ 错误
FP: false positive，假正例，I类错误
FN: false negative，假反例，II类错误
TP: true positive，真正例
TN: true negative，真反例

I类错误是指空假设为真却被我们拒绝的情况，犯这种错误的概率用 $α$ 表示，所以也称为 $α$ 错误或弃真错误；

II类错误是指空假设为假但我们没有拒绝的情况，犯这种错误的概率用 $β$ 表示，所以也称为 $β$ 错误或取伪错误。

所以，空假设为真并且我们没有拒绝的概率用 $1 - α$ 表示，空假设为假并被我们拒绝的概率用 $1-\beta$ 示。

多重假设检验和FWER, FDR

顾名思义，多重假设检验就是多个假设检验。如果有 $m$ 个人需要检查是否有病，那么就需要进行 $m$ 假设检验。 $m$ 个假设检验的结果可以表示为：

在这里插入图片描述

$m$ 表示假设检验的个数
$m_{0}$ 表示空假设为真的个数
$m-m_0$ 表示备择假设为真的个数
$V$ 表示假正例的个数
$S$ 表示真正例的个数
$U$ 表示真反例的个数
$T$ 表示假反例的个数
$R = V + S$ 表示拒绝空假设的个数

如果某次假设检验得到的 $p$ 值小于显著性水平 $α$ ，则拒绝空假设，主观上认为发现了一个有病的人（无论该人实际上真有病还是假有病），这种情况记为一次发现（discovery）。

所以 $R = V + S$ 表示发现的个数， $V$ 表示错误发现（false discovery）的个数， $S$ 表示正确发现（true discovery）的个数。

用 $Q$ 表示发现中错误发现的比例，即 $Q = V / R = V / (V + S)$ 。

$F W ER$ 定义为 $V$ 大于等于1的概率，即 $FWER=Pr\{V\geq1\}=1-Pr\{V=0\}$ 。

$F D R$ 定义为 $Q$ 的期望，即 $F D R = E [Q]$ 。

因为在 $m$ 检验中， $V, S, U, T$ 都是随机变量，所以 $F D R$ 需要用期望的形式来表示。另外，如果 $R = 0$ ，认为 $Q = 0$ 。为了包含这种情况， $\cdot P\{R>0\}$ 。通俗地理解，可以认为 $F D R = Q = V / R = V / (V + S)$ 。

FWER和FDR表示一种概念或一种方法。

FWER定义为多重假设检验中发现至少一个I类错误的概率，FDR定义为多重假设检验中错误发现占所有发现的比例。

另外，对应地，还存在FWER校正方法和FDR校正方法（也称为控制方法）。

两类校正方法都是用来控制多重假设检验中犯I类错误的概率，使其低于显著性水平 $α$ 。

FWER校正有多种实现，其中最经典的是Bonferroni correction；FDR校正也有多种实现，其中最经典的就是Benjamini–Hochberg procedure。

多重检验矫正

在一次假设检验中，我们使用显著性水平 $α$ 和 $p$ 值得出结论。显著性水平 $α$ 一般取 $0.05$ 或 $0.01$ ，可以保证一次假设检验中犯I类错误的概率和决策错误的风险小于 $α$ 。

但是在 $m$ 次假设检验中，假设 $m = 100$ 和 $\alpha=0.01$ ，假设检验之间相互独立，不犯错误的概率为 $1-0.01)^{100}=36.6\%$ ，而至少犯一次错误的概率高达 $P=1-(1-0.01)^{100}=1-0.366=63.4\%$ 。

举个实际的例子，假如有一种诊断艾滋病的试剂，试验验证其准确性为99%（每100次诊断就有一次false positive）。对于一个被检测的人来说（single test），这种准确性足够了。但对于医院来说（multiple test），这种准确性远远不够，因为每诊断10000个人，就会有100个非艾滋病病人被误诊为艾滋病。这显然是不能接受的。所以，对于多重检验，如果不进行任何控制，犯一类错误的概率便会随着假设检验的个数迅速增加。

为了解决多次检验带来的问题，我们需要对多次检验进行校正。

FWER和FDR校正都可以使多重假设检验整体犯I类错误的概率低于预先设定的显著性水平 $α$ 。

FWER显得较为保守，它主要是依靠减少假阳性(I类错误)的个数，同时也会减少TDR(true discovery rate)。

FDR方法是一种更加新颖靠谱的方法，它会对每个测试用例使用校正后的 $p$ 值（ $q$ 值），达到了更好的效果：在检验出尽可能多的阳性结果的同时将错误发现率控制在可以接受的范围。

FWER和FDR校正

FWER校正和FDR校正均有多种实现，这里只介绍两类方法的经典实现。

条件：在 $m$ 次多重假设检验中，每一次的空假设记为 $H_1, H_2, ..., H_m$ ，对应 $p$ 值记为 $p_1, p_2, ..., p_m$ ，设定显著性水平 $α$ 。

Family-wise error rate(FWER)——Bonferroni correction

邦费罗尼校正（英语：Bonferroni correction）是统计学中在多重比较时使用的一种校正方法，以意大利数学家卡罗·埃米利奥·邦费罗尼的名字命名。

令 $H_{1},\ldots ,H_{m}$ 为一组假设， $p_{1},\ldots ,p_{m}$ 为每一假设相对应的 $p$ 值。同时， $m$ 为零假设总数， $m_{0}$ 则为实际为真的零假设总数。族错误率（familywise error rate，简称FWER）指拒绝至少一个实际为真的零假设（即出现至少一次第一类错误）的概率。此时，邦费罗尼校正是指拒绝所有 $p_{i}\leq {\frac {\alpha }{m}}$ 的零假设。在应用邦费罗尼校正后，FWER满足 ${\text{FWER}}\leq \alpha$ 。这一结论可以由布尔不等式证明：
在这里插入图片描述
这样就能控制多重假设检验整体犯I类错误的概率低于预先设定的显著性水平 $α$ 。

另外，FWER校正不需要假设所有假设彼此之间相互独立，也不需要对空假设为真的个数有要求。

邦费罗尼校正是一种相对保守的FWER控制方法，会增加出现第二类错误的概率。

False discovery rate(FDR)——Benjamini–Hochberg procedure

False discovery rate

假发现率（False discovery rate, FDR）完善了对多重假设测试的检验。

$\mathrm {FDR} =Q_{e}=\mathrm {E} \!\left[Q\right],$ 其中 $E$ 表示期望， $Q=V/R=V/(V+S)$ ， $V$ 表示错误拒绝零假设的数目， $R$ 表示拒绝零假设的数目。 $R$ 取 $0$ 时FDR直接取 $0$ ，写成一句话就是 $\mathrm {FDR} =\mathrm {E} \!\left[V/R|R>0\right]\cdot \mathrm {P} \!\left(R>0\right)$

假发现率被用以校正多重比较所致的误差。在拒绝多个零假设时，FDR校正程序能够控制错误拒绝零假设（伪阳性）的可能性，来找到合适的结果组合。

较之于FWER校正（family-wise error rate)，FDR校正程序采用了更为宽松的标准（比如Bonferroni 校正，“一个假阳性也不许”）。所以，FDR校正法在提高一类错误（应接受零假设，却拒绝零假设）的同时，有更好的统计功效。

Benjamini–Hochberg procedure

Benjamini–Hochberg procedure，简称为BH。首先对所有的 $p$ 值从小到大排序，并记作 $p_{(1)},p_{(2)},...,p_{(m)}$ ，其对应的空假设为 $H_{(1)},H_{(2)},...,H_{(m)}$ 。

若想控制FDR不超过 $α$ ，需要找到最大的正整数 $k$ ，使得 $p_{(k)}\leq\frac{k*\alpha}{m}$ 【公式1】。然后，拒绝 $1 \leq i \leq k$ 时的所有空假设 $H_{(1)},H_{(2)},...,H_{(i)},...,H_{(k)}$ 。

可以看出，BH相当于对排序后的假设提供了不同的显著性水平。

FDR为 $Q$ 期望，即 $F D R = E [Q]$ ，可证

这样就能从统计学上保证FDR不超过 $α$ ，保证多重假设检验整体犯I类错误的概率低于预先设定的显著性水平 $α$ 。

另外，BH是有效的条件是要求 $m$ 个检验是相互独立的。

最后再说一下 $q$ 值( $q - v a l u e$ )， $q=\frac{p_{(k)}*m}{k}$ ，相当于对【公式1】进行变换，一般把 $q$ 值叫做校正后的 $p$ 值。

参考资料

如何通俗地理解Family-wise error rate(FWER)和False discovery rate(FDR)

多重假设检验与Bonferroni校正、FDR校正

邦费罗尼校正

假发现率

Family-wise error rate

False discovery rate

Beginner Statistics for Psychology

梁小憨憨

关注

3
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
Planned Contrasts and Post hoc Tests & 多重检验校正

最近看论文，看到了Post hoc analy和Bonferroni correction，也不知道这是个啥，就学习一下，记录一下笔记，方便以后查阅。
复制链接

扫一扫

专栏目录