How does multiple testing correction work?

最新推荐文章于 2024-08-07 14:12:54 发布

不复啊不复

最新推荐文章于 2024-08-07 14:12:54 发布

阅读量724

点赞数

分类专栏： math

math 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

When prioritizing hits from a high-throughput experiment, it is important to correct for random events that falsely appear significant. How is this done and what methods should be used?

当优先考虑来自高通量实验的命中时，重要的是校正错误地显得重要的随机事件。这是如何完成的以及应该使用哪些方法？

Imagine that you have just invested a substantial amount of time and money in a shotgun proteomics experiment designed to identify proteins involved in a particular biological process. The experiment successfully identifies most of the proteins that you already know to be involved in the process and implicates a few more. Each of these novel candidates will need to be verified with a follow-up assay. How do you decide how many candidates to pursue?

想象一下，您刚刚在鸟枪蛋白质组学实验中投入了大量的时间和金钱，该实验旨在识别特定生物过程中涉及的蛋白质。该实验成功地识别了您已经知道参与该过程的大多数蛋白质并且还涉及更多。这些新候选者中的每一个都需要通过后续分析进行验证。你如何决定追求多少候选人？

The answer lies in the tradeoff between the cost associated with a false positive versus the benefit of identifying a novel participant in the biological process that you are studying. False positives tend to be particularly problematic in genomic or proteomic studies where many candidates must be statistically tested.

答案在于与假阳性相关的成本与识别您正在研究的生物过程中的新参与者的益处之间的权衡。在基因组学或蛋白质组学研究中，假阳性往往特别成问题，其中许多候选者必须进行统计学检验。

Such studies may include identifying genes that are differentially expressed on the basis of microarray or RNA-Seq experiments, scanning a genome for occurrences of candidate transcription factor binding sites, searching a protein database for homologs of a query protein or evaluating the results of a genome-wide association study. In a nutshell, the property that makes these experiments so attractive—their massive scale—also creates many opportunities for spurious discoveries, which must be guarded against.

这些研究可能包括鉴定基于微阵列或RNA-Seq实验差异表达的基因，扫描基因组中候选转录因子结合位点的出现，在蛋白质数据库中搜索查询蛋白的同源物或评估基因组的结果全联盟研究。简而言之，使这些实验如此具有吸引力的特性 - 它们的巨大规模 - 也为虚假发现创造了许多机会，必须予以防范。

In assessing the cost-benefit tradeoff, it is helpful to associate with each discovery a statistical confidence measure. These measures may be stated in terms of P-values, false discovery rates or q-values. The goal of this article is to provide an intuitive understanding of these confidence measures, a sense for how they are computed and some guidelines for how to select an appropriate measure for a given experiment.

在评估成本 - 收益权衡时，将每个发现与统计置信度量度相关联是有帮助的。这些度量可以用P值，错误发现率或q值来表述。本文的目的是提供对这些置信度量的直观理解，对它们如何计算的意义以及如何为给定实验选择适当度量的一些指导原则。

As a motivating example, suppose that you are studying CTCF, a highly conserved zinc-finger DNA-binding protein that exhibits diverse regulatory functions and that may play a major role in the global organization of the chromatin architecture of the human genome1. To better understand this protein, you want to identify candidate CTCF binding sites in human chromosome 21. Using a previously published model of the CTCF binding motif (Fig. 1a), each 20 nucleotide (nt) sub-sequence of chromosome 21 can be scored for its similarity to the CTCF motif. Considering both DNA strands, there are 68 million such subsequences. Figure 1b lists the top 20 scores from such a search.

作为一个激励性的例子，假设您正在研究CTCF，这是一种高度保守的锌指DNA结合蛋白，具有多种调节功能，可能在人类基因组染色质结构的全球组织中发挥重要作用1。为了更好地理解这种蛋白质，您需要鉴定人类21号染色体上的候选CTCF结合位点。使用先前发表的CTCF结合基序模型（图1a），可以对21号染色体的每个20个核苷酸（nt）亚序列进行评分因为它与CTCF主题的相似性。考虑到两种DNA链，有6800万个这样的子序列。图1b列出了此类搜索的前20个分数。

(a) The binding preference of CTCF2 represented as a sequence logo9, in which the height of each letter is proportional to the information content at that position. (b) The 20 top-scoring occurrences of the CTCF binding site in human chromosome 21. Coordinates of the starting position of each occurrence are given with respect to human genome assembly NCBI 36.1. (c) A histogram of scores produced by scanning a shuffled version of human chromosome 21 with the CTCF motif. (d) This panel zooms in on the right tail of the distribution shown in c. The blue histogram is the empirical null distribution of scores observed from scanning a shuffled chromosome. The gray line is the analytic distribution. The P-value associated with an observed score of 17.0 is equal to the area under the curve to the right of 17.0 (shaded pink). (e) The false discovery rate is estimated from the empirical null distribution for a score threshold of 17.0. There are 35 null scores >17.0 and 519 observed scores >17.0, leading to an estimate of 6.7%. This procedure assumes that the number of observed scores equals the number of null scores.

Interpreting scores with the null hypothesis and the P-value

How biologically meaningful are these scores? One way to answer this question is to assess the probability that a particular score would occur by chance. This probability can be estimated by defining a 'null hypothesis' that represents, essentially, the scenario that we are not interested in (that is, the random occurrence of 20 nucleotides that match the CTCF binding site).

这些分数的生物学意义如何？回答这个问题的一种方法是评估特定分数偶然发生的概率。可以通过定义“零假设”来估计该概率，该“假设”基本上表示我们不感兴趣的情景（即，与CTCF结合位点匹配的20个核苷酸的随机发生）。

The first step in defining the null hypothesis might be to shuffle the bases of chromosome 21. After this shuffling procedure, high-scoring occurrences of the CTCF motif will only appear because of random chance. Then, the shuffled chromosome can be rescanned with the same CTCF matrix. Performing this procedure results in the distribution of scores shown in Figure 1c.

定义零假设的第一步可能是改变21号染色体的基础。在这种改组程序之后，CTCF基序的高得分事件只会因随机机会而出现。然后，可以用相同的CTCF矩阵重新扫描混洗的染色体。执行此过程将导致分数分布如图1c所示。

Although it is not visible in Figure 1c, out of the 68 million 20-nt sequences in the shuffled chromosome, only one had a score ≥26.30. In statistics, we say that the probability of observing this score under the null hypothesis is 1/68 million, or 1.5 × 10−8 . This probability—the probability that a score at least as large as the observed score would occur in data drawn according to the null hypothesis—is called the P-value.

虽然在图1c中不可见，但在改组染色体中的6800个20-nt序列中，只有一个得分≥26.30。在统计学中，我们说在零假设下观察该分数的概率是1/68百万，或1.5×10-8。该概率 - 在根据零假设绘制的数据中至少与观察到的分数一样大的分数的概率 - 被称为P值。

Likewise, the P-value of a candidate CTCF binding site with a score of 17.0 is equal to the percentage of scores in the null distribution that are ≥17.0. Among the 68 million null scores shown in Figure 1c, 35 are ≥17.0, leading to a P-value of 5.5 × 10−7 (35/68 million). The P-value associated with score x corresponds to the area under the null distribution to the right of x (Fig. 1d).

同样，得分为17.0的候选CTCF结合位点的P值等于零分布中得分≥17.0的百分比。在图1c中显示的6800万个零分中，35个≥17.0，导致P值为5.5×10-7（35/68百万）。与得分x相关联的P值对应于x右侧的零分布下的区域（图1d）。

Shuffling the human genome and rescanning with the CTCF motif is an example of an 'empirical null model'. Such an approach can be inefficient because a large number of scores must be computed. In some cases, however, it is possible to analytically calculate the form of the null distribution and calculate corresponding P-values (that is, by defining the null distribution with mathematical formulae rather than by estimating it from measured data).

改组人类基因组并用CTCF基序重新扫描是“经验空模型”的一个例子。这种方法可能效率低下，因为必须计算大量分数。然而，在某些情况下，可以分析地计算零分布的形式并计算相应的P值（即，通过用数学公式定义零分布而不是通过从测量数据估计它）。

In the case of scanning for CTCF motif occurrences, an analytic null distribution (gray line in Fig. 1d) can be calculated using a dynamic programming algorithm, assuming that the sequence being scanned is generated randomly with a specified frequency of each of the four nucleotides3. This distribution allows us to compute, for example, that the P-value associated with the top score in Figure 1b is 2.3 × 10−10(compared to 1.5 × 10−8 under the empirical null model). This P-value is more accurate and much cheaper to compute than the P-value estimated from the empirical null model.

在扫描CTCF基序出现的情况下，可以使用动态编程算法计算分析零点分布（图1d中的灰线），假设被扫描的序列是以四个核苷酸中的每一个的指定频率随机生成的。。这种分布允许我们计算，例如，与图1b中的最高分相关联的P值是2.3×10-10（与经验零模型下的1.5×10-8相比）。该P值比从经验零模型估计的P值更准确且更便宜。

In practice, determining whether an observed score is statistically significant requires comparing the corresponding statistical confidence measure (the P-value) to a confidence threshold α. For historical reasons, many studies use thresholds of α = 0.01 or α = 0.05, though there is nothing magical about these values. The choice of the significance threshold depends on the costs associated with false positives and false negatives, and these costs may differ from one experiment to the next.

在实践中，确定观察到的分数是否具有统计显着性需要将相应的统计置信度量度（P值）与置信度阈值α进行比较。由于历史原因，许多研究使用α= 0.01或α= 0.05的阈值，尽管这些值没有任何神奇之处。显着性阈值的选择取决于与假阳性和假阴性相关的成本，并且这些成本可能因实验而异。

Why P-values are problematic in a high-throughput experiment

Unfortunately, in the context of an experiment that produces many scores, such as scanning a chromosome for CTCF binding sites, reporting a P-value is inappropriate. This is because the P-value is only statistically valid when a single score is computed. For instance, if a single 20-nt sequence had been tested as a match to the CTCF binding site, rather than scanning all of chromosome 21, the P-value could be used directly as a statistical confidence measure.

不幸的是，在产生许多分数的实验的背景下，例如扫描CTCF结合位点的染色体，报告P值是不合适的。这是因为当计算单个分数时，P值仅在统计上有效。例如，如果测试单个20-nt序列作为与CTCF结合位点的匹配，而不是扫描所有染色体21，则P值可以直接用作统计置信度量度。

In contrast, in the example above, 68 million 20-nt sequences were tested. In the case of a score of 17.0, even though it is associated with a seemingly small P-value of 5.5 × 10−7 (the chance of obtaining such a P-value from null data is less than one in a million), scores of 17.0 or larger were in fact observed in a scan of the shuffled genome, owing to the large number of tests performed. We therefore need a 'multiple testing correction' procedure to adjust our statistical confidence measures based on the number of tests performed.

相反，在上面的例子中，测试了6800万个20-nt序列。在得分为17.0的情况下，即使它与看似小的P值5.5×10-7相关（从空数据获得这样的P值的机会小于百万分之一），得分事实上，由于进行了大量的测试，在洗牌基因组的扫描中观察到17.0或更大。因此，我们需要一个“多重测试校正”程序，根据执行的测试次数调整我们的统计置信度。

Correcting for multiple hypothesis tests

Perhaps the simplest and most widely used method of multiple testing correction is the Bonferroni adjustment. If a significance threshold of α is used, but n separate tests are performed, then the Bonferroni adjustment deems a score significant only if the corresponding P-value is ≤α/n. In the CTCF example, we considered 68 million distinct 20-mers as candidate CTCF sites, so achieving statistical significance at α = 0.01 according to the Bonferroni criterion would require a P-value <0.01/(68 × 106) = 1.5 × 10−10. Because the smallest observed P-value in Figure 1b is 2.3 × 10−10, no scores are deemed significant after correction.

也许最简单和最广泛使用的多次测试校正方法是Bonferroni调整。如果使用α的显着性阈值，但是执行n个单独的测试，那么仅当相应的P值≤α/ n时，Bonferroni调整才认为得分是显着的。在CTCF示例中，我们考虑了6800万个不同的20聚体作为候选CTCF位点，因此根据Bonferroni准则在α= 0.01时获得统计显着性将需要P值<0.01 /（68×106）= 1.5×10- 10。因为图1b中观察到的最小P值是2.3×10-10，所以在校正后没有分数被认为是显着的。

The Bonferroni adjustment, when applied using a threshold of α to a collection of n scores, controls the 'family-wise error rate'. That is, the adjustment ensures that for a given score threshold, one or more larger scores would be expected to be observed in the null distribution with a probability of α. Practically speaking, this means that, given a set of CTCF sites with a Bonferroni adjusted significance threshold of α = 0.01, we can be 99% sure that none of the scores would be observed by chance when drawn according to the null hypothesis.

Bonferroni调整，当使用α阈值应用于n个分数的集合时，控制“家庭错误率”。也就是说，调整确保对于给定的得分阈值，预期在具有α的概率的零分布中观察到一个或多个较大的分数。实际上，这意味着，给定一组具有Bonferroni调整显着性阈值α= 0.01的CTCF站点，我们可以99％确定在根据零假设绘制时不会偶然观察到任何分数。

In many multiple testing settings, minimizing the family-wise error rate is too strict. Rather than saying that we want to be 99% sure that none of the observed scores is drawn according to the null, it is frequently sufficient to identify a set of scores for which a specified percentage of scores are drawn according to the null. This is the basis of multiple testing correction using false discovery rate (FDR) estimation.

在许多多个测试设置中，最小化家庭错误率太严格了。而不是说我们希望99％确定没有根据空值绘制观察到的分数，通常足以识别一组分数，其中根据空值绘制指定百分比的分数。这是使用错误发现率（FDR）估计进行多次测试校正的基础。

The simplest form of FDR estimation is illustrated in Figure 1e, again using an empirical null distribution for the CTCF scan. For a specified score threshold t = 17.0, we count the number sobs of observed scores ≥tand the number snull of null scores ≥t. Assuming that the total number of observed scores and null scores are equal, then the estimated FDR is simply snull/sobs. In the case of our CTCF scan, the FDR associated with a score of 17.0 is 35/519 = 6.7%.

最简单形式的FDR估计在图1e中示出，再次使用CTCF扫描的经验零点分布。对于指定的得分阈值t = 17.0，我们计算观察得分的数量呜咽≥与空得分≥t的数量。假设观察得分和空得分的总数相等，那么估计的FDR就是单调/呜咽。在CTCF扫描的情况下，与得分17.0相关联的FDR为35/519 = 6.7％。

Note that, in Figure 1e, FDR estimates were computed directly from the score. It is also possible to compute FDRs from P-values using the Benjamini-Hochberg procedure, which relies on the P-values being uniformly distributed under the null hypothesis4. For example, if the P-values are uniformly distributed, then the P-value 5% of the way down the sorted list should be ∼0.05. Accordingly, the procedure consists of sorting the P-values in ascending order, and then dividing each observed P-value by its percentile rank to get an estimated FDR. In this way, small P-values that appear far down the sorted list will result in small FDR estimates, and vice versa.

请注意，在图1e中，FDR估计值直接从分数计算得出。也可以使用Benjamini-Hochberg程序从P值计算FDR，该程序依赖于在零假设下均匀分布的P值4。例如，如果P值均匀分布，则排序列表中5％的P值应为~0.05。因此，该过程包括按升序对P值进行排序，然后将每个观察到的P值除以其百分位数以得到估计的FDR。通过这种方式，出现在排序列表下方的小P值将导致较小的FDR估计值，反之亦然。

In general, when an analytical null model is available, you should use it to compute P-values and then use the Benjamini-Hochberg procedure because the resulting estimated FDRs will be more accurate. However, if you only have an empirical null model, then there is no need to estimate P-values in an intermediate step; instead you may directly compare your score distribution to the empirical null, as in Figure 1e.

通常，当分析空模型可用时，您应该使用它来计算P值，然后使用Benjamini-Hochberg过程，因为得到的估计FDR将更准确。但是，如果您只有经验空模型，则无需在中间步骤中估计P值;相反，您可以直接将您的分数分布与经验空值进行比较，如图1e所示。

These simple FDR estimation methods are sufficient for many studies, and the resulting estimates are provably conservative with respect to a specified null hypothesis; that is, if the simple method estimates that the FDR associated with a collection of scores is 5%, then on average the true FDR is ≤5%. However, a variety of more sophisticated methods have been developed for achieving more accurate FDR estimates (reviewed in ref. 5). Most of these methods focus on estimating a parameter π0, which represents the percentage of the observed scores that are drawn according to the null distribution. Depending on the data, applying such methods may make a big difference or almost no difference at all. For the CTCF scan, one such method6 assigns slightly lower estimated FDRs to each observed score, but the number of sites identified at a 5% FDR threshold remains unchanged relative to the simpler method.

这些简单的FDR估计方法对于许多研究来说已经足够，并且对于指定的零假设，所得到的估计值可证明是保守的; 也就是说，如果简单方法估计与分数集合相关联的FDR为5％，则平均真实FDR≤5％。然而，已经开发了各种更复杂的方法来实现更准确的FDR估计（参见参考文献5）。这些方法中的大多数侧重于估计参数π0，其表示根据零分布绘制的观察分数的百分比。根据数据，应用这些方法可能会产生很大的差异或几乎没有差异。对于CTCF扫描，一种这样的方法6为每个观察到的分数分配略低的估计FDR，但是相对于更简单的方法，在5％FDR阈值处识别的位点数保持不变。

Complementary to the FDR, Storey6 proposed defining the q-value as an analog of the P-value that incorporates FDR-based multiple testing correction. The q-value is motivated, in part, by a somewhat unfortunate mathematical property of the FDR: when considering a ranked list of scores, it is possible for the FDR associated with the first m scores to be higher than the FDR associated with the first m + 1 scores. For example, the FDR associated with the first 84 candidate CTCF sites in our ranked list is 0.0119, but the FDR associated with the first 85 sites is 0.0111. Unfortunately, this property (called nonmonotonicity, meaning that the FDR does not consistently get bigger) can make the resulting FDR estimates difficult to interpret. Consequently, Storey proposed defining the q-value as the minimum FDR attained at or above a given score. If we use a score threshold of T, then the q-value associated with T is the expected proportion of false positives among all of the scores above the threshold. This definition yields a well-behaved measure that is a function of the underlying score. We saw, above, that the Bonferroni adjustment yielded no significant matches at α = 0.05. If we use FDR analysis instead, then we are able to identify a collection of 519 sites at a q-value threshold of 0.05.

作为FDR的补充，Storey6建议将q值定义为包含基于FDR的多重测试校正的P值的模拟值。 q值部分地受到FDR的一些不幸的数学特性的驱动：当考虑排名的分数列表时，与前m分数相关联的FDR可能高于与第一分数相关联的FDR。 m + 1分。例如，与排名列表中前84个候选CTCF站点相关联的FDR为0.0119，但与前85个站点关联的FDR为0.0111。不幸的是，这种属性（称为非单调性，意味着FDR不会一直变大）可能导致难以解释的FDR估计结果。因此，Storey建议将q值定义为达到或高于给定分数的最小FDR。如果我们使用T的得分阈值，则与T相关联的q值是高于阈值的所有得分中的假阳性的预期比例。该定义产生了一个表现良好的度量，它是潜在得分的函数。我们在上面看到，Bonferroni调整在α= 0.05时没有产生显着的匹配。如果我们使用FDR分析，那么我们就能够在q值阈值为0.05的情况下识别519个站点的集合。

In general, for a fixed significance threshold and fixed null hypothesis, performing multiple testing correction by means of FDR estimation will always yield at least as many significant scores as using the Bonferroni adjustment. In most cases, FDR analysis will yield many more significant scores, as in our CTCF analysis. The question naturally arises, then, whether a Bonferroni adjustment is ever appropriate.

通常，对于固定显着性阈值和固定零点假设，通过FDR估计执行多个测试校正将总是产生与使用Bonferroni调整一样多的重要分数。在大多数情况下，FDR分析将产生更多重要分数，如我们的CTCF分析。那么，自然会出现一个问题，即Bonferroni调整是否合适。

Costs and benefits help determine the best correction method

Like choosing a significance threshold, choosing which multiple testing correction method to use depends upon the costs associated with false positives and false negatives. In particular, FDR analysis is appropriate if follow-up analyses will depend upon groups of scores. For example, if you plan to perform a collection of follow-up experiments and are willing to tolerate having a fixed percentage of those experiments fail, then FDR analysis may be appropriate. Alternatively, if follow-up will focus on a single example, then the Bonferroni adjustment is more appropriate.

与选择显着性阈值一样，选择使用哪种多重测试校正方法取决于与误报和假阴性相关的成本。特别是，如果后续分析取决于分数组，则FDR分析是合适的。例如，如果您计划进行一系列后续实验，并且愿意容忍这些实验的固定百分比失败，那么FDR分析可能是合适的。或者，如果后续工作将集中在一个例子上，那么Bonferroni调整更合适。

It is worth noting that the statistics literature describes a related probability score, known as the 'local FDR'7. Unlike the FDR, which is calculated with respect to a collection of scores, the local FDR is calculated with respect to a single score. The local FDR is the probability that a particular test gives rise to a false positive. In many situations, especially if we are interested in following up on a single gene or protein, this score may be precisely what is desired. However, in general, the local FDR is quite difficult to estimate accurately.

值得注意的是，统计学文献描述了相关的概率分数，称为“本地FDR”7。与根据分数集合计算的FDR不同，本地FDR是根据单个分数计算的。当地FDR是特定测试导致误报的概率。在许多情况下，特别是如果我们有兴趣跟进单个基因或蛋白质，这个分数可能正是所期望的。但是，一般而言，本地FDR很难准确估算。

Furthermore, all methods for calculating P-values or for performing multiple testing correction assume a valid statistical model—either analytic or empirical—that captures dependencies in the data. For example, scanning a chromosome with the CTCF motif leads to dependencies among overlapping 20-nt sequences. Also, the simple null model produced by shuffling assumes that nucleotides are independent. If these assumptions are not met, we risk introducing inaccuracies in our statistical confidence measures.

此外，所有用于计算P值或用于执行多个测试校正的方法都假设有效的统计模型 - 分析或经验 - 捕获数据中的依赖性。例如，用CTCF基序扫描染色体导致重叠的20-nt序列之间的依赖性。此外，通过改组产生的简单空模型假设核苷酸是独立的。如果不满足这些假设，我们就有可能在我们的统计置信度测量中引入不准确性。

In summary, in any experimental setting in which multiple tests are performed, P-values must be adjusted appropriately. The Bonferroni adjustment controls the probability of making one false positive call. In contrast, false discovery rate estimation, as summarized in a q-value, controls the error rate among a set of tests. In general, multiple testing correction can be much more complex than is implied by the simple methods described here. In particular, it is often possible to design strategies that minimize the number of tests performed for a particular hypothesis or set of hypotheses. For more in-depth treatment of multiple testing issues, see reference 8.

总之，在进行多次测试的任何实验设置中，必须适当调整P值。 Bonferroni调整控制了一次误报的可能性。相反，如q值中所总结的错误发现率估计控制一组测试中的错误率。通常，多次测试校正可能比这里描述的简单方法所暗示的要复杂得多。特别是，通常可以设计策略，以最小化针对特定假设或假设集执行的测试的数量。有关多个测试问题的更深入处理，请参阅参考文献8。

ref:

Phillips, J.E. & Corces, V.G. Cell 137, 1194–1211 (2009).
Kim, T.H. et al. Cell 128, 1231–1245 (2007).
Staden, R. Methods Mol. Biol. 25, 93–102 (1994).
Benjamini, Y. & Hochberg, Y. J. R. Stat. Soc., B 57, 289–300 (1995).
Kerr, K.F. Bioinformatics 25, 2035–2041 (2009).
Storey, J.D. J. R. Stat. Soc. Ser. A Stat. Soc. 64, 479–498 (2002).
Efron, B., Tibshirani, R., Storey, J. & Tusher, V. J. Am. Stat. Assoc.96, 1151–1161 (2001).
Dudoit, S. & van der Laan, M.J. Multiple Testing Procedures with Applications To Genomics (Springer, New York, 2008)
Schneider, T.D. & Stephens, R.M. Nucleic Acids Res. 18, 6097–6100 (1990).

不复啊不复

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
How does multiple testing correction work?

When prioritizing hits from a high-throughput experiment, it is important to correct for random events that falsely appear significant. How is this done and what methods should be used?当优先考虑来自高通量实验的...
复制链接

扫一扫

专栏目录