GATK4 Mutect2和HaplotypeCaller的不同

1. Mutect2不能计算参考可信度,这是HaplotypeCaller的一个功能,是产生GVCF的关键。因此,目前还没有办法为体细胞变异发现进行联合调用。

        Mutect2 is incapable of calculating reference confidence, which is a feature in HaplotypeCaller that is key to producing GVCFs. As a result, there is currently no way to perform joint calling for somatic variant discovery.

2. 由于体细胞callset是基于单个个体而不是队列,Mutect2 VCF的INFO栏中的注释仅指ALT等位基因,不包括REF等位基因的值。这与生殖系队列调集不同,在生殖系队列调集中,INFO栏的注释通常来自与所有观察到的等位基因有关的数据,包括参考。

Because a somatic callset is based on a single individual rather than a cohort, annotations in the INFO column of a Mutect2 VCF only refer to the ALT alleles and do not include values for the REF allele. This differs from a germline cohort callset, in which annotations in the INFO field are typically derived from data related to all observed alleles including the reference.

3. HaplotypeCaller依靠固定的倍性假设来计算作为基因型概率(PL)基础的基因型可能性,而Mutect2允许以每个变体的等位基因分数的形式出现不同的倍性。由于部分纯度、多个亚克隆和拷贝数的变化,肿瘤样本中经常出现不同的等位基因比例。

While HaplotypeCaller relies on a fixed ploidy assumption to calculate the genotype likelihoods that are the basis for genotype probabilities (PL), Mutect2 allows for varying ploidy in the form of allele fractions for each variant. Varying allele fractions are often seen within a tumor sample due to fractional purity, multiple subclones and copy number variation.

4. Mutect2与HaplotypeCaller的不同之处在于,它可以对位点和等位基因应用各种预过滤器,这取决于使用匹配的正常人、正常人小组(PoN)和包含等位基因特定频率的普通人群变体资源。如果提供了PoN或匹配的正常值,Mutect2可以在重新组合前使用其中之一来过滤位点,它可以使用种系资源来过滤等位基因。

Mutect2 also differs from HaplotypeCaller in that it can apply various prefilters to sites and alleles depending on the use of a matched normal, a panel of normals (PoN) and a common population variant resource containing allele-specific frequencies. If a PoN or matched normal is provided, Mutect2 can use either to filter sites before reassembly, and it can use a germline resource to filter alleles.

5. HaplotypeCaller和Mutect2默认应用的变体位点注释是非常不同的,详细情况请参见它们各自的工具文档。

The variant site annotations that HaplotypeCaller and Mutect2 apply by default are very different; see their respective tool documentation for details.

6. 最后,Mutect2还有HaplotypeCaller所没有的额外参数。这些参数是决定对一个区域进行重新组合的因素,是决定是否发出变异体和是否过滤一个站点的因素:

Finally, Mutect2 has additional parameters not available to HaplotypeCaller. These parameters factor towards the decision to perform reassembly on a region, towards whether to emit a variant and towards whether to filter a site:

        其中,不在生殖系资源中的等位基因的频率(--af-of-alleles-not-in-resource)在生殖系变体先验中定义,Mutect2在计算变体是生殖系的可能性时使用。

  • For one, the frequency of alleles not in the germline resource (--af-of-alleles-not-in-resource) defines in the germline variant prior, which Mutect2 uses in likelihood calculations of a variant being germline.

        其次,对数体细胞先验(--log-somatic-prior)定义了体细胞变异先验,Mutect2在计算变异为体细胞的可能性时使用该先验。

  • Second, the log somatic prior (--log-somatic-prior) defines the somatic variant prior, which Mutect2 uses in likelihood calculations of a variant being somatic.

        第三,正常对数赔率(-normal-lod)定义了肿瘤中的变体不在正常范围内的过滤阈值,即种系风险因素。

  • Third, the normal log odds ratio (--normal-lod) defines the filter threshold for variants in the tumor not being in the normal, i.e. the germline risk factor.

        第四,肿瘤发射的对数赔率(--tumor-lod-to-emit)定义了肿瘤变异体出现在调用集中的临界点。

  • Fourth, the tumor log odds ratio for emission (–-tumor-lod-to-emit) defines the cutoff for a tumor variant to appear in a callset.

Historical perspective explains some quirks of somatic calling

体细胞调用不是简单地从病例样本变异等位基因中减去对照变异等位基因。这样做的原因是源于癌症研究中体细胞调用集的初衷。Somatic calling is NOT a simple subtraction of control variant alleles from case sample variant alleles. The reason for this stems from the original intent for somatic callsets in cancer research.

        首先,最重要的是保护病人的隐私。胚胎变异,特别是基因组非翻译区或非编码区的变异,会使个人身份失去识别。编码区的变异受到的进化限制,如有害的框架移位变异,并不适用于非编码区,因此区分了变异可识别个人的程度。为了保护病人的身份,体细胞调用的设计是为了避免从非翻译区和非编码区传递任何可识别的生殖系变异。编码区的体细胞突变不会识别个人身份,根据TCGA的政策可以公开分享。First and foremost, protect patient privacy. Germline variants, in particular those in untranslated regions or noncoding regions of the genome, deidentify individuals. The evolutionary constraints mutations in coding regions are subject to, e.g. detrimental frameshift mutations, do not apply to noncoding regions and therefore differentiate the extent to which mutations can identify individuals. To protect patient identities, somatic calling was designed to avoid passing on any identifying germline variation from untranslated and noncoding regions. Somatic mutations in coding regions do not deidentify individuals and are publically sharable according to TCGA policies.

        最大限度地提高特异性。癌症研究中的体细胞调用是为了产生用于计算分析的数据。这些分析的重点是对癌症队列中的癌症驱动基因进行三角分析。由于队列中样本的数量,一项分析可以容忍个别样本的信号损失。然而,同样的道理,一项分析也可以拾起测序技术的反复出现的伪影。这意味着研究人员更倾向于去除最大数量的假阳性,即使以损失一些真阳性为代价。Maximize specificity. Somatic calling in cancer research was intended to generate data for use in computational analyses. These analyses focus on triangulating cancer driver genes in cancer cohorts. Because of the number of samples within a cohort, an analysis can tolerate loss of signal from individual samples. However, by the same token, an analysis can also pick up recurrent artifacts of sequencing technology. What this means is that researchers prefer to remove the maximal number of false positives even at the expense of losing some true positives.

体细胞caller在其严格的过滤中反映了这两种偏好,要么是在前期,如不发出变异体呼唤,要么是在下游,如在FILTER栏中用过滤器的名称注释一个站点。Somatic callers reflect these two preferences in their stringent filtering, either upfront such that a variant call is not emitted or downstream such that a site is annotated in the FILTER column with the filter name.

体细胞呼叫器应检测低分等位基因,可以不做明确的倍性假设,并省略传统意义上的基因分型。A somatic caller should detect low fraction alleles, can make no explicit ploidy assumption and omits genotyping in the traditional sense. Mutect2遵守了所有这些标准。一些癌症样本的特点使得这种呼叫器特征成为必要。首先,活检的肿瘤样本通常被正常细胞所污染,正常部分可能比样本中的肿瘤部分要高得多。第二,肿瘤的突变可能是异质性的。第三,这些突变通常包括非整倍体事件,以拼凑的方式改变细胞基因组的拷贝数。Mutect2 adheres to all of these criteria. A number of cancer sample characteristics necessitate such caller features. For one, biopsied tumor samples are commonly contaminated with normal cells, and the normal fraction can be much higher than the tumor fraction of a sample. Second, a tumor can be heterogeneous in its mutations. Third, these mutations not uncommonly include aneuploid events that change the copy number of a cell's genome in patchwork fashion.

如果该位点在对照组中是变异的,则病例样本中的变异等位基因不被调用。A variant allele in the case sample is not called if the site is variant in controls. We explain an exception for GATK4 Mutect2 in a bit.

历史上,体细胞呼唤器是在位点水平上呼唤体细胞变异的。也就是说,如果病例中的一个变异点在匹配的对照组或人群资源中也是变异点,如dbSNP,即使变异的等位基因与对照组或资源不同,也会从体细胞调用中扣除。这种做法部分源于癌症研究的设计,即对照组正常样本的测序深度远远低于病例肿瘤样本。由于假设突变是随机发生的,癌症遗传学家对常见种系变异部位的突变持怀疑态度。请记住,对于人类来说,常见的种系变异位点大约平均在每千个参考碱基中出现一个。因此,如果一个常见的变异位点产生了额外的突变,我们必须权衡它是由真正的体细胞事件产生的还是其他可能不会对下游分析产生价值的东西。对于大多数部位和典型的分析来说,情况是后者。该变体不太可能来自体细胞事件,而更有可能是一些假象或生殖系变体,例如来自绘图或跨样本污染。Historically, somatic callers have called somatic variants at the site-level. That is, if a variant site in the case is also variant in the matched control or in a population resource, e.g. dbSNP, even if the variant allele is different than the control or resource it is discounted from the somatic callset. This practice stems in part from cancer study designs where the control normal sample is sequenced at much lower depth than the case tumor sample. Because of the assumption mutations strike randomly, cancer geneticists view mutations at sites of common germline variation with skepticism. Remember for humans, common germline variant sites occur roughly on average one in a thousand reference bases. So if a commonly variant site accrues additional mutations, we must weigh the chance of it having arisen from a true somatic event or it being something else that will likely not add value to downstream analyses. For most sites and typical analyses, the latter is the case. The variant is unlikely to have arisen from a somatic event and more likely to be some artifact or germline variant, e.g. from mapping or cross-sample contamination.
GATK4 Mutect2仍然部分地应用这种做法。该工具不考虑与正常人小组或与匹配的正常对照的明确的变异部位共享的变异部位。如果匹配的正常人的变异等位基因被少数读数支持,等位基因比例较低,那么该工具就会考虑该位点不是种系变异的可能性。GATK4 Mutect2 still applies this practice in part. The tool discounts variant sites shared with the panel of normals or with a matched normal control's unambiguously variant site. If the matched normal's variant allele is supported by few reads, at low allele fraction, then the tool accounts for the possibility of the site not being a germline variant.
当涉及到群体种系资源时,GATK4 Mutect2会区分种系资源和病例样本中的变异等位基因。也就是说,如果等位基因与种系资源中的等位基因不同,Mutect2将称一个变异位点为体细胞。When it comes to the population germline resource, GATK4 Mutect2 distinguishes between the variant alleles in the germline resource and the case sample. That is, Mutect2 will call a variant site somatic if the allele differs from that in the germline resource.

体细胞工作流程过滤具有多个变异等位基因的病例部位。根据与上述类似的逻辑,并假设常见的变异位点是双联的,任何在病例样本中出现多个变异等位基因的位点都是可疑的。Mutect2仍然调用这样的位点和对比的变异等位基因;但是,在工作流程的下一步,FilterMutectCalls用多等位基因过滤器过滤这些位点。病例样本中的多等位点有可能是体细胞事件,但更有可能是生殖系变异位点或人为的位点。Somatic workflows filter case sites with multiple variant alleles. By a similar logic to that outlined above, and with the assumption that common variant sites are biallelic, any site that presents multiple variant alleles in the case sample is suspect. Mutect2 still calls such sites and the contrasting variant alleles; however, in the next step of the workflow, FilterMutectCalls filters such sites with the multiallelic filter. It is possible a multiallelic site in the case sample represents a somatic event, but it is more likely the site is a germline variant site or an artifactual site.

常模小组有助于过滤测序的系统性假象。神器是读数数据中看起来的变异,实际上是假阳性。测序技术的神器不全是随机的。一些神器来自于样品制备,并存在于特定的序列环境中。其他神器来自于绘图。这些假象常常表现为低等位基因分数的体细胞突变。当体细胞呼唤集被收集到一个队列中时,这些人工制品会呈现出一个强烈的信号,因为它们系统地出现在一部分样本中。为了消除这种错误的信号,Mutect2对存在于给定的常模组(用-pon指定)中的位点进行过滤。通常情况下,PoN是用生殖系正常样本构建的。首先,使用与体细胞调用相同的灵敏度进行调用,即使用Mutect2。其次,多个正常样本被收集到一个队列中。最后,小组保留在两个或多个样本中调用的位点。GATK4的CreateSomaticPanelOfNormals执行了后两个步骤。使用由种系常模构建的PoN有一个额外的好处,即过滤常见的种系变异点。这对于缺乏共同生殖系变异资源的物种的体细胞分析特别有用。
The panel of normals helps filter systematic artifacts of sequencing. Artifacts are seeming variants in the read data that are in fact false positives. Sequencing technology's artifacts are not all random. Some artifacts come from sample preparation and present in specific sequence contexts. Other artifacts come from mapping. These artifacts often appear like low allele fraction somatic mutations. When somatic callsets are gathered in a cohort, these artifacts can present a strong signal, as they occur systematically in some fraction of samples. To remove such false signals, Mutect2 filters sites present in a given panel of normals (specified with -pon). Typically, a PoN is constructed with germline normal samples. First, calls are made using the same sensitivity as that used in somatic calling, i.e. with Mutect2. Second, the multiple normal samples are gathered into a cohort. Finally, the panel retains sites called in two or more samples. GATK4's CreateSomaticPanelOfNormals performs these latter two steps. Use of a PoN constructed from germline normals has the added benefit of filtering common germline variant sites. This is especially useful for somatic analysis of species that lack a common germline variant resource.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
GATK Mutect2是一种广泛用于检测体细胞突变的工具,以下是其检测流程的简要说明。 首先,Mutect2通过比较肿瘤样本和正常样本的测序数据来区分突变事件。它采用配对样本的测序数据,其中包括Tumor样本和Normal样本,用于检测在Tumor样本中特有的变异。 其次,Mutect2将输入的DNA测序数据首先进行处理和去噪,包括读取比对、质量控制和去除PCR偏差等步骤。然后,它使用GATK提供的基于Bayesian模型的变异检测算法来识别可能的单核苷酸变异(SNVs)和小片段插入/删除突变(indels)。 然后,Mutect2使用多个过滤器来排除假阳性的变异。这些过滤器包括测序深度过滤器、错配率过滤器、基因组运行过滤器等。通过应用这些过滤器,Mutect2可以准确地识别并过滤掉可能是由于技术问题或其他伪变异引起的假阳性。 最后,Mutect2输出一个突变调用文件(VCF),其中包含检测到的变异信息,如变异位置、变异类型、基因型频率、基因型质量评分等。这个VCF文件可以进一步用于变异注释、功能预测和统计分析,从而为研究人员提供更多研究突变现象的细节。 总之,GATK Mutect2是一种高效准确的基于比较正常和肿瘤样本测序数据的突变检测工具,它的检测流程包括数据处理、变异检测和过滤、突变调用等步骤,为研究人员提供了有效分析体细胞突变的工具和结果。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值