GATK4 somatic mutations analysis

Section 1 用Mutect2调用体细胞突变

Section 2 概述如何使用Mutect2的纯肿瘤模式创建panel of normal resource

Section 3 概述了如何估计交叉样本污染

Section 4 显示了如何用FilterMutectCalls过滤callset。与GATK3不同,GATK4中,体细胞调用和过滤由不同的工具来体现

Section 5 显示了一个可选的过滤步骤,以过滤带有方向性偏差的序列背景artifacts

Section 6 显示了如何在IGV中设置手动观察

参考基因组resource:

GATK Resource Bundle

gs://gatk-best-practices/somatic-hg38

1. call somatic short variants and generate a bamout with Mutect2

这里我们有一个相当复杂的命令,使用Mutect2调用HCC1143肿瘤样本上的体细胞变异。关于体细胞调用的概要,见这篇文章。该命令调用了肿瘤样本中的体细胞变异,并使用了一个匹配的正常人、一个正常人面板(PoN)和一个群体生殖系变异资源。Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see this article. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.

gatk --java-options "-Xmx2g" Mutect2 \
    -R hg38/Homo_sapiens_assembly38.fasta \
    -I tumor.bam \
    -I normal.bam \
    -tumor HCC1143_tumor \
    -normal HCC1143_normal \
    -pon resources/chr17_pon.vcf.gz \
    --germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
    --af-of-alleles-not-in-resource 0.0000025 \
    --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
    -L chr17plus.interval_list \
    -O 1_somatic_m2.vcf.gz \
    -bamout 2_tumor_normal_m2.bam 

这将产生一个未经过滤的体细胞呼叫集1_somatic_m2.vcf.gz,一个重新组装的读数BAM 2_tumor_normal_m2.bam和各自的索引1_somatic_m2.vcf.gz.tbi和2_tumor_normal_m2.bai。

Comments on select parameters

用两个参数指定体细胞调用的病例样本。用-I提供BAM,用-tumor提供样本的读组样本名称(SM字段值)。使用GetSampleName来查找读组SM字段。或者,使用Samtools view -H tumor.bam | grep '@RG'。Specify the case sample for somatic calling with two parameters. Provide the BAM with -I and the sample's read group sample name (the SM field value) with -tumor. To look up the read group SM field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'.

预先过滤控制样本排列中的变异点。用-I指定对照BAM,用-normal指定对照样本的读组样本名称(SM字段值)。在肿瘤与正常对照相匹配的情况下,我们甚至可以排除罕见的种系变异和个体特异性的伪影。如果我们用Mutect2分析我们的肿瘤样本,而没有匹配的正常样本,我们得到的调用比匹配的正常样本多一个数量级。Prefilter variant sites in a control sample alignment. Specify the control BAM with -I and the control sample's read group sample name (the SM field value) with -normal. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.

预先过滤常模调用集中的变异点。用-pon指定常模面板(PoN)的VCF。第2节概述了如何创建一个PoN。常态面板不仅代表了常见的种系变异位点,它还展示了测序数据中常见的噪声位点,如测序中的映射神器或其他有些随机但系统的神器。默认情况下,该工具不会重新组合也不会发出与PoN变体完全匹配的变体位点。要启用PoN位点的基因分型,请使用--基因分型-on-sites选项。如果匹配不准确,例如有一个等位基因不匹配,该工具会重新组合该区域,发出呼叫并在INFO字段中用IN_PON注释匹配。Prefilter variant sites in a panel of normals callset. Specify the panel of normals (PoN) VCF with -pon. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.

通过用--种系资源指定一个种群种系资源--germline-resource来注释变异等位基因。种系资源必须包含特定的等位基因频率,即它必须包含INFO字段中的AF注释[4]。该工具用群体等位基因频率来注释变异的等位基因。当使用群体种系资源时,可以考虑将--af-of-alleles-not-in-resource参数从其默认的0.001调整。例如,gnomAD资源af-only-gnomad_grch38.vcf.gz代表了~20万个外显子和~16万个基因组,而教程数据是外显子数据,所以我们将--af-of-alleles-not-in-resource调整为0.0000025,相当于1/(2*外显子样本)。默认的0.001适合于没有任何群体资源的人类样本分析。它是基于人类的平均杂合率。群体等位基因频率(POP_AF)和非资源性等位基因的因素在变异体的概率计算中。Annotate variant alleles by specifying a population germline resource with --germline-resource. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF) and the af-of-alleles-not-in-resource factor in probability calculations of the variant being somatic.

包括那些配偶映射到不同Contig的读数。对于我们使用alt-aware和alt后处理的对准GRCh38的体细胞分析,我们用-disable-read-filter MateOnSameContigOrNoMappedMateReadFilter禁用一个特定的读数过滤器。这个过滤器会从分析中删除那些配偶映射到不同contig的配对读数。由于BWA对那些与交替的contig(对GRCh38的alt-aware映射)对齐较好的配偶信息进行纵横交错的方式,我们希望将这些类型的读数纳入我们的分析中。否则,我们可能会错过检测与交替单倍型相关的SNVs和indels。禁用这个过滤器偏离了目前的生产实践。Include reads whose mate maps to a different contig. For our somatic analysis that uses alt-aware and post-alt processed alignments to GRCh38, we disable a specific read filter with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.

用-L参数将分析锁定在特定的基因组区间。在这里,我们指定这个选项来加快我们在小型教程数据上的运行速度。对于我们在第4节中使用的完整调用集,调用的是整个数据,没有区间文件。Target the analysis to specific genomic intervals with the -L parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.

用-bamout生成重新组装的排列组合文件。bamout排列包含了正常人和肿瘤的人工单倍型和重新组合的排列,可以手动审查调用。这个参数不是工具所要求的,但推荐使用,因为添加这个参数只需要花费总运行时间的一小部分。Generate the reassembled alignments file with -bamout. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.

为了说明Mutect2是如何应用注释的,下面是全部调用集中的五个多平行基因点。用gzcat somatic_m2.vcf.gz | awk '$5 ~", "拉出这些。awk '$5 ~","'对第5列中含有逗号的记录进行子集。To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'. The awk '$5 ~","' subsets records that contain a comma in the 5th column.

我们看到每一个变异体调用有11列信息,包括正常和肿瘤的基因型调用。注意QUAL和FILTER的空字段,以及站点(INFO)和样本水平(第10和11列)的注释。每个样本都有基因型,当一个部位是多等位基因时,我们可以看到特定等位基因的注释。样本可能有额外的注释,例如与相位有关的PGT和PID。We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT and PID that relate to phasing.

 1.1 What are the Mutect2 annotations?

我们可以在VCF头中查看标准的FORMAT级和INFO级Mutect2注释。We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.

工具索引中的变体注释部分进一步描述了一些注释。关于GATK4中可用的注释的完整列表,请参见本网站。gatk/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator at master · broadinstitute/gatk · GitHub

要启用依赖于非标准注释的特定过滤,或者只是添加额外的注释,请使用-A参数。例如,-A ReferenceBases为变体调用添加了ReferenceBases注解。注意,如果过滤器所依赖的注释不存在,FilterMutectCalls将跳过特定的过滤而没有任何警告信息。To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A argument. For example, -A ReferenceBases adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.

1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

为了了解其影响,请看一些数字。在所有其他读数过滤器之后,MateOnSameContigOrNoMappedMateReadFilter(MOSCO)过滤器从分析中额外删除了完整数据中 8.71% (8,681,271 个)肿瘤样本读数和 8.18% (6,256,996 个)正常样本读数。禁用 MOSCO 过滤器的影响是,交替等位基因上的读数和跨等位基因的读数对现在可以为变异调用提供支持。To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter(MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigsand read pairs that span contigs can now lend support to variant calls.

就教程数据而言,将通常由 MOSCO 过滤器过滤的读数包括在内,可使 Mutect2 调用的数量增加大约一倍。大部分额外调用来自 ALT、HLA 和诱饵等位基因。For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.


2. Create a sites-only PoN with CreateSomaticPanelOfNormals

我们利用三个种系样本做出了创建PoN的动议。这些样本是HG00190、NA19771和HG02759。

首先,在每个正常样本上以纯肿瘤模式运行Mutect2。在纯肿瘤模式下,用-肿瘤标志来分析单个病例样本,而没有附带的匹配对照-正常样本。在本教程中,我们只对HG00190样本运行此命令。

gatk Mutect2 \
    -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
    -I HG00190.bam \
    -tumor HG00190 \
    --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
    -L chr17plus.interval_list \
    -O 3_HG00190.vcf.gz

这将产生一个调用集3_HG00190.vcf.gz和一个匹配索引。Mutect2调用样本中的变体,其敏感标准与在体细胞模式下调用肿瘤中的变体相同。由于该命令省略了触发前期过滤的选项,我们希望所有可检测的变异体都被调用。调用的变体将包括低等位基因分数的变体和具有多个变体等位基因的部位,即多等位基因部位。下面是3_HG00190.vcf.gz中的两条多等位基因记录。This generates a callset 3_HG00190.vcf.gz and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz.

我们看到对于每个位点,Mutect2都调用了参考等位基因和三个备用等位基因。GT基因型的调用是0/1/2/3。两个位点的AD等位基因深度分别为16,3,12,4和41,5,24,4。We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT genotype call is 0/1/2/3. The AD allele depths are 16,3,12,4 and 41,5,24,4, respectively for the two sites.

 Comments on select parameters

这里没有使用的一个选项是使用 --germline-resource 包含种系资源。请记住第 1 节的内容,该资源必须在 INFO 栏中包含 AF 群体等位基因频率。在纯肿瘤模式下使用该资源,就像在体细胞模式下一样,可以预先过滤常见的种系变异等位基因。这实际上是从 PoN 中省略了常见的种系变异等位基因。请注意,相关的可选参数 --max-population-af(默认值 0.01)定义了等位基因频率的截止值。如果给定了一个资源,并读取了变异的证据,Mutect2 仍会发出等位基因频率小于或等于 --max-population-af 的变异等位基因。

  • One option that is not used here is to include a germline resource with --germline-resource. Remember from section 1 this resource must contain AF population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF less than or equal to the --max-population-af.

在常模样本调用面板中重述体细胞调用中使用的任何特殊选项,例如--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter。该选项适用于alt-aware和后alt处理的排列。

  • Recapitulate any special options used in somatic calling in the panel of normals sample calling, e.g.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This particular option is relevant for alt-aware and post-alt processed alignments.

其次,使用 CreateSomaticPanelOfNormals 将所有正常 VCF 整理成一个调用集。在教程中,为了用小数据说明这一步骤,我们在三个正常样本 VCF 上运行了这一命令。一般建议法线面板至少包含 40 个样本。Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.

gatk CreateSomaticPanelOfNormals \
    -vcfs 3_HG00190.vcf.gz \
    -vcfs 4_NA19771.vcf.gz \
    -vcfs 5_HG02759.vcf.gz \
    -O 6_threesamplepon.vcf.gz

这将生成 PoN VCF 6_threesamplepon.vcf.gz 和索引。教程 PoN 包含 8,275 条记录。This generates a PoN VCF 6_threesamplepon.vcf.gz and an index. The tutorial PoN contains 8,275 records.

CreateSomaticPanelOfNormals 可保留两个或多个样本中存在变异的位点。如图所示,它保留了样本中的等位基因,但删除了所有其他注释,创建了一个八列、仅有位点的 VCF。CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.

理想情况下,PoN 包括在技术上能代表肿瘤病例样本的样本--即在相同平台上使用相同化学方法(如外显子组捕获试剂盒)测序并使用相同工具链分析的样本。不过,即使是不匹配的 PoN 也能有效过滤大部分测序伪影。这是因为在短读测序方法中,几乎相同的基因组位点都会出现映射伪影和聚合酶滑动错误。Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

例如,考虑对代表个体集合或高度相似但截然不同的 DNA 分子(如线粒体 DNA)集合的数据进行变异调用。Mutect2 能以高效的计算方式在一个位点上调用多个变体。此外,只针对肿瘤的模式也可用于简单地调用两个样本之间的差异。For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples.

3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

首先,在肿瘤 BAM 上运行 GetPileupSummaries,总结已知变异位点的读数支持。使用仅包含常见双等位基因变异的种群种系资源,例如使用 SelectVariants --restrict-alleles-to BIALLELIC 子集,以及 INFO 字段中的种群 AF 等位基因频率[4]。该工具将支持资源中位点的参考等位基因、候补等位基因和其他等位基因的读数计数制成表格。First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC, as well as population AF allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.

gatk GetPileupSummaries \
    -I tumor.bam \
    -V resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -O 7_tumor_getpileupsummaries.table

这样就生成了如图所示的六列表格。Alt_count 是种系资源中支持 ALT 等位基因的读数计数。等位基因频率与种系资源中的等位基因频率一致。other_alt_count 的计数指的是支持所有其他等位基因的读数。This produces a six-column table as shown. The alt_count is the count of reads that support the ALT allele in the germline resource. The allele_frequency corresponds to that given in the germline resource. Counts for other_alt_count refer to reads that support all other alleles.

Comments on select parameters

该工具只考虑样本中等位基因频率介于--最小群体等位基因频率(默认值 0.01)和--最大群体等位基因频率(默认值 0.2)之间的同源交替位点。这些设置的原理如下。如果同源交替位点具有稀有等位基因,那么在存在跨样本污染的情况下,我们更有可能观察到 REF 或其他更常见等位基因的存在。这样我们就能更准确地测量污染情况。

  • The tool only considers homozygous alternate sites in the sample that have a population allele frequency that ranges between that set by --minimum-population-allele-frequency (default 0.01) and --maximum-population-allele-frequency (default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.

加快分析速度的一个选项是使用 -L 参数将数据收集限制在足够大的基因组区域内,但必须是子集。

  • One option to speed up analysis, that is not used in the command above, is to limit data collection to a sufficiently large but subset genomic region with the -L argument.

自 2018 年 8 月 2 日发布的 GATK4.0.8.0 起,GetPileupSummaries 需要 -L 和 -V 两个参数。在教程中,请为每个参数提供相同的 resources/chr17_small_exac_common_3_grch38.vcf.gz 文件。有关详情,请参阅 GetPileupSummaries 工具文档。

  • As of GATK4.0.8.0, released August 2, 2018, GetPileupSummaries requires both -L and -V parameters. For the tutorial, provide the same resources/chr17_small_exac_common_3_grch38.vcf.gz file to each parameter. For details, see the GetPileupSummaries tool documentation.

其次,使用计算污染(CalculateContamination)估算污染。该工具从 GetPileupSummaries 中获取汇总表,并给出污染分数。这一估算结果将为使用 FilterMutectCalls 进行下游过滤提供信息。Second, estimate contamination withCalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.

gatk CalculateContamination \
    -I 7_tumor_getpileupsummaries.table \
    -O 8_tumor_calculatecontamination.table

这样就得到了一个包含污染和误差估计值的表格。肿瘤全样本的估计值如下所示,污染率为 0.0205。展望未来,我们知道要怀疑等位基因交替率小于 ~2% 的调用。This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.

Comments on select parameters

CalculateContamination 有两种操作模式。上面的命令使用的是简单估算给定样本污染度的模式。另一种模式结合了匹配的正常样本的度量指标,可以获得更准确的估计值。对于第二种模式,请在正常样本上运行 GetPileupSummaries,然后使用 -matched 参数为 CalculateContamination 提供正常堆积表。

  • CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument.

交叉样本污染不同于正常样本污染肿瘤和肿瘤污染正常样本。目前,工作流程没有考虑后一种纯度问题。

 3.1 What if I find high levels of contamination?

需要排除的一个问题是在读取组层面上的样本交换。One thing to rule out is sample swaps at the read group level.

Picard 的 CrosscheckFingerprints 可以在读取组水平上检测样本交换,还可以测量两个样本的关联程度。由于测序过程中可能需要在不同泳道上对样本进行多重处理,并对样本的多个读数组进行重新分组,这取决于处理这些读数组的自动化程度,因此有可能包含来自不相关样本的读数组。在肿瘤样本中加入这样的交叉样本将不利于体细胞分析。在不涉及细节的情况下,该工具允许我们:(i) 在样本层面检查肿瘤和正常样本是否相关,因为它们必须来自同一个个体;(ii) 在读取组层面检查每个读取组数据是否来自同一个个体。Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.

再设想一下,如果我们把污染读数组数据误认为是某个肿瘤亚群!教程中的正常样本和肿瘤样本分别由 16 个和 22 个读取组组成,当我们提供这些读取组并设置 EXPECT_ALL_GROUPS_TO_MATCH=true 时,CrosscheckReadGroupFingerprints(该工具现已被 CrosscheckFingerprints 取代)会通知我们所有读取组都与预期相关。Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.

4. Filter for confident somatic calls using FilterMutectCalls

FilterMutectCalls 可确定一个调用是否是可信的体细胞调用。该工具使用调用集内的注释,并应用针对人类体细胞分析而调整的预设阈值。FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.

使用 FilterMutectCalls 过滤 Mutect2 调用集。这里我们使用的是完整的调用集 somatic_m2.vcf.gz。要激活基于污染估计的过滤,请使用 --contamination-table 提供污染表。在 GATK v4.0.0.0 中,该工具使用污染估计值作为硬截止值。Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.

gatk FilterMutectCalls \
    -V somatic_m2.vcf.gz \
    --contamination-table tumor_calculatecontamination.table \
    -O 9_somatic_oncefiltered.vcf.gz

这将产生一个 VCF 调用集 9_somatic_oncefiltered.vcf.gz 和索引。可能是真阳性的调用会在 FILTER 字段中得到 PASS 标签,而可能是假阳性的调用则会在 VCF 的 FILTER 字段中标注过滤原因。我们可以使用 grep '##FILTER' 查看 VCF 头中可用的过滤器。This produces a VCF callset 9_somatic_oncefiltered.vcf.gz and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'.

这一步看似应用了 14 种过滤器,包括污染过滤器。但是,如果过滤器所依赖的注释不存在,工具就会跳过特定的过滤。过滤器仍会出现在页眉中。例如,重复证据(duplicate_evidence)过滤器需要一个非标准注释,而我们的调用集省略了这一注释。This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence filter requires a nonstandard annotation that our callset omits.

到目前为止,我们已获得 3,695 次调用,其中 2,966 次被过滤,729 次作为可靠的体细胞调用通过。在被过滤的呼叫中,污染过滤了 8 个呼叫,所有这些呼叫都会因其他原因被过滤。对于有统计倾向的人来说,这可能是一个惊喜。不过,请记住,绝大多数污染变异都是常见的种系等位基因,对此我们有其他保障措施。So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.

► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.

5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

FilterByOrientationBias 允许根据序列上下文人工制品(如 OxoG 和 FFPE)进行过滤。这一步是可选的,如果使用,应始终在使用 FilterMutectCalls 过滤之后执行。该工具需要 Picard CollectSequencingArtifactMetrics 中的 pre_adapter_detail_metrics。FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics from Picard CollectSequencingArtifactMetrics.

首先,使用 CollectSequencingArtifactMetrics 收集序列上下文人工痕迹的指标。该工具将其分为杂交选择前(preadapter)和杂交选择过程中(baitbias)出现的假象。分析结果提供了整个基因组的全局视图,有助于决策制定,这是特定位点分析无法提供的。这些指标有助于决定是否考虑下游过滤。First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

gatk CollectSequencingArtifactMetrics \
    -I tumor.bam \
    -O 10_tumor_artifact \
    –-FILE_EXTENSION ".txt" \
    -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Alternatively, use the tool from a standalone Picard jar.

java -jar picard.jar \
    CollectSequencingArtifactMetrics \
    I=tumor.bam \
    O=10_tumor_artifact \
    FILE_EXTENSION=.txt \
    R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

这会生成五个度量文件,包括 pre_adapter_detail_metrics,其中包含 FilterByOrientationBias 使用的计数。以下是完整数据的 pre_adapter_summary_metrics 摘要。我们的样本并非来自 FFPE,因此不会出现这种人工痕迹。不过,我们似乎发现了一些 OxoG 逆转现象。This generates five metrics files, including pre_adapter_detail_metrics, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.

Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.

TOTAL_QSCORE 采用 Phred 标度,分数越低,人为变化的可能性越大。例如,40 分相当于万分之一的概率。对于 OxoG 而言,关注度的大致临界值为 30。筛选方向偏差(FilterByOrientationBias)使用质量分数作为上下文将产生伪变的先验。该工具还会权衡读取的证据。例如,如果 QSCORE 为 50,但等位基因在 F1R2 中有 15 个读数支持,而在 F2R1 中没有读数支持,那么该工具就应该过滤调用。

  • The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. forty translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. FilterByOrientationBias uses the quality score as a prior that a context will produce an artifact. The tool also weighs the evidence from the reads. For example, if the QSCORE is 50 but the allele is supported by 15 reads in F1R2 and no reads in F2R1, then the tool should filter the call.

FFPE 是福尔马林固定、石蜡包埋的缩写。甲醛会使胞嘧啶脱氨基,从而导致 C→T 转换突变。在文库制备过程中,鸟嘌呤氧化为 8-氧鸟嘌呤会导致 G→T 转换突变。另一个 Picard 工具 CollectOxoGMetrics 也同样给出了 16 个三碱基扩展序列上下文的 Phred 标度分数。在 GATK4 Mutect2 中,F1R2 和 F2R1 注释计算支持等位基因的成对方向读数。这与 GATK3 的 FOXOG(部分 OxoG)注释有所不同。

  • FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations. Oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. In GATK4 Mutect2, the F1R2 and F2R1 annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG (fraction OxoG) annotation.

其次,使用 FilterByOrientationBias 进行方向偏差过滤。我们为该工具提供了一次过滤调用 9_somatic_oncefiltered.vcf.gz、pre_adapter_detail_metrics 文件以及 FFPE(C→T 转换)和 OxoG(G→T 转换)的测序上下文。该工具知道要包括反向互补上下文。Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz, the pre_adapter_detail_metrics file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.

gatk FilterByOrientationBias \
    -A G/T \
    -A C/T \
    -V 9_somatic_oncefiltered.vcf.gz \
    -P tumor_artifact.pre_adapter_detail_metrics.txt \
    -O 11_somatic_twicefiltered.vcf.gz

这将生成 VCF 11_somatic_twicefiltered.vcf.gz、索引和摘要 11_somatic_twicefiltered.vcf.gz.summary。在摘要中,我们可以看到序列上下文的调用次数和工具过滤的调用次数。This produces a VCF 11_somatic_twicefiltered.vcf.gz, index and summary 11_somatic_twicefiltered.vcf.gz.summary. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.

在 VCF 文件头中,我们可以看到增加了第 15 个过滤器,即 orientation_bias,该工具适用于 56 个调用。所有这 56 个调用之前都是 PASS 站点,即未经过滤的站点。在总共 3,695 次调用中,我们现在有 673 次调用通过。In the VCF header, we see the addition of the 15th filter, orientation_bias, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.

5.1 Tally of applied filters for the tutorial data

该表显示了应用于 11_somatic_twicefiltered.vcf.gz 的过滤器的细分情况。中间一栏统计了在所有调用中应用每种筛选器的情况,第三栏统计了筛选器是某个位点未通过的唯一原因的情况。在所有调用中,约 18% (673/3,695)是可信的体细胞调用。在经过过滤的呼叫中,约 56%(1,694/3,022)是单个过滤的。我们发现平均每个过滤呼叫(5,223/3,022)过滤约 1.73 次。The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).

使用以下命令检查通过的记录。请特别注意 AD 和 AF 注释值,因为它们显示了调用者的高度敏感性。

gzcat 11somatictwicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less

6. Set up in IGV to review somatic calls

要获得一个好的体细胞调用集,需要比较不同调用者或调用方法的调用集,人工审核通过的和过滤的调用集,必要时合并调用集并进行额外过滤。人工审核的范围从解读调用记录注释扩展到使用可视化工具审核读数排列的细节。Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.

要手动查看调用结果,请使用功能丰富的桌面版整合基因组学查看器(IGV)。请记住,Mutect2 在重新组装的比对上进行调用,并不一定反映起始 BAM 的比对。因此,查看原始 BAM 并不足以了解调用结果。我们必须检查 Mutect2 图形组装生成的 bamout。To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.

首先,在 IGV 中加载人类 (hg38) 作为参照。然后依次加载这六个文件:First, load Human (hg38) as the reference in IGV. Then load these six files in order:

  • resources/chr17_pon.vcf.gz
  • resources/chr17af-only-gnomadgrch38.vcf.gz
  • 11somatictwicefiltered.vcf.gz
  • 2tumornormal_m2.bam
  • normal.bam
  • tumor.bam

除了体细胞调用集 11_somatic_twicefiltered.vcf.gz,数据覆盖的子集区域都在 chr17plus.interval_list 中。With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz, the subset regions the data cover are in chr17plus.interval_list.

其次,将 IGV 导航至 TP53 基因座(chr17:7,666,402-7,689,550)。Second, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).

  • One of the tracks is dominating the view. Right-click on track chr17_af-only-gnomad_grch38.vcf.gz and collapse its view.其中一条轨道占据了视图。右键单击轨道 chr17_af-only-gnomad_grch38.vcf.gz,折叠其视图。
  • Zoom into the somatic call in 11_somatic_twicefiltered.vcf.gz, the gray rectangle in exon 3, by click-dragging on the ruler.点击拖动标尺,放大 11_somatic_twicefiltered.vcf.gz 中的体细胞调用,即外显子 3 中的灰色矩形。
  • Hover over or click on the gray call in track 11_somatic_twicefiltered.vcf.gz to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.将鼠标悬停在或点击轨道 11_somatic_twicefiltered.vcf.gz 中的灰色调用,可查看 INFO 级别的注释。同样,下方的蓝色调用提供了 HCC1143_tumor 样本级别的信息。
  • Scroll through the alignment data and notice the coverage for the samples.滚动比对数据,注意样本的覆盖率。

第三,调整 IGV 设置,以帮助可视化重新组装的排列。

  • 腾出空间,重点关注轨道 2_tumor_normal_m2.bam。在左侧面板上 Shift+ 选择 track tumor.bam、normal.bam 及其覆盖。右键单击并删除轨迹。Make room to focus on track 2_tumor_normal_m2.bam. Shift+select on the left panels for tracks tumor.bamnormal.bam and their coverages. Right-click and Remove Tracks.
  • Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.转到 "视图">"首选项">"对齐"。切换 "显示中心线",并切换 "关闭降样读数"。
  • Drag the alignments panel to center the red variant.拖动排列面板,使红色变体居中。
  • 右键单击排列轨道,然后
    • Group by sample
    • Sort by base
    • Color by tag: HC.
  • Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.滚动以记录组的数量。点击每组中的一个读数,以确定哪个组属于哪个样本。

下面是相应的 VCF 记录。请记住,Mutect2 不做倍性假设。GT 字段将从参考等位基因开始的每个等位基因的存在情况制成表格。Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT field tabulates the presence for each allele starting with the reference allele.

POSIDREFALTQUALFILTERINFO
chr177,674,220.CT.PASSDP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15
FORMATGT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB
HCC1143_normal0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false
HCC1143_tumor0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946

最后,这里最后,这里是我们有 bamout alignment 的 indel 调用。所有这 17 项都恰好经过了过滤。在 IGV 中探索其中的几个站点,练习手动审查的设置动作,并研究不同过滤器背后的逻辑。

\Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.

CHROMPOSREFALTFILTER
chr174,539,344TTAartifact_in_normal;germline_risk;panel_of_normals
chr177,221,420CACTGCCCTAGGTCAGGACartifact_in_normal;panel_of_normals;str_contraction
chr177,483,063AACmapping_quality;t_lod
chr178,513,688GTTGpanel_of_normals
chr1719,748,387GGAt_lod
chr1726,982,033GGCartifact_in_normal;clustered_events
chr1730,059,463CTCt_lod
chr1735,422,473CCAt_lod
chr1735,671,734CTTC,CT,CTTTartifact_in_normal;multiallelic;panel_of_normals
chr1743,104,057CACartifact_in_normal;germline_risk;panel_of_normals
chr1743,104,072AAAAAAAAAGAAAAGApanel_of_normals;t_lod
chr1746,332,538GGTartifact_in_normal;panel_of_normals
chr1747,157,394CAACpanel_of_normals;t_lod
chr1750,124,771GCACACACACACACACAGclustered_events;panel_of_normals;t_lod
chr1768,907,890GAGartifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod
chr1769,182,632CCAartifact_in_normal;t_lod
chr1769,182,835GAAAAGpanel_of_normals

  • 21
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值