0. basic
mutation: a change in DNA
- somatic
- germline
Variants: any difference that exists between any DNA
- small variants
- SNV
- INDEL
- Large variants
- Structural variance (> 1,000 bps)
- Copy number variation
- Translocations
- Inversions
- Deletions/insertions
- Chromosomal abberation
- Structural variance (> 1,000 bps)
Polymorpism: variation that is common in a population (often AF > 1%)
Alleles: the variation in sequence
Genotype: estimating the allele counts in the sample
GATK胚系短变异最佳实践工作流:包括比对,去重,碱基质量分数校准,变异检测,合并多个vcf,过滤,注释等步骤。
1. read alignment
首先,对参考基因组序列建索引。
# build index
bwa index <reference.fa>
下游gatk
要求所有输入文件(fa/bam/vcf)和索引文件中的染色体名称相同。在开始比对之前,最好检查一下输入文件的染色体命名方式。因为更改 fa 中的染色体名称比更改 bam 中的染色体名称更容易。
然后,进行比对和比对后排序。
# reads alignment
bwa mem \
reference.fa \
data/demo_R1.fastq.gz \
data/demo_R2.fastq.gz \
> alignment/demo.sam
# sort and compression
samtools sort -o alignment/demo.sorted.sam alignment/demo.sam
samtools view -bh demo.sorted.sam > demo.bam
# alignment statistics
#samtools flagstat demo.sam
上述操作也可以通过管道用法提升IO性能。
bwa mem reference.fa data/demo_R1.fastq.gz data/demo_R2.fastq.gz | \
samtools sort | \
samtools view -bh > demo.bam
(可选)修改或添加read group字段。
# add read groups
gatk AddOrReplaceReadGroups \
--INPUT alignment/demo.bam \
--OUTPUT alignment/demo.rg.bam \
--RGLB lib1 \
--RGPU H0164.2.ALXX140820 \
--RGPL ILLUMINA \
--RGSM demo \
--RGID H0164.2
read group:测序仪单次运行产生的一组reads。如果单个sample制备的单个library在单个flowcell上的单一lane运行,该lane上运行的所有read都属于同一group。如果涉及multiplexing,则来自该lane上不同library运行的每个reads子集将构成不同的group。
在 (S|B|CR)AM 文件中,read group是通过官方规范中定义的一些标记来标识的。这些标记不仅可以区分样本,还可以区分与artifacts相关的各种技术特征,进而可以在重复标记和碱基校准步骤中减轻artifacts的影响。
GATK
要求输入文件中有以下几个必需read group字段:
PL:测序平台。本例为ILUMINA。
SM:样本名。来自同一个体的所有比对reads在该字段中应具有相同ID,本例为demo。
LB:文库ID。分子重复只存在于一个文库中。如果一个文库在多条lane上测序,追踪该信息非常重要。本例只测序了一个文库lib1。
ID:read group ID。通常是 flowcell 和 lane 的组合。
(非必需)PU:platform unit。该字段用于标识测序lane,格式为[FLOWCELL].[Lane].[SAMPLE BARCODE]。虽然非必须字段,但如果存在,则在碱基校准时优先于ID。
接着,标记重复。尽管有文献(Ebbert et al., 2016)报道称去除重复对变异分析影响不大。
# Mark duplicates
gatk MarkDuplicates \
--INPUT alignment/demo.rg.bam \
--OUTPUT alignment/demo.rg.md.bam \
--METRICS_FILE alignment/marked_dup_metrics_demo.txt
# Indexing
samtools index demo.rg.md.bam
最后,为比对文件建立索引,以供下游分析工具使用。
# Indexing
samtools index demo.rg.md.bam
扩展:
bwa manual
samtools manual
Read groups
Ebbert MTW et al. (2016) Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics.
2. variant calling
gatk
需要参考序列的两个索引文件,用下述命令分别创建。 此外,如果有输入的外部vcf文件(如dbSNP, 1kg),也需要建立相应的索引。
# reference sequence
samtools faidx <reference.fa> #.fai索引
gatk CreateSequenceDictionary --REFERENCE <reference.fa> #.dict索引
# vcf
#bcftools annotate --rename-chrs <tab-delimited-renaming> <input.vcf>
#gatk要求所有输入文件(fa/bam/vcf)和索引文件中的染色体名称相同
gatk IndexFeatureFile --input <variants.vcf>
然后,进行碱基质量分数校正(BQSR)。BQSR 根据系统误差评估碱基质量。它可以忽略已知变异的位点。BQSR 有助于识别测序错误的碱基,从而降低发现假阳性变异位置的概率。
gatk BaseRecalibrator \
--reference <reference.fa> \
--input alignment/demo.rg.md.bam \
--known-sites <variants1.vcf> \
--known-sites <variants2.vcf> \
--output bqsr/demo.recal.table
gatk ApplyBQSR \
--input demo.rg.md.bam \
--bqsr-recal-file bqsr/demo.recal.table \
--output bqsr/demo.recal.bam
使用HaplotypeCaller
子命令进行变异检测。
gatk HaplotypeCaller \
--reference <reference.fa> \
--input bqsr/demo.recal.bam \
--output variants/demo.HC.vcf \
--intervals chr20:10018000-10220000 \ # 仅检测指定区域
--emit-ref-confidence GVCF \ # 为了后续合并多样本,输出为gvcf,存储非变异位点,从而实现增量分析
--bam-output variants/demo.phased.bam # 为了后续可视化单倍型定相,输出定相bam
# get some statistics
#gatk VariantsToTable --variant variants/demo.HC.vcf --fields CHROM -F POS -F TYPE -GF GT --output variants/demo.HC.table
# 扩展阅读:如何使用R计算PL和GQ?
genotype likelihoods <- function(m,g,e,ref,alt){
(((m-g)*e+g*(1-e))^alt * ((m-g)*(1-e)+g*e)^ref)/(m^(ref+alt))
}
#m: ploidy
#g : number of alternative alleles
#e : base error probability
#ref : number of reference alleles counted
#alt : number of alternative alleles counted
-10*log10(genotype_likelihood(m=2, g=0, e=0.01, ref=22, alt=4)) # PL = 80.96026, when g=0
-10*log10(genotype_likelihood(m=2, g=1, e=0.01, ref=22, alt=4)) # PL = 78.2678 , when g=1
-10*log10(genotype_likelihood(m=2, g=2, e=0.01, ref=22, alt=4)) # PL = 440.1746, when g=2
#最可能的基因型PL值最小,所以 g=1 (heterozygous)
#GL值的算法是第二小PL值减去最小PL值,本例为 80.96 - 78.27 = 2.69
##本例基因分型质量低,phred尺度转换后错误率达0.54。
(可选)将多个样本的vcf合并到一个数据库。
# generate a GenomicsDB
gatk GenomicsDBImport \
--variant variants/demo.HC.g.vcf \
--variant variants/demo2.HC.g.vcf \
--intervals chr20:10018000-10220000 \
--genomicsdb-workspace-path genomicsdb
# retrieve the combined vcf from the database
gatk GenotypeGVCFs \
--reference <reference.fa> \
--variant gendb://genomicsdb \
--intervals chr20:10018000-10220000 \
--output variants/trio.vcf
3. variant filter
强烈建议在过滤SNP/INDEL时进行VQSR。然而,在数据可用性有限,如非模式生物数据、样本水平(FORMAT字段)过滤等无法应用VQSR的情况下,可以使用硬过滤。
# split SNPs and INDELs, due to different filtering thresholds
gatk SelectVariants \
--variant variants/demo.vcf \
--select-type SNP \
--output variants/demo.SNP.vcf
gatk SelectVariants \
--variant variants/demo.vcf \
--select-type INDEL \
--output variants/demo.INDEL.vcf
# filter SNPs: INFO field (per variant) & FORMAT field (per genotype)
gatk VariantFiltration \
--variant variants/demo.SNP.vcf \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "QUAL < 30.0" --filter-name "QUAL30" \
--filter-expression "SOR > 3.0" --filter-name "SOR3" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
--output variants/demo.SNP.filtered.vcf
# filter INDELs
gatk VariantFiltration \
--variant variants/demo.INDEL.vcf \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "QUAL < 30.0" --filter-name "QUAL30" \
--filter-expression "FS > 200.0" --filter-name "FS200" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
--output variants/demo.INDEL.filtered.vcf
# merge filtered SNPs and INDELs
gatk MergeVcfs \
--INPUT variants/demo.SNP.filtered.vcf \
--INPUT variants/demo.INDEL.filtered.vcf \
--OUTPUT variants/demo.filtered.vcf
使用VQSR完成位点水平(INFO字段)的变异过滤。
# create sites-only VCF, to speed up the analysis in the modeling step
gatk MakeSitesOnlyVcf \
-I demo.vcf.gz \
-O demo_sitesonly.vcf.gz
# calculate VQSLOD tranches for indels
## population resource files gs://gcp-public-data--broad-references/hg38/v0
## The parameters are tuned for WGS. For WES, https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR?id=1259
gatk VariantRecalibrator \
-V demo_sitesonly.vcf.gz \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 \
-an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP \
-mode INDEL \
--max-gaussians 4 \
-resource:mills,known=false,training=true,truth=true,prior=12:Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
-resource:axiomPoly,known=false,training=true,truth=false,prior=10:Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2:Homo_sapiens_assembly38.dbsnp138.vcf \
-O demo_indels.recal \
--tranches-file demo_indels.tranches
# calculate VQSLOD tranches for SNPs
gatk VariantRecalibrator \
-V demo_sitesonly.vcf.gz \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP \
-mode SNP \
--max-gaussians 6 \
-resource:hapmap,known=false,training=true,truth=true,prior=15:hapmap_3.3.hg38.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12:1000G_omni2.5.hg38.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10:1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=7:Homo_sapiens_assembly38.dbsnp138.vcf \
-O demo_snps.recal \
--tranches-file demo_snps.tranches
# filter indels on VQSLOD
gatk ApplyVQSR \
-V demo_sitesonly.vcf.gz \
--recal-file demo_indels.recal \
--tranches-file demo_indels.tranches \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode INDEL \
-O indel.recalibrated.vcf.gz
# filter SNPs on VQSLOD
gatk ApplyVQSR \
-V indel.recalibrated.vcf.gz \
--recal-file demo_snps.recal \
--tranches-file demo_snps.tranches \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode SNP \
-O snp.recalibrated.vcf.gz \
参考:
Filter variants either with VQSR or by hard-filtering
4. variant annotation
常见的注释软件包括VEP
, annovar
, snpEff
等。