变异检测准确性评估软件hap.py使用

变异检测准确性评估软件hap.py使用

1. hap.py简介

Hap.py 是illumina官方开发的在单倍型水平上比较二倍体基因型的工具。可用于针对金标准变异数据集(例如NA12878样本)对检测的变异结果进行基准测试,以判断检测结果的准确性, 也可以用于比较和评估两个不同变异检测软件检测结果的差异。

2. hap.py 的使用

安装就不多说了,怕麻烦的同学建议直接使用docker镜像,省时省力。docker pull pkrusche/hap.py
下面直接上基本使用命令行吧。

hap.py 
	hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz \
    hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz \
    -f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz \
    -r hap.py/example/happy/hg38.chr21.fa \
    --threads 5 \
    -o gatk-all

其中,

  1. -r hap.py/example/happy/hg38.chr21.fa 用于指定使用的参考序列。 也可以通过设置环境变量HGREF来指定export HGREF=hap.py/example/happy/hg38.chr21.fa , 其中hap.py/example/happy/hg38.chr21.fa 需要换成你实际使用的fasta序列
  2. hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gzhap.py/example/happy/NA12878-GATK3-chr21.vcf.gz 为待比较的两个VCF文件,替换为你实际VCF文件即可。
  3. -f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz 指定需要比较的区间,即只有该bed区间内的位点用于比较。
  4. --threads 5 指定线程
  5. -o out_prefix 指定输出文件前缀

3. 结果说明

结果包含以下几个文件:

Output FileContents
gatk-all.summary.csvSummary statistics
gatk-all.extended.csvExtended statistics
gatk-all.roc.all.csvAll precision / recall data points that were calculated
gatk-all.vcf.gzAnnotated VCF according to https://github.com/ga4gh/benchmarking-tools/tree/master/doc/ref-impl
gatk-all.vcf.gz.tbiVCF Tabix Index
gatk-all.metrics.jsonJSON file containing all computed metrics and tables.
gatk-all.roc.Locations.INDEL.csvROC for ALL indels only.
gatk-all.roc.Locations.SNP.csvROC for ALL SNPs only.
gatk-all.roc.Locations.INDEL.PASS.csvROC for PASSing indels only.
gatk-all.roc.Locations.SNP.PASS.csvROC for PASSing SNPs only.

其中两个比较重要的文件为:gatk-all.summary.csvgatk-all.vcf.gz
1)gatk-all.summary.csv包含直接的统计结果,包含变异位点数,精确度,召回率,F1值等统计信息,可以直观的看到两者之间的差异。gatk-all.extended.csv 是所有的统计信息,包含了gatk-all.summary.csv里面的信息,以及其他拓展信息。相关字段说明如下(就不中文解释了,有疑问可以一起讨论):

Stratification ColumnDescription
TypeVariant type (SNP / INDEL)
SubtypeVariant Subtype (ti/tv/indel length, see above
SubsetSubset of the genome/stratification region
FilterVariant filters: PASS, SEL, ALL, or a particular filter from the query VCF
GenotypeGenotype of benchmarked variants (het / homalt / hetalt)
QQ.FieldWhich field from the original VCF was used to produce QQ values in truth and query
QQQQ threshold for ROC values
Metric ColumnDescription
METRIC.RecallRecall for truth variant representation = TRUTH.TP / (TRUTH.TP + TRUTH.FN)
METRIC.PrecisionPrecision of query variants = QUERY.TP / (QUERY.TP + QUERY.FP)
METRIC.Frac_NAFraction of non-assessed query calls = QUERY.UNK / QUERY.TOTAL
METRIC.F1_ScoreHarmonic mean of precision and recall = 2METRIC.RecallMetric.Precision/(METRIC.Recall + METRIC.Precision)
TRUTH.TOTALTotal number of truth variants
TRUTH.TPNumber of true-positive calls in truth representation (counted via the truth sample column)
TRUTH.FNNumber of false-negative calls = calls in truth without matching query call
QUERY.TOTALTotal number of query calls
QUERY.TPNumber of true positive calls in query representation (counted via the query sample column)
QUERY.FPNumber of false-positive calls in the query file (mismatched query calls within the confident regions)
QUERY.UNKNumber of query calls outside the confident regions
FP.gtNumber of genotype mismatches (alleles match, but different zygosity)
FP.alNumber of allele mismatches (variants matched by position and not by haplotype)
TRUTH.TOTAL.TiTv_ratioTransition / Transversion ratio for all truth variants
TRUTH.TOTAL.het_hom_ratioHet/Hom ratio for all truth variants
TRUTH.FN.TiTv_ratioTransition / Transversion ratio for false-negative variants
TRUTH.FN.het_hom_ratioHet/Hom ratio for false-negative variants
TRUTH.TP.TiTv_ratioTransition / Transversion ratio for true positive variants
TRUTH.TP.het_hom_ratioHet/Hom ratio for true positive variants
QUERY.FP.TiTv_ratioTransition / Transversion ratio for false positive variants
QUERY.FP.het_hom_ratioHet/Hom ratio for false-positive variants
QUERY.TOTAL.TiTv_ratioTransition / Transversion ratio for all query variants
QUERY.TOTAL.het_hom_ratioHet/Hom ratio for all query variants
QUERY.TP.TiTv_ratioTransition / Transversion ratio for true positive variants (query representation)
QUERY.TP.het_hom_ratioHet/Hom ratio for true positive variants (query representation)
QUERY.UNK.TiTv_ratioTransition / Transversion ratio for unknown variants
QUERY.UNK.het_hom_ratioHet/Hom ratio for unknown variants
Subset.SizeWhen using stratification regions, this gives the number of nucleotides contained in the current subset
Subset.IS_CONF.SizeThis gives the number of confident bases (-f regions) in the current subset

2)gatk-all.vcf.gz 是VCF格式的文件,文件最后两列(TRUTH和QUERY列)包含两个比较VCF样本的位点对比的情况。
在这里插入图片描述
其中涉及GT:BD:BK:BI:BVT:BLT:QQ等多个字段

fieldDescription
GTGenotype
BDDecision for call (TP/FP/FN/N)
BKSub-type for decision (match/mismatch type)
BIAdditional comparison information
QQVariant quality for ROC creation
BVTHigh-level variant type (SNP/INDEL)
BLTHigh-level location type (het/homref/hetalt/homalt/nocall)

hap.py的基本用法就酱了,喜欢可以点赞加收藏留着吃灰去吧。

参考:
https://github.com/Illumina/hap.py
https://github.com/Illumina/hap.py/blob/master/doc/happy.md

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值