变异检测准确性评估软件hap.py使用

最新推荐文章于 2025-02-28 10:50:26 发布

心如止水-WTF

最新推荐文章于 2025-02-28 10:50:26 发布

阅读量2k

点赞数

分类专栏：生物信息文章标签： python 经验分享

本文链接：https://blog.csdn.net/qq_28723681/article/details/127836407

版权

生物信息专栏收录该内容

4 篇文章

订阅专栏

本文介绍了Illumina官方的hap.py软件，用于评估和比较基因型检测结果的准确性，通过实例演示了如何使用它对VCF文件进行基准测试。核心内容包括安装、命令行参数及重要输出文件解读。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

变异检测准确性评估软件hap.py使用

1. hap.py简介

Hap.py 是illumina官方开发的在单倍型水平上比较二倍体基因型的工具。可用于针对金标准变异数据集（例如NA12878样本）对检测的变异结果进行基准测试，以判断检测结果的准确性，也可以用于比较和评估两个不同变异检测软件检测结果的差异。

2. hap.py 的使用

安装就不多说了，怕麻烦的同学建议直接使用docker镜像，省时省力。docker pull pkrusche/hap.py
下面直接上基本使用命令行吧。

hap.py 
	hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz \
    hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz \
    -f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz \
    -r hap.py/example/happy/hg38.chr21.fa \
    --threads 5 \
    -o gatk-all

其中，

-r hap.py/example/happy/hg38.chr21.fa 用于指定使用的参考序列。也可以通过设置环境变量HGREF来指定export HGREF=hap.py/example/happy/hg38.chr21.fa ，其中hap.py/example/happy/hg38.chr21.fa 需要换成你实际使用的fasta序列
hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz 和 hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz 为待比较的两个VCF文件，替换为你实际VCF文件即可。
-f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz 指定需要比较的区间，即只有该bed区间内的位点用于比较。
--threads 5 指定线程
-o out_prefix 指定输出文件前缀

3. 结果说明

结果包含以下几个文件：

Output File	Contents
gatk-all.summary.csv	Summary statistics
gatk-all.extended.csv	Extended statistics
gatk-all.roc.all.csv	All precision / recall data points that were calculated
gatk-all.vcf.gz	Annotated VCF according to https://github.com/ga4gh/benchmarking-tools/tree/master/doc/ref-impl
gatk-all.vcf.gz.tbi	VCF Tabix Index
gatk-all.metrics.json	JSON file containing all computed metrics and tables.
gatk-all.roc.Locations.INDEL.csv	ROC for ALL indels only.
gatk-all.roc.Locations.SNP.csv	ROC for ALL SNPs only.
gatk-all.roc.Locations.INDEL.PASS.csv	ROC for PASSing indels only.
gatk-all.roc.Locations.SNP.PASS.csv	ROC for PASSing SNPs only.

其中两个比较重要的文件为：gatk-all.summary.csv 和 gatk-all.vcf.gz。
1）gatk-all.summary.csv包含直接的统计结果，包含变异位点数，精确度，召回率，F1值等统计信息，可以直观的看到两者之间的差异。gatk-all.extended.csv 是所有的统计信息，包含了gatk-all.summary.csv里面的信息，以及其他拓展信息。相关字段说明如下（就不中文解释了，有疑问可以一起讨论）：

Stratification Column	Description
Type	Variant type (SNP / INDEL)
Subtype	Variant Subtype (ti/tv/indel length, see above
Subset	Subset of the genome/stratification region
Filter	Variant filters: PASS, SEL, ALL, or a particular filter from the query VCF
Genotype	Genotype of benchmarked variants (het / homalt / hetalt)
QQ.Field	Which field from the original VCF was used to produce QQ values in truth and query
QQ	QQ threshold for ROC values

Metric Column	Description
METRIC.Recall	Recall for truth variant representation = TRUTH.TP / (TRUTH.TP + TRUTH.FN)
METRIC.Precision	Precision of query variants = QUERY.TP / (QUERY.TP + QUERY.FP)
METRIC.Frac_NA	Fraction of non-assessed query calls = QUERY.UNK / QUERY.TOTAL
METRIC.F1_Score	Harmonic mean of precision and recall = 2METRIC.RecallMetric.Precision/(METRIC.Recall + METRIC.Precision)
TRUTH.TOTAL	Total number of truth variants
TRUTH.TP	Number of true-positive calls in truth representation (counted via the truth sample column)
TRUTH.FN	Number of false-negative calls = calls in truth without matching query call
QUERY.TOTAL	Total number of query calls
QUERY.TP	Number of true positive calls in query representation (counted via the query sample column)
QUERY.FP	Number of false-positive calls in the query file (mismatched query calls within the confident regions)
QUERY.UNK	Number of query calls outside the confident regions
FP.gt	Number of genotype mismatches (alleles match, but different zygosity)
FP.al	Number of allele mismatches (variants matched by position and not by haplotype)
TRUTH.TOTAL.TiTv_ratio	Transition / Transversion ratio for all truth variants
TRUTH.TOTAL.het_hom_ratio	Het/Hom ratio for all truth variants
TRUTH.FN.TiTv_ratio	Transition / Transversion ratio for false-negative variants
TRUTH.FN.het_hom_ratio	Het/Hom ratio for false-negative variants
TRUTH.TP.TiTv_ratio	Transition / Transversion ratio for true positive variants
TRUTH.TP.het_hom_ratio	Het/Hom ratio for true positive variants
QUERY.FP.TiTv_ratio	Transition / Transversion ratio for false positive variants
QUERY.FP.het_hom_ratio	Het/Hom ratio for false-positive variants
QUERY.TOTAL.TiTv_ratio	Transition / Transversion ratio for all query variants
QUERY.TOTAL.het_hom_ratio	Het/Hom ratio for all query variants
QUERY.TP.TiTv_ratio	Transition / Transversion ratio for true positive variants (query representation)
QUERY.TP.het_hom_ratio	Het/Hom ratio for true positive variants (query representation)
QUERY.UNK.TiTv_ratio	Transition / Transversion ratio for unknown variants
QUERY.UNK.het_hom_ratio	Het/Hom ratio for unknown variants
Subset.Size	When using stratification regions, this gives the number of nucleotides contained in the current subset
Subset.IS_CONF.Size	This gives the number of confident bases (-f regions) in the current subset

2）gatk-all.vcf.gz 是VCF格式的文件，文件最后两列（TRUTH和QUERY列）包含两个比较VCF样本的位点对比的情况。
在这里插入图片描述
其中涉及GT:BD:BK:BI:BVT:BLT:QQ等多个字段

field	Description
GT	Genotype
BD	Decision for call (TP/FP/FN/N)
BK	Sub-type for decision (match/mismatch type)
BI	Additional comparison information
QQ	Variant quality for ROC creation
BVT	High-level variant type (SNP/INDEL)
BLT	High-level location type (het/homref/hetalt/homalt/nocall)