1:参考文献:
Li H. Towards better understanding of artifacts in variant calling
from high-coverage samples[J]. Bioinformatics, 2014:
btu356.
2:针对GATK的call SNP有UnifiedGenotyper与HaplotypeCaller。现在基本上HaplotypeCaller可以取代UnifiedGenotyper。原因截取如下:
The HaplotypeCaller is a more
recent and sophisticated tool than the UnifiedGenotyper. Its
ability to call SNPs is equivalent to that of the UnifiedGenotyper,
its ability to call indels is far superior, and it is now capable
of calling non-diploid samples. It also comprises several unique
functionalities such as the reference confidence model (which
enables efficient and incremental variant discovery on ridiculously
large cohorts) and special settings for RNAseq
data.
As of GATK version 3.3, we recommend using HaplotypeCaller
in all cases, with no exceptions.(摘自GATK官方回复)
3:对于vcf文件过滤的建议参数:https://software.broadinstitute.org/gatk/guide/article?id=3225,以下这些过滤参数的设置主要是在无法使用VQSR的时候可以使用如下参数:
For SNPs:
QD < 2.0
MQ < 40.0
FS > 60.0
SOR > 3.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
If your callset was generated with UnifiedGenotyper for legacy
reasons, you can add HaplotypeScore
> 13.0.
--clusterWindowSize 5 --clusterSize
2另外还加上这两个参数,如果某个地方密集出现SNP可能是缺失或者插入。
For indels:
QD < 2.0
ReadPosRankSum < -20.0
InbreedingCoeff < -0.8
FS > 200.0
SOR > 10.0
4:在参考文献GATK中call snp使用的参数有:-stand_call_conf 30
-stand_emit_conf 10,现在stand_emit_conf这个参数在我使用的GATKv3.7已经不存在
另外建议添加:
-minPruning Minimum support to not prune paths in the
graph
-mbq Minimum base quality required
to consider a base for calling
-nct Number of CPU threads
to allocate per data thread
“-stand_call_conf 30 -mbq 20 --minPruning 2 -nct
10”这是我用的参数
5:另外Indel的范围一般是指:50bp,关于参考文献可以查看:
Tattini L, D’Aurizio R, Magi A. Detection of genomic structural
variants from next-generation sequencing data[J]. Frontiers in
bioengineering and biotechnology, 2015, 3: 92.