生物信息数据格式：vcf格式

最新推荐文章于 2022-06-27 21:00:42 发布

sunchengquan

最新推荐文章于 2022-06-27 21:00:42 发布

阅读量4.4k

点赞数 1

分类专栏： bioinformation

本文链接：https://blog.csdn.net/sunchengquan/article/details/86677629

版权

bioinformation 专栏收录该内容

21 篇文章

订阅专栏

文章目录

格式说明
实操

格式说明

VCF格式，Variant Call Format 变异判读文件格式

分为两部分内容：以“#”开头的注释部分；没有“#”开头的主体部分

先讲VCF文件主题部分的结构

CHROM ：表示变异位点是在哪个contig里call出来的，如果是人类全基因组的话那就是chr1…chr22，chrX,Y,M了
POS：变异位点相对于参考基因组所在的位置，如果是indel，就是第一个碱基所在的位置
ID：如果call出来的SNP存在于dbsnp数据库里，就会显示相应的dbsnp里的rs编号, 若没有，则用’.'表示其为一个novel variant。
REF和ALT：在这个变异位点处，参考基因组中所对应的碱基和研究对象基因组中所对应的碱基
QUAL：Phred格式(Phred_scaled)的质量值，表示在该位点存在variant的可能性；该值越高，则variant的可能性越大；计算方法：Phred值 = -10 * l/g (1-p) p为variant存在的概率; 通过计算公式可以看出值为10的表示错误概率为0.1，该位点为variant的概率为90%。
FILTER：使用上一个QUAL值来进行过滤的话，是不够的。GATK能使用其它的方法来进行过滤，过滤结果中通过则该值为”PASS”;若variant不可靠，则该项不为”PASS”或”.”。
FORMAT 和 testxxx：这两行合起来提供了’testxxx’ 这个sample的基因型的信息。 testxxx 代表这该名称的样品，是由BAM文件中的@RG下的 SM 标签决定的。
- GT：样品的基因型（genotype）。两个数字中间用’/'分开，这两个数字表示双倍体的sample的基因型。0 表示样品中有ref的allele； 1 表示样品中variant的allele； 2表示有第二个variant的allele。因此： 0/0 表示sample中该位点为纯合的，和ref一致； 0/1 表示sample中该位点为杂合的，有ref和variant两个基因型； 1/1 表示sample中该位点为纯合的，和variant一致。
- AD：对应两个以逗号隔开的值，这两个值分别表示覆盖到REF和ALT碱基的reads数，相当于支持REF和支持ALT的测序深度。
- DP：覆盖到这个位点的总的reads数量，相当于这个位点的深度（并不是多有的reads数量，而是大概一定质量值要求的reads数）。
- PL：指定的三种基因型的质量值(provieds the likelihoods of the given genotypes)。这三种指定的基因型为(0/0,0/1,1/1)，这三种基因型的概率总和为1。和之前不一致，该值越大，表明为该种基因型的可能性越小。 Phred值 = -10 * lg § p为基因型存在的概率=10^（-Phred值/10）。
  INFO
- AC：表示该Allele的数目，Allele数目为1表示双倍体的样本在该位点只有1个等位基因发生了突变
- AF：表示Allele的频率，Allele频率为0.5表示双倍体的样本在该位点只有50%的等位基因发生了突变
- AN：表示Allele的总数目,即：对于1个diploid sample而言：则基因型 0/1 表示sample为杂合子，Allele数为1(双倍体的sample在该位点只有1个等位基因发生了突变)，Allele的频率为0.5(双倍体的 sample在该位点只有50%的等位基因发生了突变)，总的Allele为2；基因型 1/1 则表示sample为纯合的，Allele数为2，Allele的频率为1，总的Allele为2。
- DP：样本在这个位置的reads覆盖度，是一些reads被过滤掉后的覆盖度（跟上面提到的DP类似）
- FS：使用Fisher’s精确检验来检测strand bias而得到的Fhred格式的p值，值越小越好
- MQ：表示覆盖序列质量的均方值RMS Mapping Quality

实操

查看头部注释信息

%%bash
grep '^#' ./data/genotype.vcf|wc -l

%%bash
grep '^#' ./data/genotype.vcf

##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=QD,Description="QD < 2.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##contig=<ID=A1,length=34083085>
##contig=<ID=A2,length=34414252>
##contig=<ID=A3,length=28939167>
##contig=<ID=A4,length=24315960>
##contig=<ID=A5,length=33714806>
##contig=<ID=A6,length=27018480>
##contig=<ID=A7,length=31477646>
##contig=<ID=A8,length=26149438>
##contig=<ID=A9,length=34986854>
##contig=<ID=A10,length=28419553>
##contig=<ID=A11,length=27106780>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	test102	test103	test105	test110	test111	test112	test116	test117	test118	test119	test121	test122	test123	test125	test127	test128	test129	test13	test133	test134	test138	test14	test140	test146	test147	test149	test151	test154	test155	test160	test162	test163	test164	test166	test17	test171	test173	test174	test176	test178	test179	test18	test181	test183	test184	test185	test186	test187	test188	test189	test19	test190	test195	test197	test198	test199	test201	test202	test205	test206	test208	test212	test213	test214	test215	test216	test217	test218	test219	test22	test220	test222	test23	test26	test3	test30	test31	test32	test33	test36	test37	test38	test39	test40	test41	test45	test46	test47	test48	test49	test51	test52	test56	test58	test59	test6	test62	test64	test65	test67	test68	test69	test70	test72	test73	test76	test77	test79	test80	test81	test84	test86	test87	test88	test92	test93	test94	test95	test96	test99

查看样本信息

%%bash
grep '#CHROM' ./data/genotype.vcf|cut -f 10-

test102	test103	test105	test110	test111	test112	test116	test117	test118	test119	test121	test122	test123	test125	test127	test128	test129	test13	test133	test134	test138	test14	test140	test146	test147	test149	test151	test154	test155	test160	test162	test163	test164	test166	test17	test171	test173	test174	test176	test178	test179	test18	test181	test183	test184	test185	test186	test187	test188	test189	test19	test190	test195	test197	test198	test199	test201	test202	test205	test206	test208	test212	test213	test214	test215	test216	test217	test218	test219	test22	test220	test222	test23	test26	test3	test30	test31	test32	test33	test36	test37	test38	test39	test40	test41	test45	test46	test47	test48	test49	test51	test52	test56	test58	test59	test6	test62	test64	test65	test67	test68	test69	test70	test72	test73	test76	test77	test79	test80	test81	test84	test86	test87	test88	test92	test93	test94	test95	test96	test99

%%bash
grep '#CHROM' ./data/genotype.vcf|cut -f 10- |tr '\t' '\n'|wc -l

查看主体信息

%%bash
grep -v ‘^#’ ./data/genotype.vcf |wc -l
grep -v ‘^#’ ./data/genotype.vcf |head -1|cut -f 1-10

过滤质量值大于80小于20000的标记

%%bash
grep -v '^#' ./data/genotype.vcf |awk  '{if(80 < $6 && $6< 20000) print $6}'

只保留SNP

%%bash
grep -v '^#' ./data/genotype.vcf |awk '{if (length($4) == 1 && length($5) == 1) print $1,$2,$3,$4,$5}'|head

A1 5418 . C A
A1 42134 . A G
A1 90833 . C A
A1 113451 . T A
A1 249739 . T A
A1 547087 . C T
A1 547089 . C T
A1 547591 . G T
A1 868930 . A G
A1 872821 . C T