What is VCF ?

From Wikipedia and some blogs as follows:

https://en.wikipedia.org/wiki/Variant_Call_Format

https://blog.csdn.net/genome_denovo/article/details/78697679

https://blog.csdn.net/g863402758/article/details/53068366


我不知道你看完什么感觉,我依然感觉很蒙逼,我感觉要真的搞懂这玩意,还得知道VCF是怎么来的,你说对吗?我觉得很对很对啊,very对啊!

我也是一个刚接触这个不久的,也不是很懂,如果你也有这方面的疑惑,我们可以加QQ1130346295

I am a green hand about data processing area . If u also have some question and get some error about VCF, this is my QQ number , 1130346295,and it may allow us understand VCF format better.

"The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome."

"The standard is currently in version 4.3,although the 1000 Genomes Project has developed their own specification for structural variations such as duplications, which are not easily accommodated into the existing schema.A set of tools is also available for editing and manipulating the files."

Example1(Chromosome 1 of Denisova)




The VCF header

"The header begins the file and provides metadata describing the body of the file. Header lines are denoted as starting with #. Special keywords in the header are denoted with ##. Recommended keywords include fileformat, fileDate and reference.

The header contains keywords that optionally semantically and syntactically describe the fields used in the body of the file, notably INFO, FILTER, and FORMAT (see below)."

The columns of a VCF

"The body of VCF follows the header, and is tab separated into 8 mandatory columns and an unlimited number of optional columns that may be used to record other information about the sample(s). When additional columns are used, the first of optional column is used to describe the format of the data in the columns that follow."

Name   
Brie  f description (see the specification for details).                                                                      
1
CHROM
The name of the sequence (typically a chromosome) on which the variation is being called. This sequence is usually known as ' the reference sequence', i.e. the sequence against which the given sample varies.
2
POS

<1>The 1-based of the variation on the given sequence.


<2>变异位点相对于参考因组的位置,如果是indel,就是第一个碱基所在的位置。


<3>variant所在的left-most位置(1-base position)(发生变异的位置的第一个碱基所在的位置)考基因组的位置,如果是indel,就是第一个碱基所在的位置。

3
ID

An optional identifier for the varation.


The identifier of the variation, e.g. a dbSNP rs identifier, or if unknown a ".". Multiple identifiers should be separated by semi-colons without white-space.

4
REF

<1>The reference allele  observed in a sample, set of samples, or a population in general (depending how the VCF was generated). 


<2>The reference base (or bases in the case of an indel) at the given position on the given reference sequence.


Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

5
ALT

<1>The list of alternative alleles at this position.


 <2>variant的Allele,若有多个,则使用逗号分隔,(变异所支持的碱基类型及碱基数量)这里的碱基类型和碱基数量,对于SNP来说是单个碱基类型的编号,而对于Indel来说是指碱基个数的添加或缺失,以及碱基类型的变化

6
QUAL

<1>A quality score associated with the inference of the given alleles.


<2>The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. 
Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic. 
Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

7

      FILTER       

<1>A flag indicating which of a given set of filters the variation has passed.


<2>This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. 
If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

8
INFO

<1>An extensible list of key-value pairs (fields) describing the variation. See below for some common fields. Multiple fields are separated by semicolons with optional values in the format: "<key>=[,data]".


 <2>Various site-level annotations. 
The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94
They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

9
FORMAT

<1>An (optional) extensible list of fields for describing the samples. See below for some common fields.

<2>variants的格式,例如GT:AD:DP:GQ:PL

+
SAMPLEs
For each (optional) sample described in the file, values are given for the fields listed in FORMAT

Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional):

AA 
ancestral allele
AC 
allele count in genotypes, for each ALT allele, in the same order as listed
AF 
allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
AN 
total number of alleles in called genotypes
BQ 
RMS base quality at this position
CIGAR 
cigar string describing how to align an alternate allele to the reference allele
DB 
dbSNP membership
DP 
combined depth across samples, e.g. DP=154
END 
end position of the variant described in this record (for use with symbolic alleles)
H2 
membership in hapmap2
H3 
membership in hapmap3
MQ 
RMS mapping quality, e.g. MQ=52
MQ0 
Number of MAPQ == 0 reads covering this record
NS 
Number of samples with data
SB 
strand bias at this position
SOMATIC 
indicates that the record is a somatic mutation, for cancer genomics
VALIDATED 
validated by follow-up experiment
1000G 
membership in 1000 Genomes


How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:



Looking at that last column, here is what the tags mean:

GT : The genotype of this sample at this site.

For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. 

When there's a single ALT allele (by far the more common case), GT will be either:

  • 0/0 - the sample is homozygous reference
  • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
  • 1/1 - the sample is homozygous alternate
In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively. 

For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.


  • AD and DP : Allele depth and depth of coverage.
    These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. 
    AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another. 
    DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP. 
    See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.

  • PL : "Normalized" Phred-scaled likelihoods of the possible genotypes.
    For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype. 
    Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.

  • GQ : Quality of the assigned genotype.
    The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.
    Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.
    Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het(heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number

NameBrief description (see the specification for details).
1CHROMThe name of the sequence (typically a chromosome) on which the variation is being called. This sequence is usually known as 'the reference sequence', i.e. the sequence against which the given sample varies.
2POSThe 1-based position of the variation on the given sequence.
3IDThe identifier of the variation, e.g. a dbSNP rs identifier, or if unknown a ".". Multiple identifiers should be separated by semi-colons without white-space.
4REFThe reference base (or bases in the case of an indel) at the given position on the given reference sequence.
5ALTThe list of alternative alleles at this position.
6QUALA quality score associated with the inference of the given alleles.
7FILTERA flag indicating which of a given set of filters the variation has passed.
8INFOAn extensible list of key-value pairs (fields) describing the variation. See below for some common fields. Multiple fields are separated by semicolons with optional values in the format: "<key>=[,data]".
9FORMATAn (optional) extensible list of fields for describing the samples. See below for some common fields.
+SAMPLEsFor each (optional) sample described in the file, values are given for the fields listed in FORMAT
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Phylip VCF是一个将VCF(Variant Call Format)文件转换为Phylip格式的工具。VCF文件是一种常用的基因组变异信息文件格式,包含了样本中的单核苷酸变异(SNP)和小型插入/缺失(InDel)等变异信息。 Phylip是一种用于序列比对和系统进化分析的常用格式。它由多个序列的序列标识符和序列数据组成,其中每条序列的长度必须相等。Phylip格式便于进行基因序列分析,例如计算种群遗传关系和构建系统进化树等。 Phylip VCF工具可以将VCF文件中的变异信息转换为Phylip格式,方便进行系统进化和种群遗传分析。在转换过程中,Phylip VCF会将每个样本的VCF变异信息解析,并将其转化为相应的碱基或缺失标记。然后,它会生成一个符合Phylip格式要求的文件,其中每个样本的序列标识符和序列数据将被填充。 Phylip VCF的使用可以帮助研究人员利用VCF文件中的变异数据进行更全面的分析。将VCF文件转换为Phylip格式后,可以使用Phylip格式支持的各种工具进行系统进化和种群遗传分析。这种转换不仅能够减少在使用其他工具时需要进行的文件格式转换步骤,还使得变异信息可以更加方便地与其他序列数据一起进行分析。 总之,Phylip VCF是一个便捷的工具,可以将VCF文件转换为Phylip格式,从而利用Phylip格式的工具进行更多样化的系统进化和种群遗传分析。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值