What is VCF ?

最新推荐文章于 2022-09-09 21:00:32 发布

xmanxworld

最新推荐文章于 2022-09-09 21:00:32 发布

阅读量355

点赞数

From Wikipedia and some blogs as follows:

https://en.wikipedia.org/wiki/Variant_Call_Format

https://blog.csdn.net/genome_denovo/article/details/78697679

https://blog.csdn.net/g863402758/article/details/53068366

我不知道你看完什么感觉，我依然感觉很蒙逼，我感觉要真的搞懂这玩意，还得知道VCF是怎么来的，你说对吗？我觉得很对很对啊，very对啊！

我也是一个刚接触这个不久的，也不是很懂，如果你也有这方面的疑惑，我们可以加QQ1130346295；

I am a green hand about data processing area . If u also have some question and get some error about VCF, this is my QQ number , 1130346295,and it may allow us understand VCF format better.

"The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome."

"The standard is currently in version 4.3,although the 1000 Genomes Project has developed their own specification for structural variations such as duplications, which are not easily accommodated into the existing schema.A set of tools is also available for editing and manipulating the files."

Example1(Chromosome 1 of Denisova)

The VCF header

"The header begins the file and provides metadata describing the body of the file. Header lines are denoted as starting with #. Special keywords in the header are denoted with ##. Recommended keywords include fileformat, fileDate and reference.

The header contains keywords that optionally semantically and syntactically describe the fields used in the body of the file, notably INFO, FILTER, and FORMAT (see below)."

The columns of a VCF

"The body of VCF follows the header, and is tab separated into 8 mandatory columns and an unlimited number of optional columns that may be used to record other information about the sample(s). When additional columns are used, the first of optional column is used to describe the format of the data in the columns that follow."

Name	Brie f description (see the specification for details).
1	CHROM	The name of the sequence (typically a chromosome) on which the variation is being called. This sequence is usually known as ' the reference sequence', i.e. the sequence against which the given sample varies.
2	POS	<1>The 1-based of the variation on the given sequence. <2>变异位点相对于参考因组的位置，如果是indel,就是第一个碱基所在的位置。 <3>variant所在的left-most位置(1-base position)（发生变异的位置的第一个碱基所在的位置）考基因组的位置，如果是indel,就是第一个碱基所在的位置。

3

ID

An optional identifier for the varation.

The identifier of the variation, e.g. a dbSNP rs identifier, or if unknown a ".". Multiple identifiers should be separated by semi-colons without white-space.

4

REF

<1>The reference allele observed in a sample, set of samples, or a population in general (depending how the VCF was generated).

<2>The reference base (or bases in the case of an indel) at the given position on the given reference sequence.

Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

5

ALT

<1>The list of alternative alleles at this position.

<2>variant的Allele，若有多个，则使用逗号分隔，（变异所支持的碱基类型及碱基数量）这里的碱基类型和碱基数量，对于SNP来说是单个碱基类型的编号，而对于Indel来说是指碱基个数的添加或缺失，以及碱基类型的变化

6

QUAL

<1>A quality score associated with the inference of the given alleles.

<2>The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data.
Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.
Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

7

FILTER

<1>A flag indicating which of a given set of filters the variation has passed.

<2>This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters.
If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

8

INFO

<1>An extensible list of key-value pairs (fields) describing the variation. See below for some common fields. Multiple fields are separated by semicolons with optional values in the format: "<key>=[,data]".

<2>Various site-level annotations.
The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94.
They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

9

FORMAT

<1>An (optional) extensible list of fields for describing the samples. See below for some common fields.

<2>variants的格式，例如GT:AD:DP:GQ:PL

+

SAMPLEs

For each (optional) sample described in the file, values are given for the fields listed in FORMAT

Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional):

ancestral allele

allele count in genotypes, for each ALT allele, in the same order as listed

allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes

total number of alleles in called genotypes

RMS base quality at this position

CIGAR

cigar string describing how to align an alternate allele to the reference allele

dbSNP membership

combined depth across samples, e.g. DP=154

END

end position of the variant described in this record (for use with symbolic alleles)

membership in hapmap2

membership in hapmap3

RMS mapping quality, e.g. MQ=52

MQ0

Number of MAPQ == 0 reads covering this record

Number of samples with data

strand bias at this position

SOMATIC

indicates that the record is a somatic mutation, for cancer genomics

VALIDATED

validated by follow-up experiment

1000G

membership in 1000 Genomes

How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

Looking at that last column, here is what the tags mean:

GT : The genotype of this sample at this site.

For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc.

When there's a single ALT allele (by far the more common case), GT will be either:

0/0 - the sample is homozygous reference
0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
1/1 - the sample is homozygous alternate

In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.

For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.

AD and DP : Allele depth and depth of coverage.
These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.
PL : "Normalized" Phred-scaled likelihoods of the possible genotypes.
For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype.
Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.
GQ : Quality of the assigned genotype.
The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.
Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.
Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het(heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number

Name	Brief description (see the specification for details).
1	CHROM	The name of the sequence (typically a chromosome) on which the variation is being called. This sequence is usually known as 'the reference sequence', i.e. the sequence against which the given sample varies.
2	POS	The 1-based position of the variation on the given sequence.
3	ID	The identifier of the variation, e.g. a dbSNP rs identifier, or if unknown a ".". Multiple identifiers should be separated by semi-colons without white-space.
4	REF	The reference base (or bases in the case of an indel) at the given position on the given reference sequence.
5	ALT	The list of alternative alleles at this position.
6	QUAL	A quality score associated with the inference of the given alleles.
7	FILTER	A flag indicating which of a given set of filters the variation has passed.
8	INFO	An extensible list of key-value pairs (fields) describing the variation. See below for some common fields. Multiple fields are separated by semicolons with optional values in the format: "<key>=[,data]".
9	FORMAT	An (optional) extensible list of fields for describing the samples. See below for some common fields.
+	SAMPLEs	For each (optional) sample described in the file, values are given for the fields listed in FORMAT

xmanxworld

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
What is VCF ?

From Wikipedia and some blog as follows:https://en.wikipedia.org/wiki/Variant_Call_Formathttps://blog.csdn.net/genome_denovo/article/details/78697679https://blog.csdn.net/g863402758/article/details/53...
复制链接

扫一扫