文章目录
安装
log in之后才能download,使用教育机构后缀的邮箱即可注册。
http://www.openbioinformatics.org/annovar/annovar_download_form.php
会邮件收到一个软件安装包
annovar.latest.tar/
包含的perl脚本
toucan@tssys /opt/script/tool/annovar Sun Oct 07 16:42 forstart
$tree -L 1
.
├── annotate_variation.pl
├── coding_change.pl
├── convert2annovar.pl
├── example
├── humandb
├── retrieve_seq_from_fasta.pl
├── table_annovar.pl
└── variants_reduction.pl
humandb
ANNOVAR的安装包里自带了一些常用的数据库,在humandb/目录下
toucan@tssys /opt/script/tool/annovar/humandb Sun Oct 07 16:43 forstart
$tree -L 1
.
├── genometrax-sample-files-gff
├── GRCh37_MT_ensGeneMrna.fa
├── GRCh37_MT_ensGene.txt
├── hg19_example_db_generic.txt
├── hg19_example_db_gff3.txt
├── hg19_MT_ensGeneMrna.fa
├── hg19_MT_ensGene.txt
├── hg19_refGeneMrna.fa
├── hg19_refGene.txt
├── hg19_refGeneVersion.txt
├── hg19_refGeneWithVerMrna.fa
└── hg19_refGeneWithVer.txt
gff文件
toucan@tssys /opt/script/tool/annovar/humandb/genometrax-sample-files-gff Sun Oct 07 16:44 forstart
$tree -L 1
.
├── list
├── sample_chip_featuretype_hg19.gff
├── sample_common_snp_featuretype_hg19.gff
├── sample_cosmic_featuretype_hg19.gff
├── sample_cpg_islands_featuretype_hg19.gff
├── sample_dbnsfp_featuretype_hg19.gff
├── sample_disease_featuretype_hg19.gff
├── sample_dnase_featuretype_hg19.gff
├── sample_drug_featuretype_hg19.gff
├── sample_evs_featuretype_hg19.gff
├── sample_gwas_featuretype_hg19.gff
├── sample_hgmd_common_snp_featuretype_hg19.gff
├── sample_hgmd_disease_genes_featuretype_hg19.gff
├── sample_hgmd_featuretype_hg19.gff
├── sample_hgmdimputed_featuretype_hg19.gff
├── sample_microsatellites_featuretype_hg19.gff
├── sample_miRNA_featuretype_hg19.gff
├── sample_omim_featuretype_hg19.gff
├── sample_pathway_featuretype_hg19.gff
├── sample_pgx_featuretype_hg19.gff
├── sample_ptms_featuretype_hg19.gff
├── sample_snps_dbsnp_featuretype_hg19.gff
├── sample_snps_ensembl_featuretype_hg19.gff
├── sample_transfac_sites_featuretype_hg19.gff
└── sample_tss_featuretype_hg19.gff
0 directories, 25 files
example
toucan@DELL5577:/mnt/e/software/linux/ANNOVAR/annovar.latest.tar/annovar/example$ ll
total 20152
drwxrwxrwx 1 toucan toucan 4096 Sep 26 22:47 ./
drwxrwxrwx 1 toucan toucan 4096 Sep 26 22:47 ../
-rwxrwxrwx 1 toucan toucan 1940 Apr 17 03:41 README*
-rwxrwxrwx 1 toucan toucan 1831 Apr 17 03:41 ex1.avinput*
-rwxrwxrwx 1 toucan toucan 1706 Apr 17 03:41 ex2.vcf*
-rwxrwxrwx 1 toucan toucan 44 Apr 17 03:41 example.simple_region*
-rwxrwxrwx 1 toucan toucan 44 Apr 17 03:41 example.tab_region*
-rwxrwxrwx 1 toucan toucan 20317115 Apr 17 03:41 gene_fullxref.txt*
-rwxrwxrwx 1 toucan toucan 295664 Apr 17 03:41 gene_xref.txt*
-rwxrwxrwx 1 toucan toucan 1436 Apr 17 03:41 grantham.matrix*
-rwxrwxrwx 1 toucan toucan 43 Apr 17 03:41 snplist.txt*
README说明
toucan@DELL5577:/mnt/e/software/linux/ANNOVAR/annovar.latest.tar/annovar/example$ cat README
visit ANNOVAR website at http://www.openbioinformatics.org/annovar for more exmaple.
Please cite ANNOVAR if you use it in your research (Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data, Nucleic Acids Research, 38:e164, 2010). I spent tremendous amount of time and effort to maintain this tool, and your citation really means a lot to me.
ex1.avinput: a simple ANNOVAR input example with a few variants (in hg19 coordinate)
ex2.vcf: a simple VCF file with genotype information for 3 samples
gene_xref.txt: an example gene cross-reference file to be used on 'gx' operation in table_annovar.pl
example.simple_region: a file containing a list of genomic regions in sample format (for use in retrieve_seq_from_fasta.pl)
example.tab_region: a flie containing a list of genomic regions in tab-delimited format (for use in retrieve_seq_from_fasta.pl)
snplist.txt: a text file listing several dbSNP rs identifiers, one per line
humandb/hg19_example_db_generic.txt: an example file for generic database
humandb/hg19_example_db_gff3.txt: an example file for GFF3 database
grantham.matrix: a matrix file containing GRANTHAM scores for gene-based annotation
humandb/genometrax-sample-files-gff: a directory containing several "sample" GFF files provided by BioBase
humandb/hg19_MT_ensGene.txt and humandb/hg19_MT_ensGene.fa: mitochondria sequence for the NC_001807 contig used by UCSC Genome Browser. Even if you align your sequence data with reference sequences from UCSC, you should still use these files, not the ENSEMBLE file, for mitochondria annotation, because the ENSEMBLE annotations have some errors.
humandb/GRCh37_MT_ensGene.txt and humandb/GRCh37_MT_ensGene.fa: mitochondria sequence for the NC_012920 contig. If you align your sequence data using 1000 Genomes Project reference FASTA file, then you should use this file for annotating mitochondria variants.
如果要进行其他注释,需要使用 -downdb 命令下载数据库到 ‘humandb/’ 目录里:
#下载1000g2015Aug数据库
$perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2015aug humandb/
软件帮助文档
(ANNOVAR程序结构
│ annotate_variation.pl #主程序,功能包括下载数据库,三种不同的注释
│ coding_change.pl #可用来推断蛋白质序列
│ convert2annovar.pl #将多种格式转为.avinput的程序
│ retrieve_seq_from_fasta.pl #用于自行建立其他物种的转录本
│ table_annovar.pl #注释程序,可一次性完成三种类型的注释
│ variants_reduction.pl #可用来更灵活地定制过滤注释流程
│
├─example #存放示例文件
│
└─humandb #人类注释数据库)
annotate_variation.pl
$cat mam/annotate_variation.txt
SYNOPSIS
annotate_variation.pl [arguments] <query-file|table-name> <database-location>
Optional arguments:
-h, --help print help message
-m, --man print complete documentation
-v, --verbose use verbose output
Arguments to download databases or perform annotations
--downdb download annotation database
--geneanno annotate variants by gene-based annotation (infer functional consequence on genes)
--regionanno annotate variants by region-based annotation (find overlapped regions in database)
--filter annotate variants by filter-based annotation (find identical variants in database)
Arguments to control input and output
--outfile <file> output file prefix
--webfrom <string> specify the source of database (ucsc or annovar or URL) (downdb operation)
--dbtype <string> specify database type
--buildver <string> specify genome build version (default: hg18 for human)
--time print out local time during program run
--comment print out comment line (those starting with #) in output files
--exonsort sort the exon number in output line (gene-based annotation)
--transcript_function use transcript name rather than gene name (gene-based annotation)
--hgvs use HGVS format for exonic annotation (c.122C>T rather than c.C122T)(gene-based annotation)
--separate separately print out all functions of a variant in several lines (gene-based annotation)
--seq_padding create a new file with cDNA sequence padded by this much either side(gene-based annotation)
--(no)firstcodondel treat first codon deletion as wholegene deletion (default: ON) (gene-based annotation)
--aamatrix <file> specify an amino acid substitution matrix file (gene-based annotation)
--colsWanted <string> specify which columns to output by comma-delimited numbers (region-based annotation)
--scorecolumn <int> the column with scores in DB file (region-based annotation)
--poscolumn <string> the comma-delimited column with position information in DB file (region-based annotation)
--gff3dbfile <file> specify a DB file in GFF3 format (region-based annotation)
--gff3attribute output all fields in GFF3 attribute (default: ID and score only)
--bedfile <file> specify a DB file in BED format file (region-based annotation)
--genericdbfile <file> specify a DB file in generic format (filter-based annotation)
--vcfdbfile <file> specify a DB file in VCF format (filter-based annotation)
--otherinfo print out additional columns in database file (filter-based annotation)
--infoasscore use INFO field in VCF file as score in output (filter-based annotation)
--idasscore use ID field in VCF file as score in output (filter-based annotation)
--infosep use # rather than , to separate fields when -otherinfo is used
Arguments to fine-tune the annotation procedure
--batchsize <int> batch size for processing variants per batch (default: 5m)
--genomebinsize <int> bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno)
--expandbin <int> check nearby bin to find neighboring genes (default: 2m/genomebinsize)
--neargene <int> distance threshold to define upstream/downstream of a gene
--exonicsplicing report exonic variants near exon/intron boundary as 'exonic;splicing' variants
--score_threshold <float> minimum score of DB regions to use in annotation
--normscore_threshold <float> minimum normalized score of DB regions to use in annotation
--reverse reverse directionality to compare to score_threshold
--rawscore output includes the raw score (not normalized score) in UCSC BrowserTrack
--minqueryfrac <float> minimum percentage of query overlap to define match to DB (default: 0)
--splicing_threshold <int> distance between splicing variants and exon/intron boundary (default: 2)
--indel_splicing_threshold <int> if set, use this value for allowed indel size for splicing variants (default: --splicing_threshold)
--maf_threshold <float> filter 1000G variants with MAF above this threshold (default: 0)
--sift_threshold <float> SIFT threshold for deleterious prediction for -dbtype avsift (default: 0.05)
--precedence <string> comma-delimited to specify precedence of variant function (default: exonic>intronic...)
--indexfilter_threshold <float> controls whether filter-based annotation use index if this fraction of bins need to be scanned (default: 0.9)
--thread <int> use multiple threads for filter-based annotation
--maxgenethread <int> max number of threads for gene-based annotation (default: 6)
--mingenelinecount <int> min line counts to enable threaded gene-based annotation (default: 1000000)
Arguments to control memory usage
--memfree <int> ensure minimum amount of free system memory (default: 0)
--memtotal <int> limit total amount of memory used by ANNOVAR (default: 0, unlimited,in the order of kb)
--chromosome <string> examine these specific chromosomes in database file
Function: annotate a list of genetic variants against genome annotation
databases stored at local disk.
# 示例
Example: #download annotation databases from ANNOVAR or UCSC and save to humandb/ directory
annotate_variation.pl -downdb -webfrom annovar refGene humandb/
annotate_variation.pl -buildver mm9 -downdb refGene mousedb/
annotate_variation.pl -downdb -webfrom annovar esp6500siv2_all humandb/
#gene-based annotation of variants in the varlist file (by default --geneanno is ON)
annotate_variation.pl -buildver hg19 ex1.avinput humandb/
#region-based annotate variants
annotate_variation.pl -regionanno -buildver hg19 -dbtype cytoBand ex1.avinput humandb/
annotate_variation.pl -regionanno -buildver hg19 -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.avinput humandb/
#filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants
annotate_variation.pl -filter -dbtype 1000g2015aug_all -maf 0.01 ex1.avinput humandb/
annotate_variation.pl -filter -buildver hg19 -dbtype snp138 ex1.avinput humandb/
annotate_variation.pl -filter -dbtype dbnsfp30a -otherinfo ex1.avinput humandb/
Version: $Date: 2018-04-16 00:43:31 -0400 (Mon, 16 Apr 2018) $
OPTIONS
--help print a brief usage message and detailed explanation of options.
--man print the complete manual of the program.
--verbose
use verbose output.
--downdb
download annotation databases from UCSC Genome Browser, Ensembl,
1000 Genomes Project, ANNOVAR website or other resources. The
annotation databases are required for functional annotation of
genetic variants.
--geneanno
perform gene-based annotation. For each variant, examine whether
it hit exon, intron, intergenic region, or close to a transcript,
or hit a non-coding RNA gene, or is located in a untranslated
region (see *.variant_function output file). In addition, for an
exonic variant, determine whether it causes splicing change,
non-synonymous amino acid change, synonymous amino acid change or
frameshift changes (see *.exonic_variant_function output file).
--regionanno
perform region-based annotation. For each variant, examine whether
its genomic region (one or multiple base pairs) overlaps with a
specific genomic region, such as the most conserved elements, the
predicted transcription factor binding sites, the specific
cytogeneic bands, the evolutionarily conserved RNA secondary
structures.
--filter
perform filter-based annotation. For each variants, filter it
against a variation database, such as the 1000 Genomes Project
database, to identify whether it has been reporte in the database.
Exact match of nucleotide position and nucleotide composition are
required.
--outfile
specify the output file prefix. Several output files will be
generated using this prefix and different suffixes. A directory
name can also be specified as part of the argument, so that the
output files can be written to a different directory than the
current directory.
--webfrom
specify the source of database (ucsc or annovar or URL) in the
downdb operation. By default, files from UCSC Genome Browser
annotation database will be downloaded.
--dbtype
specify the database type to be used in gene-based, region-based
or filter-based annotations. For gene-based annotation, by default
refGene annotations from the UCSC Genome Browser will be used for
annotating variants. However, users can switch to use Ensembl
annotations, or use the UCSC Gene annotations, or the GENCODE Gene
annotations, or other types of gene annotations. For region-based
annotations, users can select any UCSC annotation databases (by
providing the database name), or alternatively select a Generic
Feature Format version 3 (GFF3) formatted file for annotation (by
providing 'gff3' as the --dbtype and providing the --gff3dbfile
argument), or select a BED file (by providing '-- dbtype bed' and
--bedfile arguments). For filter-based annotations, users can
select a dbSNP file, a 1000G file, a generic format file (with
simple columns including chr, start, end, reference, observed,
score), a VCF format file (which is a widely used format for
variants exchange), or many other types of formats.
--buildver
genome build version to use. By default, the hg18 build for human
genome is used. The build version will be used by ANNOVAR to
identify corresponding database files automatically, for example,
when gene-based annotation is used for hg18 build, ANNOVAR will
search for the hg18_refGene.txt file, but if the hg19 is used as
-- buildver, ANNOVAR will examine hg19_refGene.txt instead.
--time print out the local time during execution of the program
--comment
specify that the program should include comment lines in the
output files. Comment lines are defined as any line starting with
#. By default, these lines are not recognized as valid ANNOVAR
input and are therefore written to the INVALID_INPUT file. This
argument can be very useful to keep columns headers in the output
file, if the input file use comment line to flag the column
headers (usually the first line in the input file).
--exonsort
sort the exon number in output line in the exonic_variant_function
file during gene-based annotation. If a mutation affects multiple
transcripts, the ones with the smaller exon number will be printed
before the transcript with larger exon number in the output.
--transcript_function
use transcript name rather than gene name in output, for
gene-based annotation
--hgvs use HGVS format for exonic annotation (c.122C>T rather than
c.C122T) for gene-based annotation
--separate
for gene-based annotation, separate the effects of each variant,
so that each effect (intronic, exonic, splicing) is printed in one
output line. By default, all effects are printed in the same line,
in the comma-separated form of 'UTR3,UTR5' or 'exonic,splicing'.
--seq_padding
create a new file with cDNA sequence padded by this much either
side (gene-based annotation)
--firstcodondel
if the first codon of a gene is deleted, then the whole gene will
be treated as deleted in gene-based annotation. By default, this
option is ON.
--aamatrixfile
specify an amino acid substitution matrix, so that the scores are
printed in the exonic_variant_function file in gene-based
annotation. The matrix file is tab- delimited, and an example is
included in the ANNOVAR package.
--colsWanted
specify which columns are desired in the output for -regionanno.
By default, ANNOVAR inteligently selects the columns based on the
DB type. However, users can use a list of comma-delimited numbers,
or use 'all', or use 'none', to request custom output columns.
--scorecolumn
specify the the column with desired output scores in UCSC database
file (for region-based annotation). The default usually works
okay.
--poscolumn
the comma-delimited column with position information in DB file
(region-based annotation). The default usually works okay.
--gff3dbfile
specify the GFF3-formatted database file used in the region-based
annotation. Please consult
http://www.sequenceontology.org/resources/gff3.html for detailed
description on this file format. Note that GFF3 is generally not
compatible with previous versions of GFF.
--gff3attribute
output should contain all fields in GFF3 file attribute column
(the 9th column). By default, only the ID in the attribute and the
scores for the GFF3 file will be printed.
--bedfile
specify a DB file in BED format file in region-based annotation.
Please consult http://genome.ucsc.edu/FAQ/FAQformat.html#format1
for detailed descriptions on this format.
--genericdbfile
specify the generic format database file used in the filter-based
annotation.
--vcfdbfile
specify the database file in VCF format in the filter-based
annotation. VCF has been a popular format for summarizing SNP and
indel calls in a population of samples, and has been adopted by
1000 Genomes Project in their most recent data release.
--otherinfo
print out additional columns in database file in filter-based
annotation. This argument is useful when the annotation database
contains more than one annotation columns, so that all columns
will be printed out and separated by comma (by default).
--idasscore
when annotating against a VCF file, treat the ID field in VCF file
as the score to be printed in the output, in filter-based
annotation. By default the score is the allele frequency inferred
from VCF file.
--infoasscore
when annotating against a VCF file, treat the INFO field in VCF
file as the score to be printed in the output, in filter-based
annotation. By default the score is allele frequency inferred from
VCF file.
--infosep
use '#' rather than ',' to separate multiple fields when
-otherinfo is used in annotation. This argument is useful when the
annotation string itself contains comma, to help users clearly
separate different annotation fields.
--batchsize
this argument specifies the batch size for processing variants by
gene-based annotation. Normally 5 million variants (usually one
human genome will have about 3-5 million variants depending on
ethnicity) are annotated as a batch, to reduce the amounts of
memory. The users can adjust the parameters: larger values make
the program slightly faster, at the expense of slightly larger
memory requirements. In a 64bit computer, the default settings
usually take 1GB memory for gene-based annotation for human genome
for a typical query file, but this depends on the complexity of
the query (note that the query has a few required fields, but may
have many optional fields and those fields need to be read and
kept in memory).
--genomebinsize
the bin size of genome to speed up search. By default 100kb is
used for gene- based annotation, so that variant annotation
focused on specific bins only (based on the start-end site of a
given variant), rather than searching the entire chromosomes for
each variant. By default 10kb is used for region-based annotation.
The filter-based annotations look for variants directly so no bin
is used.
--expandbin
expand bin to both sides to find neighboring genes/regions. For
gene-based annotation, ANNOVAR tries to find nearby genes for any
intergenic variant, with a maximum number of nearby bins to
search. By default, ANNOVAR will automatically set this argument
to search 2 megabases to the left and right of the variant in
genome.
--neargene
the distance threshold to define whether a variant is in the
upstream or downstream region of a gene. By default 1 kilobase
from the start or end site of a transcript is defined as upstream
or downstream, respectively. This is useful, for example, when one
wants to identify variants that are located in the promoter
regions of genes across the genome.
--exonicsplicing
report exonic variants near exon/intron boundary as
'exonic;splicing' variants. These variants are technically exonic
variants, but there are some literature reports that some of them
may also affect splicing so a keyword is preserved specifically
for them.
--score_threshold
the minimum score to consider when examining region-based
annotations on UCSC Genome Browser tables. Some tables do not have
such scores and this argument will not be effective.
--normscore_threshold
the minimum normalized score to consider when examining
region-based annotations on UCSC Genome Browser tables. The
normalized score is calculated by UCSC, ranging from 0 to 1000, to
make visualization easier. Some tables do not have such scores and
this argument will not be effective.
--reverse
reverse the criteria for --score_threshold and
--normscore_threshold. So the minimum score becomes maximum score
for a result to be printed.
--rawscore
for region-based annotation, print out raw scores from UCSC Genome
Browser tables, rather than normalized scores. By default,
normalized scores are printed in the output files. Normalized
scores are compiled by UCSC Genome Browser for each track, and
they usually range from 0 to 1000, but there are some exceptions.
--minqueryfrac
The minimum fraction of overlap between a query and a database
record to decide on their match. By default, any overlap is
regarded as a match, but this may not work best when query consist
of large copy number variants.
--splicing_threshold
distance between splicing variants and exon/intron boundary, to
claim that a variant is a splicing variant. By default, 2bp is
used. ANNOVAR is relatively more stringent than some other
software to claim variant as regulating splicing. In addition, if
a variant is an exonic variant, it will not be reported as
splicing variant even if it is within 2bp to an exon/intron
boundary.
--indel_splicing_threshold
If set, max size of indel allowed to be called a splicing variant
(if boundary within --splicing_threshold bases of an intron/exon
junction.) If not set, this is equal to the --splicing_threshold,
as per original behavior.
--maf_threshold
the minor allele frequency (MAF) threshold to be used in the
filter-based annotation for the 1000 Genomes Project databases. By
default, any variant annotated in the 1000G will be used in
filtering.
--sift_threshold
the default SIFT threshold for deleterious prediction for -dbtype
avsift (default: 0.05). This argument is obselete, since the
recommended database for SIFT annotation is LJB database now,
rather than avsift database.
--thread
specify the number of threads to use in filter-based annotation.
The Perl and all components in the system needs to support
multi-threaded analysis to use this feature. It is recommended
when your database is stored at a SSD drive, which results in
nearly linear speed up of annotation for large genome files.
--maxgenethread
specify the maximum number of threads for gene-based annotation
(default: 6). Generally speaking, too many threads for gene-based
annotation will negatively impacts the performance.
--mingenelinecount
specify the minimum line counts to enable threaded gene-based
annotation (default: 1000000). For input files with less lines,
the threaded annotation will not be used, since it actually cost
more time than non-threaded annotation.
--memfree
the minimum amount of free system memory that ANNOVAR should
ensure to have.
--memtotal
the total amount of memory that ANNOVAR should use at most. By
default, this value is zero, meaning that there is no limit on
that. Decreasing this threshold reduce the memory requirement by
ANNOVAR, but may increase the execution time.
--chromosome
examine these specific chromosomes in database file. The argument
takes comma- delimited values, and the dash can be correctly
recognized. For example, 5-10,X represent chromosome 5 through
chromosome 10 plus chromosome X.
DESCRIPTION
ANNOVAR is a software tool that can be used to functionally annotate a
list of genetic variants, possibly generated from next-generation
sequencing experiments. For example, given a whole-genome resequencing
data set for a human with specific diseases, typically around 3 million
SNPs and around half million insertions/deletions will be identified.
Given this massive amounts of data (and candidate disease- causing
variants), it is necessary to have a fast algorithm that scans the data
and identify a prioritized subset of variants that are most likely
functional for follow-up Sanger sequencing studies and functional assays.
Currently, these various types of functional annotations produced by
ANNOVAR can be (1) gene-based annotations (the default behavior), such as
exonic variants, intronic variants, intergenic variants, downstream
variants, UTR variants, splicing site variants, stc. For exonic variants,
ANNOVAR will try to predict whether each of the variants is non-synonymous
SNV, synonymous SNV, frameshifting change, nonframeshifting change. (2)
region-based annotation, to identify whether a given variant overlaps with
a specific type of genomic region, for example, predicted transcription
factor binding site or predicted microRNAs.(3) filter-based annotation, to
filter a list of variants so that only those not observed in variation
databases (such as 1000 Genomes Project and dbSNP) are printed out.
Detailed documentation for ANNOVAR should be viewed in ANNOVAR website
(http://annovar.openbioinformatics.org/). Below is description on commonly
encountered file formats when using ANNOVAR software.
* variant file format
A sample variant file contains one variant per line, with the
fields being chr, start, end, reference allele, observed allele,
other information. The other information can be anything (for
example, it may contain sample identifiers for the corresponding
variant.) An example is shown below:
16 49303427 49303427 C T rs2066844 R702W (NOD2)
16 49314041 49314041 G C rs2066845 G908R (NOD2)
16 49321279 49321279 - C rs2066847 c.3016_3017insC (NOD2)
16 49290897 49290897 C T rs9999999 intronic (NOD2)
16 49288500 49288500 A T rs8888888 intergenic (NOD2)
16 49288552 49288552 T - rs7777777 UTR5 (NOD2)
18 56190256 56190256 C T rs2229616 V103I (MC4R)
* database file format: UCSC Genome Browser annotation database
Most but not all of the gene annotation databases are directly
downloaded from UCSC Genome Browser, so the file format is
identical to what was used by the genome browser. The users can
check Table Browser (for example, human hg18 table browser is at
http://www.genome.ucsc.edu/cgi-bin/hgTables?org=Human&db=hg18) to
see what fields are available in the annotation file. Note that
even for the same species (such as humans), the file format might
be different between different genome builds (such as between
hg16, hg17 and hg18). ANNOVAR will try to be smart about guessing
file format, based on the combination of the -- buildver argument
and the number of columns in the input file. In general, the
database file format should not be something that users need to
worry about.
* database file format: GFF3 format for gene-based annotations)
As of June 2010, ANNOVAR cannot perform gene-based annotations
using GFF3 input files, and any annotations on GFF3 is
region-based. I suggest that users download gff3ToGenePred tool
from UCSC and convert GFF3-based gene annotation to UCSC format,
so that ANNOVAR can perform gene-based annotation for your species
of interests.
* database file format: GFF3 format for region-based
annotations)
Currently, region-based annotations can support the Generic
Feature Format version 3 (GFF3) formatted files. The GFF3 has
become the de facto golden standards for many model organism
databases, such that many users may want to take a custom
annotation database and run ANNOVAR on them, and it would be the
most convenient if the custom file is made with GFF3 format.
* database file format: generic format for filter-based
annotations)
The 'generic' format is designed for filter-based annotation that
looks for exact variants. The format is almost identical to the
ANNOVAR input format, with chr, start, end, reference allele,
observed allele and scores (higher scores are regarded as better).
* database file format: VCF format for filter-based annotations)
ANNOVAR can directly interrogate VCF files as database files. A
VCF file may contain summary information for variants (for
example, this variant has MAF of 5% in this population), or it may
contain the actual variant calls for each individual in a specific
population.
* sequence file format
ANNOVAR can directly examine FASTA-formatted sequence files. For
mRNA sequences, the name of the sequences are the mRNA identifier.
For genomic sequences, the name of the sequences in the files are
usually chr1, chr2, chr3, etc, so that ANNOVAR knows which
sequence corresponds to which chromosome. Unfortunately, UCSC uses
things like chr6_random to annotate un-assembled sequences, as
opposed to using the actual contig identifiers. This causes some
issues (depending on how reads alignment algorithms works), but in
general should not be something that user need to worry about. If
the users absolutely care about the exact contigs rather than
chr*_random, then they will need to re-align the short reads at
chr*_random to a different FASTA file that contains the contigs
(such as the GRCh36/37/38), and then execute ANNOVAR on the newly
identified variants.
* invalid input
If the query file contains input lines with invalid format,
ANNOVAR will skip such line and continue with the annotation on
next lines. These invalid input lines will be written to a file
with suffix invalid_input. Users should manually examine this file
and identify sources of error.
--------------------------------------------------------------------------
------
ANNOVAR is free for academic, personal and non-profit use.
For questions or comments, please contact $Author: kaichop
<kaichop@gmail.com> $.
convert2annovar.pl
$cat mam/convert2annovar.txt
SYNOPSIS
convert2annovar.pl [arguments] <variantfile>
Optional arguments:
-h, --help print help message
-m, --man print complete documentation
-v, --verbose use verbose output
--format <string> input format (default: pileup)
--includeinfo include supporting information in output
--outfile <file> output file name (default: STDOUT)
--snpqual <float> quality score threshold in pileup file (default: 20)
--snppvalue <float> SNP P-value threshold in GFF3-SOLiD file (default: 1)
--coverage <int> read coverage threshold in pileup file (default: 0)
--maxcoverage <int> maximum coverage threshold (default: none)
--chr <string> specify the chromosome (for CASAVA format)
--chrmt <string> chr identifier for mitochondria (default: M)
--fraction <float> minimum allelic fraction to claim a mutation (for pileup format)
--altcov <int> alternative allele coverage threshold (for pileup format)
--allelicfrac print out allelic fraction rather than het/hom status (for pileup format)
--species <string> if human, convert chr23/24/25 to X/Y/M (for gff3-solid format)
--filter <string> output variants with this filter (case insensitive, for vcf4 format)
--confraction <float> minimal fraction for two indel calls as a 0-1 value (for vcf4old format)
--allallele print all alleles rather than first one (for vcf4old format)
--withzyg print zygosity/coverage/quality when -includeinfo is used (for vcf4 format)
--comment keep comment line in output (for vcf4 format)
--allsample process all samples in file with separate output files (for vcf4 format)
--genoqual <float> genotype quality score threshold (for vcf4 format)
--varqual <float> variant quality score threshold (for vcf4 format)
--dbsnpfile <file> dbSNP file in UCSC format (for rsid format)
--withfreq for --allsample, print frequency information instead (for vcf4 format)
--withfilter print filter information in output (for vcf4 format)
--seqdir <string> directory with FASTA sequences (for region format)
--inssize <int> insertion size (for region format)
--delsize <int> deletion size (for region format)
--subsize <int> substitution size (default: 1, for region format)
--genefile <file> specify the gene file from UCSC (for transcript format)
--splicing_threshold <int> the splicing threshold (for transcript format)
--context <int> print context nucleotide for indels (for casava format)
--avsnpfile <file> specify the avSNP file (for rsid format)
--keepindelref keep Ref/Alt alleles for indels (for vcf4 format)
Function: convert variant call file generated from various software programs
into ANNOVAR input format
Example: convert2annovar.pl -format pileup -outfile variant.query variant.pileup
convert2annovar.pl -format cg -outfile variant.query variant.cg
convert2annovar.pl -format cgmastervar variant.masterVar.txt
convert2annovar.pl -format gff3-solid -outfile variant.query variant.snp.gff
convert2annovar.pl -format soap variant.snp > variant.avinput
convert2annovar.pl -format maq variant.snp > variant.avinput
convert2annovar.pl -format casava -chr 1 variant.snp > variant.avinput
convert2annovar.pl -format vcf4 variantfile > variant.avinput
convert2annovar.pl -format vcf4 -filter pass variantfile -allsample -outfile variant
convert2annovar.pl -format vcf4old input.vcf > output.avinput
convert2annovar.pl -format rsid snplist.txt -dbsnpfile snp138.txt > output.avinput
convert2annovar.pl -format region -seqdir humandb/hg19_seq/ chr1:2000001-2000003 -inssize 1 -delsize 2
convert2annovar.pl -format transcript NM_022162 -gene humandb/hg19_refGene.txt -seqdir humandb/hg19_seq/
Version: $Date: 2018-04-16 00:48:00 -0400 (Mon, 16 Apr 2018) $
OPTIONS
--help print a brief usage message and detailed explanation of options.
--man print the complete manual of the program.
--verbose
use verbose output.
--format
the format of the input files. Currently supported formats include
pileup, cg, cgmastervar, gff3-solid, soap, maq, casava, vcf4,
vcf4old, rsid. In August 2013, the VCF file processing subroutine
is changed (multiple samples in VCF file can be processed in
genotype-aware manner), but users can use vcf4old to have
identical results as the old behavior.
--outfile
specify the output file name. By default, output is written to
STDOUT.
--snpqual
quality score threshold in the pileup file, such that variant
calls with lower quality scores will not be printed out in the
output file.
--snppvalue
SNP p-value threshold in the pileup file, such that variant calls
with higher values will not be printed out in the output file.
--coverage
read coverage threshold in the pileup file, such that variants
calls generated with lower coverage will not be printed in the
output file.
--maxcoverage
maximum read coverage threshold in the pileup file, such that
variants calls generated with higher coverage will not be printed
in the output file.
--includeinfo
specify that the output should contain additional information in
the input line. By default, only the chr, start, end, reference
allele, observed allele and homozygosity status are included in
output files.
--chr specify the chromosome for CASAVA format
--chrmt specify the name of mitochondria chromosome (default is MT)
--altcov
the minimum coverage of the alternative (mutated) allele to be
printed out in output
--allelicfrac
print out allelic fraction rather than het/hom status (for pileup
format). This is useful when processing mitochondria variants.
--fraction
specify the minimum fraction of alternative allele, to print out
the mutation. For example, a site has 10 reads, 3 supports
alternative allele. A -fraction of 0.4 will not allow the mutation
to be printed out.
--species
specify the species from which the sequencing data is obtained.
For the GFF3- SOLiD format, when species is human, the chromosome
23, 24 and 25 will be converted to X, Y and M, respectively.
--filter
for VCF4 file, only print out variant calls with this filter
annotated. For example, if using GATK VariantFiltration walker,
you will see PASS, GATKStandard, HARD_TO_VALIDATE, etc in the
filter field. Using 'pass' as a filter is recommended in this
case.
--allsample
for multi-sample VCF4 file, the --allsample argument will process
all samples in the file and generate separate output files for
each sample. By default, only the first sample in VCF4 file will
be processed.
--withzyg
for VCF4 format, print out zygosity information, coverage
information and genotype quality information when -includeinfo is
used. By default, these information are printed out if
-includeinfo is not used.
--genoqual
minimum genotype quality for the variant in this sample, to be
printed out. The genotype quality is typically denoted as GQ in
the SAMPLE column
--varqual
minimum variant quality (the QUAL column in the VCF file) to
handle the variant in VCF file.
--comment
include VCF4 header comment lines in the output file
--genoqual
specify the genotype quality score to be included in the output
file
--varqual
specify the variant quality score to be included in the output
file
--dbsnpfile
specify the dbSNP file to query (for rsid format)
--withfreq
include frequency information in the output (for VCF format with
multiple samples)
--withfilter
include filter information in the output file (for VCF format)
--seqdir
specify the directory for sequence file (for region format)
--inssize
specify the insertion size when generating all mutations (for
region format)
--delsize
specify the deletion size when generating all mutations (for
region format)
--subsize
specify the substitution size when generating all mutations (for
region format)
--genefile
specify the gene file from UCSC, which can be refGene, knownGene
or ensGene (for transcript format)
--splicing_threshold
specify the splicing threshold (for transcript format)
--context
print context for indels which is useful to convert to VCF files
(for CASAVA format)
--avsnpfile
specify the avsnpfile that will be queried when using rsid as the
input file format
--keepindelref
do not alter the Ref and Alt alleles for indels in the VCF file
(by default the program automatically changes and shortens the Ref
and Alt allele)
DESCRIPTION
This program is used to convert variant call file generated from various
software programs into ANNOVAR input format. Currently, the program can
handle Samtools genotype-calling pileup format, Solid GFF format, Complete
Genomics variant format, SOAP format, MAQ format, CASAVA format, VCF
format. These formats are described below.
* pileup format
The pileup format can be produced by the Samtools genotyping
calling subroutine. Note that the phrase pileup format can be used
in several instances, and here I am only referring to the pileup
files that contains the actual genotype calls.
Using SamTools, given an alignment file in BAM format, a pileup
file with genotype calls can be produced by the command below:
samtools pileup -vcf ref.fa aln.bam> raw.pileup
samtools.pl varFilter raw.pileup > final.pileup
ANNOVAR will automatically filter the pileup file so that only
SNPs reaching a quality threshold are printed out (default is 20,
use --snpqual argument to change this). Most likely, users may
want to also apply a coverage threshold, such that SNPs calls from
only a few reads are not considered. This can be achieved using
the -coverage argument (default value is 0).
An example of pileup files for SNPs is shown below:
chr1 556674 G G 54 0 60 16 a,.....,...,.... (B%A+%7B;0;%=B<:
chr1 556675 C C 55 0 60 16 ,,..A..,...,.... CB%%5%,A/+,%....
chr1 556676 C C 59 0 60 16 g,.....,...,.... .B%%.%.?.=/%...1
chr1 556677 G G 75 0 60 16 ,$,.....,...,.... .B%%9%5A6?)%;?:<
chr1 556678 G K 60 60 60 24 ,$.....,...,....^~t^~t^~t^~t^~t^~t^~t^~t^~t B%%B%<A;AA%??<=??;BA%B89
chr1 556679 C C 61 0 60 23 .....a...a....,,,,,,,,, %%1%&?*:2%*&)(89/1A@B@@
chr1 556680 G K 88 93 60 23 ..A..,..A,....ttttttttt %%)%7B:B0%55:7=>>A@B?B;
chr1 556681 C C 102 0 60 25 .$....,...,....,,,,,,,,,^~,^~. %%3%.B*4.%.34.6./B=?@@>5.
chr1 556682 A A 70 0 60 24 ...C,...,....,,,,,,,,,,. %:%(B:A4%7A?;A><<999=<<
chr1 556683 G G 99 0 60 24 ....,...,....,,,,,,,,,,. %A%3B@%?%C?AB@BB/./-1A7?
The columns are chromosome, 1-based coordinate, reference base,
consensus base, consensus quality, SNP quality, maximum mapping
quality of the reads covering the sites, the number of reads
covering the site, read bases and base qualities.
An example of pileup files for indels is shown below:
seq2 156 * +AG/+AG 71 252 99 11 +AG * 3 8 0
ANNOVAR automatically recognizes both SNPs and indels in pileup
file, and process them correctly.
* GFF3-SOLiD format
The SOLiD provides a GFF3-compatible format for SNPs, indels and
structural variants. A typical example file is given below:
##gff-version 3
##solid-gff-version 0.3
##source-version 2
##type DNA
##date 2009-03-13
##time 0:0:0
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files Yoruban_snp_10x.txt
##run-path
chr_name AB_SOLiD SNP caller SNP coord coord 1 . . coverage=# cov;ref_base=ref;ref_score=score;ref_confi=confi;ref_single=Single;ref_paired=Paired;consen_base=consen;consen_score=score;consen_confi=conf;consen_single=Single;consen_paired=Paired;rs_id=rs_id,dbSNP129
1 AB_SOLiD SNP caller SNP 997 997 1 . . coverage=3;ref_base=A;ref_score=0.3284;ref_confi=0.9142;ref_single=0/0;ref_paired=1/1;consen_base=G;consen_score=0.6716;consen_confi=0.9349;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 2061 2061 1 . . coverage=2;ref_base=G;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C;consen_score=1.0000;consen_confi=0.8985;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 4770 4770 1 . . coverage=2;ref_base=A;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G;consen_score=1.0000;consen_confi=0.8854;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 4793 4793 1 . . coverage=14;ref_base=A;ref_score=0.0723;ref_confi=0.8746;ref_single=0/0;ref_paired=1/1;consen_base=G;consen_score=0.6549;consen_confi=0.8798;consen_single=0/0;consen_paired=9/9
1 AB_SOLiD SNP caller SNP 6241 6241 1 . . coverage=2;ref_base=T;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C;consen_score=1.0000;consen_confi=0.7839;consen_single=0/0;consen_paired=2/2
Newer version of ABI BioScope now use diBayes caller, and the
output file is given below:
##gff-version 3
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##List of SNPs. Date Sat Dec 18 10:30:45 2010 Stringency: medium Mate Pair: 1 Read Length: 50 Polymorphism Rate: 0.003000 Bayes Coverage: 60 Bayes_Single_SNP: 1 Filter_Single_SNP: 1 Quick_P_Threshold: 0.997000 Bayes_P_Threshold: 0.040000 Minimum_Allele_Ratio: 0.150000 Minimum_Allele_Ratio_Multiple_of_Dicolor_Error: 100
##1 chr1
##2 chr2
##3 chr3
##4 chr4
##5 chr5
##6 chr6
##7 chr7
##8 chr8
##9 chr9
##10 chr10
##11 chr11
##12 chr12
##13 chr13
##14 chr14
##15 chr15
##16 chr16
##17 chr17
##18 chr18
##19 chr19
##20 chr20
##21 chr21
##22 chr22
##23 chrX
##24 chrY
##25 chrM
# source-version SOLiD BioScope diBayes(SNP caller)
#Chr Source Type Pos_Start Pos_End Score Strand Phase Attributes
chr1 SOLiD_diBayes SNP 221367 221367 0.091151 . . genotype=R;reference=G;coverage=3;refAlleleCounts=1;refAlleleStarts=1;refAlleleMeanQV=29;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=27;diColor1=11;diColor2=33;het=1;flag=
chr1 SOLiD_diBayes SNP 555317 555317 0.095188 . . genotype=Y;reference=T;coverage=13;refAlleleCounts=11;refAlleleStarts=10;refAlleleMeanQV=23;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=29;diColor1=00;diColor2=22;het=1;flag=
chr1 SOLiD_diBayes SNP 555327 555327 0.037582 . . genotype=Y;reference=T;coverage=12;refAlleleCounts=6;refAlleleStarts=6;refAlleleMeanQV=19;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=29;diColor1=12;diColor2=30;het=1;flag=
chr1 SOLiD_diBayes SNP 559817 559817 0.094413 . . genotype=Y;reference=T;coverage=9;refAlleleCounts=5;refAlleleStarts=4;refAlleleMeanQV=23;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=14;diColor1=11;diColor2=33;het=1;flag=
chr1 SOLiD_diBayes SNP 714068 714068 0.000000 . . genotype=M;reference=C;coverage=13;refAlleleCounts=7;refAlleleStarts=6;refAlleleMeanQV=25;novelAlleleCounts=6;novelAlleleStarts=4;novelAlleleMeanQV=22;diColor1=00;diColor2=11;het=1;flag=
The file conforms to standard GFF3 specifications, but the last column is solid-
specific and it gives certain parameters for the SNP calls.
An example of the short indel format by GFF3-SOLiD is given below:
##gff-version 3
##solid-gff-version 0.3
##source-version SOLiD Corona Lite v.4.0r2.0, find-small-indels.pl v 1.0.1, process-small-indels v 0.2.2, 2009-01-12 12:28:49
##type DNA
##date 2009-01-26
##time 18:33:20
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files ../../mp-results/JOAN_20080104_1.pas,../../mp-results/BARB_20071114_1.pas,../../mp-results/BARB_20080227_2.pas
##run-path /data/results2/Yoruban-frag-indel/try.01.06/mp-w2x25-2x-4x-8x-10x/2x
##Filter-settings: max-ave-read-pos=none,min-ave-from-end-pos=9.1,max-nonreds-4filt=2,min-insertion-size=none,min-deletion-size=none,max-insertion-size=none,max-deletion-size=none,require-called-indel-size?=T
chr1 AB_SOLiD Small Indel Tool deletion 824501 824501 1 . . del_len=1;tight_chrom_pos=824501-824502;loose_chrom_pos=824501-824502;no_nonred_reads=2;no_mismatches=1,0;read_pos=4,6;from_end_pos=21,19;strands=+,-;tags=R3,F3;indel_sizes=-1,-1;read_seqs=G3021212231123203300032223,T3321132212120222323222101;dbSNP=rs34941678,chr1:824502-824502(-),EXACT,1,/GG
chr1 AB_SOLiD Small Indel Tool insertion_site 1118641 1118641 1 . . ins_len=3;tight_chrom_pos=1118641-1118642;loose_chrom_pos=1118641-1118642;no_nonred_reads=2;no_mismatches=0,1;read_pos=17,6;from_end_pos=8,19;strands=+,+;tags=F3,R3;indel_sizes=3,3;read_seqs=T0033001100022331122033112,G3233112203311220000001002
The keyword deletion or insertion_site is used in the fourth
column to indicate that file format.
An example of the medium CNV format by GFF3-SOLiD is given below:
##gff-version 3
##solid-gff-version 0.3
##source-version SOLiD Corona Lite v.4.0r2.0, find-small-indels.pl v 1.0.1, process-small-indels v 0.2.2, 2009-01-12 12:28:49
##type DNA
##date 2009-01-27
##time 15:54:36
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files big_d20e5-del12n_up-ConsGrp-2nonred.pas.sum
##run-path /data/results2/Yoruban-frag-indel/try.01.06/mp-results-lmp-e5/big_d20e5-indel_950_2050
chr1 AB_SOLiD Small Indel Tool deletion 3087770 3087831 1 . . del_len=62;tight_chrom_pos=none;loose_chrom_pos=3087768-3087773;no_nonred_reads=2;no_mismatches=2,2;read_pos=27,24;from_end_pos=23,26;strands=-,+;tags=F3,F3;indel_sizes=-62,-62;read_seqs=T11113022103331111130221213201111302212132011113022,T02203111102312122031111023121220311111333012203111
chr1 AB_SOLiD Small Indel Tool deletion 4104535 4104584 1 . . del_len=50;tight_chrom_pos=4104534-4104537;loose_chrom_pos=4104528-4104545;no_nonred_reads=3;no_mismatches=0,4,4;read_pos=19,19,27;from_end_pos=31,31,23;strands=+,+,-;tags=F3,R3,R3;indel_sizes=-50,-50,-50;read_seqs=T31011011013211110130332130332132110110132020312332,G21031011013211112130332130332132110132132020312332,G20321302023001101123123303103303101113231011011011
chr1 AB_SOLiD Small Indel Tool insertion_site 2044888 2044888 1 . . ins_len=18;tight_chrom_pos=2044887-2044888;loose_chrom_pos=2044887-2044889;no_nonred_reads=2;bead_ids=1217_1811_209,1316_908_1346;no_mismatches=0,2;read_pos=13,15;from_end_pos=37,35;strands=-,-;tags=F3,F3;indel_sizes=18,18;read_seqs=T31002301231011013121000101233323031121002301231011,T11121002301231011013121000101233323031121000101231;non_indel_no_mismatches=3,1;non_indel_seqs=NIL,NIL
chr1 AB_SOLiD Small Indel Tool insertion_site 74832565 74832565 1 . . ins_len=16;tight_chrom_pos=74832545-74832565;loose_chrom_pos=74832545-74832565;no_nonred_reads=2;bead_ids=1795_181_514,1651_740_519;no_mismatches=0,2;read_pos=13,13;from_end_pos=37,37;strands=-,-;tags=F3,R3;indel_sizes=16,16;read_seqs=T33311111111111111111111111111111111111111111111111,G23311111111111111111111111111111111111111311011111;non_indel_no_mismatches=1,0;non_indel_seqs=NIL,NIL
An example of the large indel format by GFF3-SOLiD is given below:
##gff-version 3
##solid-gff-version 0.3
##source-version ???
##type DNA
##date 2009-03-13
##time 0:0:0
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files /data/results5/yoruban_strikes_back_large_indels/LMP/five_mm_unique_hits_no_rescue/5_point_6x_del_lib_1/results/NA18507_inter_read_indels_5_point_6x.dat
##run-path
chr1 AB_SOLiD Large Indel Tool insertion_site 1307279 1307791 1 . . deviation=-742;stddev=7.18;ref_clones=-;dev_clones=4
chr1 AB_SOLiD Large Indel Tool insertion_site 2042742 2042861 1 . . deviation=-933;stddev=8.14;ref_clones=-;dev_clones=3
chr1 AB_SOLiD Large Indel Tool insertion_site 2443482 2444342 1 . . deviation=-547;stddev=11.36;ref_clones=-;dev_clones=17
chr1 AB_SOLiD Large Indel Tool insertion_site 2932046 2932984 1 . . deviation=-329;stddev=6.07;ref_clones=-;dev_clones=14
chr1 AB_SOLiD Large Indel Tool insertion_site 3166925 3167584 1 . . deviation=-752;stddev=13.81;ref_clones=-;dev_clones=14
An example of the CNV format by GFF3-SOLiD if given below:
##gff-version 3
##solid-gff-version 0.3
##source-version ???
##type DNA
##date 2009-03-13
##time 0:0:0
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files Yoruban_cnv.coords
##run-path
chr1 AB_CNV_PIPELINE repeat_region 1062939 1066829 . . . fraction_mappable=51.400002;logratio=-1.039300;copynum=1;numwindows=1
chr1 AB_CNV_PIPELINE repeat_region 1073630 1078667 . . . fraction_mappable=81.000000;logratio=-1.409500;copynum=1;numwindows=2
chr1 AB_CNV_PIPELINE repeat_region 2148325 2150352 . . . fraction_mappable=98.699997;logratio=-1.055000;copynum=1;numwindows=1
chr1 AB_CNV_PIPELINE repeat_region 2245558 2248109 . . . fraction_mappable=78.400002;logratio=-1.042900;copynum=1;numwindows=1
chr1 AB_CNV_PIPELINE repeat_region 3489252 3492632 . . . fraction_mappable=59.200001;logratio=-1.119900;copynum=1;numwindows=1
chr1 AB_CNV_PIPELINE repeat_region 5654415 5657276 . . . fraction_mappable=69.900002;logratio=1.114500;copynum=4;numwindows=1
chr1 AB_CNV_PIPELINE repeat_region 9516165 9522726 . . . fraction_mappable=65.850006;logratio=-1.316700;numwindows=2
chr1 AB_CNV_PIPELINE repeat_region 16795117 16841025 . . . fraction_mappable=44.600002;logratio=1.880778;copynum=7;numwindows=9
The keyword repeat_region is used here, although it actually
refers to CNVs.
An example of the inversion format by GFF3-SOLiD is given below:
##gff-version 3
##solid-gff-version 0.2
##generated by SOLiD inversion tool
chr10 AB_SOLiD inversion 46443107 46479585 268.9 . . left=chr10:46443107-46443146;right=chr10:46479583-46479585;leftscore=295.0;rightscore=247.0;count_AAA_further_left=117;count_AAA_left=3;count_AAA_right=3;count_AAA_further_right=97;left_min_count_AAA=chr10:46443107-46443112;count_AAA_min_left=0;count_AAA_max_left=3;right_min_count_AAA=chr10:46479585-46479585;count_AAA_min_right=1;count_AAA_max_right=3;homozygous=UNKNOWN
chr4 AB_SOLiD inversion 190822813 190850112 214.7 . . left=chr4:190822813-190822922;right=chr4:190850110-190850112;leftscore=140.0;rightscore=460.0;count_AAA_further_left=110;count_AAA_left=78;count_AAA_right=74;count_AAA_further_right=77;left_min_count_AAA=chr4:190822813-190822814;count_AAA_min_left=69;count_AAA_max_left=77;right_min_count_AAA=chr4:190850110-190850112;count_AAA_min_right=74;count_AAA_max_right=74;homozygous=NO
chr6 AB_SOLiD inversion 168834969 168837154 175.3 . . left=chr6:168834969-168835496;right=chr6:168836643-168837154;leftscore=185.4;rightscore=166.2;count_AAA_further_left=67;count_AAA_left=43;count_AAA_right=40;count_AAA_further_right=59;left_min_count_AAA=chr6:168835058-168835124,chr6:168835143-168835161,chr6:168835176-168835181,chr6:168835231-168835262;count_AAA_min_left=23;count_AAA_max_left=29;right_min_count_AAA=chr6:168836643-168836652;count_AAA_min_right=23;count_AAA_max_right=31;homozygous=NO
The program should be able to recognize all the above GFF3-SOLiD
format automatically, and handle them accordingly.
* Complete Genomics format
This format is provided by the Complete Genomics company to their
customers. The file var-[ASM-ID].tsv.bz2 includes a description of
all loci where the assembled genome differs from the reference
genome.
An example of the Complete Genomics format is shown below:
#BUILD 1.5.0.5
#GENERATED_AT 2009-Nov-03 19:52:21.722927
#GENERATED_BY dbsnptool
#TYPE VAR-ANNOTATION
#VAR_ANN_SET /Proj/Pipeline/Production_Data/REF/HUMAN-F_06-REF/dbSNP.csv
#VAR_ANN_TYPE dbSNP
#VERSION 0.3
>locus ploidy haplotype chromosome begin end varType reference alleleSeq totalScore hapLink xRef
1 2 all chr1 0 959 no-call = ?
2 2 all chr1 959 972 = = =
3 2 all chr1 972 1001 no-call = ?
4 2 all chr1 1001 1008 = = =
5 2 all chr1 1008 1114 no-call = ?
6 2 all chr1 1114 1125 = = =
7 2 all chr1 1125 1191 no-call = ?
8 2 all chr1 1191 1225 = = =
9 2 all chr1 1225 1258 no-call = ?
10 2 all chr1 1258 1267 = = =
12 2 all chr1 1267 1275 no-call = ?
13 2 all chr1 1275 1316 = = =
14 2 all chr1 1316 1346 no-call = ?
15 2 all chr1 1346 1367 = = =
16 2 all chr1 1367 1374 no-call = ?
17 2 all chr1 1374 1388 = = =
18 2 all chr1 1388 1431 no-call = ?
19 2 all chr1 1431 1447 = = =
20 2 all chr1 1447 1454 no-call = ?
The following information is provided in documentation from
Complete Genomics, that describes the var-ASM format.
1. locus. Identifier of a particular genomic locus
2. ploidy. The ploidy of the reference genome at the locus (= 2 for autosomes, 2 for pseudoautosomal regions on the sex chromosomes, 1 for males on the non-pseudoautosomal parts of the sex chromosomes, 1 for mitochondrion, '?' if varType is 'no-ref' or 'PAR-called-in-X'). The reported ploidy is fully determined by gender, chromosome and location, and is not inferred from the sequence data.
3. haplotype. Identifier for each haplotype at the variation locus. For diploid genomes, 1 or 2. Shorthand of 'all' is allowed where the varType field is one of 'ref', 'no-call', 'no-ref', or 'PAR-called-in-X'. Haplotype numbering does not imply phasing; haplotype 1 in locus 1 is not necessarily in phase with haplotype 1 in locus 2. See hapLink, below, for phasing information.
4. chromosome. Chromosome name in text: 'chr1','chr2', ... ,'chr22','chrX','chrY'. The mitochondrion is represented as 'chrM'. The pseudoautosomal regions within the sex chromosomes X and Y are reported attheir coordinates on chromosome X.
5. begin. Reference coordinate specifying the start of the variation (not the locus) using the half-open zero-based coordinate system. See section 'Sequence Coordinate System' for more information.
6. end. Reference coordinate specifying the end of the variation (not the locus) using the half-open zero-based coordinate system. See section 'Sequence Coordinate System' for more information.
7. varType. Type of variation, currently one of:
snp: single-nucleotide polymorphism
ins: insertion
del: deletion
sub: Substitution of one or more reference bases with the bases in the allele column
'ref' : no variation; the sequence is identical to the reference sequence on the indicated haplotype
no-call-rc: 'no-call reference consistent 'one or more bases are ambiguous, but the allele is potentially consistent with the reference
no-call-ri: 'no-call reference inconsistent' one or more bases are ambiguous, but the allele is definitely inconsistent with the reference
no-call: an allele is completely indeterminate in length and composition, i.e. alleleSeq = '?'
no-ref: the reference sequence is unspecified at this locus.
PAR-called-in-X: this locus overlaps one of the pseudoautosomal regions on the sex chromosomes. The called sequence is reported as diploid sequence on Chromosome X; on chromosome Y the sequence is reported as varType = 'PAR-called-in-X'.
8. reference. The reference sequence for the locus of variation. Empty when varType is ins. A value of '=' indicates that the user must consult the reference for the sequence; this shorthand is only used in regions where no haplotype deviates from the reference sequence.
9. alleleSeq. The observed sequence at the locus of variation. Empty when varType is del. '?' isused to indicate 0 or more unknown bases within the sequence; 'N' is used to indicate exactly one unknown base within the sequence.'=' is used as shorthand to indicate identity to the reference sequence for non-variant sequence, i.e. when varType is 'ref'.
10. totalScore. A score corresponding to a single variation and haplotype, representing the confidence in the call.
11. hapLink. Identifier that links a haplotype at one locus to haplotypes at other loci. Currently only populated for very proximate variations that were assembled together. Two calls that share a hapLink identifier are expected to be on the same haplotype,
12. xRef. Field containing external variation identifiers, currently only populated for variations corroborated directly by dbSNP. Format: dbsnp:[rsID], with multiple entries separated by the semicolon (;).
In older versions of the format specification, the sub keyword
used to be insdel keyword. ANNOVAR takes care of this.
* SOAPsnp format
An example of the SOAP SNP caller format is shown below:
chr8 35782 A R 1 A 27 1 2 G 26 1 2 5 0.500000 2.00000 1 5
chr8 35787 G R 0 G 25 4 6 A 17 2 4 10 0.266667 1.60000 0 5
The following information is provided in documentation from BGI
who developed SOAP suite. It differs slightly from the description
at the SOAPsnp website, and presumably the website is outdated.
Format description:(left to right)
1. Chromosome name
2. Position of locus
3. Nucleotide at corresponding locus of reference sequence
4. Genotype of sequencing sample
5. Quality value
6. nucleotide with the highest probability(first nucleotide)
7. Quality value of the nucleotide with the highest probability
8. Number of supported reads that can only be aligned to this locus
9. Number of all supported reads that can be aligned to this locus
10. Nucleotide with higher probability
11. Quality value of nucleotide with higher probability
12. Number of supported reads that can only be aligned to this locus
13. Number of all supported reads that can be aligned to this locus
14. Total number of reads that can be aligned to this locus
15. Order and quality value
16. Estimated copy number for this locus
17. Presence of this locus in the dbSNP database. 1 refers to presence and 0 refers to inexistence
18. The distance between this locus and another closest SNP
Later SOAPsnp changed its output format to 17 columns. An example of the format is shown below:
1 12837840 G C 12 C 37 5 5 G 0 0 0 5 1.00000 1.00000 0 1 12853805
T K 0 T 39 1 1 G 35 1 1 2 1.00000 1.00000 0
The following information is provided on SOAPsnp website as of
16Apr2013, and it is slightly different from the documentation
with SOAPsnp, which only has 14 columns.
The result of SOAPsnp has 17 columns:
1) Chromosome ID
2) Coordinate on chromosome, start from 1
3) Reference genotype
4) Consensus genotype
5) Quality score of consensus genotype
6) Best base
7) Average quality score of best base
8) Count of uniquely mapped best base
9) Count of all mapped best base
10) Second best bases
11) Average quality score of second best base
12) Count of uniquely mapped second best base
13) Count of all mapped second best base
14) Sequencing depth of the site
15) Rank sum test p_value
16) Average copy number of nearby region
17) Whether the site is a dbSNP.
=item * B<SOAPindel format>
The current version of ANNOVAR handles SoapSNP and SoapIndel
automatically via a single argument '--format soap'. An example of
SOAP indel caller format is shown below:
chr11 44061282 - +2 CT Hete
chr11 45901572 + +1 C Hete
chr11 48242562 * -3 TTC Homo
chr11 57228723 * +4 CTTT Homo
chr11 57228734 * +4 CTTT Homo
chr11 57555685 * -1 C Hete
chr11 61482191 - +3 TCC Hete
chr11 64608031 * -1 T Homo
chr11 64654936 * +1 C Homo
chr11 71188303 + -1 T Hete
chr11 75741034 + +1 T Hete
chr11 76632438 * +1 A Hete
chr11 89578266 * -2 AG Homo
chr11 104383261 * +1 T Hete
chr11 124125940 + +4 CCCC Hete
chr12 7760052 * +1 T Homo
chr12 8266049 * +3 ACG Homo
I do not see a documentation describing this format yet as of
September 2010.
* --SOAPsv format
An example is given below:
Chr2 Deletion 42894 43832 43167 43555 388 0-0-0 FR 41
An explanation of the structural variation format is given below:
Format description (from left to right)
1. Chromosome name
2. Type of structure variation
3. Minimal value of start position in cluster
4. Maximal value of end position in cluster
5. Estimated start position of this structure variation
6. Estimated end position of this structure variation
7. Length of SV
8. Breakpoint of SV (only for insertion)
9. Unusual matching mode (F refers to align with forward sequence, R refers
to align with reverse
sequence)
10. number of paired-end read which support this structure variation
* MAQ format
MAQ can perform alignment and generate genotype calls, including
SNP calls and indel calls. The format is described below:
For indel header: The output is TAB delimited with each line
consisting of chromosome, start position, type of the indel,
number of reads across the indel, size of the indel and
inserted/deleted nucleotides (separated by colon), number of
indels on the reverse strand, number of indels on the forward
strand, 5' sequence ahead of the indel, 3' sequence following the
indel, number of reads aligned without indels and three additional
columns for filters.
An example is below:
chr10 110583 - 2 -2:AG 0 1 GCGAGACTCAGTATCAAAAAAAAAAAAAAAAA AGAAAGAAAGAAAAAGAAAAAAATAGAAAGAA 1 @2, @72, @0,
chr10 120134 - 8 -2:CA 0 1 CTCTTGCCCGCTCACACATGTACACACACGCG CACACACACACACACACATCAGCTACCTACCT 7 @65,62,61,61,45,22,7, @9,12,13,13,29,52,67, @0,0,0,0,0,0,0,
chr10 129630 - 1 -1:T 1 0 ATGTTGTGACTCTTAATGGATAAGTTCAGTCA TTTTTTTTTAGCTTTTAACCGGACAAAAAAAG 0 @ @ @
chr10 150209 - 1 4:TTCC 1 0 GCATATAGGGATGGGCACTTTACCTTTCTTTT TTCCTTCCTTCCTTCCTTCCCTTTCCTTTCCT 0 @ @ @
chr10 150244 - 2 -4:TTCT 0 1 CTTCCTTCCTTCCTTCCCTTTCCTTTCCTTTC TTCTTTCTTTCTTTCTTTCTTTTTTTTTTTTT 0 @ @ @
chr10 159622 - 1 3:AGG 0 1 GAAGGAGGAAGGACGGAAGGAGGAAGGAAGGA AGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGA 0 @ @ @
chr10 206372 - 2 2:GT 1 0 ATAATAGTAACTGTGTATTTGATTATGTGTGC GTGTGTGTGTGTGTGTGTGTGTGTGCGTGCTT 1 @37, @37, @8,
chr10 245751 - 11 -1:C 0 1 CTCATAAATACAAGTCATAATGAAAGAAATTA CCACCATTTTCTTATTTTCATTCATTTTTAGT 10 @69,64,53,41,30,25,22,14,5,4, @5,10,21,33,44,49,52,60,69,70, @0,0,0,0,0,0,0,0,0,0,
chr10 253066 - 1 2:TT 0 1 TATTGATGAGGGTGGATTATACTTTAGAACAC TATTCAAACAGTTCTTCCACATATCTCCCTTT 0 @ @ @
chr10 253455 - 2 -3:AAA 1 0 GTTGCACTCCAGCCTGGCGAGATTCTGTCTCC AAAAAAAAAAAAAAAAATTGTTGTGAAATACA 1 @55, @19, @4,
For snp output file: Each line consists of chromosome, position,
reference base, consensus base, Phred-like consensus quality, read
depth, the average number of hits of reads covering this position,
the highest mapping quality of the reads covering the position,
the minimum consensus quality in the 3bp flanking regions at each
side of the site (6bp in total), the second best call, log
likelihood ratio of the second best and the third best call, and
the third best call.
An example is below:
chr10 83603 C T 28 12 2.81 63 34 Y 26 C
chr10 83945 G R 59 61 4.75 63 62 A 47 G
chr10 83978 G R 47 40 3.31 63 62 A 21 G
chr10 84026 G R 89 22 2.44 63 62 G 49 A
chr10 84545 C T 54 9 1.69 63 30 N 135 N
chr10 85074 G A 42 5 1.19 63 38 N 108 N
chr10 85226 A T 42 5 1.00 63 42 N 107 N
chr10 85229 C T 42 5 1.00 63 42 N 112 N
chr10 87518 A G 39 4 3.25 63 38 N 9 N
chr10 116402 T C 39 4 1.00 63 38 N 76 N
* CASAVA format
An example of Illumina CASAVA format is given below:
#position A C G T modified_call total used score reference type
14930 3 0 8 0 GA 11 11 29.10:11.10 A SNP_het2
14933 4 0 7 0 GA 11 11 23.50:13.70 G SNP_het1
14976 3 0 8 0 GA 11 11 24.09:9.10 G SNP_het1
15118 2 1 4 0 GA 8 7 10.84:6.30 A SNP_het2
An example of the indels is given below:
# ** CASAVA depth-filtered indel calls **
#$ CMDLINE /illumina/pipeline/install/CASAVA_v1.7.0/libexec/CASAVA-1.7.0/filterIndelCalls.pl--meanReadDepth=2.60395068970547 --indelsCovCutoff=-1 --chrom=chr1.fa /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0000.txt /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0001.txt /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0002.txt /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0003.txt /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0004.txt
#$ CHROMOSOME chr1.fa
#$ MAX_DEPTH undefined
#
#$ COLUMNS pos CIGAR ref_upstream ref/indel ref_downstream Q(indel) max_gtype Q(max_gtype) max2_gtype bp1_reads ref_reads indel_reads other_reads repeat_unit ref_repeat_count indel_repeat_count
948847 1I CCTCAGGCTT -/A ATAATAGGGC 969 hom 47 het 22 0 16 6 A 1 2
978604 2D CACTGAGCCC CT/-- GTGTCCTTCC 251 hom 20 het 8 0 4 4 CT 1 0
1276974 4I CCTCATGCAG ----/ACAC ACACATGCAC 838 hom 39 het 18 0 14 4 AC 2 4
1289368 2D AGCCCGGGAC TG/-- GGAGCCGCGC 1376 hom 83 het 33 0 25 9 TG 1 0
* VCF4 format
VCF4 can be used to describe both population-level variation
information, or for reads derived from a single individual.
One example of the indel format for one individual is given below:
##fileformat=VCFv4.0
##IGv2_bam_file_used=MIAPACA2.alnReAln.bam
##INFO=<ID=AC,Number=2,Type=Integer,Description="# of reads supporting consensus indel/any indel at the site">
##INFO=<ID=DP,Number=1,Type=Integer,Description="total coverage at the site">
##INFO=<ID=MM,Number=2,Type=Float,Description="average # of mismatches per consensus indel-supporting read/per reference-supporting read">
##INFO=<ID=MQ,Number=2,Type=Float,Description="average mapping quality of consensus indel-supporting reads/reference-supporting reads">
##INFO=<ID=NQSBQ,Number=2,Type=Float,Description="Within NQS window: average quality of bases from consensus indel-supporting reads/from reference-supporting reads">
##INFO=<ID=NQSMM,Number=2,Type=Float,Description="Within NQS window: fraction of mismatchingbases in consensus indel-supporting reads/in reference-supporting reads">
##INFO=<ID=SC,Number=4,Type=Integer,Description="strandness: counts of forward-/reverse-aligned indel-supporting reads / forward-/reverse-aligned reference supporting reads">
##IndelGenotyperV2=""
##reference=hg18.fa
##source=IndelGenotyperV2
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Miapaca_trimmed_sorted.bam
chr1 439 . AC A . PASS AC=5,5;DP=7;MM=7.0,3.0;MQ=23.4,1.0;NQSBQ=23.98,25.5;NQSMM=0.04,0.0;SC=2,3,0,2 GT 1/0
chr1 714048 . T TCAAC . PASS AC=3,3;DP=9;MM=3.0,7.1666665;MQ=1.0,10.833333;NQSBQ=23.266666,21.932203;NQSMM=0.0,0.15254237;SC=3,0,3,3 GT 0/1
chr1 714049 . G GC . PASS AC=3,3;DP=9;MM=3.0,7.1666665;MQ=1.0,10.833333;NQSBQ=23.233334,21.83051;NQSMM=0.0,0.15254237;SC=3,0,3,3 GT 0/1
chr1 813675 . A AATAG . PASS AC=5,5;DP=8;MM=0.4,1.0;MQ=5.0,67.0;NQSBQ=25.74,25.166666;NQSMM=0.0,0.033333335;SC=4,1,1,2 GT 0/1
chr1 813687 . AGAGAGAGAGAAG A . PASS AC=5,5;DP=8;MM=0.4,1.0;MQ=5.0,67.0;NQSBQ=24.54,25.2;NQSMM=0.02,0.06666667;SC=4,1,1,2 GT 1/0
* annovar2vcf format
This is useful for converting certain ANNOVAR files to VCF format.
These ANNOVAR input files MUST include zygosity, quality and
filter information as the 3 extra columns after Chr, Start, End,
Ref and Alt alleles.
The code was written by Dr. Kai Wang and modified by Dr. GermଠGast
Leparc. Various users have provided sample input files for many SNP callin
software, for the development of conversion subroutines. We thank these
users for their continued support to improve the functionality of the
script.
For questions or comments, please contact kai@openbioinformatics.org.
POD ERRORS
Hey! The above document had some coding errors, which are explained
below:
Around line 3110:
Expected '=item *'
Around line 3236:
Non-ASCII character seen before =encoding in 'Germଧ. Assuming CP1252
table_annovar.pl
$cat table_annovar.txt
SYNOPSIS
table_annovar.pl [arguments] <query-file> <database-location>
Optional arguments:
-h, --help print help message
-m, --man print complete documentation
-v, --verbose use verbose output
--protocol <string> comma-delimited string specifying database protocol
--operation <string> comma-delimited string specifying type of operation
--outfile <string> output file name prefix
--buildver <string> genome build version (default: hg18)
--remove remove all temporary files
--(no)checkfile check if database file exists (default: ON)
--genericdbfile <files> specify comma-delimited generic db files
--gff3dbfile <files> specify comma-delimited GFF3 files
--bedfile <files> specify comma-delimited BED files
--vcfdbfile <files> specify comma-delimited VCF files
--otherinfo print out otherinfo (infomration after fifth column in queryfile)
--onetranscript print out only one transcript for exonic variants (default: all transcripts)
--nastring <string> string to display when a score is not available (default: null)
--csvout generate comma-delimited CSV file (default: tab-delimited txt file)
--argument <string> comma-delimited strings as optional argument for each operation (use& for comma inside string)
--convertarg <string> argument to convert2annovar.pl
--codingarg <string> argument to coding_change.pl
--tempdir <dir> directory to store temporary files (default: --outfile)
--vcfinput specify that input is in VCF format and output will be in VCF format
--dot2underline change dot in field name to underline (eg, Func.refGene to Func_refGene)
--thread <int> specify the number of threads to be used in annotation
--maxgenethread <int> specify the maximum number of threads allowed in gene annotation (default: 6)
--polish polish the protein notation for indels (such as p.G12Vfs*2)
--xreffile <file> specify a cross-reference file for gene-based annotation
Function: automatically run a pipeline on a list of variants and summarize
their functional effects in a comma-delimited file, or to an annotated VCF file
if the original input is a VCF file
Example: table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,dbnsfp30a -operation g,r,f -nastring . -csvout -polish -xreffile example/gene_fullxref.txt
table_annovar.pl example/ex2.vcf humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,dbnsfp30a -operation g,r,f -nastring . -vcfinput
Version: $Date: 2018-04-16 00:47:49 -0400 (Mon, 16 Apr 2018) $
OPTIONS
--help print a brief usage message and detailed explanation of options.
--man print the complete manual of the program.
--verbose
use verbose output.
--protocol
comma-delimited string specifying annotation protocol. These
strings typically represent database names in ANNOVAR.
--operation
comma-delimited string specifying type of operation. These strings
can be g (gene), r (region) or f (filter).
--outfile
the prefix of output file names
--buildver
specify the genome build version
--remove
remove all temporary files. By default, all temporary files will
be kept for user inspection, but this will easily clutter the
directory.
--(no)checkfile
the program will check if all required database files exist before
execution of annotation
--genericdbfile
specify the genericdb files used in -dbtype generic. Note that
multiple comma- delimited file names can be supplied.
--gff3dbfile
specify the GFF3 dbfiles files used in -dbtype gff3. Note that
multiple comma- delimited file names can be supplied.
--bedfile
specify the GFF3 dbfiles files used in -dbtype bed. Note that
multiple comma- delimited file names can be supplied.
--vcfdbfile
specify the VCF dbfiles files used in -dbtype vcf. Note that
multiple comma- delimited file names can be supplied.
--otherinfo
print out otherinfo in the output file. "otherinfo" refers to all
the infomration after fifth column in the input queryfile.
--onetranscript
print out only one random transcript for exonic variants. By
default, all transcripts are printed in the output.
--nastring
string to display when a score is not available. By default, empty
string is printed in the output file.
--csvout
generate comma-delimited CSV file. By default, tab-delimited text
file is generated.
--argument
a comma-separated list of arguments, to be supplied to each of the
protocols. This list faciliates customized annotation procedure
for each protocol.
--convertarg
a string as argument to be supplied to the convert2annovar.pl
program
--codingarg
a string as argument to be supplied to the coding_change.pl
program
--tempdir
specify the directory location for storing temporary files used by
table_annovar. This argument is especially useful in a cluster
computing environment, so that temporary files are written to
local disk of compute nodes, yet results files are written to
possibly remote hosts.
--vcfinput
specify that input is in VCF format and output will be in VCF
format. if you want to generate a tab-delimited output or
comma-delimited output file, you must use convert2annovar to
generate an ANNOVAR input file first.
--dot2underline
change dot in field name to underline (eg, Func.refGene to
Func_refGene), which is useful for post-processing of the results
in some software tools that cannot handle dot in field names.
--thread
specify the number of threads to be used in annotation (when
multi-threading support is enabled in the system)
--maxgenethread
specify the maximum number of threads allowed in gene annotation
(default: 6)
--polish
polish the protein notation for indels (such as p.G12Vfs*2) by
re-calculating the protein sequence after a mutation is introduced
in coding_change.pl
--xreffile
specify a cross-reference file for gene-based annotation, so that
the final output includes extra columns for genes
DESCRIPTION
ANNOVAR is a software tool that can be used to functionally annotate a
list of genetic variants, possibly generated from next-generation
sequencing experiments. For example, given a whole-genome resequencing
data set for a human with specific diseases, typically around 3 million
SNPs and around half million insertions/deletions will be identified.
Given this massive amounts of data (and candidate disease- causing
variants), it is necessary to have a fast algorithm that scans the data
and identify a prioritized subset of variants that are most likely
functional for follow-up Sanger sequencing studies and functional assays.
The table_annovar.pl program is designed to replace summarize_annovar.pl
in earlier version of ANNOVAR. Basically, it takes an input file, and run
a series of annotations on the input file, and generate a tab-delimited
output file, where each column represent a specific type of annotation.
Therefore, the new table_annovar.pl allows better customization for users
who want to annotate specific columns.
ANNOVAR is freely available to the community for non-commercial use. For
questions or comments, please contact kai@openbioinformatics.org.