084-【生信软件】-ANNOVAR软件帮助文档

最新推荐文章于 2023-12-05 15:07:20 发布

leadingsci

最新推荐文章于 2023-12-05 15:07:20 发布

阅读量5.9k

点赞数

分类专栏：【生信软件】

本文链接：https://blog.csdn.net/leadingsci/article/details/82959523

版权

【生信软件】专栏收录该内容

5 篇文章

订阅专栏

ANNOVAR是一款用于大规模遗传变异数据的功能注释工具，适用于全基因组测序实验。它能快速筛选并注释可能的功能性变异，如SNP、插入和删除等，辅助后续的验证和功能研究。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

安装

log in之后才能download，使用教育机构后缀的邮箱即可注册。
http://www.openbioinformatics.org/annovar/annovar_download_form.php

会邮件收到一个软件安装包

annovar.latest.tar/

包含的perl脚本

toucan@tssys /opt/script/tool/annovar  Sun Oct 07 16:42  forstart
$tree -L 1
.
├── annotate_variation.pl
├── coding_change.pl
├── convert2annovar.pl
├── example
├── humandb
├── retrieve_seq_from_fasta.pl
├── table_annovar.pl
└── variants_reduction.pl

humandb

ANNOVAR的安装包里自带了一些常用的数据库，在humandb/目录下

toucan@tssys /opt/script/tool/annovar/humandb  Sun Oct 07 16:43  forstart
$tree -L 1
.
├── genometrax-sample-files-gff
├── GRCh37_MT_ensGeneMrna.fa
├── GRCh37_MT_ensGene.txt
├── hg19_example_db_generic.txt
├── hg19_example_db_gff3.txt
├── hg19_MT_ensGeneMrna.fa
├── hg19_MT_ensGene.txt
├── hg19_refGeneMrna.fa
├── hg19_refGene.txt
├── hg19_refGeneVersion.txt
├── hg19_refGeneWithVerMrna.fa
└── hg19_refGeneWithVer.txt

gff文件

toucan@tssys /opt/script/tool/annovar/humandb/genometrax-sample-files-gff  Sun Oct 07 16:44  forstart
$tree -L 1
.
├── list
├── sample_chip_featuretype_hg19.gff
├── sample_common_snp_featuretype_hg19.gff
├── sample_cosmic_featuretype_hg19.gff
├── sample_cpg_islands_featuretype_hg19.gff
├── sample_dbnsfp_featuretype_hg19.gff
├── sample_disease_featuretype_hg19.gff
├── sample_dnase_featuretype_hg19.gff
├── sample_drug_featuretype_hg19.gff
├── sample_evs_featuretype_hg19.gff
├── sample_gwas_featuretype_hg19.gff
├── sample_hgmd_common_snp_featuretype_hg19.gff
├── sample_hgmd_disease_genes_featuretype_hg19.gff
├── sample_hgmd_featuretype_hg19.gff
├── sample_hgmdimputed_featuretype_hg19.gff
├── sample_microsatellites_featuretype_hg19.gff
├── sample_miRNA_featuretype_hg19.gff
├── sample_omim_featuretype_hg19.gff
├── sample_pathway_featuretype_hg19.gff
├── sample_pgx_featuretype_hg19.gff
├── sample_ptms_featuretype_hg19.gff
├── sample_snps_dbsnp_featuretype_hg19.gff
├── sample_snps_ensembl_featuretype_hg19.gff
├── sample_transfac_sites_featuretype_hg19.gff
└── sample_tss_featuretype_hg19.gff

0 directories, 25 files

example

toucan@DELL5577:/mnt/e/software/linux/ANNOVAR/annovar.latest.tar/annovar/example$ ll
total 20152
drwxrwxrwx 1 toucan toucan     4096 Sep 26 22:47 ./
drwxrwxrwx 1 toucan toucan     4096 Sep 26 22:47 ../
-rwxrwxrwx 1 toucan toucan     1940 Apr 17 03:41 README*
-rwxrwxrwx 1 toucan toucan     1831 Apr 17 03:41 ex1.avinput*
-rwxrwxrwx 1 toucan toucan     1706 Apr 17 03:41 ex2.vcf*
-rwxrwxrwx 1 toucan toucan       44 Apr 17 03:41 example.simple_region*
-rwxrwxrwx 1 toucan toucan       44 Apr 17 03:41 example.tab_region*
-rwxrwxrwx 1 toucan toucan 20317115 Apr 17 03:41 gene_fullxref.txt*
-rwxrwxrwx 1 toucan toucan   295664 Apr 17 03:41 gene_xref.txt*
-rwxrwxrwx 1 toucan toucan     1436 Apr 17 03:41 grantham.matrix*
-rwxrwxrwx 1 toucan toucan       43 Apr 17 03:41 snplist.txt*

README说明

toucan@DELL5577:/mnt/e/software/linux/ANNOVAR/annovar.latest.tar/annovar/example$ cat README
visit ANNOVAR website at http://www.openbioinformatics.org/annovar for more exmaple.

Please cite ANNOVAR if you use it in your research (Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data, Nucleic Acids Research, 38:e164, 2010). I spent tremendous amount of time and effort to maintain this tool, and your citation really means a lot to me.

ex1.avinput: a simple ANNOVAR input example with a few variants (in hg19 coordinate)

ex2.vcf: a simple VCF file with genotype information for 3 samples

gene_xref.txt: an example gene cross-reference file to be used on 'gx' operation in table_annovar.pl

example.simple_region: a file containing a list of genomic regions in sample format (for use in retrieve_seq_from_fasta.pl)

example.tab_region: a flie containing a list of genomic regions in tab-delimited format (for use in retrieve_seq_from_fasta.pl)

snplist.txt: a text file listing several dbSNP rs identifiers, one per line

humandb/hg19_example_db_generic.txt: an example file for generic database

humandb/hg19_example_db_gff3.txt: an example file for GFF3 database

grantham.matrix: a matrix file containing GRANTHAM scores for gene-based annotation

humandb/genometrax-sample-files-gff: a directory containing several "sample" GFF files provided by BioBase

humandb/hg19_MT_ensGene.txt and humandb/hg19_MT_ensGene.fa: mitochondria sequence for the NC_001807 contig used by UCSC Genome Browser. Even if you align your sequence data with reference sequences from UCSC, you should still use these files, not the ENSEMBLE file, for mitochondria annotation, because the ENSEMBLE annotations have some errors.

humandb/GRCh37_MT_ensGene.txt and humandb/GRCh37_MT_ensGene.fa: mitochondria sequence for the NC_012920 contig. If you align your sequence data using 1000 Genomes Project reference FASTA file, then you should use this file for annotating mitochondria variants.

如果要进行其他注释，需要使用 -downdb 命令下载数据库到 ‘humandb/’ 目录里：

#下载1000g2015Aug数据库
$perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2015aug humandb/

软件帮助文档

(ANNOVAR程序结构
│ annotate_variation.pl #主程序，功能包括下载数据库，三种不同的注释
│ coding_change.pl #可用来推断蛋白质序列
│ convert2annovar.pl #将多种格式转为.avinput的程序
│ retrieve_seq_from_fasta.pl #用于自行建立其他物种的转录本
│ table_annovar.pl #注释程序，可一次性完成三种类型的注释
│ variants_reduction.pl #可用来更灵活地定制过滤注释流程
│
├─example #存放示例文件
│
└─humandb #人类注释数据库)

annotate_variation.pl

$cat mam/annotate_variation.txt
SYNOPSIS
     annotate_variation.pl [arguments] <query-file|table-name> <database-location>

     Optional arguments:
            -h, --help                      print help message
            -m, --man                       print complete documentation
            -v, --verbose                   use verbose output

            Arguments to download databases or perform annotations
                --downdb                    download annotation database
                --geneanno                  annotate variants by gene-based annotation (infer functional consequence on genes)
                --regionanno                annotate variants by region-based annotation (find overlapped regions in database)
                --filter                    annotate variants by filter-based annotation (find identical variants in database)

            Arguments to control input and output
                --outfile <file>            output file prefix
                --webfrom <string>          specify the source of database (ucsc or annovar or URL) (downdb operation)
                --dbtype <string>           specify database type
                --buildver <string>         specify genome build version (default: hg18 for human)
                --time                      print out local time during program run
                --comment                   print out comment line (those starting with #) in output files
                --exonsort                  sort the exon number in output line (gene-based annotation)
                --transcript_function       use transcript name rather than gene name (gene-based annotation)
                --hgvs                      use HGVS format for exonic annotation (c.122C>T rather than c.C122T)(gene-based annotation)
                --separate                  separately print out all functions of a variant in several lines (gene-based annotation)
                --seq_padding               create a new file with cDNA sequence padded by this much either side(gene-based annotation)
                --(no)firstcodondel         treat first codon deletion as wholegene deletion (default: ON) (gene-based annotation)
                --aamatrix <file>           specify an amino acid substitution matrix file (gene-based annotation)
                --colsWanted <string>       specify which columns to output by comma-delimited numbers (region-based annotation)
                --scorecolumn <int>         the column with scores in DB file (region-based annotation)
                --poscolumn <string>        the comma-delimited column with position information in DB file (region-based annotation)
                --gff3dbfile <file>         specify a DB file in GFF3 format (region-based annotation)
                --gff3attribute             output all fields in GFF3 attribute (default: ID and score only)
                --bedfile <file>            specify a DB file in BED format file (region-based annotation)
                --genericdbfile <file>      specify a DB file in generic format (filter-based annotation)
                --vcfdbfile <file>          specify a DB file in VCF format (filter-based annotation)
                --otherinfo                 print out additional columns in database file (filter-based annotation)
                --infoasscore               use INFO field in VCF file as score in output (filter-based annotation)
                --idasscore                 use ID field in VCF file as score in output (filter-based annotation)
                --infosep                   use # rather than , to separate fields when -otherinfo is used


            Arguments to fine-tune the annotation procedure
                --batchsize <int>           batch size for processing variants per batch (default: 5m)
                --genomebinsize <int>       bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno)
                --expandbin <int>           check nearby bin to find neighboring genes (default: 2m/genomebinsize)
                --neargene <int>            distance threshold to define upstream/downstream of a gene
                --exonicsplicing            report exonic variants near exon/intron boundary as 'exonic;splicing' variants
                --score_threshold <float>   minimum score of DB regions to use in annotation
                --normscore_threshold <float> minimum normalized score of DB regions to use in annotation
                --reverse                   reverse directionality to compare to score_threshold
                --rawscore                  output includes the raw score (not normalized score) in UCSC BrowserTrack
                --minqueryfrac <float>      minimum percentage of query overlap to define match to DB (default: 0)
                --splicing_threshold <int>  distance between splicing variants and exon/intron boundary (default: 2)
                --indel_splicing_threshold <int>    if set, use this value for allowed indel size for splicing variants (default: --splicing_threshold)
                --maf_threshold <float>     filter 1000G variants with MAF above this threshold (default: 0)
                --sift_threshold <float>    SIFT threshold for deleterious prediction for -dbtype avsift (default: 0.05)
                --precedence <string>       comma-delimited to specify precedence of variant function (default: exonic>intronic...)
                --indexfilter_threshold <float>     controls whether filter-based annotation use index if this fraction of bins need to be scanned (default: 0.9)
                --thread <int>              use multiple threads for filter-based annotation
                --maxgenethread <int>       max number of threads for gene-based annotation (default: 6)
                --mingenelinecount <int>    min line counts to enable threaded gene-based annotation (default: 1000000)

           Arguments to control memory usage
                --memfree <int>             ensure minimum amount of free system memory (default: 0)
                --memtotal <int>            limit total amount of memory used by ANNOVAR (default: 0, unlimited,in the order of kb)
                --chromosome <string>       examine these specific chromosomes in database file


     Function: annotate a list of genetic variants against genome annotation
     databases stored at local disk.
# 示例
     Example: #download annotation databases from ANNOVAR or UCSC and save to humandb/ directory
              annotate_variation.pl -downdb -webfrom annovar refGene humandb/
              annotate_variation.pl -buildver mm9 -downdb refGene mousedb/
              annotate_variation.pl -downdb -webfrom annovar esp6500siv2_all humandb/

              #gene-based annotation of variants in the varlist file (by default --geneanno is ON)
              annotate_variation.pl -buildver hg19 ex1.avinput humandb/

              #region-based annotate variants
              annotate_variation.pl -regionanno -buildver hg19 -dbtype cytoBand ex1.avinput humandb/
              annotate_variation.pl -regionanno -buildver hg19 -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.avinput humandb/

              #filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants
              annotate_variation.pl -filter -dbtype 1000g2015aug_all -maf 0.01 ex1.avinput humandb/
              annotate_variation.pl -filter -buildver hg19 -dbtype snp138 ex1.avinput humandb/
              annotate_variation.pl -filter -dbtype dbnsfp30a -otherinfo ex1.avinput humandb/

     Version: $Date: 2018-04-16 00:43:31 -0400 (Mon, 16 Apr 2018) $

OPTIONS
    --help  print a brief usage message and detailed explanation of options.

    --man   print the complete manual of the program.

    --verbose
            use verbose output.

    --downdb
            download annotation databases from UCSC Genome Browser, Ensembl,
            1000 Genomes Project, ANNOVAR website or other resources. The
            annotation databases are required for functional annotation of
            genetic variants.

    --geneanno
            perform gene-based annotation. For each variant, examine whether
            it hit exon, intron, intergenic region, or close to a transcript,
            or hit a non-coding RNA gene, or is located in a untranslated
            region (see *.variant_function output file). In addition, for an
            exonic variant, determine whether it causes splicing change,
            non-synonymous amino acid change, synonymous amino acid change or
            frameshift changes (see *.exonic_variant_function output file).

    --regionanno
            perform region-based annotation. For each variant, examine whether
            its genomic region (one or multiple base pairs) overlaps with a
            specific genomic region, such as the most conserved elements, the
            predicted transcription factor binding sites, the specific
            cytogeneic bands, the evolutionarily conserved RNA secondary
            structures.

    --filter
            perform filter-based annotation. For each variants, filter it
            against a variation database, such as the 1000 Genomes Project
            database, to identify whether it has been reporte in the database.
            Exact match of nucleotide position and nucleotide composition are
            required.

    --outfile
            specify the output file prefix. Several output files will be
            generated using this prefix and different suffixes. A directory
            name can also be specified as part of the argument, so that the
            output files can be written to a different directory than the
            current directory.

    --webfrom
            specify the source of database (ucsc or annovar or URL) in the
            downdb operation. By default, files from UCSC Genome Browser
            annotation database will be downloaded.

    --dbtype
            specify the database type to be used in gene-based, region-based
            or filter-based annotations. For gene-based annotation, by default
            refGene annotations from the UCSC Genome Browser will be used for
            annotating variants. However, users can switch to use Ensembl
            annotations, or use the UCSC Gene annotations, or the GENCODE Gene
            annotations, or other types of gene annotations. For region-based
            annotations, users can select any UCSC annotation databases (by
            providing the database name), or alternatively select a Generic
            Feature Format version 3 (GFF3) formatted file for annotation (by
            providing 'gff3' as the --dbtype and providing the --gff3dbfile
            argument), or select a BED file (by providing '-- dbtype bed' and
            --bedfile arguments). For filter-based annotations, users can
            select a dbSNP file, a 1000G file, a generic format file (with
            simple columns including chr, start, end, reference, observed,
            score), a VCF format file (which is a widely used format for
            variants exchange), or many other types of formats.

    --buildver
            genome build version to use. By default, the hg18 build for human
            genome is used. The build version will be used by ANNOVAR to
            identify corresponding database files automatically, for example,
            when gene-based annotation is used for hg18 build, ANNOVAR will
            search for the hg18_refGene.txt file, but if the hg19 is used as
            -- buildver, ANNOVAR will examine hg19_refGene.txt instead.

    --time  print out the local time during execution of the program

    --comment
            specify that the program should include comment lines in the
            output files. Comment lines are defined as any line starting with
            #. By default, these lines are not recognized as valid ANNOVAR
            input and are therefore written to the INVALID_INPUT file. This
            argument can be very useful to keep columns headers in the output
            file, if the input file use comment line to flag the column
            headers (usually the first line in the input file).

    --exonsort
            sort the exon number in output line in the exonic_variant_function
            file during gene-based annotation. If a mutation affects multiple
            transcripts, the ones with the smaller exon number will be printed
            before the transcript with larger exon number in the output.

    --transcript_function
            use transcript name rather than gene name in output, for
            gene-based annotation

    --hgvs  use HGVS format for exonic annotation (c.122C>T rather than
            c.C122T) for gene-based annotation

    --separate
            for gene-based annotation, separate the effects of each variant,
            so that each effect (intronic, exonic, splicing) is printed in one
            output line. By default, all effects are printed in the same line,
            in the comma-separated form of 'UTR3,UTR5' or 'exonic,splicing'.

    --seq_padding
            create a new file with cDNA sequence padded by this much either
            side (gene-based annotation)

    --firstcodondel
            if the first codon of a gene is deleted, then the whole gene will
            be treated as deleted in gene-based annotation. By default, this
            option is ON.

    --aamatrixfile
            specify an amino acid substitution matrix, so that the scores are
            printed in the exonic_variant_function file in gene-based
            annotation. The matrix file is tab- delimited, and an example is
            included in the ANNOVAR package.

    --colsWanted
            specify which columns are desired in the output for -regionanno.
            By default, ANNOVAR inteligently selects the columns based on the
            DB type. However, users can use a list of comma-delimited numbers,
            or use 'all', or use 'none', to request custom output columns.

    --scorecolumn
            specify the the column with desired output scores in UCSC database
            file (for region-based annotation). The default usually works
            okay.

    --poscolumn
            the comma-delimited column with position information in DB file
            (region-based annotation). The default usually works okay.

    --gff3dbfile
            specify the GFF3-formatted database file used in the region-based
            annotation. Please consult
            http://www.sequenceontology.org/resources/gff3.html for detailed
            description on this file format. Note that GFF3 is generally not
            compatible with previous versions of GFF.

    --gff3attribute
            output should contain all fields in GFF3 file attribute column
            (the 9th column). By default, only the ID in the attribute and the
            scores for the GFF3 file will be printed.

    --bedfile
            specify a DB file in BED format file in region-based annotation.
            Please consult http://genome.ucsc.edu/FAQ/FAQformat.html#format1
            for detailed descriptions on this format.

    --genericdbfile
            specify the generic format database file used in the filter-based
            annotation.

    --vcfdbfile
            specify the database file in VCF format in the filter-based
            annotation. VCF has been a popular format for summarizing SNP and
            indel calls in a population of samples, and has been adopted by
            1000 Genomes Project in their most recent data release.

    --otherinfo
            print out additional columns in database file in filter-based
            annotation. This argument is useful when the annotation database
            contains more than one annotation columns, so that all columns
            will be printed out and separated by comma (by default).

    --idasscore
            when annotating against a VCF file, treat the ID field in VCF file
            as the score to be printed in the output, in filter-based
            annotation. By default the score is the allele frequency inferred
            from VCF file.

    --infoasscore
            when annotating against a VCF file, treat the INFO field in VCF
            file as the score to be printed in the output, in filter-based
            annotation. By default the score is allele frequency inferred from
            VCF file.

    --infosep
            use '#' rather than ',' to separate multiple fields when
            -otherinfo is used in annotation. This argument is useful when the
            annotation string itself contains comma, to help users clearly
            separate different annotation fields.

    --batchsize
            this argument specifies the batch size for processing variants by
            gene-based annotation. Normally 5 million variants (usually one
            human genome will have about 3-5 million variants depending on
            ethnicity) are annotated as a batch, to reduce the amounts of
            memory. The users can adjust the parameters: larger values make
            the program slightly faster, at the expense of slightly larger
            memory requirements. In a 64bit computer, the default settings
            usually take 1GB memory for gene-based annotation for human genome
            for a typical query file, but this depends on the complexity of
            the query (note that the query has a few required fields, but may
            have many optional fields and those fields need to be read and
            kept in memory).

    --genomebinsize
            the bin size of genome to speed up search. By default 100kb is
            used for gene- based annotation, so that variant annotation
            focused on specific bins only (based on the start-end site of a
            given variant), rather than searching the entire chromosomes for
            each variant. By default 10kb is used for region-based annotation.
            The filter-based annotations look for variants directly so no bin
            is used.

    --expandbin
            expand bin to both sides to find neighboring genes/regions. For
            gene-based annotation, ANNOVAR tries to find nearby genes for any
            intergenic variant, with a maximum number of nearby bins to
            search. By default, ANNOVAR will automatically set this argument
            to search 2 megabases to the left and right of the variant in
            genome.

    --neargene
            the distance threshold to define whether a variant is in the
            upstream or downstream region of a gene. By default 1 kilobase
            from the start or end site of a transcript is defined as upstream
            or downstream, respectively. This is useful, for example, when one
            wants to identify variants that are located in the promoter
            regions of genes across the genome.

    --exonicsplicing
            report exonic variants near exon/intron boundary as
            'exonic;splicing' variants. These variants are technically exonic
            variants, but there are some literature reports that some of them
            may also affect splicing so a keyword is preserved specifically
            for them.

    --score_threshold
            the minimum score to consider when examining region-based
            annotations on UCSC Genome Browser tables. Some tables do not have
            such scores and this argument will not be effective.

    --normscore_threshold
            the minimum normalized score to consider when examining
            region-based annotations on UCSC Genome Browser tables. The
            normalized score is calculated by UCSC, ranging from 0 to 1000, to
            make visualization easier. Some tables do not have such scores and
            this argument will not be effective.

    --reverse
            reverse the criteria for --score_threshold and
            --normscore_threshold. So the minimum score becomes maximum score
            for a result to be printed.

    --rawscore
            for region-based annotation, print out raw scores from UCSC Genome
            Browser tables, rather than normalized scores. By default,
            normalized scores are printed in the output files. Normalized
            scores are compiled by UCSC Genome Browser for each track, and
            they usually range from 0 to 1000, but there are some exceptions.

    --minqueryfrac
            The minimum fraction of overlap between a query and a database
            record to decide on their match. By default, any overlap is
            regarded as a match, but this may not work best when query consist
            of large copy number variants.

    --splicing_threshold
            distance between splicing variants and exon/intron boundary, to
            claim that a variant is a splicing variant. By default, 2bp is
            used. ANNOVAR is relatively more stringent than some other
            software to claim variant as regulating splicing. In addition, if
            a variant is an exonic variant, it will not be reported as
            splicing variant even if it is within 2bp to an exon/intron
            boundary.

    --indel_splicing_threshold
            If set, max size of indel allowed to be called a splicing variant
            (if boundary within --splicing_threshold bases of an intron/exon
            junction.) If not set, this is equal to the --splicing_threshold,
            as per original behavior.

    --maf_threshold
            the minor allele frequency (MAF) threshold to be used in the
            filter-based annotation for the 1000 Genomes Project databases. By
            default, any variant annotated in the 1000G will be used in
            filtering.

    --sift_threshold
            the default SIFT threshold for deleterious prediction for -dbtype
            avsift (default: 0.05). This argument is obselete, since the
            recommended database for SIFT annotation is LJB database now,
            rather than avsift database.

    --thread
            specify the number of threads to use in filter-based annotation.
            The Perl and all components in the system needs to support
            multi-threaded analysis to use this feature. It is recommended
            when your database is stored at a SSD drive, which results in
            nearly linear speed up of annotation for large genome files.

    --maxgenethread
            specify the maximum number of threads for gene-based annotation
            (default: 6). Generally speaking, too many threads for gene-based
            annotation will negatively impacts the performance.

    --mingenelinecount
            specify the minimum line counts to enable threaded gene-based
            annotation (default: 1000000). For input files with less lines,
            the threaded annotation will not be used, since it actually cost
            more time than non-threaded annotation.

    --memfree
            the minimum amount of free system memory that ANNOVAR should
            ensure to have.

    --memtotal
            the total amount of memory that ANNOVAR should use at most. By
            default, this value is zero, meaning that there is no limit on
            that. Decreasing this threshold reduce the memory requirement by
            ANNOVAR, but may increase the execution time.

    --chromosome
            examine these specific chromosomes in database file. The argument
            takes comma- delimited values, and the dash can be correctly
            recognized. For example, 5-10,X represent chromosome 5 through
            chromosome 10 plus chromosome X.

DESCRIPTION
    ANNOVAR is a software tool that can be used to functionally annotate a
    list of genetic variants, possibly generated from next-generation
    sequencing experiments. For example, given a whole-genome resequencing
    data set for a human with specific diseases, typically around 3 million
    SNPs and around half million insertions/deletions will be identified.
    Given this massive amounts of data (and candidate disease- causing
    variants), it is necessary to have a fast algorithm that scans the data
    and identify a prioritized subset of variants that are most likely
    functional for follow-up Sanger sequencing studies and functional assays.

    Currently, these various types of functional annotations produced by
    ANNOVAR can be (1) gene-based annotations (the default behavior), such as
    exonic variants, intronic variants, intergenic variants, downstream
    variants, UTR variants, splicing site variants, stc. For exonic variants,
    ANNOVAR will try to predict whether each of the variants is non-synonymous
    SNV, synonymous SNV, frameshifting change, nonframeshifting change. (2)
    region-based annotation, to identify whether a given variant overlaps with
    a specific type of genomic region, for example, predicted transcription
    factor binding site or predicted microRNAs.(3) filter-based annotation, to
    filter a list of variants so that only those not observed in variation
    databases (such as 1000 Genomes Project and dbSNP) are printed out.

    Detailed documentation for ANNOVAR should be viewed in ANNOVAR website
    (http://annovar.openbioinformatics.org/). Below is description on commonly
    encountered file formats when using ANNOVAR software.

    *       variant file format

            A sample variant file contains one variant per line, with the
            fields being chr, start, end, reference allele, observed allele,
            other information. The other information can be anything (for
            example, it may contain sample identifiers for the corresponding
            variant.) An example is shown below:

                    16      49303427        49303427        C       T       rs2066844       R702W (NOD2)
                    16      49314041        49314041        G       C       rs2066845       G908R (NOD2)
                    16      49321279        49321279        -       C       rs2066847       c.3016_3017insC (NOD2)
                    16      49290897        49290897        C       T       rs9999999       intronic (NOD2)
                    16      49288500        49288500        A       T       rs8888888       intergenic (NOD2)
                    16      49288552        49288552        T       -       rs7777777       UTR5 (NOD2)
                    18      56190256        56190256        C       T       rs2229616       V103I (MC4R)

    *       database file format: UCSC Genome Browser annotation database

            Most but not all of the gene annotation databases are directly
            downloaded from UCSC Genome Browser, so the file format is
            identical to what was used by the genome browser. The users can
            check Table Browser (for example, human hg18 table browser is at
            http://www.genome.ucsc.edu/cgi-bin/hgTables?org=Human&db=hg18) to
            see what fields are available in the annotation file. Note that
            even for the same species (such as humans), the file format might
            be different between different genome builds (such as between
            hg16, hg17 and hg18). ANNOVAR will try to be smart about guessing
            file format, based on the combination of the -- buildver argument
            and the number of columns in the input file. In general, the
            database file format should not be something that users need to
            worry about.

    *       database file format: GFF3 format for gene-based annotations)

            As of June 2010, ANNOVAR cannot perform gene-based annotations
            using GFF3 input files, and any annotations on GFF3 is
            region-based. I suggest that users download gff3ToGenePred tool
            from UCSC and convert GFF3-based gene annotation to UCSC format,
            so that ANNOVAR can perform gene-based annotation for your species
            of interests.

    *       database file format: GFF3 format for region-based
            annotations)

            Currently, region-based annotations can support the Generic
            Feature Format version 3 (GFF3) formatted files. The GFF3 has
            become the de facto golden standards for many model organism
            databases, such that many users may want to take a custom
            annotation database and run ANNOVAR on them, and it would be the
            most convenient if the custom file is made with GFF3 format.

    *       database file format: generic format for filter-based
            annotations)

            The 'generic' format is designed for filter-based annotation that
            looks for exact variants. The format is almost identical to the
            ANNOVAR input format, with chr, start, end, reference allele,
            observed allele and scores (higher scores are regarded as better).

    *       database file format: VCF format for filter-based annotations)

            ANNOVAR can directly interrogate VCF files as database files. A
            VCF file may contain summary information for variants (for
            example, this variant has MAF of 5% in this population), or it may
            contain the actual variant calls for each individual in a specific
            population.

    *       sequence file format

            ANNOVAR can directly examine FASTA-formatted sequence files. For
            mRNA sequences, the name of the sequences are the mRNA identifier.
            For genomic sequences, the name of the sequences in the files are
            usually chr1, chr2, chr3, etc, so that ANNOVAR knows which
            sequence corresponds to which chromosome. Unfortunately, UCSC uses
            things like chr6_random to annotate un-assembled sequences, as
            opposed to using the actual contig identifiers. This causes some
            issues (depending on how reads alignment algorithms works), but in
            general should not be something that user need to worry about. If
            the users absolutely care about the exact contigs rather than
            chr*_random, then they will need to re-align the short reads at
            chr*_random to a different FASTA file that contains the contigs
            (such as the GRCh36/37/38), and then execute ANNOVAR on the newly
            identified variants.

    *       invalid input

            If the query file contains input lines with invalid format,
            ANNOVAR will skip such line and continue with the annotation on
            next lines. These invalid input lines will be written to a file
            with suffix invalid_input. Users should manually examine this file
            and identify sources of error.

    --------------------------------------------------------------------------
    ------

    ANNOVAR is free for academic, personal and non-profit use.

    For questions or comments, please contact $Author: kaichop
    <kaichop@gmail.com> $.

convert2annovar.pl

$cat mam/convert2annovar.txt
SYNOPSIS
     convert2annovar.pl [arguments] <variantfile>

     Optional arguments:
            -h, --help                      print help message
            -m, --man                       print complete documentation
            -v, --verbose                   use verbose output
                --format <string>           input format (default: pileup)
                --includeinfo               include supporting information in output
                --outfile <file>            output file name (default: STDOUT)
                --snpqual <float>           quality score threshold in pileup file (default: 20)
                --snppvalue <float>         SNP P-value threshold in GFF3-SOLiD file (default: 1)
                --coverage <int>            read coverage threshold in pileup file (default: 0)
                --maxcoverage <int>         maximum coverage threshold (default: none)
                --chr <string>              specify the chromosome (for CASAVA format)
                --chrmt <string>            chr identifier for mitochondria (default: M)
                --fraction <float>          minimum allelic fraction to claim a mutation (for pileup format)
                --altcov <int>              alternative allele coverage threshold (for pileup format)
                --allelicfrac               print out allelic fraction rather than het/hom status (for pileup format)
                --species <string>          if human, convert chr23/24/25 to X/Y/M (for gff3-solid format)
                --filter <string>           output variants with this filter (case insensitive, for vcf4 format)
                --confraction <float>       minimal fraction for two indel calls as a 0-1 value (for vcf4old format)
                --allallele                 print all alleles rather than first one (for vcf4old format)
                --withzyg                   print zygosity/coverage/quality when -includeinfo is used (for vcf4 format)
                --comment                   keep comment line in output (for vcf4 format)
                --allsample                 process all samples in file with separate output files (for vcf4 format)
                --genoqual <float>          genotype quality score threshold (for vcf4 format)
                --varqual <float>           variant quality score threshold (for vcf4 format)
                --dbsnpfile <file>          dbSNP file in UCSC format (for rsid format)
                --withfreq                  for --allsample, print frequency information instead (for vcf4 format)
                --withfilter                print filter information in output (for vcf4 format)
                --seqdir <string>           directory with FASTA sequences (for region format)
                --inssize <int>             insertion size (for region format)
                --delsize <int>             deletion size (for region format)
                --subsize <int>             substitution size (default: 1, for region format)
                --genefile <file>           specify the gene file from UCSC (for transcript format)
                --splicing_threshold <int>  the splicing threshold (for transcript format)
                --context <int>             print context nucleotide for indels (for casava format)
                --avsnpfile <file>          specify the avSNP file (for rsid format)
                --keepindelref              keep Ref/Alt alleles for indels (for vcf4 format)

     Function: convert variant call file generated from various software programs
     into ANNOVAR input format

     Example: convert2annovar.pl -format pileup -outfile variant.query variant.pileup
              convert2annovar.pl -format cg -outfile variant.query variant.cg
              convert2annovar.pl -format cgmastervar variant.masterVar.txt
              convert2annovar.pl -format gff3-solid -outfile variant.query variant.snp.gff
              convert2annovar.pl -format soap variant.snp > variant.avinput
              convert2annovar.pl -format maq variant.snp > variant.avinput
              convert2annovar.pl -format casava -chr 1 variant.snp > variant.avinput
              convert2annovar.pl -format vcf4 variantfile > variant.avinput
              convert2annovar.pl -format vcf4 -filter pass variantfile -allsample -outfile variant
              convert2annovar.pl -format vcf4old input.vcf > output.avinput
              convert2annovar.pl -format rsid snplist.txt -dbsnpfile snp138.txt > output.avinput
              convert2annovar.pl -format region -seqdir humandb/hg19_seq/ chr1:2000001-2000003 -inssize 1 -delsize 2
              convert2annovar.pl -format transcript NM_022162 -gene humandb/hg19_refGene.txt -seqdir humandb/hg19_seq/

     Version: $Date: 2018-04-16 00:48:00 -0400 (Mon, 16 Apr 2018) $

OPTIONS
    --help  print a brief usage message and detailed explanation of options.

    --man   print the complete manual of the program.

    --verbose
            use verbose output.

    --format
            the format of the input files. Currently supported formats include
            pileup, cg, cgmastervar, gff3-solid, soap, maq, casava, vcf4,
            vcf4old, rsid. In August 2013, the VCF file processing subroutine
            is changed (multiple samples in VCF file can be processed in
            genotype-aware manner), but users can use vcf4old to have
            identical results as the old behavior.

    --outfile
            specify the output file name. By default, output is written to
            STDOUT.

    --snpqual
            quality score threshold in the pileup file, such that variant
            calls with lower quality scores will not be printed out in the
            output file.

    --snppvalue
            SNP p-value threshold in the pileup file, such that variant calls
            with higher values will not be printed out in the output file.

    --coverage
            read coverage threshold in the pileup file, such that variants
            calls generated with lower coverage will not be printed in the
            output file.

    --maxcoverage
            maximum read coverage threshold in the pileup file, such that
            variants calls generated with higher coverage will not be printed
            in the output file.

    --includeinfo
            specify that the output should contain additional information in
            the input line. By default, only the chr, start, end, reference
            allele, observed allele and homozygosity status are included in
            output files.

    --chr   specify the chromosome for CASAVA format

    --chrmt specify the name of mitochondria chromosome (default is MT)

    --altcov
            the minimum coverage of the alternative (mutated) allele to be
            printed out in output

    --allelicfrac
            print out allelic fraction rather than het/hom status (for pileup
            format). This is useful when processing mitochondria variants.

    --fraction
            specify the minimum fraction of alternative allele, to print out
            the mutation. For example, a site has 10 reads, 3 supports
            alternative allele. A -fraction of 0.4 will not allow the mutation
            to be printed out.

    --species
            specify the species from which the sequencing data is obtained.
            For the GFF3- SOLiD format, when species is human, the chromosome
            23, 24 and 25 will be converted to X, Y and M, respectively.

    --filter
            for VCF4 file, only print out variant calls with this filter
            annotated. For example, if using GATK VariantFiltration walker,
            you will see PASS, GATKStandard, HARD_TO_VALIDATE, etc in the
            filter field. Using 'pass' as a filter is recommended in this
            case.

    --allsample
            for multi-sample VCF4 file, the --allsample argument will process
            all samples in the file and generate separate output files for
            each sample. By default, only the first sample in VCF4 file will
            be processed.

    --withzyg
            for VCF4 format, print out zygosity information, coverage
            information and genotype quality information when -includeinfo is
            used. By default, these information are printed out if
            -includeinfo is not used.

    --genoqual
            minimum genotype quality for the variant in this sample, to be
            printed out. The genotype quality is typically denoted as GQ in
            the SAMPLE column

    --varqual
            minimum variant quality (the QUAL column in the VCF file) to
            handle the variant in VCF file.

    --comment
            include VCF4 header comment lines in the output file

    --genoqual
            specify the genotype quality score to be included in the output
            file

    --varqual
            specify the variant quality score to be included in the output
            file

    --dbsnpfile
            specify the dbSNP file to query (for rsid format)

    --withfreq
            include frequency information in the output (for VCF format with
            multiple samples)

    --withfilter
            include filter information in the output file (for VCF format)

    --seqdir
            specify the directory for sequence file (for region format)

    --inssize
            specify the insertion size when generating all mutations (for
            region format)

    --delsize
            specify the deletion size when generating all mutations (for
            region format)

    --subsize
            specify the substitution size when generating all mutations (for
            region format)

    --genefile
            specify the gene file from UCSC, which can be refGene, knownGene
            or ensGene (for transcript format)

    --splicing_threshold
            specify the splicing threshold (for transcript format)

    --context
            print context for indels which is useful to convert to VCF files
            (for CASAVA format)

    --avsnpfile
            specify the avsnpfile that will be queried when using rsid as the
            input file format

    --keepindelref
            do not alter the Ref and Alt alleles for indels in the VCF file
            (by default the program automatically changes and shortens the Ref
            and Alt allele)

DESCRIPTION
    This program is used to convert variant call file generated from various
    software programs into ANNOVAR input format. Currently, the program can
    handle Samtools genotype-calling pileup format, Solid GFF format, Complete
    Genomics variant format, SOAP format, MAQ format, CASAVA format, VCF
    format. These formats are described below.

    *       pileup format

            The pileup format can be produced by the Samtools genotyping
            calling subroutine. Note that the phrase pileup format can be used
            in several instances, and here I am only referring to the pileup
            files that contains the actual genotype calls.

            Using SamTools, given an alignment file in BAM format, a pileup
            file with genotype calls can be produced by the command below:

                    samtools pileup -vcf ref.fa aln.bam> raw.pileup
                    samtools.pl varFilter raw.pileup > final.pileup

            ANNOVAR will automatically filter the pileup file so that only
            SNPs reaching a quality threshold are printed out (default is 20,
            use --snpqual argument to change this). Most likely, users may
            want to also apply a coverage threshold, such that SNPs calls from
            only a few reads are not considered. This can be achieved using
            the -coverage argument (default value is 0).

            An example of pileup files for SNPs is shown below:

                    chr1 556674 G G 54 0 60 16 a,.....,...,.... (B%A+%7B;0;%=B<:
                    chr1 556675 C C 55 0 60 16 ,,..A..,...,.... CB%%5%,A/+,%....
                    chr1 556676 C C 59 0 60 16 g,.....,...,.... .B%%.%.?.=/%...1
                    chr1 556677 G G 75 0 60 16 ,$,.....,...,.... .B%%9%5A6?)%;?:<
                    chr1 556678 G K 60 60 60 24 ,$.....,...,....^~t^~t^~t^~t^~t^~t^~t^~t^~t B%%B%<A;AA%??<=??;BA%B89
                    chr1 556679 C C 61 0 60 23 .....a...a....,,,,,,,,, %%1%&?*:2%*&)(89/1A@B@@
                    chr1 556680 G K 88 93 60 23 ..A..,..A,....ttttttttt %%)%7B:B0%55:7=>>A@B?B;
                    chr1 556681 C C 102 0 60 25 .$....,...,....,,,,,,,,,^~,^~. %%3%.B*4.%.34.6./B=?@@>5.
                    chr1 556682 A A 70 0 60 24 ...C,...,....,,,,,,,,,,. %:%(B:A4%7A?;A><<999=<<
                    chr1 556683 G G 99 0 60 24 ....,...,....,,,,,,,,,,. %A%3B@%?%C?AB@BB/./-1A7?

            The columns are chromosome, 1-based coordinate, reference base,
            consensus base, consensus quality, SNP quality, maximum mapping
            quality of the reads covering the sites, the number of reads
            covering the site, read bases and base qualities.

            An example of pileup files for indels is shown below:

                    seq2  156 *  +AG/+AG  71  252  99  11  +AG  *  3  8  0

            ANNOVAR automatically recognizes both SNPs and indels in pileup
            file, and process them correctly.

    *       GFF3-SOLiD format

            The SOLiD provides a GFF3-compatible format for SNPs, indels and
            structural variants. A typical example file is given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version 2
                    ##type DNA
                    ##date 2009-03-13
                    ##time 0:0:0
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files Yoruban_snp_10x.txt
                    ##run-path
                    chr_name        AB_SOLiD SNP caller     SNP     coord   coord   1       .       .       coverage=# cov;ref_base=ref;ref_score=score;ref_confi=confi;ref_single=Single;ref_paired=Paired;consen_base=consen;consen_score=score;consen_confi=conf;consen_single=Single;consen_paired=Paired;rs_id=rs_id,dbSNP129
                    1       AB_SOLiD SNP caller     SNP     997     997     1       .       .       coverage=3;ref_base=A;ref_score=0.3284;ref_confi=0.9142;ref_single=0/0;ref_paired=1/1;consen_base=G;consen_score=0.6716;consen_confi=0.9349;consen_single=0/0;consen_paired=2/2
                    1       AB_SOLiD SNP caller     SNP     2061    2061    1       .       .       coverage=2;ref_base=G;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C;consen_score=1.0000;consen_confi=0.8985;consen_single=0/0;consen_paired=2/2
                    1       AB_SOLiD SNP caller     SNP     4770    4770    1       .       .       coverage=2;ref_base=A;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G;consen_score=1.0000;consen_confi=0.8854;consen_single=0/0;consen_paired=2/2
                    1       AB_SOLiD SNP caller     SNP     4793    4793    1       .       .       coverage=14;ref_base=A;ref_score=0.0723;ref_confi=0.8746;ref_single=0/0;ref_paired=1/1;consen_base=G;consen_score=0.6549;consen_confi=0.8798;consen_single=0/0;consen_paired=9/9
                    1       AB_SOLiD SNP caller     SNP     6241    6241    1       .       .       coverage=2;ref_base=T;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C;consen_score=1.0000;consen_confi=0.7839;consen_single=0/0;consen_paired=2/2

            Newer version of ABI BioScope now use diBayes caller, and the
            output file is given below:

                    ##gff-version 3
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##List of SNPs. Date Sat Dec 18 10:30:45 2010    Stringency: medium Mate Pair: 1 Read Length: 50 Polymorphism Rate: 0.003000 Bayes Coverage: 60 Bayes_Single_SNP: 1 Filter_Single_SNP: 1 Quick_P_Threshold: 0.997000 Bayes_P_Threshold: 0.040000 Minimum_Allele_Ratio: 0.150000 Minimum_Allele_Ratio_Multiple_of_Dicolor_Error: 100
                    ##1     chr1
                    ##2     chr2
                    ##3     chr3
                    ##4     chr4
                    ##5     chr5
                    ##6     chr6
                    ##7     chr7
                    ##8     chr8
                    ##9     chr9
                    ##10    chr10
                    ##11    chr11
                    ##12    chr12
                    ##13    chr13
                    ##14    chr14
                    ##15    chr15
                    ##16    chr16
                    ##17    chr17
                    ##18    chr18
                    ##19    chr19
                    ##20    chr20
                    ##21    chr21
                    ##22    chr22
                    ##23    chrX
                    ##24    chrY
                    ##25    chrM
                    # source-version SOLiD BioScope diBayes(SNP caller)
                    #Chr    Source  Type    Pos_Start       Pos_End Score   Strand  Phase   Attributes
                    chr1    SOLiD_diBayes   SNP     221367  221367  0.091151        .       .       genotype=R;reference=G;coverage=3;refAlleleCounts=1;refAlleleStarts=1;refAlleleMeanQV=29;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=27;diColor1=11;diColor2=33;het=1;flag=
                    chr1    SOLiD_diBayes   SNP     555317  555317  0.095188        .       .       genotype=Y;reference=T;coverage=13;refAlleleCounts=11;refAlleleStarts=10;refAlleleMeanQV=23;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=29;diColor1=00;diColor2=22;het=1;flag=
                    chr1    SOLiD_diBayes   SNP     555327  555327  0.037582        .       .       genotype=Y;reference=T;coverage=12;refAlleleCounts=6;refAlleleStarts=6;refAlleleMeanQV=19;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=29;diColor1=12;diColor2=30;het=1;flag=
                    chr1    SOLiD_diBayes   SNP     559817  559817  0.094413        .       .       genotype=Y;reference=T;coverage=9;refAlleleCounts=5;refAlleleStarts=4;refAlleleMeanQV=23;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=14;diColor1=11;diColor2=33;het=1;flag=
                    chr1    SOLiD_diBayes   SNP     714068  714068  0.000000        .       .       genotype=M;reference=C;coverage=13;refAlleleCounts=7;refAlleleStarts=6;refAlleleMeanQV=25;novelAlleleCounts=6;novelAlleleStarts=4;novelAlleleMeanQV=22;diColor1=00;diColor2=11;het=1;flag=
                    The file conforms to standard GFF3 specifications, but the last column is solid-
                    specific and it gives certain parameters for the SNP calls.

            An example of the short indel format by GFF3-SOLiD is given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version SOLiD Corona Lite v.4.0r2.0, find-small-indels.pl v 1.0.1, process-small-indels v 0.2.2, 2009-01-12 12:28:49
                    ##type DNA
                    ##date 2009-01-26
                    ##time 18:33:20
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files ../../mp-results/JOAN_20080104_1.pas,../../mp-results/BARB_20071114_1.pas,../../mp-results/BARB_20080227_2.pas
                    ##run-path /data/results2/Yoruban-frag-indel/try.01.06/mp-w2x25-2x-4x-8x-10x/2x
                    ##Filter-settings: max-ave-read-pos=none,min-ave-from-end-pos=9.1,max-nonreds-4filt=2,min-insertion-size=none,min-deletion-size=none,max-insertion-size=none,max-deletion-size=none,require-called-indel-size?=T
                    chr1    AB_SOLiD Small Indel Tool       deletion        824501  824501  1       .       .   del_len=1;tight_chrom_pos=824501-824502;loose_chrom_pos=824501-824502;no_nonred_reads=2;no_mismatches=1,0;read_pos=4,6;from_end_pos=21,19;strands=+,-;tags=R3,F3;indel_sizes=-1,-1;read_seqs=G3021212231123203300032223,T3321132212120222323222101;dbSNP=rs34941678,chr1:824502-824502(-),EXACT,1,/GG
                    chr1    AB_SOLiD Small Indel Tool       insertion_site  1118641 1118641 1       .       .   ins_len=3;tight_chrom_pos=1118641-1118642;loose_chrom_pos=1118641-1118642;no_nonred_reads=2;no_mismatches=0,1;read_pos=17,6;from_end_pos=8,19;strands=+,+;tags=F3,R3;indel_sizes=3,3;read_seqs=T0033001100022331122033112,G3233112203311220000001002

            The keyword deletion or insertion_site is used in the fourth
            column to indicate that file format.

            An example of the medium CNV format by GFF3-SOLiD is given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version SOLiD Corona Lite v.4.0r2.0, find-small-indels.pl v 1.0.1, process-small-indels v 0.2.2, 2009-01-12 12:28:49
                    ##type DNA
                    ##date 2009-01-27
                    ##time 15:54:36
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files big_d20e5-del12n_up-ConsGrp-2nonred.pas.sum
                    ##run-path /data/results2/Yoruban-frag-indel/try.01.06/mp-results-lmp-e5/big_d20e5-indel_950_2050
                    chr1    AB_SOLiD Small Indel Tool       deletion        3087770 3087831 1       .       .   del_len=62;tight_chrom_pos=none;loose_chrom_pos=3087768-3087773;no_nonred_reads=2;no_mismatches=2,2;read_pos=27,24;from_end_pos=23,26;strands=-,+;tags=F3,F3;indel_sizes=-62,-62;read_seqs=T11113022103331111130221213201111302212132011113022,T02203111102312122031111023121220311111333012203111
                    chr1    AB_SOLiD Small Indel Tool       deletion        4104535 4104584 1       .       .   del_len=50;tight_chrom_pos=4104534-4104537;loose_chrom_pos=4104528-4104545;no_nonred_reads=3;no_mismatches=0,4,4;read_pos=19,19,27;from_end_pos=31,31,23;strands=+,+,-;tags=F3,R3,R3;indel_sizes=-50,-50,-50;read_seqs=T31011011013211110130332130332132110110132020312332,G21031011013211112130332130332132110132132020312332,G20321302023001101123123303103303101113231011011011
                    chr1    AB_SOLiD Small Indel Tool       insertion_site  2044888 2044888 1       .       .   ins_len=18;tight_chrom_pos=2044887-2044888;loose_chrom_pos=2044887-2044889;no_nonred_reads=2;bead_ids=1217_1811_209,1316_908_1346;no_mismatches=0,2;read_pos=13,15;from_end_pos=37,35;strands=-,-;tags=F3,F3;indel_sizes=18,18;read_seqs=T31002301231011013121000101233323031121002301231011,T11121002301231011013121000101233323031121000101231;non_indel_no_mismatches=3,1;non_indel_seqs=NIL,NIL
                    chr1    AB_SOLiD Small Indel Tool       insertion_site  74832565        74832565        1   .       .       ins_len=16;tight_chrom_pos=74832545-74832565;loose_chrom_pos=74832545-74832565;no_nonred_reads=2;bead_ids=1795_181_514,1651_740_519;no_mismatches=0,2;read_pos=13,13;from_end_pos=37,37;strands=-,-;tags=F3,R3;indel_sizes=16,16;read_seqs=T33311111111111111111111111111111111111111111111111,G23311111111111111111111111111111111111111311011111;non_indel_no_mismatches=1,0;non_indel_seqs=NIL,NIL

            An example of the large indel format by GFF3-SOLiD is given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version ???
                    ##type DNA
                    ##date 2009-03-13
                    ##time 0:0:0
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files /data/results5/yoruban_strikes_back_large_indels/LMP/five_mm_unique_hits_no_rescue/5_point_6x_del_lib_1/results/NA18507_inter_read_indels_5_point_6x.dat
                    ##run-path
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  1307279 1307791 1       .       .   deviation=-742;stddev=7.18;ref_clones=-;dev_clones=4
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  2042742 2042861 1       .       .   deviation=-933;stddev=8.14;ref_clones=-;dev_clones=3
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  2443482 2444342 1       .       .   deviation=-547;stddev=11.36;ref_clones=-;dev_clones=17
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  2932046 2932984 1       .       .   deviation=-329;stddev=6.07;ref_clones=-;dev_clones=14
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  3166925 3167584 1       .       .   deviation=-752;stddev=13.81;ref_clones=-;dev_clones=14

            An example of the CNV format by GFF3-SOLiD if given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version ???
                    ##type DNA
                    ##date 2009-03-13
                    ##time 0:0:0
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files Yoruban_cnv.coords
                    ##run-path
                    chr1    AB_CNV_PIPELINE repeat_region   1062939 1066829 .       .       .       fraction_mappable=51.400002;logratio=-1.039300;copynum=1;numwindows=1
                    chr1    AB_CNV_PIPELINE repeat_region   1073630 1078667 .       .       .       fraction_mappable=81.000000;logratio=-1.409500;copynum=1;numwindows=2
                    chr1    AB_CNV_PIPELINE repeat_region   2148325 2150352 .       .       .       fraction_mappable=98.699997;logratio=-1.055000;copynum=1;numwindows=1
                    chr1    AB_CNV_PIPELINE repeat_region   2245558 2248109 .       .       .       fraction_mappable=78.400002;logratio=-1.042900;copynum=1;numwindows=1
                    chr1    AB_CNV_PIPELINE repeat_region   3489252 3492632 .       .       .       fraction_mappable=59.200001;logratio=-1.119900;copynum=1;numwindows=1
                    chr1    AB_CNV_PIPELINE repeat_region   5654415 5657276 .       .       .       fraction_mappable=69.900002;logratio=1.114500;copynum=4;numwindows=1
                    chr1    AB_CNV_PIPELINE repeat_region   9516165 9522726 .       .       .       fraction_mappable=65.850006;logratio=-1.316700;numwindows=2
                    chr1    AB_CNV_PIPELINE repeat_region   16795117        16841025        .       .       .   fraction_mappable=44.600002;logratio=1.880778;copynum=7;numwindows=9

            The keyword repeat_region is used here, although it actually
            refers to CNVs.

            An example of the inversion format by GFF3-SOLiD is given below:

                    ##gff-version 3
                    ##solid-gff-version 0.2
                    ##generated by SOLiD inversion tool
                    chr10   AB_SOLiD        inversion       46443107        46479585        268.9   .       .   left=chr10:46443107-46443146;right=chr10:46479583-46479585;leftscore=295.0;rightscore=247.0;count_AAA_further_left=117;count_AAA_left=3;count_AAA_right=3;count_AAA_further_right=97;left_min_count_AAA=chr10:46443107-46443112;count_AAA_min_left=0;count_AAA_max_left=3;right_min_count_AAA=chr10:46479585-46479585;count_AAA_min_right=1;count_AAA_max_right=3;homozygous=UNKNOWN
                    chr4    AB_SOLiD        inversion       190822813       190850112       214.7   .       .   left=chr4:190822813-190822922;right=chr4:190850110-190850112;leftscore=140.0;rightscore=460.0;count_AAA_further_left=110;count_AAA_left=78;count_AAA_right=74;count_AAA_further_right=77;left_min_count_AAA=chr4:190822813-190822814;count_AAA_min_left=69;count_AAA_max_left=77;right_min_count_AAA=chr4:190850110-190850112;count_AAA_min_right=74;count_AAA_max_right=74;homozygous=NO
                    chr6    AB_SOLiD        inversion       168834969       168837154       175.3   .       .   left=chr6:168834969-168835496;right=chr6:168836643-168837154;leftscore=185.4;rightscore=166.2;count_AAA_further_left=67;count_AAA_left=43;count_AAA_right=40;count_AAA_further_right=59;left_min_count_AAA=chr6:168835058-168835124,chr6:168835143-168835161,chr6:168835176-168835181,chr6:168835231-168835262;count_AAA_min_left=23;count_AAA_max_left=29;right_min_count_AAA=chr6:168836643-168836652;count_AAA_min_right=23;count_AAA_max_right=31;homozygous=NO

            The program should be able to recognize all the above GFF3-SOLiD
            format automatically, and handle them accordingly.

    *       Complete Genomics format

            This format is provided by the Complete Genomics company to their
            customers. The file var-[ASM-ID].tsv.bz2 includes a description of
            all loci where the assembled genome differs from the reference
            genome.

            An example of the Complete Genomics format is shown below:

                    #BUILD  1.5.0.5
                    #GENERATED_AT   2009-Nov-03 19:52:21.722927
                    #GENERATED_BY   dbsnptool
                    #TYPE   VAR-ANNOTATION
                    #VAR_ANN_SET    /Proj/Pipeline/Production_Data/REF/HUMAN-F_06-REF/dbSNP.csv
                    #VAR_ANN_TYPE   dbSNP
                    #VERSION        0.3

                    >locus  ploidy  haplotype       chromosome      begin   end     varType reference       alleleSeq       totalScore      hapLink xRef
                    1       2       all     chr1    0       959     no-call =       ?
                    2       2       all     chr1    959     972     =       =       =
                    3       2       all     chr1    972     1001    no-call =       ?
                    4       2       all     chr1    1001    1008    =       =       =
                    5       2       all     chr1    1008    1114    no-call =       ?
                    6       2       all     chr1    1114    1125    =       =       =
                    7       2       all     chr1    1125    1191    no-call =       ?
                    8       2       all     chr1    1191    1225    =       =       =
                    9       2       all     chr1    1225    1258    no-call =       ?
                    10      2       all     chr1    1258    1267    =       =       =
                    12      2       all     chr1    1267    1275    no-call =       ?
                    13      2       all     chr1    1275    1316    =       =       =
                    14      2       all     chr1    1316    1346    no-call =       ?
                    15      2       all     chr1    1346    1367    =       =       =
                    16      2       all     chr1    1367    1374    no-call =       ?
                    17      2       all     chr1    1374    1388    =       =       =
                    18      2       all     chr1    1388    1431    no-call =       ?
                    19      2       all     chr1    1431    1447    =       =       =
                    20      2       all     chr1    1447    1454    no-call =       ?

            The following information is provided in documentation from
            Complete Genomics, that describes the var-ASM format.

                    1. locus. Identifier of a particular genomic locus
                    2. ploidy. The ploidy of the reference genome at the locus (= 2 for autosomes, 2 for pseudoautosomal regions on the sex chromosomes, 1 for males on the non-pseudoautosomal parts of the sex chromosomes, 1 for mitochondrion, '?' if varType is 'no-ref' or 'PAR-called-in-X'). The reported ploidy is fully determined by gender, chromosome and location, and is not inferred from the sequence data.
                    3. haplotype. Identifier for each haplotype at the variation locus. For diploid genomes, 1 or 2. Shorthand of 'all' is allowed where the varType field is one of 'ref', 'no-call', 'no-ref', or 'PAR-called-in-X'. Haplotype numbering does not imply phasing; haplotype 1 in locus 1 is not necessarily in phase with haplotype 1 in locus 2. See hapLink, below, for phasing information.
                    4. chromosome. Chromosome name in text: 'chr1','chr2', ... ,'chr22','chrX','chrY'. The mitochondrion is represented as 'chrM'. The pseudoautosomal regions within the sex chromosomes X and Y are reported attheir coordinates on chromosome X.
                    5. begin. Reference coordinate specifying the start of the variation (not the locus) using the half-open zero-based coordinate system. See section 'Sequence Coordinate System' for more information.
                    6. end. Reference coordinate specifying the end of the variation (not the locus) using the half-open zero-based coordinate system. See section 'Sequence Coordinate System' for more information.
                    7. varType. Type of variation, currently one of:
                            snp: single-nucleotide polymorphism
                            ins: insertion
                            del: deletion
                            sub: Substitution of one or more reference bases with the bases in the allele column
                            'ref' : no variation; the sequence is identical to the reference sequence on the indicated haplotype
                            no-call-rc: 'no-call reference consistent 'one or more bases are ambiguous, but the allele is potentially consistent with the reference
                            no-call-ri: 'no-call reference inconsistent' one or more bases are ambiguous, but the allele is definitely inconsistent with the reference
                            no-call: an allele is completely indeterminate in length and composition, i.e. alleleSeq = '?'
                            no-ref: the reference sequence is unspecified at this locus.
                            PAR-called-in-X: this locus overlaps one of the pseudoautosomal regions on the sex chromosomes. The called sequence is reported as diploid sequence on Chromosome X; on chromosome Y the sequence is reported as varType = 'PAR-called-in-X'.
                    8. reference. The reference sequence for the locus of variation. Empty when varType is ins. A value of '=' indicates that the user must consult the reference for the sequence; this shorthand is only used in regions where no haplotype deviates from the reference sequence.
                    9. alleleSeq. The observed sequence at the locus of variation. Empty when varType is del. '?' isused to indicate 0 or more unknown bases within the sequence; 'N' is used to indicate exactly one unknown base within the sequence.'=' is used as shorthand to indicate identity to the reference sequence for non-variant sequence, i.e. when varType is 'ref'.
                    10. totalScore. A score corresponding to a single variation and haplotype, representing the confidence in the call.
                    11. hapLink. Identifier that links a haplotype at one locus to haplotypes at other loci. Currently only populated for very proximate variations that were assembled together. Two calls that share a hapLink identifier are expected to be on the same haplotype,
                    12. xRef. Field containing external variation identifiers, currently only populated for variations corroborated directly by dbSNP. Format: dbsnp:[rsID], with multiple entries separated by the semicolon (;).

            In older versions of the format specification, the sub keyword
            used to be insdel keyword. ANNOVAR takes care of this.

    *       SOAPsnp format

            An example of the SOAP SNP caller format is shown below:

                    chr8  35782  A  R  1  A  27  1  2  G  26  1  2  5   0.500000  2.00000  1  5
                    chr8  35787  G  R  0  G  25  4  6  A  17  2  4  10  0.266667  1.60000  0  5

            The following information is provided in documentation from BGI
            who developed SOAP suite. It differs slightly from the description
            at the SOAPsnp website, and presumably the website is outdated.

                    Format description:(left to right)
                    1. Chromosome name
                    2. Position of locus
                    3. Nucleotide at corresponding locus of reference sequence
                    4. Genotype of sequencing sample
                    5. Quality value
                    6. nucleotide with the highest probability(first nucleotide)
                    7. Quality value of the nucleotide with the highest probability
                    8. Number of supported reads that can only be aligned to this locus
                    9. Number of all supported reads that can be aligned to this locus
                    10. Nucleotide with higher probability
                    11. Quality value of nucleotide with higher probability
                    12. Number of supported reads that can only be aligned to this locus
                    13. Number of all supported reads that can be aligned to this locus
                    14. Total number of reads that can be aligned to this locus
                    15. Order and quality value
                    16. Estimated copy number for this locus
                    17. Presence of this locus in the dbSNP database. 1 refers to presence and 0 refers to inexistence
                    18. The distance between this locus and another closest SNP
            Later SOAPsnp changed its output format to 17 columns. An example of the format is shown below:

            1 12837840 G C 12 C 37 5 5 G 0 0 0 5 1.00000 1.00000 0 1 12853805
            T K 0 T 39 1 1 G 35 1 1 2 1.00000 1.00000 0

            The following information is provided on SOAPsnp website as of
            16Apr2013, and it is slightly different from the documentation
            with SOAPsnp, which only has 14 columns.

                    The result of SOAPsnp has 17 columns:
                    1)  Chromosome ID
                    2)  Coordinate on chromosome, start from 1
                    3)  Reference genotype
                    4)  Consensus genotype
                    5)  Quality score of consensus genotype
                    6)  Best base
                    7)  Average quality score of best base
                    8)  Count of uniquely mapped best base
                    9)  Count of all mapped best base
                    10) Second best bases
                    11) Average quality score of second best base
                    12) Count of uniquely mapped second best base
                    13) Count of all mapped second best base
                    14) Sequencing depth of the site
                    15) Rank sum test p_value
                    16) Average copy number of nearby region
                    17) Whether the site is a dbSNP.
            =item * B<SOAPindel format>

            The current version of ANNOVAR handles SoapSNP and SoapIndel
            automatically via a single argument '--format soap'. An example of
            SOAP indel caller format is shown below:

                    chr11   44061282        -       +2      CT      Hete
                    chr11   45901572        +       +1      C       Hete
                    chr11   48242562        *       -3      TTC     Homo
                    chr11   57228723        *       +4      CTTT    Homo
                    chr11   57228734        *       +4      CTTT    Homo
                    chr11   57555685        *       -1      C       Hete
                    chr11   61482191        -       +3      TCC     Hete
                    chr11   64608031        *       -1      T       Homo
                    chr11   64654936        *       +1      C       Homo
                    chr11   71188303        +       -1      T       Hete
                    chr11   75741034        +       +1      T       Hete
                    chr11   76632438        *       +1      A       Hete
                    chr11   89578266        *       -2      AG      Homo
                    chr11   104383261       *       +1      T       Hete
                    chr11   124125940       +       +4      CCCC    Hete
                    chr12   7760052 *       +1      T       Homo
                    chr12   8266049 *       +3      ACG     Homo

            I do not see a documentation describing this format yet as of
            September 2010.

    *       --SOAPsv format

            An example is given below:

                    Chr2 Deletion 42894 43832 43167 43555 388 0-0-0 FR 41

            An explanation of the structural variation format is given below:

                    Format description (from left to right)
                    1. Chromosome name
                    2. Type of structure variation
                    3. Minimal value of start position in cluster
                    4. Maximal value of end position in cluster
                    5. Estimated start position of this structure variation
                    6. Estimated end position of this structure variation
                    7. Length of SV
                    8. Breakpoint of SV (only for insertion)
                    9. Unusual matching mode (F refers to align with forward sequence, R refers
                    to align with reverse
                    sequence)
                    10. number of paired-end read which support this structure variation

    *       MAQ format

            MAQ can perform alignment and generate genotype calls, including
            SNP calls and indel calls. The format is described below:

            For indel header: The output is TAB delimited with each line
            consisting of chromosome, start position, type of the indel,
            number of reads across the indel, size of the indel and
            inserted/deleted nucleotides (separated by colon), number of
            indels on the reverse strand, number of indels on the forward
            strand, 5' sequence ahead of the indel, 3' sequence following the
            indel, number of reads aligned without indels and three additional
            columns for filters.

            An example is below:

                    chr10   110583  -       2       -2:AG   0       1       GCGAGACTCAGTATCAAAAAAAAAAAAAAAAA   AGAAAGAAAGAAAAAGAAAAAAATAGAAAGAA        1       @2,     @72,   @0,
                    chr10   120134  -       8       -2:CA   0       1       CTCTTGCCCGCTCACACATGTACACACACGCG   CACACACACACACACACATCAGCTACCTACCT        7       @65,62,61,61,45,22,7,   @9,12,13,13,29,52,67,   @0,0,0,0,0,0,0,
                    chr10   129630  -       1       -1:T    1       0       ATGTTGTGACTCTTAATGGATAAGTTCAGTCA   TTTTTTTTTAGCTTTTAACCGGACAAAAAAAG        0       @       @      @
                    chr10   150209  -       1       4:TTCC  1       0       GCATATAGGGATGGGCACTTTACCTTTCTTTT   TTCCTTCCTTCCTTCCTTCCCTTTCCTTTCCT        0       @       @      @
                    chr10   150244  -       2       -4:TTCT 0       1       CTTCCTTCCTTCCTTCCCTTTCCTTTCCTTTC   TTCTTTCTTTCTTTCTTTCTTTTTTTTTTTTT        0       @       @      @
                    chr10   159622  -       1       3:AGG   0       1       GAAGGAGGAAGGACGGAAGGAGGAAGGAAGGA   AGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGA        0       @       @      @
                    chr10   206372  -       2       2:GT    1       0       ATAATAGTAACTGTGTATTTGATTATGTGTGC   GTGTGTGTGTGTGTGTGTGTGTGTGCGTGCTT        1       @37,    @37,   @8,
                    chr10   245751  -       11      -1:C    0       1       CTCATAAATACAAGTCATAATGAAAGAAATTA   CCACCATTTTCTTATTTTCATTCATTTTTAGT        10      @69,64,53,41,30,25,22,14,5,4,   @5,10,21,33,44,49,52,60,69,70,  @0,0,0,0,0,0,0,0,0,0,
                    chr10   253066  -       1       2:TT    0       1       TATTGATGAGGGTGGATTATACTTTAGAACAC   TATTCAAACAGTTCTTCCACATATCTCCCTTT        0       @       @      @
                    chr10   253455  -       2       -3:AAA  1       0       GTTGCACTCCAGCCTGGCGAGATTCTGTCTCC   AAAAAAAAAAAAAAAAATTGTTGTGAAATACA        1       @55,    @19,   @4,

            For snp output file: Each line consists of chromosome, position,
            reference base, consensus base, Phred-like consensus quality, read
            depth, the average number of hits of reads covering this position,
            the highest mapping quality of the reads covering the position,
            the minimum consensus quality in the 3bp flanking regions at each
            side of the site (6bp in total), the second best call, log
            likelihood ratio of the second best and the third best call, and
            the third best call.

            An example is below:

                    chr10   83603   C       T       28      12      2.81    63      34      Y       26      C
                    chr10   83945   G       R       59      61      4.75    63      62      A       47      G
                    chr10   83978   G       R       47      40      3.31    63      62      A       21      G
                    chr10   84026   G       R       89      22      2.44    63      62      G       49      A
                    chr10   84545   C       T       54      9       1.69    63      30      N       135     N
                    chr10   85074   G       A       42      5       1.19    63      38      N       108     N
                    chr10   85226   A       T       42      5       1.00    63      42      N       107     N
                    chr10   85229   C       T       42      5       1.00    63      42      N       112     N
                    chr10   87518   A       G       39      4       3.25    63      38      N       9       N
                    chr10   116402  T       C       39      4       1.00    63      38      N       76      N

    *       CASAVA format

            An example of Illumina CASAVA format is given below:

                    #position       A       C       G       T       modified_call   total   used    score   reference       type
                    14930   3       0       8       0       GA      11      11      29.10:11.10             A   SNP_het2
                    14933   4       0       7       0       GA      11      11      23.50:13.70             G   SNP_het1
                    14976   3       0       8       0       GA      11      11      24.09:9.10              G   SNP_het1
                    15118   2       1       4       0       GA      8       7       10.84:6.30              A   SNP_het2

            An example of the indels is given below:

                    # ** CASAVA depth-filtered indel calls **
                    #$ CMDLINE /illumina/pipeline/install/CASAVA_v1.7.0/libexec/CASAVA-1.7.0/filterIndelCalls.pl--meanReadDepth=2.60395068970547 --indelsCovCutoff=-1 --chrom=chr1.fa /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0000.txt /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0001.txt /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0002.txt /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0003.txt /data/Basecalls/100806_HARMONIAPILOT-H16_0338_A2065HABXX/Data/Intensities/BaseCalls/CASAVA_PE_L2/Parsed_14-08-10/chr1.fa/Indel/varling_indel_calls_0004.txt
                    #$ CHROMOSOME chr1.fa
                    #$ MAX_DEPTH undefined
                    #
                    #$ COLUMNS pos CIGAR ref_upstream ref/indel ref_downstream Q(indel) max_gtype Q(max_gtype) max2_gtype bp1_reads ref_reads indel_reads other_reads repeat_unit ref_repeat_count indel_repeat_count
                    948847  1I      CCTCAGGCTT      -/A     ATAATAGGGC      969     hom     47      het     22   0       16      6       A       1       2
                    978604  2D      CACTGAGCCC      CT/--   GTGTCCTTCC      251     hom     20      het     8   0       4       4       CT      1       0
                    1276974 4I      CCTCATGCAG      ----/ACAC       ACACATGCAC      838     hom     39      het   18      0       14      4       AC      2       4
                    1289368 2D      AGCCCGGGAC      TG/--   GGAGCCGCGC      1376    hom     83      het     33   0       25      9       TG      1       0

    *       VCF4 format

            VCF4 can be used to describe both population-level variation
            information, or for reads derived from a single individual.

            One example of the indel format for one individual is given below:

                    ##fileformat=VCFv4.0
                    ##IGv2_bam_file_used=MIAPACA2.alnReAln.bam
                    ##INFO=<ID=AC,Number=2,Type=Integer,Description="# of reads supporting consensus indel/any indel at the site">
                    ##INFO=<ID=DP,Number=1,Type=Integer,Description="total coverage at the site">
                    ##INFO=<ID=MM,Number=2,Type=Float,Description="average # of mismatches per consensus indel-supporting read/per reference-supporting read">
                    ##INFO=<ID=MQ,Number=2,Type=Float,Description="average mapping quality of consensus indel-supporting reads/reference-supporting reads">
                    ##INFO=<ID=NQSBQ,Number=2,Type=Float,Description="Within NQS window: average quality of bases from consensus indel-supporting reads/from reference-supporting reads">
                    ##INFO=<ID=NQSMM,Number=2,Type=Float,Description="Within NQS window: fraction of mismatchingbases in consensus indel-supporting reads/in reference-supporting reads">
                    ##INFO=<ID=SC,Number=4,Type=Integer,Description="strandness: counts of forward-/reverse-aligned indel-supporting reads / forward-/reverse-aligned reference supporting reads">
                    ##IndelGenotyperV2=""
                    ##reference=hg18.fa
                    ##source=IndelGenotyperV2
                    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Miapaca_trimmed_sorted.bam
                    chr1    439     .       AC      A       .       PASS    AC=5,5;DP=7;MM=7.0,3.0;MQ=23.4,1.0;NQSBQ=23.98,25.5;NQSMM=0.04,0.0;SC=2,3,0,2   GT      1/0
                    chr1    714048  .       T       TCAAC   .       PASS    AC=3,3;DP=9;MM=3.0,7.1666665;MQ=1.0,10.833333;NQSBQ=23.266666,21.932203;NQSMM=0.0,0.15254237;SC=3,0,3,3 GT      0/1
                    chr1    714049  .       G       GC      .       PASS    AC=3,3;DP=9;MM=3.0,7.1666665;MQ=1.0,10.833333;NQSBQ=23.233334,21.83051;NQSMM=0.0,0.15254237;SC=3,0,3,3  GT      0/1
                    chr1    813675  .       A       AATAG   .       PASS    AC=5,5;DP=8;MM=0.4,1.0;MQ=5.0,67.0;NQSBQ=25.74,25.166666;NQSMM=0.0,0.033333335;SC=4,1,1,2       GT      0/1
                    chr1    813687  .       AGAGAGAGAGAAG   A       .       PASS    AC=5,5;DP=8;MM=0.4,1.0;MQ=5.0,67.0;NQSBQ=24.54,25.2;NQSMM=0.02,0.06666667;SC=4,1,1,2    GT      1/0

    *       annovar2vcf format

            This is useful for converting certain ANNOVAR files to VCF format.
            These ANNOVAR input files MUST include zygosity, quality and
            filter information as the 3 extra columns after Chr, Start, End,
            Ref and Alt alleles.

    The code was written by Dr. Kai Wang and modified by Dr. GermଠGast
    Leparc. Various users have provided sample input files for many SNP callin
    software, for the development of conversion subroutines. We thank these
    users for their continued support to improve the functionality of the
    script.

    For questions or comments, please contact kai@openbioinformatics.org.

POD ERRORS
    Hey! The above document had some coding errors, which are explained
    below:

    Around line 3110:
        Expected '=item *'

    Around line 3236:
        Non-ASCII character seen before =encoding in 'Germଧ. Assuming CP1252

table_annovar.pl

$cat table_annovar.txt
SYNOPSIS
     table_annovar.pl [arguments] <query-file> <database-location>

     Optional arguments:
            -h, --help                      print help message
            -m, --man                       print complete documentation
            -v, --verbose                   use verbose output
                --protocol <string>         comma-delimited string specifying database protocol
                --operation <string>        comma-delimited string specifying type of operation
                --outfile <string>          output file name prefix
                --buildver <string>         genome build version (default: hg18)
                --remove                    remove all temporary files
                --(no)checkfile             check if database file exists (default: ON)
                --genericdbfile <files>     specify comma-delimited generic db files
                --gff3dbfile <files>        specify comma-delimited GFF3 files
                --bedfile <files>           specify comma-delimited BED files
                --vcfdbfile <files>         specify comma-delimited VCF files
                --otherinfo                 print out otherinfo (infomration after fifth column in queryfile)
                --onetranscript             print out only one transcript for exonic variants (default: all transcripts)
                --nastring <string>         string to display when a score is not available (default: null)
                --csvout                    generate comma-delimited CSV file (default: tab-delimited txt file)
                --argument <string>         comma-delimited strings as optional argument for each operation (use& for comma inside string)
                --convertarg <string>       argument to convert2annovar.pl
                --codingarg <string>        argument to coding_change.pl
                --tempdir <dir>             directory to store temporary files (default: --outfile)
                --vcfinput                  specify that input is in VCF format and output will be in VCF format
                --dot2underline             change dot in field name to underline (eg, Func.refGene to Func_refGene)
                --thread <int>              specify the number of threads to be used in annotation
                --maxgenethread <int>       specify the maximum number of threads allowed in gene annotation (default: 6)
                --polish                    polish the protein notation for indels (such as p.G12Vfs*2)
                --xreffile <file>           specify a cross-reference file for gene-based annotation


     Function: automatically run a pipeline on a list of variants and summarize
     their functional effects in a comma-delimited file, or to an annotated VCF file
     if the original input is a VCF file

     Example: table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,dbnsfp30a -operation g,r,f -nastring . -csvout -polish -xreffile example/gene_fullxref.txt
              table_annovar.pl example/ex2.vcf humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,dbnsfp30a -operation g,r,f -nastring . -vcfinput

     Version: $Date: 2018-04-16 00:47:49 -0400 (Mon, 16 Apr 2018) $

OPTIONS
    --help  print a brief usage message and detailed explanation of options.

    --man   print the complete manual of the program.

    --verbose
            use verbose output.

    --protocol
            comma-delimited string specifying annotation protocol. These
            strings typically represent database names in ANNOVAR.

    --operation
            comma-delimited string specifying type of operation. These strings
            can be g (gene), r (region) or f (filter).

    --outfile
            the prefix of output file names

    --buildver
            specify the genome build version

    --remove
            remove all temporary files. By default, all temporary files will
            be kept for user inspection, but this will easily clutter the
            directory.

    --(no)checkfile
            the program will check if all required database files exist before
            execution of annotation

    --genericdbfile
            specify the genericdb files used in -dbtype generic. Note that
            multiple comma- delimited file names can be supplied.

    --gff3dbfile
            specify the GFF3 dbfiles files used in -dbtype gff3. Note that
            multiple comma- delimited file names can be supplied.

    --bedfile
            specify the GFF3 dbfiles files used in -dbtype bed. Note that
            multiple comma- delimited file names can be supplied.

    --vcfdbfile
            specify the VCF dbfiles files used in -dbtype vcf. Note that
            multiple comma- delimited file names can be supplied.

    --otherinfo
            print out otherinfo in the output file. "otherinfo" refers to all
            the infomration after fifth column in the input queryfile.

    --onetranscript
            print out only one random transcript for exonic variants. By
            default, all transcripts are printed in the output.

    --nastring
            string to display when a score is not available. By default, empty
            string is printed in the output file.

    --csvout
            generate comma-delimited CSV file. By default, tab-delimited text
            file is generated.

    --argument
            a comma-separated list of arguments, to be supplied to each of the
            protocols. This list faciliates customized annotation procedure
            for each protocol.

    --convertarg
            a string as argument to be supplied to the convert2annovar.pl
            program

    --codingarg
            a string as argument to be supplied to the coding_change.pl
            program

    --tempdir
            specify the directory location for storing temporary files used by
            table_annovar. This argument is especially useful in a cluster
            computing environment, so that temporary files are written to
            local disk of compute nodes, yet results files are written to
            possibly remote hosts.

    --vcfinput
            specify that input is in VCF format and output will be in VCF
            format. if you want to generate a tab-delimited output or
            comma-delimited output file, you must use convert2annovar to
            generate an ANNOVAR input file first.

    --dot2underline
            change dot in field name to underline (eg, Func.refGene to
            Func_refGene), which is useful for post-processing of the results
            in some software tools that cannot handle dot in field names.

    --thread
            specify the number of threads to be used in annotation (when
            multi-threading support is enabled in the system)

    --maxgenethread
            specify the maximum number of threads allowed in gene annotation
            (default: 6)

    --polish
            polish the protein notation for indels (such as p.G12Vfs*2) by
            re-calculating the protein sequence after a mutation is introduced
            in coding_change.pl

    --xreffile
            specify a cross-reference file for gene-based annotation, so that
            the final output includes extra columns for genes

DESCRIPTION
    ANNOVAR is a software tool that can be used to functionally annotate a
    list of genetic variants, possibly generated from next-generation
    sequencing experiments. For example, given a whole-genome resequencing
    data set for a human with specific diseases, typically around 3 million
    SNPs and around half million insertions/deletions will be identified.
    Given this massive amounts of data (and candidate disease- causing
    variants), it is necessary to have a fast algorithm that scans the data
    and identify a prioritized subset of variants that are most likely
    functional for follow-up Sanger sequencing studies and functional assays.

    The table_annovar.pl program is designed to replace summarize_annovar.pl
    in earlier version of ANNOVAR. Basically, it takes an input file, and run
    a series of annotations on the input file, and generate a tab-delimited
    output file, where each column represent a specific type of annotation.
    Therefore, the new table_annovar.pl allows better customization for users
    who want to annotate specific columns.

    ANNOVAR is freely available to the community for non-commercial use. For
    questions or comments, please contact kai@openbioinformatics.org.