Manual
What is TopHat? TopHat is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie. TopHat runs on Linux and OS X. What types of reads can I use TopHat with? TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. In TopHat 1.1.0, we began supporting Applied Biosystems' Colorspace format. The software is optimized for reads 75bp or longer. How does TopHat find junctions? TopHat can find splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping information, TopHat builds a database of possible splice junctions and then maps the reads against these junctions to confirm them. Short read sequencing machines can currently produce reads 100bp or longer but many exons are shorter than this so they would be missed in the initial mapping. TopHat solves this problem mainly by splitting all input reads into smaller segments which are then mapped independently. The segment alignments are put back together in a final step of the program to produce the end-to-end read alignments. TopHat generates its database of possible splice junctions from two sources of evidence. The first and strongest source of evidence for a splice junction is when two segments from the same read (for reads of at least 45bp) are mapped at a certain distance on the same genomic sequence or when an internal segment fails to map - again suggesting that such reads are spanning multiple exons. With this approach, "GT-AG", "GC-AG" and "AT-AC" introns will be found ab initio. The second source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. We only suggest users use this second option (--coverage-search) for short reads (< 45bp) and with a small number of reads (<= 10 million). This latter option will only report alignments across "GT-AG" introns Prerequisites To use TopHat, you will need the following programs in your PATH:
- bowtie2 and bowtie2-align (or bowtie)
- bowtie2-inspect (or bowtie-inspect)
- bowtie2-build (or bowtie-build)
- samtools
Because TopHat outputs and handles alignments in BAM format, you will need to download and install the SAM tools. You may want to take a look at the Getting startedguide for more detailed installation instructions, including installation of SAM tools and Boost. You will also need Python version 2.6 or higher. Obtaining and installing TopHat You can download the latest source release and precompiled binaries for Linux and Mac OSX here. See the Getting started guide for detailed instructions about installing TopHat from the binary package or building TopHat and its dependencies from source. To install TopHat from source package, unpack the tarball and change directory to the package directory as follows:
tar zxvf tophat-2.0.0.tar.gz
cd tophat-2.0.0/
Configure the package, specifying the install path and the library dependencies as needed (see the Getting started guide for details):
./configure --prefix=<install_prefix> --with-boost=<boost_install_prefix> --with-bam=<samtools_install_prefix>
Finally, build and install TopHat:
make
make install
As detailed in the Getting started guide, if you want to install TopHat 2 without overwriting a previous version of TopHat already installed on your system you should specify a new, separate <install_prefix> for the ./configure command above, and after the 'make install' step just copy the tophat2 script from <install_prefix>/bin to a directory that is in your shell's PATH, so you can invoke this new version of TopHat with the command 'tophat2'. Below you will find a detailed list of command-line options you can use to control TopHat. Beginning users should take a look at the Getting started guide for a tutorial on installing and running TopHat and its prerequisites.
Using TopHat
Usage: tophat [options]* <genome_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]
When running TopHat with paired reads it is critical that the *_1 files an the *_2 files appear in separate comma-delimited lists, and that the order of the files in the two lists is the same. TopHat allows the use of additional unpaired reads to be provided after the paired reads. These unpaired reads can be either given at the end of the paired read files on one side (as reads that can no longer be paired with reads from the other side), or they can be given in separate file(s) which are appended (comma delimited) to the list of paired input files on either side e.g.:
tophat [options]* <genome_index_base> PE_reads_1.fq.gz,SE_reads.fa PE_reads_2.fq.gz
‐
or ‐
tophat [options]* <genome_index_base> PE_reads_1.fq.gz PE_reads_2.fq.gz,SE_reads.fa
Starting with version 2.0.10 TopHat accepts mixed input file formats (FASTA/FASTQ). NOTE: TopHat can align reads that are up to 1024 bp long, and it handles paired-end reads and unpaired reads at once, but we do not recommend mixing different types of reads in the same TopHat run. For example, mixing 100bp single end reads and 2x27bp paired reads in the same TopHat run may give sub-optimal results. If you'd like to combine results from data sets with different types of RNA-Seq reads, you can follow a protocol like this:
- run TopHat on the first set of reads, with the appropriate parameters for this data set
- use bed_to_juncs to convert the junctions.bed file obtained in this first run to a junction file usable by Tophat's -j option
- run Tophat on the 2nd set of reads using the -j option to supply the junctions file produced by bed_to_juncs in the previous step
The following is a detailed description of the options used to control the TopHat script.
Arguments: | | <genome_index_base> | The basename of the genome index to be searched. The basename is the name of any of the index files up to but not including the first period. Bowtie first looks in the current directory for the index files, then looks in the indexes subdirectory under the directory where the currently-running bowtie executable is located, then looks in the directory specified in the BOWTIE_INDEXES (or BOWTIE2_INDEXES) environment variable. Please note that it is highly recommended that a FASTA file with the sequence(s) the genome being indexed be present in the same directory with the Bowtie index files and having the name <genome_index_base>.fa. If not present, TopHat will automatically rebuild this FASTA file from the Bowtie index files. | <reads1_1[,...,readsN_1]> | A comma-separated list of files containing reads in FASTQ or FASTA format. When running TopHat with paired-end reads, this should be the *_1 ("left") set of files. | <[reads1_2,...readsN_2]> | A comma-separated list of files containing reads in FASTA or FASTA format. Only used when running TopHat with paired end reads, and contains the *_2 ("right") set of files. The *_2 files MUST appear in the same order as the *_1 files. | Options: | | -h/--help | Prints the help message and exits | -v/--version | Prints the TopHat version number and exits | -N/--read-mismatches | Final read alignments having more than these many mismatches are discarded. The default is 2. | --read-gap-length | Final read alignments having more than these many total length of gaps are discarded. The default is 2. | --read-edit-dist | Final read alignments having more than these many edit distance are discarded. The default is 2. | --read-realign-edit-dist | Some of the reads spanning multiple exons may be mapped incorrectly as a contiguous alignment to the genome even though the correct alignment should be a spliced one - this can happen in the presence of processed pseudogenes that are rarely (if at all) transcribed or expressed. This option can direct TopHat to re-align reads for which the edit distance of an alignment obtained in a previous mapping step is above or equal to this option value. If you set this option to 0, TopHat will map every read in all the mapping steps (transcriptome if you provided gene annotations, genome, and finally splice variants detected by TopHat), reporting the best possible alignment found in any of these mapping steps. This may greatly increase the mapping accuracy at the expense of an increase in running time. The default value for this option is set such that TopHat will not try to realign reads already mapped in earlier steps. | --bowtie1 | Uses Bowtie1 instead of Bowtie2. If you use colorspace reads, you need to use this option as Bowtie2 does not support colorspace reads. | -o/--output-dir <string> | Sets the name of the directory in which TopHat will write all of its output. The default is "./tophat_out". | -r/--mate-inner-dist <int> | This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. The default is 50bp. | --mate-std-dev <int> | The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp. | -a/--min-anchor-length <int> | The "anchor length". TopHat will report junctions spanned by reads with at least this many bases on each side of the junction. Note that individual spliced alignments may span a junction with fewer than this many bases on one side. However, every junction involved in spliced alignments is supported by at least one read with this many bases on each side. This must be at least 3 and the default is 8. | -m/--splice-mismatches <int> | The maximum number of mismatches that may appear in the "anchor" region of a spliced alignment. The default is 0. | -i/--min-intron-length <int> | The minimum intron length. TopHat will ignore donor/acceptor pairs closer than this many bases apart. The default is 70. | -I/--max-intron-length <int> | The maximum intron length. When searching for junctions ab initio, TopHat will ignore donor/acceptor pairs farther than this many bases apart, except when such a pair is supported by a split segment alignment of a long read. The default is 500000. | --max-insertion-length <int> | The maximum insertion length. The default is 3. | --max-deletion-length <int> | The maximum deletion length. The default is 3. | --solexa-quals | Use the Solexa scale for quality values in FASTQ files. | --solexa1.3-quals | As of the Illumina GA pipeline version 1.3, quality scores are encoded in Phred-scaled base-64. Use this option for FASTQ files from pipeline 1.3 or later. | -Q/--quals | Separate quality value files - colorspace read files (CSFASTA) come with separate qual files. | --integer-quals | Quality values are space-delimited integer values, this becomes default when you specify -C/--color. | -C/--color | Colorspace reads, note that it uses a colorspace bowtie index and requires Bowtie 0.12.6 or higher. Common usage: tophat --color --quals [other options]* <colorspace_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2] <quals1_1[,...,qualsN_1]> [quals1_2,...qualsN_2] | -p/--num-threads <int> | Use this many threads to align reads. The default is 1. | -g/--max-multihits <int> | Instructs TopHat to allow up to this many alignments to the reference for a given read, and choose the alignments based on their alignment scores if there are more than this number. The default is 20 for read mapping. Unless you use --report-secondary-alignments, TopHat will report the alignments with the best alignment score. If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments. In case of using --report-secondary-alignments, TopHat will try to report alignments up to this option value, and TopHat may randomly output some of the alignments with the same score to meet this number. | --report-secondary-alignments | By default TopHat reports best or primary alignments based on alignment scores (AS). Use this option if you want to output additional or secondary alignments (up to 20 alignments will be reported this way, this limit can be changed by using the -g/--max-multihits option above). | --no-discordant | For paired reads, report only concordant mappings. | --no-mixed | For paired reads, only report read alignments if both reads in a pair can be mapped (by default, if TopHat cannot find a concordant or discordant alignment for both reads in a pair, it will find and report alignments for each read separately; this option disables that behavior). | --no-coverage-search | Disables the coverage based search for junctions. | --coverage-search | Enables the coverage based search for junctions. Use when coverage search is disabled by default (such as for reads 75bp or longer), for maximum sensitivity. | --microexon-search | With this option, the pipeline will attempt to find alignments incident to micro-exons. Works only for reads 50bp or longer. | --library-type | The default is unstranded (fr-unstranded). If either fr-firststrand or fr-secondstrand is specified, every read alignment will have anXS attribute tag as explained below. Consider supplying library type options below to select the correct RNA-seq protocol. |
Library Type | Examples | Description | fr-unstranded | Standard Illumina | Reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand. | fr-firststrand | dUTP, NSR, NNSR | Same as above except we enforce the rule that the right-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced. | fr-secondstrand | Ligation, Standard SOLiD | Same as above except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced. | |