随着三代测序的兴起,长片段比对需求增加。前面一直在用GMAP,这个软件有个缺点,一条带有polyA序列的可能比对不上参考序列,当把polyA去掉之后,又可以比对上。意思就是,当序列开头比对不上的时候,GMAP可能就认为比对不上。有的时候GMAP也不能找全所有可能的hit。使用GMAP时,对于那些map不上的序列,最好随机选择一些,看看是不是真的map不上。今天发现一个新的长片段比对软件 alfalfa (苜蓿)。这款软件号称也兼容短序列比对。文章测评下来,当然要比BWA-MEM, BWA-SW, Bowtie 2, CUSHAW3要好。下面简单介绍下安装及使用
安装
$ git clone git://github.com/readmapping/alfalfa.git
$ cd alfalfa
$ make
```就是这样简单,安装之后在当前目录下会出现alfalfa可执行命令文件,可以将他copy至全局变量路径里。
<div class="se-preview-section-delimiter"></div>
####使用
<div class="se-preview-section-delimiter"></div>
@PG ID:alfalfa VN:0.8.1
Usage: alfalfa [] [option…]
Command should be index, align or evaluate
Subcommand is only required for the evaluate command
commands:
index is used to construct the data structures for indexing a given reference
genome.
align is used for mapping and aligning a read set onto a reference genome.
evaluate is used for evaluating the accuracy of simulated reads and summarizing
statistics from the SAM-formatted alignments reported by a read mapper.
call alfalfa -h/–help for more detailed information on the specific
commands
<div class="se-preview-section-delimiter"></div>
#####Usage: alfalfa index [option...]
<div class="se-preview-section-delimiter"></div>
index is used to construct the data structures for indexing a given reference
genome.
options
-r/–reference (file).
Specifies the location of a file that contains the reference
genome in multi-fasta format.
-s/–sparseness (int, 12).
Specifies the sparseness of the index structure as a way to
control part of the speed-memory trade-off.
-p/–prefix (string, filename passed to the -r option).
Specifies the prefix that will be used to name all generated
index files. The same prefix has to be passed to the -i option
of the align command to load the index structure when mapping
reads.
–no-child .
By default, a sparse child array is constructed and stored in an
index file with extension .child. The construction of this
sparse child array is skipped when the –no-child option is set.
This data structure speeds up seed-finding at the cost of (4/s)
bytes per base in the reference genome. As the data structure
provides a major speed-up, it is advised to have it constructed.
–suflink .
Suffix link support is disabled by default. Suffix link support
is enabled when the –suflink option is set, resulting in an
index file with extension .isa to be generated. This data
structure speeds up seed-finding at the cost of (4/s) bytes per
base. It is only useful when sparseness is less than four and
minimum seed length is very low (less than 10), because it
conflicts with skipping suffixes in matching the read. In
practice, this is rarely the case.
–no-kmer .
By default, a 10-mer lookup table is constructed that contains
the suffix array interval positions to depth 10 in the virtual
suffix tree. It is stored in