-
- where:
-
-
database and query are each either a .fa , .nib or .2bit file, -
-
or a list these files one file name per line. -
-
-ooc=11.ooc tells the program to load over-occurring 11-mers from -
-
and external file. This will increase the speed -
-
by a factor of 40 in many cases, but is not required -
-
output.psl is where to put the output. -
-
Subranges of nib and .2bit files may specified using the syntax: -
-
/path/file.nib:seqid:start-end -
-
or -
-
/path/file.2bit:seqid:start-end -
-
or -
-
/path/file.nib:start-end -
-
With the second form, a sequence id of file:start-end will be used. -
- options:
-
-
-t=type Database type. Type is one of: -
-
库序列 dna - DNA sequence -
-
prot - protein sequence -
-
dnax - DNA sequence translated in six frames to protein -
-
The default is dna -
-
-q=type Query type. Type is one of: -
-
查询序列 dna - DNA sequence -
-
rna - RNA sequence -
-
prot - protein sequence -
-
dnax - DNA sequence translated in six frames to protein -
-
rnax - DNA sequence translated in three frames to protein -
-
The default is dna -
-
-prot Synonymous with -t=prot -q=prot -
-
-ooc=N.ooc Use overused tile file N.ooc. N should correspond to -
-
the tileSize -
-
-tileSize=N sets the size of match that triggers an alignment. -
-
Usually between 8 and 12 -
-
Default is 11 for DNA and 5 for protein. -
-
-stepSize=N spacing between tiles. Default is tileSize. -
-
-oneOff=N If set to 1 this allows one mismatch in tile and still -
-
triggers an alignments. Default is 0. -
-
-minMatch=N sets the number of tile matches. Usually set from 2 to 4 -
-
Default is 2 for nucleotide, 1 for protein. -
-
-minScore=N sets minimum score. This is the matches minus the -
-
mismatches minus some sort of gap penalty. Default is 30 -
-
-minIdentity=N Sets minimum sequence identity (in percent). Default is -
-
90 for nucleotide searches, 25 for protein or translated -
-
protein searches. -
-
-maxGap=N sets the size of maximum gap between tiles in a clump. Usually -
-
set from 0 to 3. Default is 2. Only relevent for minMatch > 1. -
-
-noHead suppress .psl header (so it's just a tab-separated file) -
-
-makeOoc=N.ooc Make overused tile file. Target needs to be complete genome. -
-
-repMatch=N sets the number of repetitions of a tile allowed before -
-
it is marked as overused. Typically this is 256 for tileSize -
-
12, 1024 for tile size 11, 4096 for tile size 10. -
-
Default is 1024. Typically only comes into play with makeOoc. -
-
Also affected by stepSize. When stepSize is halved repMatch is -
-
doubled to compensate. -
-
-mask=type Mask out repeats. Alignments won't be started in masked region -
-
but may extend through it in nucleotide searches. Masked areas -
-
are ignored entirely in protein or translated searches. Types are -
-
lower - mask out lower cased sequence -
-
upper - mask out upper cased sequence -
-
out - mask according to database.out RepeatMasker .out file -
-
file.out - mask database according to RepeatMasker file.out -
-
-qMask=type Mask out repeats in query sequence. Similar to -mask above but for query rather than target sequence. -
-
-repeats=type Type is same as mask types above. Repeat bases will not be -
-
masked in any way, but matches in repeat areas will be reported -
-
separately from matches in other areas in the psl output. -
-
-minRepDivergence=NN - minimum percent divergence of repeats to allow -
-
them to be unmasked. Default is 15. Only relevant for -
-
masking using RepeatMasker .out files. -
-
-dots=N Output dot every N sequences to show program's progress -
-
-trimT Trim leading poly-T -
-
-noTrimA Don't trim trailing poly-A -
-
-trimHardA Remove poly-A tail from qSize as well as alignments in -
-
psl output -
-
-fastMap Run for fast DNA/DNA remapping - not allowing introns, -
-
requiring high %ID -
-
-out=type Controls output file format. Type is one of: -
-
psl - Default. Tab separated format, no sequence -
-
pslx - Tab separated format with sequence -
-
axt - blastz-associated axt format -
-
maf - multiz-associated maf format -
-
sim4 - similar to sim4 format -
-
wublast - similar to wublast format -
-
blast - similar to NCBI blast format -
-
blast8- NCBI blast tabular format -
-
blast9 - NCBI blast tabular format with comments -
-
-fine For high quality mRNAs look harder for small initial and -
-
terminal exons. Not recommended for ESTs -
-
-maxIntron=N Sets maximum intron size. Default is 750000 -
-
-extendThroughN - Allows extension of alignment through large blocks of N's
Blat,全称The BLAST-Like Alignment Tool, 可以称为“类BLAST比对工具”,由W.James Kent于2002年开发。当时随着人类基因组计划的进展,把大量的基因和ESTs快速定位到较大的基因组上称为一种迫切需要。blast相对于这种比对有几个缺陷:速度偏慢、结果难于处理、无法表示包含intron的基因定位。Blat就是再这种形势下应运而生了。
Blat的主要特点是:速度快,共线性输出结果简单易读。对于比较小的序列(如cDNA等)对大基因组的比对,blat无疑是首选。Blat把相关的呈共线性的比对结果连接成更大的比对结果,从中也可以很容易的找到exons和introns。因此,在相近物种的基因同源性分析和EST分析中,blat得到了广泛的应用。
如下图所示,blast会把每一个比对作为一个输出,而blat会把一些符合共线性关系的比对连接起来作为一个输出。
Blat的输入文件必须满足fasta格式,运行时非常的简单,不需要进行建库就可以直接比对。Blat的基本命令:
blat
程序正常运行时,会在读完database中的所有subject序列时在屏幕输出database的统计结果:
Loaded 1493629 letters in 486 sequences###486条序列中有1493629个letters
Searched 1493629 bases in 486 sequences###自己和自己比对
默认的输出结果是列表形式的文本文件,即psl格式。
psl格式的结果包含了详细的比对位置信息,每一列的意义都在文件开头列出。第1~8列是通体的比对统计,包括精确比对碱基数、错配、query和subject上的gap个数与gap总长等;第9~17列是比对位置信息,包括比对方向、query和subject的名字、长度、比对起止位置;18~21列是显示每一个精确比对的block的信息,包括blocks数、每个block的长度和在query、subject上的位置。
对psl输出结果,需要注意一下几点:
1.blat的结果在subject上允许存在很大的gap(intron区域),所以同一个结果在query和subjects上覆盖的区域可能会相差很多,这一点与blast不同。
2.在基因对基因组的比对中,block的个数不能等同于exon的个数。因为blat对block的定义是一个没有插入缺失的比对,任何插入或者缺失的碱基都会使一个block终止,所以一个exon很可能是有很多block构成的。因此exon和intron的个数要通过足够大的gap来判断。
3.psl结果里面碱基位置的计算是从0开始的而不是1.
做不同类型的比对时候需要注意一个问题,就是 “-t”和“-q”的定义必须为同一类型。比如database和query都是蛋白序列,并且两者同时定义为 “prot”的时候,比对能够正常进行;如果database是DNA序列而query序列是蛋白序列,那么在定义 “-q=prot”的同时还需要定义 “-tdnax”.下面就用同一个基因的DNA和蛋白序列举几个例子。
运行命令1:
blat
程序报错退出:
d
运行命令2:
blat
ok, right
注意蛋白比对和核酸比对在输出上的不同点,在显示方向的位置显示了2个“+”,表示query和subject都是正向比对。
运行命令3,核酸序列的蛋白级别比对:
blat
http://blog.sina.com.cn/s/blog_959d22480101k348.html