Blat The BLAST-Like Alignment Tool

最新推荐文章于 2021-11-25 00:31:13 发布

zhu_si_tao

最新推荐文章于 2021-11-25 00:31:13 发布

阅读量1.5k

点赞数

分类专栏：生物信息文章标签： alignment 生物信息学

生物信息专栏收录该内容

18 篇文章

订阅专栏

blat database query [-ooc=11.ooc] output.psl

where:
database and query are each either a .fa , .nib or .2bit file,
or a list these files one file name per line.
-ooc=11.ooc tells the program to load over-occurring 11-mers from
and external file. This will increase the speed
by a factor of 40 in many cases, but is not required
output.psl is where to put the output.
Subranges of nib and .2bit files may specified using the syntax:
/path/file.nib:seqid:start-end
or
/path/file.2bit:seqid:start-end
or
/path/file.nib:start-end
With the second form, a sequence id of file:start-end will be used.
options:
-t=type Database type. Type is one of:
库序列 dna - DNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
The default is dna
-q=type Query type. Type is one of:
查询序列 dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
rnax - DNA sequence translated in three frames to protein
The default is dna
-prot Synonymous with -t=prot -q=prot
-ooc=N.ooc Use overused tile file N.ooc. N should correspond to
the tileSize
-tileSize=N sets the size of match that triggers an alignment.
Usually between 8 and 12
Default is 11 for DNA and 5 for protein.
-stepSize=N spacing between tiles. Default is tileSize.
-oneOff=N If set to 1 this allows one mismatch in tile and still
triggers an alignments. Default is 0.
-minMatch=N sets the number of tile matches. Usually set from 2 to 4
Default is 2 for nucleotide, 1 for protein.
-minScore=N sets minimum score. This is the matches minus the
mismatches minus some sort of gap penalty. Default is 30
-minIdentity=N Sets minimum sequence identity (in percent). Default is
90 for nucleotide searches, 25 for protein or translated
protein searches.
-maxGap=N sets the size of maximum gap between tiles in a clump. Usually
set from 0 to 3. Default is 2. Only relevent for minMatch > 1.
-noHead suppress .psl header (so it's just a tab-separated file)
-makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
-repMatch=N sets the number of repetitions of a tile allowed before
it is marked as overused. Typically this is 256 for tileSize
12, 1024 for tile size 11, 4096 for tile size 10.
Default is 1024. Typically only comes into play with makeOoc.
Also affected by stepSize. When stepSize is halved repMatch is
doubled to compensate.
-mask=type Mask out repeats. Alignments won't be started in masked region
but may extend through it in nucleotide searches. Masked areas
are ignored entirely in protein or translated searches. Types are
lower - mask out lower cased sequence
upper - mask out upper cased sequence
out - mask according to database.out RepeatMasker .out file
file.out - mask database according to RepeatMasker file.out
-qMask=type Mask out repeats in query sequence. Similar to -mask above but for query rather than target sequence.
-repeats=type Type is same as mask types above. Repeat bases will not be
masked in any way, but matches in repeat areas will be reported
separately from matches in other areas in the psl output.
-minRepDivergence=NN - minimum percent divergence of repeats to allow
them to be unmasked. Default is 15. Only relevant for
masking using RepeatMasker .out files.
-dots=N Output dot every N sequences to show program's progress
-trimT Trim leading poly-T
-noTrimA Don't trim trailing poly-A
-trimHardA Remove poly-A tail from qSize as well as alignments in
psl output
-fastMap Run for fast DNA/DNA remapping - not allowing introns,
requiring high %ID
-out=type Controls output file format. Type is one of:
psl - Default. Tab separated format, no sequence
pslx - Tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8- NCBI blast tabular format
blast9 - NCBI blast tabular format with comments
-fine For high quality mRNAs look harder for small initial and
terminal exons. Not recommended for ESTs
-maxIntron=N Sets maximum intron size. Default is 750000
-extendThroughN - Allows extension of alignment through large blocks of N's

Blat,全称The BLAST-Like Alignment Tool, 可以称为“类BLAST比对工具”，由W.James Kent于2002年开发。当时随着人类基因组计划的进展，把大量的基因和ESTs快速定位到较大的基因组上称为一种迫切需要。blast相对于这种比对有几个缺陷：速度偏慢、结果难于处理、无法表示包含intron的基因定位。Blat就是再这种形势下应运而生了。

Blat的主要特点是：速度快，共线性输出结果简单易读。对于比较小的序列（如cDNA等）对大基因组的比对，blat无疑是首选。Blat把相关的呈共线性的比对结果连接成更大的比对结果，从中也可以很容易的找到exons和introns。因此，在相近物种的基因同源性分析和EST分析中，blat得到了广泛的应用。

如下图所示，blast会把每一个比对作为一个输出，而blat会把一些符合共线性关系的比对连接起来作为一个输出。

Blat的输入文件必须满足fasta格式，运行时非常的简单，不需要进行建库就可以直接比对。Blat的基本命令：

blat database query [-参数] output

程序正常运行时，会在读完database中的所有subject序列时在屏幕输出database的统计结果：

Loaded 1493629 letters in 486 sequences###486条序列中有1493629个letters

Searched 1493629 bases in 486 sequences###自己和自己比对

默认的输出结果是列表形式的文本文件，即psl格式。

psl格式的结果包含了详细的比对位置信息，每一列的意义都在文件开头列出。第1~8列是通体的比对统计，包括精确比对碱基数、错配、query和subject上的gap个数与gap总长等；第9~17列是比对位置信息，包括比对方向、query和subject的名字、长度、比对起止位置；18~21列是显示每一个精确比对的block的信息，包括blocks数、每个block的长度和在query、subject上的位置。

对psl输出结果，需要注意一下几点：

1.blat的结果在subject上允许存在很大的gap（intron区域），所以同一个结果在query和subjects上覆盖的区域可能会相差很多，这一点与blast不同。

2.在基因对基因组的比对中，block的个数不能等同于exon的个数。因为blat对block的定义是一个没有插入缺失的比对，任何插入或者缺失的碱基都会使一个block终止，所以一个exon很可能是有很多block构成的。因此exon和intron的个数要通过足够大的gap来判断。

3.psl结果里面碱基位置的计算是从0开始的而不是1.

做不同类型的比对时候需要注意一个问题，就是 “-t”和“-q”的定义必须为同一类型。比如database和query都是蛋白序列，并且两者同时定义为 “prot”的时候，比对能够正常进行；如果database是DNA序列而query序列是蛋白序列，那么在定义 “-q=prot”的同时还需要定义 “-tdnax”.下面就用同一个基因的DNA和蛋白序列举几个例子。

运行命令1：

blat cdna.seq pro.seq -q=prot out.psl

程序报错退出：

d and q must both be either protein or dna

运行命令2：

blat cdna.seq pro.seq -t=dnax -q=prot -noHead out.psl

ok, right

注意蛋白比对和核酸比对在输出上的不同点，在显示方向的位置显示了2个“+”，表示query和subject都是正向比对。

运行命令3，核酸序列的蛋白级别比对：

blat cdna.seq cdna.seq -t=dnax -q=dnax -noHead out.psl

http://blog.sina.com.cn/s/blog_959d22480101k348.html