Blast程序本地化使用的方法

最新推荐文章于 2023-03-14 14:18:39 发布

keying0520

最新推荐文章于 2023-03-14 14:18:39 发布

阅读量2.9k

点赞数

文章标签： database file list extension 数据库 input

本文链接：https://blog.csdn.net/keying0520/article/details/6476445

版权

Blast程序的下载地址:

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release

数据库的下载:

ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/

其中 nr.gz 为非冗余的数据库，nt.gz 为核酸数据库

month.nt.gz 为最近一个月的核酸序列数据。

formatdb -i month.nt -p F -o T

-i input file 参数用于指定需要格式的数据库

-p type of file 用于指定文件类型，T 为蛋白质，F为核酸，默认为 T

-o parse options 用于指定是否解析序列ID并创建索引 T 为创建，F为不创建，默认为F。

blastall -p blastn -d month.nt -i test.txt -o out.txt

-p program name 为需要使用的程序名

blastn 为核酸序列对比搜索

blastp 为蛋白质序列对比搜索

blastx 为用被翻译的核酸序列在蛋白质数据库中搜索

tblastn 为用蛋白质序列在 [核酸序列翻译后数据库] [**1]中搜索

tblastx 为用翻译后的核酸序列在核酸序列翻译后数据库中搜索

-d database name 指定所使用的数据库名称

-i input file 待搜索的序列文件

-o output file 指定保存结果的文件

即可在out.txt中得到相应的结果。

此外，之前由于在使用formatdb.exe 使没有使用 -o T 参数，导致没有生成索引文件，出现了以下错误提示：

[NULL_Caption] WARNING: Test: Could not find index files for database month.nt

一个正确的解决办法，那就是在使用formatdb.exe时,不要忘了-o 参数，因为这个参数默认是不创建索引的，另外数据库的类型不要弄错了！

附详细说明:

Command Line Options
A list of the command line options and the current version for formatdb may be obtained by executing formatdb without options, as in:

formatdb -The formatdb options are summarized below:

formatdb 2.2.5 arguments:

-t Title for database file [String]
Optional
-i Input file(s) for formatting (this parameter must be set)
[File In]
-l Logfile name: [File Out]
Optional
default = formatdb.log
-p Type of file
T - protein
F - nucleotide [T/F] Optional
default = T

-o Parse options
T - True: Parse SeqId and create indexes.
F - False: Do not parse SeqId. Do not create indexes.
[T/F] Optional default = FIf the "-o" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.

-a Input file is database in ASN.1 format (otherwise FASTA is expected)
T - True,
F - False.
[T/F] Optional default = F

-b ASN.1 database in binary mode
T - binary,
F - text mode.
[T/F] Optional default = FA source ASN.1 database may be represented in two formats - ascii text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.

-e Input is a Seq-entry [T/F]
Optional
default = FA source ASN.1 database (either text ascii or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.

-n Base name for BLAST files [String]
OptionalThis options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named 'ecoli.nuc.txt' and and format it as 'ecoli':

formatdb -i ecoli.nuc.txt -p F -o T -n ecoli

uncompress -c nr.z | formatdb -i stdin -o T -n nrThis can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.

-v Database volume size in millions of letters [Integer] Optional
default = 0
range from 0 to This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.

-s Create indexes limited only to accessions - sparse [T/F]
Optional
default = FThis option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.

-L Create an alias file with this name
use the gifile arg (below) if set to calculate db size
use the BLAST db specified with -i (above) [File Out] OptionalThis option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the -F argument. See the section "Note on creating an alias file for a GI list" for more information.

-F Gifile (file containing list of gi's) [File In] OptionalThis option can be used to specify the GI list for the alias file construction (-L option above) or to produce a binary GI list if the -B option (below) is set.

-B Binary Gifile produced from the Gifile specified above [File Out]
OptionalThis option specifies the name of a binary GI list file. This option should be used with the -F option. A text GI list may be specified with the -F option and the -B option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.

Notes/Troubleshooting:
A.) Note on -o option:
It is always advantageous to use the '-o' option if the database identifiers are in the format specified at ftp://ftp.NCBI.nih.gov/blast/db/README. If the database identifiers are in this parseable format, formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers. It is sufficient if the first word on the FASTA definition line is a unique identifier (e.g., ">3091 Alcoho de..."). It is necessary to use parseable identifiers for the following cases:

1.) ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE.

2.) query-anchored alignments are desired (i.e., the '-m' option with a non-zero value is used).

3.) The gi's are desired as part of the output (i.e., '-I' is used).

4.) fastacmd will be used to fetch sequences from the database by accession or gi.

See Appendix 1: The Files Produced by Formatdb for more information in the -o T option.

B.) Note on "SORTFiles failed" message:

Formatdb will use the 'standard' temporary directory to sort the string indices on disk. Under UNIX this directory is often /var/tmp and if there is not enough space there, then the error message: "ERROR: [000.000] SORTFiles failed" will be issued. This can be avoided by setting the TMPDIR environment variable to a partition with more free space. This message may also often be avoided by using the sparse option (-s) for formatdb described above.

C.) Note on formatting large (4 Gig and larger) FASTA files:

A single BLAST database can contain up to 4 billion letters. If one wishes to formatdb a FASTA file containing more letters than this, several databases, each of a maximum size of 4 billion bases, will be produced. This will be done automatically if the -v argument is not set. One may also specify a smaller size for the volume databases by using the -v option:

formatdb -i hugefasta -p F -v 2000000000This command line will format the "hugefasta" FASTA file as a number of database "volumes," each containing a maximum of two billion base pairs, as specified by the "-v" option. Two billion is the current limitation on the NCBI toolkit command-line parser. The volumes will have names consisting of the root database name, "hugefasta" followed by a two-digit volume extension, followed by the usual BLAST database extensions. These smaller databases can be searched as if they were a single entity using:

blastall -i infile -d hugefasta -p blastn -o outIn this case, BLAST recognizes that the database "hugefasta" has been partitioned into several volumes because it detects a file with the name of the root database followed by an extension of "nal" (for protein databases, the extension is "pal"). This file specifies a database list to be searched when the root database name is specified to BLAST. BLAST sequentially searches each database listed in this "nal" file and generates output that is indistinguishable from that of a single database search. A sample "nal" file, resulting from formatting the datafile "hugefasta" into three volumes, is given below. The "DBLIST" line can also be edited to specify additional databases to be searched.

#
# Alias file created Tue Jan 18 13:12:24 2000
#
#
TITLE hugefasta
#
DBLIST hugefasta.00 hugefasta.01 hugefasta.02
#
#GILIST
#
#OIDLIST
#The "nal" and "pal" files can also be used to simplify searches of multiple databases created separately. For instance, a file called "multi.nal" containing the following lines could be created from scratch using a text editor.

#
# Alias file created Tue Jan 18 13:12:24 2000
#
#
TITLE multi
#
DBLIST part1 part2 part3
#
#GILIST
#
#OIDLIST
#The "multi.nal" file would allow the three databases, "part1", "part2", and "part3", to be searched by specifying a single database name, "multi", on the blastall command line as follows:

blastall -i infile -d multi -p blastn -o outThe reason for using database volumes, as opposed to simply making the indices in the BLAST databases large enough to handle all conceivable databases with an eight-byte 'integer', is that this would have doubled the size of the indices for all searches no matter how small the database. Hence very large FASTA files are broken down into a couple of databases.

Formatdb must be able to open files larger than 2 Gig in order to work on very large files. This is not a problem on a 64-bit OS and on certain 32-bit OS that allows binaries to be made large-file aware. The 32-bit Solaris formatdb binary on the NCBI FTP site is now compiled large file aware.

D.) Note on running formatdb on a database without uncompressing it:

Under UNIX it is possible to uncompress a database on the fly and pipe it to formatdb. This can reduce the disk-space needed for running formatdb on a large database. In addition, some operating systems cannot write files larger than 2 Gig to disk. To circumvent this on Unix or Linux systems, use a "pipe" system such as:

uncompress -c nt.Z|formatdb -i stdin -o T -p T -n "nt" -v 100000000In this case, no file is written which is larger than 1 Gig and an arbitrarily large database is formatted as a set of 1 Gig volumes. Note the use of the '-n' option that specifies the name of the resulting BLAST database. Note also that 'stdin' specifies that input will be coming from 'standard input'. The nt FASTA file is not needed for running BLAST searches and nt.Z may be deleted after formatdb has been run.

E) Note on creating custom databases:

With Standalone BLAST it is possible to take any custom file of FASTA sequences and use these as a database source file for searching. All BLAST database source files must be in FASTA format. In order to use the formatdb option -o T, especially for use with NCBI tool kit retrieval tools the FASTA defline must follow a specific format.

F) Note on creating an alias file for a GI list:

Formatdb can now produce a BLAST database alias file that specifies a (real) BLAST database to search as well as a GI list to limit the search. This is useful if one often searches a subset of a database (e.g., based on organism or a curated list). The alias file makes the search appear as if one were searching a real database rather than the subset of one. The procedure to produce an alias file for searching (protein) nr limiting it to a list of zebrafish GI's would be:

1.) obtain the list of zebrafish GI's from Entrez or some other source and keep it in a file called "zebrafish.gi.in".

2.) invoke formatdb to convert the text GI list to the more efficient binary format:

formatdb -F zebrafish.gi.in -B zebrafish.gi 3.) invoke formatdb with the following options:

formatdb -i nr -p T -L zebrafish -F zebrafish.gi -t "My zebrafish database"This will produce the alias file zebrafish.pal listing the database title, the real database to be searched, the GI file, and some statistics:

#
# Alias file created Thu Jul 5 15:04:29 2001
#
#
TITLE My zebrafish database
#
DBLIST nr
#
GILIST zebrafish.gi
#
#OIDLIST
#
NSEQ 1836
LENGTH 640724One can search this by invoking (for example):

blastall -p blastp -d zebrafish -i MYQUERY -o MYOUTPUTNOTE: One may wish to prepare the alias file in one directory, but move it to a different production directory that does not contain the real database. In that case you may use the '-n' option to specify a path to the real database in the production environment. In the example below the -n option is used to specify that the nr database can really be found at a relative path of ../../newest_blast/blast

formatdb -i nr -n ../../newest_blast/blast/nr -p T -L zebrafish -F
zebrafish.gi -t "My zebrafish database"and the alias file will be:

#
# Alias file created Wed Nov 28 13:55:41 2001
#
#
TITLE My zebrafish database
#
DBLIST ../../newest_blast/blast/nr
#
GILIST zebrafish.gi
#
#OIDLIST
#
NSEQ 1836
LENGTH 640724