Blast中文手册(1)

原文链接BLAST Command Line Applications User Manual

Building a BLAST database with your (local) sequences(使用(本地)序列构建BLAST数据库)

Created: June 23, 2008; Updated: January 7, 2021.
If you would like to search the BLAST databases NCBI offers, please see Get NCBI BLAST databases
The makeblastdb application produces BLAST databases from FASTA files. It is possible to use completely unstructured (or even blank) FASTA definition lines, but this is not the recommended procedure. Assigning a unique identifier to every sequence in the database allows you to retrieve the sequence by identifier and allows you to associate every sequence with a taxonomic node (through the taxid of the sequence). The unique identifier can be a simple string (as in the example below) or could be actual accession of the sequence if the sequence comes from a public database (e.g., GenBank). Being able to associate a database sequence with a taxonomic node is especially powerful for the version 5 databases that BLAST can use to limit the search by taxonomy. The identifier should begin right after the “>” sign on the definition line and contain no spaces and the -parse_seqids flag should be used. In general, you should not use a “|” (bar) in your identifier. The “|” (bar) is a reserved character for the NCBI FASTA ID parser and makeblastdb will return an error unless the bar is used in a specific manner described at https://ncbi.github.io/cxx-toolkit/pages/ch_demo#ch_demo.T5

如果您想搜索NCBI提供的BLAST数据库,请参阅Get-NCBI BLAST databases
makeblastdb应用程序从FASTA文件生成BLAST数据库。可以使用完全非结构化(甚至空白)的FASTA定义行,但这不是推荐的步骤。**为数据库中的每个序列分配一个唯一标识符,允许您逐个标识符检索序列,并允许您将每个序列与分类节点相关联(通过序列的taxid)。**唯一标识符可以是一个简单的字符串(如下面的示例),或者如果序列来自公共数据库(例如GenBank),则可以是序列的实际加入。能够将数据库序列与分类节点相关联对于版本5数据库尤其强大,BLAST可以使用该版本限制分类搜索。标识符应该在定义行上的“>”符号之后开始,不包含空格,并且应该使用-parse_seqids标志。通常,标识符中不应使用“|”(条形)。 “|”(bar)是NCBI FASTA ID解析器的保留字符,makeblastdb将返回错误,除非以中所述的特定方式使用

https://ncbi.github.io/cxx-toolkit/pages/ch_demo#ch_demo.T5

在这里插入图片描述

An additional (optional) file mapping the identifiers to taxids (a number identifying a taxonomic node) may be used to associate each sequence with a taxonomic node.

将标识符映射到TAXID(标识分类节点的数字)的附加(可选)文件可用于将每个序列与分类节点相关联。
在这里插入图片描述

The taxid for a taxonomic node can be looked up with the get_species_taxids.sh script distributed with the BLAST+. Additionally, the NCBI provides other resources. The files in https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/provide a mapping from accession to taxid (useful if the sequences are from a public database). Information on other taxonomy files is available at https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files-available-with-lineage-type-and-host-information/Finally,https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgiprovides a means to perform species name to taxid lookups
Makeblastdb can be invoked for the FASTA and (optional) taxid mapping files as below. We use the -blastdb_version parameter to construct a version 5 database and the -taxid_map parameter to associate each sequence with a taxonomic node. Note that we also use -parse_seqids.

分类节点的taxid可以用get_species_taxid查找。与BLAST+一起分发的sh脚本。此外,NCBI还提供其他资源。文件在https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/提供从加入到taxid的映射(如果序列来自公共数据库,则很有用)。有关其他分类法文件的信息,请访问https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files-available-with-lineage-type-and-host-information/最后https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi提供了一种执行物种名称到排序查找的方法。
Makeblastdb可以为FASTA和(可选)taxid映射文件调用,如下所示。我们使用-blastdb_version参数构建版本5数据库,使用-taxid_map参数将每个序列与分类节点相关联。请注意,我们还使用-parse_seqids。

makeblastdb -in test.fsa -parse_seqids -blastdb_version 5 -taxid_map test_map.txt
-title "Cookbook demo" -dbtype prot

在这里插入图片描述
在这里插入图片描述

If you do add the taxids to your database, make sure you have the BLAST taxonomy data files (taxdb.bt[di]) which are available from https://ftp.ncbi.nlm.nih.gov/blast/db/ but also packaged with most BLAST databases distributed by the NCBI.

如果您确实将taxid添加到数据库中,请确保您拥有BLAST分类法数据文件(taxdb.bt[di]),该文件可从https://ftp.ncbi.nlm.nih.gov/blast/db/而且还与NCBI分发的大多数BLAST数据库打包。

If all of the sequences in your database have the same taxid, you can simply use the -taxid flag on makeblastdb to associate all sequences with that taxid rather than needing to prepare a file.

如果数据库中的所有序列都具有相同的taxid,则可以简单地使用makeblastdb上的-taxid标志将所有序列与该taxid关联,而无需准备文件。

For releases prior to BLAST+ 2.9.0, ad hoc identifiers (as shown in our example above) should be prefixed with “lcl|” (e.g., lcl|seq1 in place of seq1) for the taxid mapping file.

对于BLAST+2.9.0之前的版本,taxid映射文件的特殊标识符(如上面的示例所示)应以“lcl|”(例如,lcl_seq1代替seq1)作为前缀。

The NCBI makes databases that are searchable on the NCBI web site (such as nr, refseq_rna, and swissprot) available on its FTP site. It is better to download the preformatted databases rather than starting with FASTA. The databases on the FTP site contain taxonomic information for each sequence, include the identifier indices for lookups, and can be up to four times smaller than the FASTA. The original FASTA can be generated from the BLAST database using blastdbcmd.

NCBI在其FTP站点上提供可在NCBI网站上搜索的数据库(如nr、refseq_rna和swissprot)。最好下载预格式化的数据库,而不是从FASTA开始。FTP站点上的数据库包含每个序列的分类信息,包括用于查找的标识符索引,可以比FASTA小四倍。原始FASTA可以使用blastdbcmd从BLAST数据库生成。

Starting with the 2.10.0 release, makeblastdb produces version 5 databases by default, which uses LMDB. LMDB requires virtual memory (at least 600 GB, but 800 GB is recommended). Virtual memory is just that (virtual) and doesn’t depend on the hardware in your system. In general, we recommend that BLAST users simply set the virtual memory to unlimited.

从2.10.0版本开始,makeblastdb默认生成版本5数据库,使用LMDB。LMDB需要虚拟内存(至少600 GB,但建议800 GB)。虚拟内存就是这样(虚拟的),不依赖于系统中的硬件。一般来说,我们建议BLAST用户只需将虚拟内存设置为无限。

Multiple databases vs. spaces in filenames and paths(多个数据库与文件名和路径中的空格)

Created: June 23, 2008; Updated: January 7, 2021.
BLAST has been able to search multiple databases since 1997. The databases can be listed after the “-db” argument or in an alias file (see cookbook entries on blastdb_aliastool), separated by spaces. Many operating systems now allow spaces in filenames and directory paths, so some care is required. Basically, one should always have two sets of quotes for any path containing a space. Blastdbcmd is used as an example below, but the same rules apply to makeblastdb as well as the search programs like blastn or blastp.

自1997年以来,BLAST能够搜索多个数据库。数据库可以在“-db”参数后列出,也可以在别名文件中列出(请参见blastdb_aliastool上的cookbook条目),用空格分隔。许多操作系统现在允许在文件名和目录路径中使用空格,因此需要谨慎。基本上,对于包含空格的任何路径,应该始终有两组引号。Blastdbcmd用作下面的示例,但相同的规则适用于makeblastdb以及blastn或blastp等搜索程序。

To access a BLAST database containing spaces under Microsoft Windows it is necessary to use two sets of double-quotes, escaping the innermost quotes with a backslash. For example, Users\joeuser\My Documents\Downloads would be accessed by:

要访问Microsoft Windows下包含空格的BLAST数据库,必须使用两组双引号,用反斜杠转义最里面的引号。例如:

blastdbcmd -db "\"Users\joeuser\My Documents\Downloads\mydb\"" -info

在这里插入图片描述

The first backslash escapes the beginning inner quote, and the backslash following “mydb” escapes the ending inner quote.

第一个反斜杠转义开始的内部引号,而“mydb”后面的反斜杠则转义结束的内部引号。

A second database can be added to this command by including it within the outer pair of quotes:

通过将第二个数据库包含在外部引号对中,可以将其添加到此命令中:

blastdbcmd -db "\"Users\joeuser\My Documents\Downloads\mydb\" myotherdb" -info

在这里插入图片描述

If the second database had contained a space, it would have been necessary to surround it by quotes escaped by a backslash.

如果第二个数据库包含空格,则必须用反斜杠转义的引号将其括起来。

Under UNIX systems (including LINUX and Mac OS X) it is preferable to use a single quote (‘) in place of the escaped double quote:

在UNIX系统(包括LINUX和Mac OS X)下,最好使用单引号(')代替转义的双引号:

blastdbcmd -db ‘ "path with spaces/mydb" ’ -info

在这里插入图片描述

Multiple databases can also be listed within the single quotes, similar to the procedure described for Microsoft Windows.

也可以在单引号中列出多个数据库,类似于为Microsoft Windows描述的过程。

Specifying a sequence as the multiple sequence alignment master in psiblast(在psiblast中指定一个序列作为多序列对齐主序列)

Created: June 23, 2008; Updated: January 7, 2021.
The -in_msa psiblast option, unlike blastpgp, does not support the specification of a master sequence via the-query option, so if one wants to specify a sequence (other than the first one) in the multiple sequence alignment file to be the master sequence, this has to be specified via the -msa_master_idx option. For instance, in the example below, the third sequence in the multiple sequence alignment would be used as the master sequence:

与blastpgp不同,-in_msa psiblast选项不支持通过查询选项指定主序列,因此如果要在多序列比对文件中指定一个序列(第一个序列除外)作为主序列,则必须通过-msa_master_idx选项指定。例如,在下面的示例中,多序列比对中的第三个序列将用作主序列:

psiblast -in_msa align1 -db pataa -msa_master_idx 3

在这里插入图片描述

Ignoring the consensus sequence in the multiple sequence alignment in psiblast(忽略psiblast中多序列比对中的一致序列)

Created: June 23, 2008; Updated: January 7, 2021.
Often a consensus sequence is added to a multiple sequence alignment to be used as the master sequence in a PSI-BLAST search. The consensus sequence provides a good option to display the query-subject alignment in the output and to define which MSA columns are to be converted to PSSM. At the same time adding the consensus sequence changes the statistical properties of the original alignment. To avoid this, the -ignore_msa_master option can be used:

通常,共识序列被添加到多序列比对中,以用作PSI-BLAST搜索中的主序列。共识序列提供了一个很好的选项,可以在输出中显示查询主题对齐方式,并定义哪些MSA列将转换为PSSM。同时,添加一致序列会改变原始比对的统计特性。为了避免这种情况,可以使用-ignore_msa_master选项:

psiblast -in_msa align1 -db pataa -ignore_msa_master

在这里插入图片描述

In this case the master sequence is displayed in the output but ignored when the PSSM scores are calculated.

在这种情况下,主序列显示在输出中,但在计算PSSM分数时忽略。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值