Blast中文手册(2)

Get NCBI BLAST databases(获取NCBI BLAST数据库 )

Created: June 23, 2008; Updated: January 7, 2021.
The best way to obtain BLAST databases is to download them from NCBI or cloud providers (currently from Google Cloud Platform and Amazon Web Services). These are the same databases available via the public BLAST Web Service (https://blast.ncbi.nlm.nih.gov), are updated regularly, and contain taxonomic information built into them. These can also be a source of biological sequence data (see below). To download a preformatted NCBI BLAST database, run the update_blastdb.pl program followed by any relevant options and the name(s) of the BLAST databases to download. For example:

获取BLAST数据库的最佳方式是从NCBI或云提供商(目前从谷歌云平台和亚马逊网络服务)下载它们。这些数据库与公共BLAST Web服务提供的数据库相同(https://blast.ncbi.nlm.nih.gov),定期更新,并包含内置的分类信息。这些也可以是生物序列数据的来源(见下文)。要下载预格式化的NCBI BLAST数据库,请运行update_blastdb.pl程序,然后是要下载的BLAST数据库的任何相关选项和名称。例如:

update_blastdb.pl --decompress nr [*]

在这里插入图片描述

This command will download the compressed nr BLAST database from NCBI to the current working directory and decompress it. Any subsequent identical invocations of this script with the same parameters in that directory will only download any data if it has a different time stamp when compared to the data at NCBI.

此命令将压缩的nr BLAST数据库从NCBI下载到当前工作目录并解压缩。如果与NCBI上的数据相比,该目录中具有相同参数的该脚本的任何后续相同调用将仅下载具有不同时间戳的任何数据。

The update_blastdb.pl script can determine if you are calling it from within a cloud provider and will automatically download from the appropriate cloud bucket.

update_blastdb.pl脚本可以确定您是否从云提供商内部调用它,并将自动从相应的云存储桶下载。

If you would like to see what BLAST databases are available to download, please run:

如果您想查看哪些BLAST数据库可供下载,请运行:

update_blastdb.pl --showall [*]

在这里插入图片描述

For more information on available NCBI BLAST databases, please see https://go.usa.gov/xPhky . For a demo of this tool, please see https://bit.ly/2UA7tYb (external link).

有关可用NCBI BLAST数据库的更多信息,请参阅https://go.usa.gov/xPhky有关此工具的演示,请参见https://bit.ly/2UA7tYb(外部链接)。

For more details about what command line options this tool supports, please run:

有关此工具支持的命令行选项的详细信息,请运行:

update_blastdb.pl --help

在这里插入图片描述

If you need FASTA from these BLAST databases, you can obtain it as follows:

如果您需要这些BLAST数据库中的FASTA,您可以通过以下方式获得:

blastdbcmd -entry all -db nr -out nr.fsa

在这里插入图片描述

If you need FASTA for selected sequence(s) from these BLAST databases, you can obtain it as follows (the sequence of interest is identified by the accession u00001 in this example):

如果您需要从这些BLAST数据库中选择序列的FASTA,您可以按如下方式获得它(在本例中,感兴趣的序列由登录u00001标识):

blastdbcmd -entry u00001 -db nr -out u00001.fsa

在这里插入图片描述

[*] If you run into any problems with this invocation, please try the –passive option, which is enabled by default in BLAST+ 2.8.1 and following. The --decompress option is only needed if the source data comes from NCBI

[*]如果此调用遇到任何问题,请尝试–被动选项,该选项在BLAST+2.8.1及以下版本中默认启用。只有源数据来自NCBI时才需要–decompress选项

Create a masked BLAST database(创建掩蔽blast数据库)

Created: June 23, 2008; Updated: January 7, 2021.
Creating a masked BLAST database is a two step process:
a. Generate the masking data using a sequence filtering utility like windowmasker or dustmasker
b. Generate the actual BLAST database using makeblastdb
For both steps, the input file can be a text file containing sequences in FASTA format, or an existing BLAST database created using makeblastdb. We will provide examples for both scenarios.

创建掩蔽blast数据库需要两个步骤:
a、 使用序列过滤实用程序(如windowmasker或dustmasker)生成屏蔽数据
b、 使用makeblastdb生成实际的BLAST数据库
对于这两个步骤,输入文件可以是包含FASTA格式序列的文本文件,也可以是使用makeblastdb创建的现有BLAST数据库。我们将为这两种场景提供示例。

Collect mask information files(收集掩码信息文件)

For nucleotide sequence data in FASTA files or BLAST database format, we can generate the mask information files using windowmasker or dustmasker. Windowmasker masks the over-represented sequence data and it can also mask the low complexity sequence data using the built-in dust algorithm (through the -dust option). To mask low-complexity sequences only, we will need to use dustmasker.

对于FASTA文件或BLAST数据库格式的核苷酸序列数据,我们可以使用windowmasker或dustmasker生成掩码信息文件。Windowmasker屏蔽过度表示的序列数据,还可以使用内置的dust算法(通过-dust选项)屏蔽低复杂度序列数据。为了仅屏蔽低复杂度序列,我们需要使用dustmasker。

For protein sequence data in FASTA files or BLAST database format, we need to use segmasker to generate the mask information file.

对于FASTA文件或BLAST数据库格式的蛋白质序列数据,我们需要使用segmasker生成掩码信息文件。

The following examples assume that BLAST databases, listed in “Obtaining sample data for this cookbook entry”, are available in the current working directory. Note that you should use the sequence id parsing consistently. In all our examples, we enable this function by including the “-parse_seqids” in the command line arguments.

以下示例假设“获取此烹饪(cookbook)书条目的示例数据”中列出的BLAST数据库在当前工作目录中可用。请注意,您应该一致地使用序列id解析。在我们的所有示例中,我们通过在命令行参数中包含“-parse_seqids”来启用此函数。

Create masking information using dustmasker(使用dustmasker创建掩蔽信息)

We can generate the masking information with dustmasker using a single command line:

我们可以使用单个命令行使用dustmasker生成掩蔽信息:

dustmasker -in hs_chr -infmt blastdb -parse_seqids \
-outfmt maskinfo_asn1_bin -out hs_chr_dust.asnb

在这里插入图片描述

Here we specify the input is a BLAST database named hs_chr (-in hs_chr -infmt blastdb), enable the sequence id parsing (-parse_seqids), request the mask data in binary asn.1 format (-outfmt maskinfo_asn1_bin), and name the output file as hs_chr_dust.asnb (-out hs_chr_dust.asnb).

在这里,我们指定输入是一个名为hs_chr的BLAST数据库(-in hs_chr -infmt blastdb),启用序列id解析(-parse_seqids),请求二进制 asn.1 format (-outfmt maskinfo_asn1_bin)中的掩码数据,并将输出文件命名为hs_chr_dust.asnb-out hs_chr_dust.asnb)。

If the input format is the original FASTA file, hs_chr.fa, we need to change input to -in and -infmt options as follows:

如果输入格式是原始FASTA文件hs_chr.fa,我们需要将输入更改为-in和-infmt选项,如下所示:

dustmasker -in hs_chr.fa -infmt fasta -parse_seqids \
-outfmt maskinfo_asn1_bin -out hs_chr_dust.asnb

在这里插入图片描述

Create masking information using windowmasker(使用windowmasker创建屏蔽信息)

To generate the masking information using windowmasker from the BLAST database hs_chr, we first need to generate a counts file:

为了从BLAST数据库hs_chr中使用windowmasker生成掩蔽信息,我们首先需要生成一个计数文件:

windowmasker -in hs_chr -infmt blastdb -mk_counts \
-parse_seqids -out hs_chr_mask.counts

在这里插入图片描述

Here we specify the input BLAST database (-in hs_chr -infmt blastdb), request it to generate the counts (-mk_counts) with sequence id parsing (-parse_seqids), and save the output to a file named hs_chr_mask.counts(-out hs_chr_mask.counts).

在这里,我们指定输入BLAST数据库(-in hs_chr-infmt blastdb),请求它使用序列id解析(-parse_seqids)生成计数(-mk_counts),并将输出保存到名为hs_chr_mask.counts-out hs_chr_mask.counts)。

To use the FASTA file hs_chr.fa to generate the counts, we need to change the input file name and format:

为了使用FASTA文件hs_chr.fa要生成计数,我们需要更改输入文件名和格式:

windowmasker -in hs_chr.fa -infmt fasta -mk_counts \
-parse_seqids -out hs_chr_mask.counts

在这里插入图片描述

With the counts file we can then proceed to create the file containing the masking information as follows:

使用计数文件,我们可以继续创建包含掩蔽信息的文件,如下所示:

windowmasker -in hs_chr -infmt blastdb -ustat hs_chr_mask.count \
-outfmt maskinfo_asn1_bin -parse_seqids -out hs_chr_mask.asnb

在这里插入图片描述

Here we need to use the same input (-in hs_chr -infmt blastdb) and the output of step 1 (-ustat
hs_chr_mask.counts). We set the mask file format to binary asn.1 (-outfmt maskinfo_asn1_bin), enable the sequence ids parsing (-parse_seqids), and save the masking data to hs_chr_mask.asnb (-out hs_chr_mask.asnb).

这里我们需要使用相同的输入(-in hs_chr-infmt blastdb)和步骤1的输出(-ustat hs_chr_mask.counts)。我们将掩码文件格式设置为二进制asn.1-outfmt maskinfo_asn1_bin),启用序列ID解析(-parse_seqids),并将屏蔽数据保存到hs_chr_mask.asnb-out hs_chr_mask.asnb)。

To use the FASTA file hs_chr.fa, we change the input file name and file type:

以使用FASTA文件hs_chr.fa,我们更改输入文件名和文件类型:

windowmasker -in hs_chr.fa -infmt fasta -ustat hs_chr.counts \
-outfmt maskinfo_asn1_bin -parse_seqids -out hs_chr_mask.asnb

在这里插入图片描述

Create masking information using segmasker(使用segmasker创建掩蔽信息)

We can generate the masking information with segmasker using a single command line:

我们可以使用单个命令行使用segmasker生成掩蔽信息:

segmasker -in refseq_protein -infmt blastdb -parse_seqids \
-outfmt maskinfo_asn1_bin -out refseq_seg.asnb

在这里插入图片描述

Here we specify the refseq_protein BLAST database (-in refseq_protein -infmt blastdb), enable sequence ids parsing (-parse_seqids), request the mask data in binary asn.1 format (-outfmt maskinfo_asn1_bin), and name the out file as refseq_seg.asnb (-out refseq_seg.asnb).

这里我们指定refseq_protein BLAST数据库(-in refseq-protein-infmt blastdb),启用序列ID解析(-parse_seqids),请求二进制asn.1 format中的掩码数据(-outfmt maskinfo_asn1_bin),并将输出文件命名为refseq_seg.asnb-out refseq_seg.asnb)。

If the input format is the FASTA file, we need to change the command line to specify the input format:

如果输入格式是FASTA文件,我们需要更改命令行以指定输入格式:

segmasker -in refseq_protein.fa -infmt fasta -parse_seqids \
-outfmt maskinfo_asn1_bin -out refseq_seg.asnb

在这里插入图片描述

Extract masking information from FASTA sequences with lowercase masking(用小写掩码从FASTA序列中提取掩码信息)

We can also extract the masking information from a FASTA sequence file with lowercase masking (generated by various means) using convert2blastmask utility. An example command line follows:

我们还可以使用convert2blastmask实用程序从带有小写掩码(通过各种方式生成)的FASTA序列文件中提取掩码信息。命令行示例如下:

convert2blastmask -in hs_chr.mfa -parse_seqids -masking_algorithm repeat \
-masking_options "repeatmasker, default" -outfmt maskinfo_asn1_bin \
-out hs_chr_mfa.asnb

在这里插入图片描述

Here the input is hs_chr.mfa (-in hs_chr.mfa), enable parsing of sequence ids, specify the masking algorithm name (-masking_algorithm repeat) and its parameter (-masking_options “repeatmasker, default”), and ask for asn.1 output (-outfmt maskinfo_asn1_bin) to be saved in specified file (-out hs_chr_mfa.asnb).

这里输入是hs_chr.mfa-in hs_chr.mfa),启用序列ID的解析,指定掩码算法名称(-masking_algorithm repeat)及其参数(-maskind_options“repeatmasker,default”),并请求asn.1个输出(-outpmt maskinfo_asn1_bin)保存在指定文件(-out hs_chr_mfa.asnb)中。

Create BLAST database with the masking information(使用屏蔽信息创建BLAST数据库)

Using the masking information data files generated in the previous 4 steps, we can create BLAST database with masking information incorporated.

使用前面4个步骤中生成的掩蔽信息数据文件,我们可以创建包含掩蔽信息的BLAST数据库。

Notes:

  1. we should use “-parse_seqids” in a consistent manner – either use it in both steps or not use it at all.
  2. Starting with the 2.10.0 release, makeblastdb produces version 5 databases by default, which uses LMDB.LMDB requires virtual memory (at least 600 GB, but 800 GB is recommended). Virtual memory is just that (virtual) and doesn’t depend on the hardware in your system. In general, we recommend that BLAST users simply set the virtual memory to unlimited.

笔记:

1.我们应该以一致的方式使用“-parse_seqids”——要么在两个步骤中都使用,要么根本不使用。
**2.从2.10.0版本开始,makeblastdb默认生成使用LMDB的版本5数据库。LMDB需要虚拟内存(至少600 GB,但建议800 GB)。**虚拟内存就是这样(虚拟的),不依赖于系统中的硬件。一般来说,我们建议BLAST用户只需将虚拟内存设置为无限。

Create BLAST database with masking information using an existing BLAST database or FASTA sequence file as input(使用现有BLAST数据库或FASTA序列文件作为输入,创建具有屏蔽信息的BLAST数据)

For example, we can use the following command line to apply the masking information, created above, to the existing BLAST database generated in Obtaining sample data for this cookbook entry:

例如,我们可以使用以下命令行将上面创建的屏蔽信息应用于在获取此烹饪书条目的样本数据时生成的现有BLAST数据库:

makeblastdb -in hs_chr –input_type blastdb -dbtype nucl -parse_seqids \
-mask_data hs_chr_mask.asnb -out hs_chr -title \
"Human Chromosome, Ref B37.1"

Here, we use the existing BLAST database as input file (-in hs_chr), specify its type (-dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide the masking data (-mask_data hs_chr_mask.asnb), and name the output database with the same base name (-out hs_chr) overwriting the existing one.

在这里,我们使用现有的BLAST数据库作为输入文件(-in hs_chr),指定其类型(-dbtype nucl),启用序列ID的解析(-parse_seqids),提供屏蔽数据(-mask_data hs_chr_mask.asnb),并使用相同的基名称(-out hs_chr)命名输出数据库,覆盖现有数据库。

To use the original FASTA sequence file (hs_chr.fa) as the input, we need to use “-in hs_chr.fa” to instruct makeblastdb to use that FASTA file instead.

为了使用原始FASTA序列文件(hs_chr.fa)作为输入,我们需要使用**“-in hs_chr.fa”**来指示makeblastdb使用该FASTA文件。

We can check the “re-created” database to find out if the masking information was added properly, using blastdbcmd with the following command line:

我们可以使用blastdbcmd和以下命令行检查“重新创建”数据库,以确定是否正确添加了屏蔽信息:

blastdbcmd -db hs_chr -info

This command prints out a summary of the target database:

此命令打印出目标数据库的摘要:
在这里插入图片描述

Extra lines (under the “Available filtering algorithms …”)( describe the masking algorithms available). (The “Algorithm ID” field), 30 in our case, is what we need to use if we want to invoke database soft masking during an actual search through the “-db_soft_mask” parameter.

“可用过滤算法…”下的额外行描述了可用的掩蔽算法。如果我们想在通过**“-db_soft_mask”参数进行实际搜索期间调用数据库软屏蔽,则需要使用“Algorithm ID”**字段30。???

We can apply additional masking data to (an existing BLAST database) (with one type of masking information already added). For example, we can apply the dust masking generated above to the database generated earlier (by using this command line):

我们可以在已经添加了一种类型的屏蔽信息的情况下,将额外的屏蔽数据应用于现有的BLAST数据库。例如,我们可以使用以下命令行将上面生成的dust 掩蔽应用于先前生成的数据库:

makeblastdb -in hs_chr –input_type blastdb -dbtype nucl -parse_seqids \
-mask_data hs_chr_dust.asnb -out hs_chr -title "Human Chromosome, Ref B37.1"

Here, we use the existing database as input file (-in hs_chr), specify its input and molecule type (-input_type blastdb -dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide the dust masking data (-mask_data hs_chr_dust.asnb), naming the database with the same based name (-out hs_chr) overwriting the existing one.

在这里,我们使用现有数据库作为输入文件(-in hs_chr),指定其输入和分子类型(-input_type blastdb -dbtype nucl),启用序列ID的解析(-parse_seqids),提供dust 掩蔽数据(-mask_data hs_chr_dust.asnb),使用相同的基础名称命名数据库(-out hs_chr),覆盖现有数据库。

Checking the “re-generated” database with blastdbcmd:

使用blastdbcmd检查“重新生成”数据库:

blastdbcmd -db hs_chr -info

we can see that both sets of masking information are available:

我们可以看到两组屏蔽信息都可用:
在这里插入图片描述

A more straightforward approach to apply multiple sets of masking information in a single makeblastdb run by providing multiple set of masking data files in a comma delimited list:

通过在逗号分隔的列表中提供多组屏蔽数据文件,在单个makeblastdb运行中应用多组屏蔽信息的更直接的方法是:

makeblastdb -in hs_chr –input_type blastdb -dbtype nucl -parse_seqids \
-mask_data hs_chr_dust.asnb, hs_chr_mask.asnb -out hs_chr

Create a protein BLAST database with masking information(使用掩蔽信息创建蛋白质BLAST数据库)

We can use the masking data file generated in “Create masking information using segmasker” to create a proteinBLAST database:

我们可以使用“使用segmasker创建屏蔽信息”中生成的屏蔽数据文件创建proteinBLAST数据库:

makeblastdb -in refseq_protein –input_type blastdb -dbtype prot -parse_seqids \
-mask_data refseq_seg.asnb -out refseq_protein -title \
"RefSeq Protein Database"

Using blastdbcmd, we can check the database thus generated:

使用blastdbcmd,我们可以检查由此生成的数据库:

blastdbcmd -db refseq_protein -info

This produces the following summary, which includes the masking information:

这将生成以下摘要,其中包括屏蔽信息:
在这里插入图片描述

Create a nucleotide BLAST database using the masking information extracted from lower case masked FASTA file(使用从小写掩码的FASTA文件中提取的掩码信息创建核苷酸BLAST数据库)

We use the following command line:

我们使用以下命令行:

makeblastdb -in hs_chr.mfa -dbtype nucl -parse_seqids \
-mask_data hs_chr_mfa.asnb -out hs_chr_mfa -title "Human chromosomes (mfa)"

Here we use the lowercase masked FASTA sequence file as input (-in hs_chr.mfa), its file type (-input_type fasta),specify the database as nucleotide (-dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide the masking data (-mask_data hs_chr_mfa.asnb), and name the resulting database as hs_chr_mfa (-out hs_chr_mfa).

这里我们使用小写掩码的FASTA序列文件作为输入(-in hs_chr.mfa),其文件类型(-input_type FASTA),将数据库指定为核苷酸(-dbtype nucl),启用序列ID的解析(-parse_seqids),提供屏蔽数据(-mask_data hs_chr_mfa.asnb),并将结果数据库命名为hs_chr_mfa (-out hs_chr_mfa)

Checking the database thus generated using blastdbcmd, we have:

检查使用blastdbcmd生成的数据库,我们有:
在这里插入图片描述

The algorithm name and algorithm options are the values we provided in “Extract masking information from FASTA sequences with lowercase masking”.

算法名称和算法选项是我们在“使用小写掩码从FASTA序列提取掩码信息”中提供的值。

Obtaining Sample data for this cookbook entry(获取此食谱条目的样本数据)

For input nucleotide sequences, we use the BLAST database generated from a FASTA input file hs_chr.fa,containing complete human chromosomes from BUILD38, generated by inflating and combining the hs_ref_*.fa.gz files located at:

对于输入核苷酸序列,我们使用从FASTA输入文件hs_chr.fa生成的BLAST数据库,包含来自BUILD38的完整人类染色体,通过膨胀和组合位于以下位置的hs_ref_*.fa.gz文件生成:

ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/

We use this command line to create the BLAST database from the input nucleotide sequences:

我们使用此命令行从输入的核苷酸序列创建BLAST数据库:

makeblastdb -in hs_chr.fa -dbtype nucl -parse_seqids \
-out hs_chr -title "Human chromosomes, Ref B38"

For input nucleotide sequences with lowercase masking, we use the FASTA file hs_chr.mfa, containing the complete human chromosomes from BUILD37.1, generated by inflating and combining the hs_ref_*.mfa.gz files located in the same ftp directory.

对于带有小写掩码的输入核苷酸序列,我们使用FASTA文件hs_chr.mfa,包含BUILD37.1中完整的人类染色体,通过膨胀和组合hs_ref_*.mfa.gz 文件位于同一ftp目录中。

For input protein sequences, we use the preformatted refseq_protein database from the NCBI blast/db/ ftp directory:

对于输入的蛋白质序列,我们使用来自NCBI blast/db/ftp目录的预格式化refseq_protein数据库

ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.00.tar.gz
ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.01.tar.gz
ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.02.tar.gz
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值