构建NCBI本地BLAST数据库 (NR NT等) | blastx/diamond使用方法 | blast构建索引 | makeblastdb...

最新推荐文章于 2024-03-07 20:33:54 发布

weixin_30497527

最新推荐文章于 2024-03-07 20:33:54 发布

阅读量3k

点赞数 1

文章标签：数据库 json 操作系统

原文链接：http://www.cnblogs.com/leezx/p/6425620.html

版权

本文详细介绍了如何下载和构建NCBI BLAST数据库，包括makeblastdb的使用，资源消耗分析，并建议使用diamond作为替代工具。还提供了blast结果的解读以及数据库更新和格式化的相关信息。

摘要由CSDN通过智能技术生成

参考链接：

FTP README

如何下载 NCBI NR NT数据库？

下载blast：ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+

先了解BLAST Databases：BLAST FTP Site

如何下载NCBI blast数据库？

NCBI提供了一个非常智能化的脚本update_blastdb.pl来自动下载所有blast数据库。

脚本使用方法：

perl update_blastdb.pl nr

有哪些可供下载的blast数据库？

perl update_blastdb.pl --showall

该命令会显示所有可供下载的blast数据库，请自行选择：

16SMicrobial
cdd_delta
env_nr
env_nt
est
est_human
est_mouse
est_others
gss
gss_annot
htgs
human_genomic
landmark
nr
nt
other_genomic
pataa
patnt
pdbaa
pdbnt
ref_prok_rep_genomes
ref_viroids_rep_genomes
ref_viruses_rep_genomes
refseq_genomic
refseq_protein
refseq_rna
refseqgene
sts
swissprot
taxdb
tsa_nr
tsa_nt
vector

这里我选择的是nr数据库。

nohup perl update_blastdb.pl --decompress nr >out.log 2>&1 &

自动在后台下载，然后自动解压。（下载到一半断网了，在运行会接着下载，而不会覆盖已经下载好的文件）

blast如何使用？

这里只演示blastx的使用方法。

刚才下载的nr库就是蛋白库，blastx就是用来将核酸序列比对到蛋白库上的。（nt就是核酸库）

因为我们下载的是已经建好索引的数据库，所以省去了makeblastdb的过程。

常见的命令有下面几个：

-query <File_In> 要查询的核酸序列

-db <String> 数据库名字

-out <File_Out> 输出文件

-evalue <Real> evalue阈值

-outfmt <String> 输出的格式

blast构建索引 | makeblastdb

makeblastdb -in mature.fa -input_type fasta -dbtype nucl -title miRBase -parse_seqids -out miRBase -logfile File_Name

-in 后接输入文件，你要格式化的fasta序列
-dbtype 后接序列类型，nucl为核酸，prot为蛋白
-title 给数据库起个名，好看~~(不能用在后面搜索时-db的参数)
-parse_seqids 推荐加上，现在有啥原因还没搞清楚
-out 后接数据库名，自己起一个有意义的名字，以后blast+搜索时要用到的-db的参数
-logfile 日志文件，如果没有默认输出到屏幕

资源消耗

blastx -query test.merged.transcript.fasta -db nr -out test.blastx.out

其中fasta文件只有19938行。

可是运行起来耗费了很多资源：

平均内存消耗：51.45G；峰值：115.37G

cpu：1个

运行时间：06:00:24（你敢信？这才是一个小小的test）

所以我强烈推荐用diamond替代blast来做数据库搜索。

blast结果解读

每一个合格的序列比对都会给出一个这样的结果（一个query sequence比对到多个就有多个结果）：

>AAB70410.1 Similar to Schizosaccharomyces CCAAT-binding factor (gb|U88525).
EST gb|T04310 comes from this gene [Arabidopsis thaliana]
Length=208

 Score = 238 bits (607),  Expect = 7e-76, Method: Compositional matrix adjust.
 Identities = 116/145 (80%), Positives = 127/145 (88%), Gaps = 2/145 (1%)
 Frame = +1

Query  253  FWASQYQEIEQTSDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR  432
            FW +Q++EIE+T+DFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR
Sbjct  39   FWENQFKEIEKTTDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR  98

Query  433  SWNHTEENKRRTLQKNDIAAAITRNEIFDFLVDIVPREDLKDEVLASIPRGTLPMGAPTE  612
            SWNHTEENKRRTLQKNDIAAA+TR +IFDFLVDIVPREDL+DEVL SIPRGT+P  A
Sbjct  99   SWNHTEENKRRTLQKNDIAAAVTRTDIFDFLVDIVPREDLRDEVLGSIPRGTVPEAA-AA  157

Query  613  GLPYYYMQPQHAPQVGAPGMFMGKP  687
            G PY Y+    AP +G PGM MG P
Sbjct  158  GYPYGYLPAGTAP-IGNPGMVMGNP  181

结果解读网上很多，这里不啰嗦了。

以下是我在同样条件下测试的diamond：

平均内存消耗：11.01G；峰值：12.44G

cpu：1个（571.17%）也就是会自动占用5-6个cpu

运行时间：00:26:15

而且diamond注明了，它的优势是处理>1M 的query，量越大速度越快。

diamond的简单用法：

diamond makedb --in nr.fa -d nr
diamond blastx -d nr -q test.merged.transcript.fasta -o test.matches.m8

但是diamond使用有限制，只能用于比对蛋白数据库。

以下是OrfPredictor推荐的参数设置：

To minimize the file size of BLASTX output for loading, the following parameters are recommended if the BLASTX in the 'NCBI-blastall' package is used: "-v 1 -b 1 -e 1e-5" (Note: we used version 2.2.19 - earlier or later versions may not work properly).

下面是详细的blastx帮助文档，以供查阅：

$ blastx -help
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -version
   Print version number;  ignore other arguments

 *** Input query options
 -query <File_In>
   Input file name
   Default = `-'
 -query_loc <String>
   Location on the query sequence in 1-based offsets (Format: start-stop)
 -strand <String, `both', `minus', `plus'>
   Query strand(s) to search against database/subject
   Default = `both'
 -query_gencode <Integer, values between: 1-6, 9-16, 21-25>
   Genetic code to use to translate query (see user manual for details)
   Default = `1'

 *** General search options
 -task &