NCBI Local BLAST

LOCAL BLAST


#安装、配置、使用

#!/bin/bash
function install_blast(){
	#install local blast programs
	wget 
	tar -zxvf 
	export PATH=/home/xx/ncbi/bin:${PATH}
	balstp -version #产看是否安装成功
}

function download_database(){
	#download databases from NCBI websites
	#wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz 
	
	if false;then
	# 1.下载fasta格式序列文件,随后需要用makeblastdb格式化
	time wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
	time unzip nr 
	time makeblastdb \
	-in nr \
	-parse_seqids \
	-dbtype prot \
	-out nr \
	-logfile nr_logfile 
	#-parse_seqids根据序列ID来分裂
	fi 
	
	# 2.下载NCBI已经格式化好的blast序列文件(推荐)
	time for i in {00..38};do wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.${i}.tar.gz &;done 
	time for i in {00..38};do tar -zxvf nr.${i}.tar.gz &;done 
	time for i in {00..38};do rm nr.${i}.tar.gz;done #节省空间
}

function blast(){
	#start blast+
	 
	nohup time blastp \
	-num_threads 20 \
	-query Bacillus_subtilis_GLB191_Genome_ORF_Prediction_translations.fasta \
	-db ../nr/nr \
	-out result.txt  \
	-max_hsps 1  \
	-num_alignments 1 \
	-evalue 1e-5 \
	-outfmt "6 qseqid sseqid sgi stitle evalue bitscore pident qcovs length mismatch gapopen qstart qend sstart send" & #太耗时,用nohup和&
	if false;then
	1.一般常用的就是-outfmt 5, 6 或者 7 参数,“5”输出XML格式,“6”输出TAB分割格式,“7”输出带注释的TAB分割格式
		qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
	
	2.如果像输出其他信息,比如比对覆盖度信息(qcovs:Query Coverage Per Subject),需要如在outfmt 6基础上加上 “qcovs”( 覆盖度信息),
	添加位置和顺序就是输出TAB文件中列的排列形式:
		-outfmt "6 qseqid sseqid sgi stitle evalue bitscore pident qcovs length mismatch gapopen qstart qend sstart send"
	
	3.如果需要blast比对返回一个最优的比对结果,需要控制-max_target_seqs , -num_alignments 和 -max_hsps 选项:
		-max_target_seqs <Integer, >=1>Maximum number of aligned sequences to keepNot applicable for outfmt <= 4* Incompatible with: num_descriptions, num_alignments 
		-num_alignments <Integer, >=0>Number of database sequences to show alignments for* Incompatible with: max_target_seqs
		
	4.NCB blast-2.8版本可支持用NCBI自带代码分割的NR子库的索引作为比对的库,使用比较方便
	当然如果用这个版本的话,NR库也要重新下载了ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/
		I.使用方式也比较简单(至少比之前的方法方便了),如果只想比对单一物种(如人:9606的话),命令如下:
			blastp –db nr –query query.fasta –taxids 9606 –outfmt 6 –out blast.outfm6
			
		II.如果想比对NR子库哺乳动物的话,需要先建个哺乳动物子库tax_id索引:
		    get_species_taxids.sh -t 40674 > 40674.txids
			
		II.然后再将序列比对至NR哺乳动物子库:
			blastp –db nr –query query.fasta –taxidlist 40674.txids –outfmt 6 –out blast.outfm6
			具体说明可看:https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf
	fi 
}



#main process
install_blast
download_database
blast
*** Formatting options
 -outfmt <String>
   alignment view options:
     0 = Pairwise,
     1 = Query-anchored showing identities,
     2 = Query-anchored no identities,
     3 = Flat query-anchored showing identities,
     4 = Flat query-anchored no identities,
     5 = BLAST XML,
     6 = Tabular,
     7 = Tabular with comment lines,
     8 = Seqalign (Text ASN.1),
     9 = Seqalign (Binary ASN.1),
    10 = Comma-separated values,
    11 = BLAST archive (ASN.1),
    12 = Seqalign (JSON),
    13 = Multiple-file BLAST JSON,
    14 = Multiple-file BLAST XML2,
    15 = Single-file BLAST JSON,
    16 = Single-file BLAST XML2,
    18 = Organism Report
 
   Options 6, 7 and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
        qseqid means Query Seq-id
           qgi means Query GI
          qacc means Query accesion
       qaccver means Query accesion.version
          qlen means Query sequence length
        sseqid means Subject Seq-id
     sallseqid means All subject Seq-id(s), separated by a ';'
           sgi means Subject GI
        sallgi means All subject GIs
          sacc means Subject accession
       saccver means Subject accession.version
       sallacc means All subject accessions
          slen means Subject sequence length
        qstart means Start of alignment in query
          qend means End of alignment in query
        sstart means Start of alignment in subject
          send means End of alignment in subject
          qseq means Aligned part of query sequence
          sseq means Aligned part of subject sequence
        evalue means Expect value
      bitscore means Bit score
         score means Raw score
        length means Alignment length
        pident means Percentage of identical matches
        nident means Number of identical matches
      mismatch means Number of mismatches
      positive means Number of positive-scoring matches
       gapopen means Number of gap openings
          gaps means Total number of gaps
          ppos means Percentage of positive-scoring matches
        frames means Query and subject frames separated by a '/'
        qframe means Query frame
        sframe means Subject frame
          btop means Blast traceback operations (BTOP)
        staxid means Subject Taxonomy ID
      ssciname means Subject Scientific Name
      scomname means Subject Common Name
    sblastname means Subject Blast Name
     sskingdom means Subject Super Kingdom
       staxids means unique Subject Taxonomy ID(s), separated by a ';'
             (in numerical order)
     sscinames means unique Subject Scientific Name(s), separated by a ';'
     scomnames means unique Subject Common Name(s), separated by a ';'
    sblastnames means unique Subject Blast Name(s), separated by a ';'
             (in alphabetical order)
    sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
             (in alphabetical order) 
        stitle means Subject Title
    salltitles means All Subject Title(s), separated by a '<>'
       sstrand means Subject Strand
         qcovs means Query Coverage Per Subject
       qcovhsp means Query Coverage Per HSP
        qcovus means Query Coverage Per Unique Subject (blastn only)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值