Blast中文手册(6)

Appendices

Created: June 23, 2008; Updated: March 14, 2021.

Conversion from C toolkit applications(从C工具包到应用程序的转换)

The functionality offered by the BLAST+ applications has been organized by program type. The following graph depicts a correspondence between the NCBI C Toolkit BLAST command line applications and the BLAST+ applications:

BLAST+应用程序提供的功能按程序类型进行组织。下图描述了NCBI C Toolkit BLAST命令行应用程序与BLAST+应用程序之间的对应关系:
在这里插入图片描述

The easiest way to get started using the BLAST+ command line applications is by means of the legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications. For example, instead of using

开始使用BLAST+命令行应用程序的最简单方法是使用 legacy_blast.pl PERL脚本,它与BLAST+应用程序捆绑在一起。要使用此脚本,只需将其添加到C toolkit BLAST命令行应用程序的调用之前,并附加指向BLAST+应用程序安装目录的–path选项。
在这里插入图片描述

The purpose of the legacy_blast.pl PERL script is to help users make the transition from the C Toolkit BLAST command line applications to the BLAST+ applications. This script produces its own documentation by invoking it without any arguments.

legacy_blast.plPERL脚本的目的是帮助用户从C Toolkit BLAST命令行应用程序过渡到BLAST+应用程序。此脚本通过调用它而不带任何参数来生成自己的文档。

The legacy_blast.pl script supports two modes of operation, one in which the C Toolkit BLAST command line invocation is converted and executed on behalf of the user and another which solely displays the BLAST+ application equivalent to what was provided, without executing the command.

legacy_blast.pl脚本支持两种操作模式,一种是转换并代表用户执行C Toolkit BLAST命令行调用,另一种是仅显示与所提供内容等效的BLAST+应用程序,而不执行命令。???

The first mode of operation is achieved by specifying the C Toolkit BLAST command line application invocation and optionally providing the --path argument after the command line to convert if the installation path for the BLAST+ applications differs from the default (available by invoking the script without arguments). See example in the first section of the Quick start.

第一种操作模式是通过指定C Toolkit BLAST命令行应用程序调用来实现的,并且如果BLAST+应用程序的安装路径与默认路径不同(通过调用不带参数的脚本来实现),则可以在命令行之后可选地提供–path参数进行转换。请参见快速入门第一节中的示例。

The second mode of operation is achieved by specifying the C Toolkit BLAST command line application invocation and appending the --print_only command line option as follows:

第二种操作模式是通过指定C Toolkit BLAST命令行应用程序调用并附加–print_only命令行选项来实现的,如下所示:

./legacy_blast.pl megablast -i query.fsa -d nt -o mb.out --print_only
/opt/ncbi/blast/bin/blastn -query query.fsa -db "nt" -out mb.out

Exit codes

All BLAST+ applications have consistent exit codes to signify the exit status of the application. The possible exit codes along with their meaning are detailed in the table below:

所有BLAST+应用程序都有一致的退出代码,以表示应用程序的退出状态。下表详细说明了可能的退出代码及其含义:

Exit CodeMeaning
0Success
1Error in query sequence(s) or BLAST options
2Error in BLAST database
3Error in BLAST engine
4Out of memory
5Network error connecting to NCBI to fetch sequence data
6Error creating output files
255Unknown error

In the case of BLAST+ database applications, the possible exit codes are 0 (indicating success) and 1 (indicating failure).

对于BLAST+数据库应用程序,可能的退出代码为0(表示成功)和1(表示失败)。

Options for the command-line applications.(命令行应用程序的选项。)

This appendix consists of several tables that list option names, types, default values, and a short description of the option. These tables were first published as an appendix to an article in BMC Bioinformatics (BLAST+: architecture and applications). They have been updated for this manual.

本附录由几个表格组成,列出了选项名称、类型、默认值和选项的简短说明。这些表格最初作为BMC生物信息学(BLAST+:架构和应用)一篇文章的附录发布。本手册已对其进行了更新。

Table C1: Options common to all BLAST+ search applications. An option of type “flag” takes no argument, but if present is true. Some options are valid only for a local search (“remote” option not used), others are valid only for a remote search (“remote” option used).

表C1:所有BLAST+搜索应用程序共有的选项。 \colorbox{yellow}{表C1:所有BLAST+搜索应用程序共有的选项。} C1:所有BLAST+搜索应用程序共有的选项。“flag”类型的选项不带参数,但如果存在则为true。某些选项仅对本地搜索有效(“未使用远程”选项),其他选项仅对远程搜索有效(“使用远程”按钮)。

optiontypedefault valuedescription and notes
dbstringnoneBLAST database name.
(BLAST数据库名称.)
querystringstdinQuery file name.
(查询文件名.)
query_locstringnoneLocation on the query sequence (Format: start-stop)
(查询序列上的位置(格式:开始-停止).)
outstringstdoutOutput file name
(输出文件名.)
evaluereal10.0Expect value (E) for saving hits
(保存命中的期望值(E).)
subjectstringnoneFile with subject sequence(s) to search.
(带有要搜索的subject序列的文件.)
subject_locstringnoneLocation on the subject sequence (Format: start-stop).
(subject序列上的位置(格式:开始-停止).)
show_gisflagN/AShow NCBI GIs in report.
(在报告中显示NCBI GIs.)
num_descriptionsinteger500Show one-line descriptions for this number of database sequences.
(显示此数据库序列数目的单行描述.)
num_alignmentsinteger250Show alignments for this number of database sequences.
(显示这个数据库序列的数目的对齐.)
max_target_seqsinteger500Number of aligned sequences to keep. Use with report formats that do not have separate definition line and alignment sections such as tabular (all outfmt > 4). Not compatible with num_descriptions or num_alignments. Ties are broken by order of sequences in the database.
(要保留的对齐序列数。与没有单独定义线和对齐部分的报告格式一起使用,如表格格式(all Outpmt>4)。与num_descriptions或num_alignments不兼容。按数据库中的序列顺序断开连接.)
max_hspsintegernoneMaximum number of HSPs (alignments) to keep for any single query-subject pair. The HSPs shown will be the best as judged by expect value. This number should be an integer that is one or greater. If this option is not set, BLAST shows all HSPs meeting the expect value criteria. Setting it to one will show only the best HSP for every query-subject pair
(为任何单个查询主题对保留的最大HSP数(对齐)。根据预期值判断,显示的HSP将是最佳的。此数字应为一个或更大的整数。如果未设置此选项,BLAST将显示所有符合预期值标准的HSP。将其设置为1将仅显示每个查询主题对的最佳HSP.)
htmlflagN/AProduce HTML output
(生成HTML输出.)
giliststringnoneRestrict search of database to GI’s listed in this file. Local searches only.
(将数据库搜索限制在此文件中列出的GI。仅限本地搜索.)
negative_giliststringnoneRestrict search of database to everything except the GI’s listed in this file. Local searches only.
(将数据库搜索限制为除此文件中列出的GI之外的所有内容。仅限本地搜索.)
entrez_querystringnoneRestrict search with the given Entrez query. Remote searches only.
(使用给定的Entrez查询限制搜索。仅限远程搜索.)
culling_limitintegernoneDelete a hit that is enveloped by at least this many higher-scoring hits.
(删除被至少这么多得分较高的命中包围的命中.)
best_hit_overhangrealnoneBest Hit algorithm overhang value (recommended value: 0.1)
(最佳命中算法悬垂值(推荐值:0.1).)
best_hit_score_edgerealnoneBest Hit algorithm score edge value (recommended value: 0.1)(最佳命中算法得分边缘值(推荐值:0.1).)
dbsizeintegernoneEffective size of the database
(数据库的有效大小.)
searchspintegernoneEffective length of the search space
(搜索空间的有效长度.)
import_search_strategystringnoneSearch strategy file to read.
(要读取的搜索策略文件)
export_search_strategystringnoneRecord search strategy to this file.
(将搜索策略记录到此文件)
parse_deflinesflagN/AParse query and subject bar delimited sequence identifiers (e.g., gi|129295).
(分析查询和主题栏分隔的序列标识符)
num_threadsinteger1Number of threads (CPUs) to use in blast search.
(在blast搜索中使用的线程(CPU)数)
remoteflagN/AExecute search on NCBI servers?
(在NCBI服务器上执行搜索?)
outfmtstring0alignment view options:
0 = pairwise,(成对的)
1 = query-anchored showing identities,(查询锚定为显示标识)
2 = query-anchored no identities,(查询没有标识)
3 = flat query-anchored,show identities,(平面查询锚定,显示标识)
4 = flat query-anchored,no identities,(平面查询被锚定,没有标识)
5 = XML Blast output(XML Blast输出)
6 = tabular(表格的)
7 = tabular with comment lines(带注释行的表格)
8 = Text ASN.1(文本ASN.1)
9 = Binary ASN.1(二进制ASN.1)
10 = Comma-separated values(逗号分隔值)
11 = BLAST archive format (ASN.1)(Blast存档格式(ASN.1))
12 = Seqalign (JSON)
13 = Multiple-file BLAST JSON(多文件Blast JSON)
14 = Multiple-file BLAST XML2(多文件Blast XML2)
15 = Single-file BLAST JSON(单文件Blast JSON)
16 = Single-file BLAST XML2(单文件Blast XML2)
17 = Sequence Alignment/Map (SAM)(序列比对/映射(SAM) )
18 = Organism Report(组成报告)
Options 6,7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers.(选项6、7和10还可以配置为生成由空格分隔格式说明符指定的自定义格式)
The supported format specifiers are:(支持的格式说明符有:)
qseqid means Query Seq-id(查询序列ID标识)
qgi means Query GI(查询GI)
qacc means Query accession(查询加入?)
sseqid means Subject Seq-id(比对上的目标序列ID标识)
sallseqid means All subject Seq-id(s), separated by a ‘;’(所有目标序列的序列ID,用分号相隔)
sgi means Subject GI(目标序列的GI)
sallgi means All subject GIs(所有目标序列的GIs)
sacc means Subject accession(目标序列的加入?)
sallacc means All subject accessions(所有目标序列的加入?)
qstart means Start of alignment in query(比对区域在查询序列上的起始位点)
qend means End of alignment in query(比对区域在查询序列上的终止位点)
sstart means Start of alignment in subject(比对区域在目标序列上的起始位点)
send means End of alignment in subject(比对区域在目标序列上的终止位点)
qseq means Aligned part of query sequence(比对上的部分查询序列)
sseq means Aligned part of subject sequence(比对上的部分目标序列)
evalue means Expect value(期望值)
bitscore means Bit score(二进制值)
score means Raw score(原始分数)
length means Alignment length(对齐长度)
pident means Percentage of identical matches(序列比对的一致性百分比)
nident means Number of identical matches(匹配相同的数目)
mismatch means Number of mismatches(比对区域的错配数)
positive means Number of positive-scoring matches(正得分匹配数)
gapopen means Number of gap openings(比对区域的gap数目)
gaps means Total number of gap(gap的总数目)
ppos means Percentage of positive-scoring matches(正得分匹配的百分比)
frames means Query and subject frames separated by a ‘/’(查询和目标序框架?被’/'分离)
qframe means Query frame(查询框架)
sframe means Subject frame(目标框架)
btop means Blast traceback operations (BTOP)(blast追踪操作)
staxids means unique Subject Taxonomy ID(s), separated by a ‘;’(in numerical order)(唯一的目标分类ID,由“;”分隔(按数字顺序))
sscinames means unique Subject Scientific Name(s), separated by a ‘;’(唯一的目标学名,用“;”分隔)
scomnames means unique Subject Common Name(s), separated by a ‘;’(唯一的目标通用名称,用“;”分隔)
blastnames means unique Subject Blast Name(s), separated by a ‘;’ (in alphabetical order)(唯一的目标blast名称,用“;”分隔(按字母顺序排列))
sskingdoms means unique Subject Super Kingdom(s), separated by a ‘;’ (in alphabetical order)(唯一目标超级王国,由“;”分隔(按字母顺序排列))
stitle means Subject Title(目标题目)
salltitles means All Subject Title(s), separated by a ‘<>’(所有目标题目,用’<>'分隔)
sstrand means Subject Strand(目标链)
qcovs means Query Coverage Per Subject (for all HSPs)(每个主题的查询覆盖率(适用于所有HSP))
qcovhsp means Query Coverage Per HSP(每个HSP的查询覆盖率)
qcovus is a measure of Query Coverage that counts a position in a subject sequence for this measure only once. The second time the position is aligned to the query is not counted towards this measure.(查询覆盖率的一种度量,该度量仅计算一次目标序列中的位置。位置与查询对齐的第二次时间不计入此度量)
When not provided, the default value is:
‘qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore’, which is equivalent to the keyword ‘std’

blastn:核酸序列与核酸库的比对,直接比对核酸序列的同源性。 \colorbox{yellow}{blastn:核酸序列与核酸库的比对,直接比对核酸序列的同源性。} blastn:核酸序列与核酸库的比对,直接比对核酸序列的同源性。

Table C2: blastn application options. The blastn application searches a nucleotide query against nucleotide subject sequences or a nucleotide database. An option of type “flag” takes no arguments, but if present the argument is true. Four different tasks are supported: 1.) “megablast”, for very similar sequences (e.g, sequencing errors), 2.) “dc-megablast”, typically used for inter-species comparisons, 3.) “blastn”, the traditional program used for inter-species comparisons, 4.) “blastn-short”, optimized for sequences less than 30 nucleotides.

表C2:blastn应用程序选项。 \colorbox{yellow}{表C2:blastn应用程序选项。} C2:blastn应用程序选项。blastn应用程序根据核苷酸主题序列或核苷酸数据库搜索核苷酸查询序列。“flag”类型的选项不接受任何参数,但如果存在,则该参数为true。支持四种不同的任务:1.“megablast”,用于非常相似的序列(例如,测序错误),2.“dc-megablast”,通常用于物种间比较,3.“blastn”是用于物种间比较的传统程序,4.“短blastn”,针对少于30个核苷酸的序列进行优化。

optiontask(s)typedefault valuedescription and notes
word_sizemegablastinteger28Length of initial exact match.(初始精确匹配的长度.)
word_sizedc-megablastinteger11Number of matching nucleotides in initial match. dc-megablast allows non-consecutive letters to match.(初始匹配中匹配的核苷酸数."dc megablast"允许非连续字母匹配)
word_sizeblastninteger11Length of initial exact match.(初始精确匹配的长度.)
word_sizeblastn-shortinteger7Length of initial exact match.(初始精确匹配的长度.)
gapopenmegablastinteger0Cost to open a gap. See appendix “BLASTN reward/penaltyvalues”.(打开缺口的成本.见附录“BLASTN奖励/惩罚价值”.)
gapextendmegablastintegernoneCost to extend a gap. This default is a function of reward/penalty value. See appendix “BLASTN reward/penalty values”.(扩大差距的成本.此默认值是奖励/惩罚值的函数.见附录“BLASTN奖励/惩罚值”.)
gapopenblastn, blastn-short,dc-megablastinteger5Cost to open a gap. See appendix “BLASTN reward/penalty values”.(打开缺口的成本.见附录“BLASTN奖励/惩罚值”)
gapextendblastn, blastn-short,dc-megablastinteger2Cost to extend a gap. See appendix “BLASTN reward/penalty values”.(扩大差距的成本.见附录“BLASTN奖励/惩罚值”.)
rewardmegablastinteger1Reward for a nucleotide match.(核苷酸匹配奖励.)
penaltymegablastinteger-2Penalty for a nucleotide mismatch.(核苷酸错配的惩罚.)
rewardblastn, dc-megablastinteger2Reward for a nucleotide match.(核苷酸匹配奖励.)
penaltyblastn, dc-megablastinteger-3Penalty for a nucleotide mismatch.(核苷酸错配的惩罚.)
rewardblastn-shortinteger1Reward for a nucleotide match.(核苷酸匹配奖励.)
penaltyblastn-shortinteger-3Penalty for a nucleotide mismatch.(核苷酸错配的惩罚.)
strandallstringbothQuery strand(s) to search against database/subject. Choice of both, minus, or plus.(根据数据库/主题搜索的查询strand(s).选择两者,减或加.)
dustblastn-shortstring20 64 1Filter query sequence with dust.(带灰尘的过滤器查询序列.)
filtering_dballstringnoneMask query using the sequences in this database.(使用此数据库中的序列屏蔽查询.)
window_masker_taxidallintegernoneEnable WindowMasker filtering using a Taxonomic ID.(使用分类ID启用WindowMasker筛选.)
window_masker_dballstringnoneEnable WindowMasker filtering using this file.(使用此文件启用WindowMasker筛选.)
soft_maskingallbooleantrueApply filtering locations as soft masks (i.e., only for finding initial matches).(将过滤位置应用为软掩码(即,仅用于查找初始匹配).)
lcase_maskingallflagN/AUse lower case filtering in query and subject sequence(s).(在查询和主题序列中使用小写筛选.)
db_soft_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).(作为软掩码应用于BLAST数据库的过滤算法ID(即,仅用于查找初始匹配).)
db_hard_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).(作为硬掩码应用于BLAST数据库的过滤算法ID(即,对搜索的所有阶段屏蔽序列).)
perc_identityallinteger0Percent identity cutoff.(身份截止百分比.)
template_typedc-megablaststringcodingDiscontiguous MegaBLAST template type. Allowed values are coding, optimal and coding_and_optimal.(不连续MegaBLAST模板类型.允许的值是coding、optimal和coding_)
template_lengthdc-megablastinteger18Discontiguous MegaBLAST template length.(不连续MegaBLAST模板长度)
use_indexmegablastbooleanfalseUse MegaBLAST database index. Indices may be created with the makembindex application.(使用MegaBLAST数据库索引.可以使用makembindex应用程序创建索引.)
index_namemegablaststringnoneMegaBLAST database index name.(MegaBLAST数据库索引名称.)
xdrop_ungapallreal20Heuristic value (in bits) for ungapped extensions.(无上限扩展的启发式值(位).)
xdrop_gapallreal30Heuristic value (in bits) for preliminary gapped extensions.(初始间隙扩展的启发式值(位).)
xdrop_gap_finalallreal100Heuristic value (in bits) for final gapped alignment.(最终间隙对齐的启发式值(位).)
no_greedymegablastflagN/AUse non-greedy dynamic programming extension.(使用非贪婪动态规划扩展 .)
min_raw_gapped_scoreallintegernoneMinimum raw gapped score to keep an alignment in the preliminary gapped and trace-back stages. Normally set based upon expect value.(在初始间隙和追溯阶段保持对齐的最小原始间隙分数 .通常根据期望值设置.)
ungappedallflagN/APerform ungapped alignment.(执行无盖对齐.)
window_sizedc-megablastinteger40Multiple hits window size, use 0 to specify 1-hit algorithm.(执行无盖对齐.多次命中窗口大小,使用0指定一次命中算法)

blastp:蛋白序列与蛋白库作比对,直接比对蛋白序列的同源性。 \colorbox{yellow}{blastp:蛋白序列与蛋白库作比对,直接比对蛋白序列的同源性。} blastp:蛋白序列与蛋白库作比对,直接比对蛋白序列的同源性。

Table C3: blastp application options. The blastp application searches a protein sequence against protein subject sequences or a protein database. An option of type “flag” takes no arguments, but if present the argument is true. Three different tasks are supported: 1.) “blastp”, for standard protein-protein comparisons, 2.) “blastp-short”, optimized for query sequences shorter than 30 residues, and 3.)“blastp-fast”, a faster version that uses a larger word-size per https://www.ncbi.nlm.nih.gov/pubmed/17921491. This table reflects the 2.2.27 BLAST+ release.

表C3:blastp应用程序选项。 \colorbox{yellow}{表C3:blastp应用程序选项。} C3:blastp应用程序选项。blastp应用程序根据蛋白质主题序列或蛋白质数据库搜索蛋白质序列。“flag”类型的选项不接受任何参数,但如果存在,则该参数为true。支持三种不同的任务:1.“blastp”,用于标准蛋白质-蛋白质比较,2.“blastp short”,针对短于30个残基的查询序列进行了优化;3.“BLATP fast”,这是一个更快的版本,每个查询序列使用更大的字数https://www.ncbi.nlm.nih.gov/pubmed/17921491.此表反映了2.2.27 BLAST+版本。

optiontasktypedefault valuedescription and notes
word_sizeblastpinteger3Word size of initial match. Valid word sizes are 2-7. (初始匹配的字长。有效字长为2-7。)
word_sizeblastp-shortinteger2Word size of initial match.(初始匹配的字长。)
word sizeblastp-fastInteger6Word size of initial match. (初始匹配的字长。)
gapopenblastpinteger11Cost to open a gap. (打开缺口的成本。)
gapextendblastpinteger1Cost to extend a gap. (扩大差距的成本。)
gapopenblastp-shortinteger9Cost to open a gap. (打开缺口的成本。)
gapextendblastp-shortinteger1Cost to extend a gap. (扩大差距的成本。)
matrixblastpstringBLOSUM62Scoring matrix name. (评分矩阵名称。)
matrixblastp-shortstringPAM30Scoring matrix name. (评分矩阵名称。)
thresholdblastpinteger11Minimum score to add a word to the BLAST lookup table.(将单词添加到BLAST查找表的最低分数。)
thresholdblastp-shortinteger16Minimum score to add a word to the BLAST lookup table.(将单词添加到BLAST查找表的最低分数。)
ThresholdBlastp-fastinteger21Minimum score to add a word to the BLAST lookup table.(将单词添加到BLAST查找表的最低分数。)
comp_based_statsBlastp and blastp-faststring2Use composition-based statistics:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, unconditionally
comp_based_statsblastp-shortstring0Use composition-based statistics:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, unconditionally
segallstringnoFilter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).(使用SEG过滤查询序列(格式为“是”、“窗口插入”或“否”以禁用)。)
soft_maskingblastpbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).(将过滤位置应用为软掩码(即,仅用于查找初始匹配)。)
lcase_maskingallflagN/AUse lower case filtering in query and subject sequence(s).(在查询和主题序列中使用小写筛选。)
db_soft_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).(作为软掩码应用于BLAST数据库的过滤算法ID(即,仅用于查找初始匹配)。)
db_hard_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).(作为硬掩码应用于BLAST数据库的过滤算法ID(即,对搜索的所有阶段屏蔽序列)。)
xdrop_gap_finalallreal25Heuristic value (in bits) for final gapped alignment/(最终间隙对齐的启发式值(位)/)
window_sizeBlastp and blastp-fastinteger40Multiple hits window size, use 0 to specify 1-hit algorithm.(多次命中窗口大小,使用0指定一次命中算法)
window_sizeblastp-shortinteger5Multiple hits window size, use 0 to specify 1-hit algorithm.(多次命中窗口大小,使用0指定一次命中算法)
use_sw_tbackallflagN/ACompute locally optimal Smith-Waterman alignments?(计算局部最优Smith-Waterman路线?)

blastx:核酸序列与蛋白库作比对,将核酸序列先翻译成蛋白序列,再将其与蛋白库作比对。 \colorbox{yellow}{blastx:核酸序列与蛋白库作比对,将核酸序列先翻译成蛋白序列,再将其与蛋白库作比对。} blastx:核酸序列与蛋白库作比对,将核酸序列先翻译成蛋白序列,再将其与蛋白库作比对。

Table C4: blastx application options. The blastx application translates a nucleotide query and searches it against protein subject sequences or a protein database. Two different tasks are supported: 1.) “blastx” for standard translated nucleotide-protein comparison and 2.) “blastx-fast”, a faster version that uses a larger word-size based on https://www.ncbi.nlm.nih.gov/pubmed/17921491.

表C4:blastx应用程序选项。 \colorbox{yellow}{表C4:blastx应用程序选项。} C4:blastx应用程序选项。blastx应用程序翻译核苷酸查询并根据蛋白质主题序列或蛋白质数据库进行搜索。支持两种不同的任务:1.“blastx”用于标准翻译的核苷酸-蛋白质比较,2.“blast x-fast”,这是一种更快的版本,使用基于https://www.ncbi.nlm.nih.gov/pubmed/17921491.

optiontasktypedefault valuedescription and notes
word_sizeBlastxinteger3Word size for initial match. Valid word sizes are 2-7.
word_sizeBlastx-fastinteger6Word size for initial match.
gapopenAllinteger11Cost to open a gap.
gapextendAllinteger1Cost to extend a gap.
matrixAllstringBLOSUM62Scoring matrix name.
thresholdBlastxinteger12Minimum score to add a word to the BLAST lookup table.
thresholdBlastx-fastinteger21Minimum score to add a word to the BLAST lookup table.
segAllstring12 2.2 2.5Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).
soft_maskingallbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_maskingallflagN/AUse lower case filtering in query and subject sequence(s).
db_soft_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e.,only for finding initial matches).
db_hard_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e.,sequence is masked for all phases of search).
xdrop_gap_finalallreal25Heuristic value (in bits) for final gapped alignment.
window_sizeallinteger40Multiple hits window size, use 0 to specify 1-hit algorithm.
strandallstringbothQuery strand(s) to search against database/subject. Choice of both, minus,or plus.
query_genetic_codeallinteger1Genetic code to translate query, seeftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
max_intron_lengthallinteger0Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
comp_based_statsallinteger2Use composition-based statistics for blastx:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911,2005, unconditionally
Default = `2’

tblastn:蛋白序列对核酸库的比对,现将核酸库翻译成蛋白库,再将蛋白序列与翻译后的蛋白库进行比对。 \colorbox{yellow}{tblastn:蛋白序列对核酸库的比对,现将核酸库翻译成蛋白库,再将蛋白序列与翻译后的蛋白库进行比对。} tblastn:蛋白序列对核酸库的比对,现将核酸库翻译成蛋白库,再将蛋白序列与翻译后的蛋白库进行比对。

Table C5: tblastn application options. The tblastn application searches a protein query against nucleotide subject sequences or a nucleotide database translated at search time. Two different tasks are supported: 1.) “tblastn” for a standard protein-translated nucleotide comparison and 2.) “tblastn-fast” for a faster version with a larger word-size based on https://www.ncbi.nlm.nih.gov/pubmed/17921491.

表C5:tblastn应用程序选项。 \colorbox{yellow}{表C5:tblastn应用程序选项。} C5:tblastn应用程序选项。tblastn应用程序根据核苷酸主题序列或在搜索时翻译的核苷酸数据库搜索蛋白质查询。支持两种不同的任务:1)“tblastn”用于标准蛋白质翻译核苷酸比较,2)“tblastn fast”用于更快速的版本,基于https://www.ncbi.nlm.nih.gov/pubmed/17921491.

optiontasktypedefault valuedescription and notes
word_sizetblastninteger3Word size for initial match. Valid word sizes are 2-7.
word_sizetblastn-fastinteger6Word size for initial match.
gapopenAllinteger11Cost to open a gap.
gapextendAllinteger1Cost to extend a gap.
matrixAllstringBLOSUM62Scoring matrix name.
thresholdtblastninteger13Minimum score to add a word to the BLAST lookup table.
thresholdtblastn-fastinteger21Minimum score to add a word to the BLAST lookup table.
segAllstring12 2.2 2.5Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).
soft_maskingallbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_maskingallflagN/AUse lower case filtering in query and subject sequence(s).
db_soft_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e.,only for finding initial matches).
db_hard_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e.,sequence is masked for all phases of search).
xdrop_gap_finalallreal25Heuristic value (in bits) for final gapped alignment.
window_sizeallinteger40Multiple hits window size, use 0 to specify 1-hit algorithm.
db_gen_codeAllinteger1Genetic code to translate subject sequences, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
max_intron_lengthAllinteger0Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
comp_based_statsallstring2Use composition-based statistics for tblastn:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911,2005, unconditionally
Default = `2’

tblastx:核酸与核酸数据库在蛋白质水平比较 \colorbox{yellow}{tblastx:核酸与核酸数据库在蛋白质水平比较} tblastx:核酸与核酸数据库在蛋白质水平比较

Table C6: tblastx application options. The tblastx application searches a translated nucleotide query against translated nucleotide subject sequences or a translated nucleotide database. An option of type “flag” takes no arguments, but if present the argument is true.This table reflects the 2.2.27 BLAST+ release. Only ungapped searches are supported for tblastx.

表C6:tblastx应用程序选项。 \colorbox{yellow}{表C6:tblastx应用程序选项。} C6:tblastx应用程序选项。tblastx应用程序根据翻译的核苷酸主题序列或翻译的核苷酸数据库搜索翻译的核苷酸查询。“flag”类型的选项不接受任何参数,但如果存在,则该参数为true。此表反映了2.2.27 BLAST+版本。tblastx仅支持未加上限的搜索。

optiontypedefault valuedescription and notes
word_sizeinteger3Word size for initial match.
matrixstringBLOSUM62Scoring matrix name.
thresholdinteger13Minimum word score to add the word to the BLAST lookup table.
segstring12 2.2 2.5Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).
soft_maskingbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_maskingflagN/AUse lower case filtering in query and subject sequence(s).
db_soft_maskintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).
db_hard_maskintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).
strandstringbothQuery strand(s) to search against database subject sequences. Choice of both, minus, or plus.
query_genetic_codeinteger1Genetic code to translate query, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
db_gen_codeinteger1Genetic code to translate subject sequences, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
max_intron_lengthinteger0Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking)

CDD(Conserved Domain Database)
简介:CDD是蛋白质保守结构域数据库,收集了大量保守结构域序列信息和蛋白质序列信息。一个蛋白质的保守结构域在一定程度上体现了该蛋白质的功能,检索时通过CD-Search服务,可获得蛋白质序列中所含的保守结构域信息,从而分析、预测该蛋白质的功能。https://zhuanlan.zhihu.com/p/460178458

Table C7: rpsblast application options. The rpsblast application searches a protein query against the conserved domain database(CDD), which is a set of protein profiles. Many of the common options such as matrix or word threshold are set when the CDD is built and cannot be changed by the rpsblast application. A search ready CDD can be downloaded from

表C7:rpsblast应用程序选项。 \colorbox{yellow}{表C7:rpsblast应用程序选项。} C7:rpsblast应用程序选项。rpsblast应用程序根据保守域数据库(CDD)搜索蛋白质查询,CDD是一组蛋白质图谱。许多常见选项(如矩阵或字阈值)是在构建CDD时设置的,rpsblast应用程序无法更改。可从以下网站下载搜索就绪CDD:ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/

optiontypedefault valuedescription and notes
word_sizeinteger40Multiple hits window size, use 0 to specify 1-hit algorithm.
xdrop_ungapreal15Heuristic value (in bits) for ungapped extensions
xdrop_gapreal25Heuristic value (in bits) for preliminary gapped extensions.
xdrop_gap_finalreal40Heuristic value (in bits) for final gapped alignment.
segstring12 2.2 2.5Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).
soft_maskingbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).
db_soft_maskintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).
mt_modeinteger0Set to 1 if a large number of queries are to be searched and you wish to use multiple threads, as specified by the num_threads argument.
comp_based_statsinteger2Use composition-based statistics for rpsblast:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005,unconditionally
Default = `2’

Table C8: Makeblastdb application options. This application builds a BLAST database. An option of type “flag” takes no arguments, but if present the argument is true. Starting with the 2.10.0 release, makeblastdb produces version 5 databases by default, which uses LMDB. LMDB requires virtual memory (at least 600 GB, but 800 GB is recommended) to build an index. If makeblastdb cannot access enough virtual memory, it will produce a message containing the string “mdb_env_open”. Virtual memory is just that (virtual) and doesn’t depend on the hardware in your system. In general, we recommend that BLAST users simply set the virtual memory to unlimited. The other alternative is to use an environment variable (BLASTDB_LMDB_MAP_SIZE) to set the required virtual memory lower, but this runs the risk of LMDB not being able to complete indexing the database. For a smaller database (tens of millions of letters) it may be possible to use a value of 100 million.

optiontypedefault valuedescription and notes
instringstdinInput file/database name
input_typestringfastaInput file type, it may be any of the following:
fasta: for FASTA file(s)
blastdb: for BLAST database(s)
asn1_txt: for Seq-entries in text ASN.1 format
asn1_bin: for Seq-entries in binary ASN.1 format
dbtypestringprotMolecule type of input, values can be nucl or prot.
titlestringnoneTitle for BLAST database. If not set, the input file name will be used.
parse_seqidsflagN/AParse bar delimited sequence identifiers (e.g., gi
hash_indexflagN/ACreate index of sequence hash values.
mask_datastringnoneComma-separated list of input files containing masking data as produced by NCBI masking applications (e.g. dustmasker, segmasker, windowmasker).
outstringinput file nameName of BLAST database to be created. Input file name is used if none provided.This field is required if input consists of multiple files.
max_file_sizestring1GBMaximum file size to use for BLAST database. 4GB is the maximum supported by the database structure.
blastdb_versioninteger5Version 5 (taxonomy aware) is the default starting with the 2.10.0 release. Value must be 4 or 5.
taxidintegernoneTaxonomy ID to assign to all sequences.
taxid_mapstringnoneFile with two columns mapping sequence ID to the taxonomy ID. The first column is the sequence ID represented as one of:
1.fasta with accessions (e.g., emb|X17276.1|)
2.fasta with GI (e.g., gi|4)
3.GI as a bare number (e.g., 4)
4.A local ID. The local ID must be prefixed with “lcl” (e.g., lcl|4).
The second column should be the NCBI taxonomy ID (e.g., 9606 for human).
metadata_output_prefixstringnonePath prefix for “files” field in BLASTDB metadata file
logfilestringnoneProgram log file (default is stderr).

Table C9: Makeprofiledb application options. This application builds an RPS-BLAST database. An option of type “flag” takes no arguments, but if present the argument is true. COBALT (a multiple sequence alignment program) and DELTA-BLAST both use RPS-BLAST searches as part of their processing but use specialized versions of the database. This application can build databases for COBALT, DELTA-BLAST, and a standard RPS-BLAST search. The “dbtype” option (see entry in table) determines which flavor of the database is built.

optiontypedefault valuedescription and notes
instringstdinInput file that contains a list of scoremat files (delimited by space, tab, or newline)
binaryflagN/AThe scoremat files are binary ASN.1
titlestringnoneTitle for RPS-BLAST database. If not set, the input file name will be used.
thresholdreal9.82Threshold for RPSBLAST lookup table.
outstringinput file nameName of BLAST database to be created. Input file name is used if none provided.
max_file_sizestring1GBMaximum file size to use for BLAST database.
dbtypestringrpsSpecifies use for RPSBLAST db. One of rps, cobalt, or delta.
indexflagN/ACreates index files.
gapopenintegernoneCost to open a gap. Used only if scoremat files do not contain PSSM scores, otherwise ignored.
gapextendintegernoneCost to extend a gap by one residue. Used only if scoremat files do not contain PSSM scores, otherwise ignored.
scalereal100PSSM scale factor.
matrixstringBLOSUM62Matrix to use in constructing PSSM. One of BLOSUM45, BLOSUM50, BLOSUM62,BLOSUM80, BLOSUM90, PAM250, PAM30 or PAM70. Used only if scoremat files do not contain PSSM scores, otherwise ignored.
obsr_thresholdreal6Exclude domains with maximum number of independent observations below this value (for use in DELTA-BLAST searches).
exclude_invalidrealtrueExclude domains that do not pass validation test (for use in DELTA-BLAST searches).
logfilestringnoneProgram log file (default is stderr).

Table C10: Blastdbcmd application options. This application reads a BLAST database and produces reports.

optiontypedefault valuedescription and notes
dbstringnrBLAST database name.
dbtypestringguessMolecule type stored in BLAST database, one of nucl, prot, or guess.
entrystringnoneComma-delimited search string(s) of sequence identifiers: e.g.: 555, AC147927, ‘gnl|dbname|tag’, or ‘all’ to select all sequences in the database
entry_batchstringnoneInput file for batch processing. The format requires one entry per line; each line should begin with the sequence ID followed by any of the following optional specifiers (in any order): range (format: ‘from-to’, inclusive in 1-offsets), strand (‘plus’ or ‘minus’), or masking algorithm ID (integer value representing the available masking algorithm).Omitting the ending range (e.g.: ‘10-‘) is supported, but there should not be any spaces around the ‘-‘.
pigintegernonePIG (protein identity group) to retrieve.
infoflagN/APrint BLAST database information.
rangestringnoneRange of sequence to extract (Format: start-stop).
strandstringplusStrand of nucleotide sequence to extract. Choice of plus or minus.
mask_sequence_withstringnoneProduce lower-case masked FASTA using the algorithm IDs specified.
outstringstdoutOutput file name.
outfmtstring%fOutput format, where the available format specifiers are:
%f means sequence in FASTA format
%s means sequence data (without defline)
%a means accession
%g means gi
%o means ordinal id (OID)
%t means sequence title
%l means sequence length
%T means taxid
%L means common taxonomic name
%S means scientific name
%P means PIG
%mX means sequence masking data, where X is an optional comma-separated list of integers to specify the algorithm ID(s) to display (or all masks if absent or invalid specification). Masking data will be displayed as a series of ‘N-M’ values separated by ‘;’ or the word ‘none’ if none are available. For every format except '%f ', each line of output will correspond to a sequence.
target_onlyflagN/ADefinition line should contain target GI only.
get_dupsflagN/ARetrieve duplicate accessions.
line_lengthinteger80Line length for output.
ctrl_aflagN/AUse Ctrl-A as the non-redundant definition line separator.

Table C11: Makembindex application options. The indexed databases created by makembindex are used by production MegaBLAST software and by a new srsearch utility designed to quickly search for nearly exact matches (up to one mismatch) of short queries against a genomic database. When a FASTA formatted file is used as the input, then masking by lower case letters is incorporated in the index. Makembindex can currently build two types of indices, called “old style” and “new style” indexing. The NCBI offers full support for the new style and has deprecated the old style. A MegaBLAST search with a new style index requires that both the index and the corresponding BLAST database be present. The index structure is described in PMID:18567917. Please cite this paper in any publication that uses makembindex.

optiontypedefault valuedescription and notes
inputstringstdinInput file name or BLAST database name, depending on the value of the iformat parameter.For FASTA formatted input, this parameter is optional and defaults to the program’s standard input stream.
outputstringnoneThe resulting index name. The index itself can consist of multiple files, called volumes, called <index_name>.00.idx, <index_name>.01.idx,…
This option should not be used with new style indices.
iformatstringfastaThe input format selector. Possible values are ‘fasta’ and ‘blastdb’.
old_style_indexbooleanfalseThe old_style_index is no longer supported. If set to ‘false’ the new style index is created.New style indices require a BLAST database as input (use -iformat blastdb), which can be downloaded from the NCBI FTP site or created with makeblastdb. The option -output is ignored for a new style index. New style indices are always created at the same location as the corresponding BLAST database.
db_maskintegerNoneExclude masked regions of BLAST db from the index. Use makeblastdb to discover the algorithm ID to be used as input for this argument.
legacybooleantrueThis is a compatibility feature to support current production MegaBLAST. If true, then -stride, -nmer, and -ws_hint are ignored. The legacy format must be used for BLAST.
nmerinteger12N-mer size to use. Ignored if –legacy is specified
ws_hintinteger28This is an optimization hint for makembindex that indicates an expected minimum match size in searches that use the index. If n is the value of -nmer parameter and s is the value of –stride parameter, then the value of -ws_hint must be at least n + s - 1.
strideinteger5makembindex will index every stride-th N-mer of the database.
volsizeinteger1536Target index volume size in megabytes.

BLASTN reward/penalty values(BLASTN奖励/惩罚值)

BLASTN uses a simple approach to score alignments, with identically matching bases assigned a reward and mismatching bases assigned a penalty. It is important to choose reward/penalty values appropriate to the sequences being aligned with the (absolute) reward/penalty ratio increasing for more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved [2]. For each reward/penalty pair, a number of different gap costs are supported. A gap cost includes a value to open the gap and a value to extend the gap by a base. Following the convention of the command-line applications, these costs are listed as positive numbers here. MegaBLAST uses a specialized algorithm to calculate the default gap costs for a reward/penalty pair that is described in PMID:10890397. Briefly, the default megaBLAST cost to open a gap is zero and the cost to extend a gap two letters is given by the absolute value of two mismatches minus one match. For example, given a reward of 1 and penalty of -5, the cost to extend a gap by one letter is 5.5.The default gap costs for other tasks supported by the blastn application is 5 to open a gap and 2 to extend one base.

Table D1 presents the supported reward/penalty values and gap costs.

Table D1: Supported reward/penalty values and gap costs for the blastn application. The left-most column presents the supported reward/penalty values. The middle column presents pairs of numbers for the cost to open and extend a gap for each reward/penalty value. Blastn also supports gap costs more stringent than those listed (e.g., for reward/penalty of 1/-3 gap costs of 5/2 or 500/2 are supported). The reward/penalty values are ordered from most to least stringent, with the more stringent values better suited for alignments with high sequence identity. The default megaBLAST gap costs are shown in the right-most column. Accurate statistics for these default megaBLAST gap costs can only be calculated for the most stringent reward/penalty values, but the values listed in the middle column can always be used.

在这里插入图片描述在这里插入图片描述

BLAST Substitution Matrices(BLAST置换矩阵)

BLAST uses a substitution matrix for any program that aligns residues. The program may align residues because both the query and database consist of proteins (e.g. BLASTP) or the program may align DNA translated to protein with protein (e.g. BLASTX). A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The theory of amino acid substitution matrices is described in [1], and applied to DNA sequence comparison in [2]. In general,different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change [1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentage of matching residues) to rise above background noise. Such short but strong alignments are more easily detected using a matrix with a higher "relative entropy" [1] than that of BLOSUM-62.In particular, short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices [5,6] may be used instead. For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is:

补充来源:

https://blog.csdn.net/weixin_43202635/article/details/82962032?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-1-82962032-blog-88382137.pc_relevant_vip_default&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-1-82962032-blog-88382137.pc_relevant_vip_default&utm_relevant_index=2


在这里插入图片描述
在这里插入图片描述 在这里插入图片描述
在这里插入图片描述 在这里插入图片描述

BLAST使用替换矩阵来表示任何对齐残基的程序。程序可以对齐残基,因为查询和数据库都由蛋白质组成(例如BLASTP),或者程序可以将翻译成蛋白质的DNA与蛋白质对齐(例如BLATX)。评估成对序列比对质量的一个关键因素是“替换矩阵”,它为任何可能的残基对的比对分配分数。[1]中描述了氨基酸置换矩阵理论,并将其应用于[2]中的DNA序列比较。一般来说,不同的替换矩阵可用于检测不同程度发散的序列之间的相似性[1-3]。然而,单个矩阵在相对广泛的进化变化范围内可能是合理有效的[1-3]。实验表明,BLOSUM-62矩阵[4]是检测最弱蛋白质相似性的最佳方法之一 对于特别长和弱的对齐,BLOSUM-45矩阵可能会证明更优越。短比对需要相对较强(即具有较高百分比的匹配残基)才能高于背景噪声。使用比BLOSUM-62具有更高“相对熵”[1]的矩阵更容易检测到这种短而强的对齐 \colorbox{yellow}{对于特别长和弱的对齐,BLOSUM-45矩阵可能会证明更优越。短比对需要相对较强(即具有较高百分比的匹配残基)才能高于背景噪声。使用比BLOSUM-62具有更高“相对熵”[1]的矩阵更容易检测到这种短而强的对齐} 对于特别长和弱的对齐,BLOSUM-45矩阵可能会证明更优越。短比对需要相对较强(即具有较高百分比的匹配残基)才能高于背景噪声。使用比BLOSUM-62具有更高相对熵”[1]的矩阵更容易检测到这种短而强的对齐特别是,短查询序列只能产生短对齐,因此使用短查询的数据库搜索应使用适当定制的矩阵。关于间隙对准的详细统计理论尚未开发,使用给定替代矩阵的最佳间隙成本是通过经验确定的。短比对需要相对较强(即具有较高百分比的匹配残基)才能高于背景噪声。使用比BLOSUM-62具有更高“相对熵”[1]的矩阵更容易检测到这种短而强的对齐。特别是,短查询序列只能产生短对齐,因此使用短查询的数据库搜索应使用适当定制的矩阵。BLOSUM序列不包括任何具有适合于最短查询的相对熵的矩阵,因此可以使用较旧的PAM矩阵[5,6]。对于蛋白质,不同查询长度的推荐替代矩阵和缺口成本临时表如下:
在这里插入图片描述

Gap Costs(缺口成本)

The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. **Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap.**Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).

对齐的原始得分是对齐残基对的得分和间隙得分之和。Gapped BLAST和PSI-BLAST使用"affine gap costs",对间隙的存在收取分数-a,对间隙中的每个残基收取分数-b。因此,k残基缺口的总分为-(a+bk);具体而言,长度为1的间隙接收分数-(a+b)。
在这里插入图片描述

Lambda Ratio

To convert a raw score S into a normalized score S' expressed in bits, one uses the formula S' = (lambda*S - lnK)/(ln 2), where lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed [7-9]. For determining S', the more important of these parameters is lambda. The "lambda ratio"quoted here is the ratio of the lambda for the given scoring system to that for one using the same substitution scores, **but with infinite gap costs** [8]. This ratio indicates what proportion of information in an ungapped alignment must be sacrificed in the hope of improving its score through extension using gaps. We have found empirically that the most effective gap costs tend to be those with lambda ratios in the range 0.8 to 0.9.

为了将原始分数S转换为以位表示的归一化分数S’,我们使用公式S’=(lambda*S-lnK)/(ln2),其中lambda和K是取决于采用的评分系统(替代矩阵和间隙成本)的参数[7-9]。对于确定S′,这些参数中更重要的是lambda。此处引用的“lambda比率”是给定评分系统的lambda与使用相同替代分数但具有无限缺口成本的系统的lampda的比率[8]。该比率表明,为了通过使用间隙进行扩展来提高其得分,必须牺牲无上限对齐中的信息比例。我们经验发现,最有效的缺口成本往往是那些lambda比率在0.8到0.9范围内的缺口成本。

  • 4
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值