MMseqs2蛋白质序列快速高效比对工具

最新推荐文章于 2025-06-26 09:07:40 发布

2401_83947105

最新推荐文章于 2025-06-26 09:07:40 发布

阅读量1.4k

点赞数 8

CC 4.0 BY-SA版权

分类专栏： 2024年程序员学习文章标签： linux 服务器 microsoft

本文链接：https://blog.csdn.net/2401_83947105/article/details/137657888

2024年程序员学习专栏收录该内容

270 篇文章

订阅专栏

make install
export PATH= $(pw d) / bin / :$ PATH


 全参数使用帮助信息：

MMseqs2 Version: 13.45111
© Martin Steinegger (martin.steinegger@snu.ac.kr)

usage: mmseqs []

Easy workflows for plain text input/output
easy-search Sensitive homology search
easy-linsearch Fast, less sensitive homology search
easy-cluster Slower, sensitive clustering
easy-linclust Fast linear time cluster, less sensitive clustering
easy-taxonomy Taxonomic classification
easy-rbh Find reciprocal best hit

Main workflows for database input/output
search Sensitive homology search
linsearch Fast, less sensitive homology search
map Map nearly identical sequences
rbh Reciprocal best hit search
linclust Fast, less sensitive clustering
cluster Slower, sensitive clustering
clusterupdate Update previous clustering with new sequences
taxonomy Taxonomic classification

Input database creation
databases List and download databases
createdb Convert FASTA/Q file(s) to a sequence DB
createindex Store precomputed index on disk to reduce search overhead
createlinindex Create linsearch index
convertmsa Convert Stockholm/PFAM MSA file to a MSA DB
tsv2db Convert a TSV file to any DB
tar2db Convert content of tar archives to any DB
msa2profile Convert a MSA DB to a profile DB

Handle databases on storage and memory
compress Compress DB entries
decompress Decompress DB entries
rmdb Remove a DB
mvdb Move a DB
cpdb Copy a DB
lndb Symlink a DB
unpackdb Unpack a DB into separate files
touchdb Preload DB into memory (page cache)

Unite and intersect databases
createsubdb Create a subset of a DB from list of DB keys
concatdbs Concatenate two DBs, giving new IDs to entries from 2nd DB
splitdb Split DB into subsets
mergedbs Merge entries from multiple DBs
subtractdbs Remove all entries from first DB occurring in second DB by key

Format conversion for downstream processing
convertalis Convert alignment DB to BLAST-tab, SAM or custom format
createtsv Convert result DB to tab-separated flat file
convert2fasta Convert sequence DB to FASTA format
result2flat Create flat file by adding FASTA headers to DB entries
createseqfiledb Create a DB of unaligned FASTA entries
taxonomyreport Create a taxonomy report in Kraken or Krona format

Sequence manipulation/transformation
extractorfs Six-frame extraction of open reading frames
extractframes Extract frames from a nucleotide sequence DB
orftocontig Write ORF locations in alignment format
reverseseq Reverse (without complement) sequences
translatenucs Translate nucleotides to proteins
translateaa Translate proteins to lexicographically lowest codons
splitsequence Split sequences by length
masksequence Soft mask sequence DB using tantan
extractalignedregion Extract aligned sequence region from query

Result manipulation
swapresults Transpose prefilter/alignment DB
result2rbh Filter a merged result DB to retain only reciprocal best hits
result2msa Compute MSA DB from a result DB
result2dnamsa Compute MSA DB with out insertions in the query for DNA sequences
result2stats Compute statistics for each entry in a DB
filterresult Pairwise alignment result filter
offsetalignment Offset alignment by ORF start position
proteinaln2nucl Transform protein alignments to nucleotide alignments
result2repseq Get representative sequences from result DB
sortresult Sort a result DB in the same order as the prefilter or align module
summarizealis Summarize alignment result to one row (uniq. cov., cov., avg. seq. id.)
summarizeresult Extract annotations from alignment DB

Taxonomy assignment
createtaxdb Add taxonomic labels to sequence DB
createbintaxonomy Create binary taxonomy from NCBI input
addtaxonomy Add taxonomic labels to result DB
taxonomyreport Create a taxonomy report in Kraken or Krona format
filtertaxdb Filter taxonomy result database
filtertaxseqdb Filter taxonomy sequence database
aggregatetax Aggregate multiple taxon labels to a single label
aggregatetaxweights Aggregate multiple taxon labels to a single label
lcaalign Efficient gapped alignment for lca computation
lca Compute the lowest common ancestor
majoritylca Compute the lowest common ancestor using majority voting

Multi-hit search
multihitdb Create sequence DB for multi hit searches
multihitsearch Search with a grouped set of sequences against another grouped set
besthitperset For each set of sequences compute the best element and update p-value
combinepvalperset For each set compute the combined p-value
mergeresultsbyset Merge results from multiple ORFs back to their respective contig

Prefiltering
prefilter Double consecutive diagonal k-mer search
ungappedprefilter Optimal diagonal score search
kmermatcher Find bottom-m-hashed k-mer matches within sequence DB
kmersearch Find bottom-m-hashed k-mer matches between target and query DB

Alignment
align Optimal gapped local alignment
alignall Within-result all-vs-all gapped local alignment
transitivealign Transfer alignments via transitivity
rescorediagonal Compute sequence identity for diagonal
alignbykmer Heuristic gapped local k-mer based alignment

Clustering
clust Cluster result by Set-Cover/Connected-Component/Greedy-Incremental
clusthash Hash-based clustering of equal length sequences
mergeclusters Merge multiple cascaded clustering steps

Profile databases
result2profile Compute profile DB from a result DB
msa2result Convert a MSA DB to a profile DB
msa2profile Convert a MSA DB to a profile DB
profile2pssm Convert a profile DB to a tab-separated PSSM file
profile2consensus Extract consensus sequence DB from a profile DB
profile2repseq Extract representative sequence DB from a profile DB
convertprofiledb Convert a HH-suite HHM DB to a profile DB

Profile-profile databases
enrich Boost diversity of search result
result2pp Merge two profile DBs by shared hits
profile2cs Convert a profile DB into a column state sequence DB
convertca3m Convert a cA3M DB to a result DB
expandaln Expand an alignment result based on another
expand2profile Expand an alignment result based on another and create a profile

Utility modules to manipulate DBs
view Print DB entries given in --id-list to stdout
apply Execute given program on each DB entry
filterdb DB filtering by given conditions
swapdb Transpose DB with integer values in first column
prefixid For each entry in a DB prepend the entry key to the entry itself
suffixid For each entry in a DB append the entry key to the entry itself
renamedbkeys Create a new DB with original keys renamed

Special-purpose utilities
diffseqdbs Compute diff of two sequence DBs
summarizetabs Extract annotations from HHblits BLAST-tab-formatted results
gff2db Extract regions from a sequence database based on a GFF3 file
maskbygff Mask out sequence regions in a sequence DB by features selected from a GFF3 file
convertkb Convert UniProtKB data to a DB
summarizeheaders Summarize FASTA headers of result DB
nrtotaxmapping Create taxonomy mapping for NR database
extractdomains Extract highest scoring alignment regions for each sequence from BLAST-tab file
countkmer Count k-mers


光看帮助会有点懵了，但总体还是清晰的，下面大家可以在逐步使用中熟悉这些参数的使用方法。 


这里说一下主要工作流程模块：

###帮助文件最上面是关于主要工作流程模块的介绍。
easy-search Sensitive homology search，高敏感度同源基因搜索
easy-linsearch Fast, less sensitive homology search，较低敏感度同源基因搜索
easy-cluster Slower, sensitive clustering，较慢的较高敏感度聚类
easy-linclust Fast linear time cluster, less sensitive clustering，快速线性时间聚类，低灵敏度聚类
easy-taxonomy Taxonomic classification，物种注释
easy-rbh Find reciprocal best hit，查找最佳命中

#####使用时很简单，分别查看帮助文件
mmseqs easy-search --help
mmseqs easy-linsearch --help
mmseqs easy-cluster --help
mmseqs easy-linclust --help
mmseqs easy-taxonomy --help
mmseqs easy-rbh --help


## 2. 下载数据库[Downloading databases]( )

#先查看有些什么数据库，可以直接使用下面的帮助信息查看

mmseqs databases

Usage: mmseqs databases <o:sequenceDB> [options]

Name Type Taxonomy Url

UniRef100 Aminoacid yes https://www.uniprot.org/help/uniref
UniRef90 Aminoacid yes https://www.uniprot.org/help/uniref
UniRef50 Aminoacid yes https://www.uniprot.org/help/uniref
UniProtKB Aminoacid yes https://www.uniprot.org/help/uniprotkb
UniProtKB/TrEMBL Aminoacid yes https://www.uniprot.org/help/uniprotkb
UniProtKB/Swiss-Prot Aminoacid yes https://uniprot.org
NR Aminoacid yes https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
NT Nucleotide - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
GTDB Aminoacid yes https://gtdb.ecogenomic.org
PDB Aminoacid - https://www.rcsb.org
PDB70 Profile - https://github.com/soedinglab/hh-suite
Pfam-A.full Profile - https://pfam.xfam.org
Pfam-A.seed Profile - https://pfam.xfam.org
Pfam-B Profile - https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released
CDD Profile - https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
eggNOG Profile - http://eggnog5.embl.de
VOGDB Profile - https://vogdb.org
dbCAN2 Profile - http://bcb.unl.edu/dbCAN2
SILVA Nucleotide yes https://www.arb-silva.de
Resfinder Nucleotide - https://cge.cbs.dtu.dk/services/ResFinder
Kalamari Nucleotide yes https://github.com/lskatz/Kalamari


下载指定数据库

#下载swissprot数据库
mmseqs databases UniProtKB/Swiss-Prot outpath/swissprot tmp


下载完的数据库就在指定路径下，不含swissprot名， 也就是自己指定的/outpath路径，使用的时候指定数据库路径/outpath/swissprot


![](https://img-blog.csdnimg.cn/direct/1a3315696de44365850d1c587acca0be.png) 


当然可以自己下载fasta文件手动配置数据库 


## 3. 创建数据库


使用MMseqs创建一个数据库，该数据库将包含您要使用的蛋白质序列数据。要创建数据库，请执行以下命令：

#先将参考库fasta文件生成mmseqs对应数据库文件
mmseqs createdb <sequences.fasta> <database_name>

其中，`<sequences.fasta>`是您的蛋白质序列文件名，`<database_name>`是您要为数据库指定的名称。

#######################################################################################
mmseqs createdb examples/QUERY.fasta queryDB
mmseqs createdb examples/DB.fasta targetDB


##  4. 训练数据库


为了提高比对质量，可以训练数据库。要训练数据库，请执行以下命令：

#建立索引，加速比对
mmseqs createindex <database_name> <index_prefix>

其中，`<database_name>`是您之前创建的数据库名称，`<index_prefix>`是用于索引的前缀。


## 5. 进行比对


现在，您可以使用MMseqs比对您的蛋白质序列了。要进行比对，请执行以下命令：

mmseqs search <query.fasta> <database_name> <result_file> <tmp_dir>

#其中，<query.fasta>是您要比对的蛋白质序列文件名，<database_name>是您之前创建的数据库名称，<result_file>是将保存结果的文件名，<tmp_dir>是用于临时文件的目录。

例如，这里直接用easy-search模块基于swissprot数据库进行QUERY.fasta输入文件的比对

比对结果放入alnRes.m8

个人建议输入文件，数据库文件还有输出文件和tmp目录统一都使用绝对路径

mmseqs easy-search examples/QUERY.fasta swissprot alnRes.m8 tmp

###结果是不是很熟悉：
k141_759496_length_1110_cov_3.0000_1 A8BQB4 0.258 337 187 0 117 369 1084 1420 2.200E-12 73
k141_759496_length_1110_cov_3.0000_1 Q2PQH8 0.258 337 187 0 117 369 1084 1420 3.903E-12 72
k141_759496_length_1110_cov_3.0000_1 P35574 0.252 337 188 0 117 369 1106 1442 6.921E-12 72
k141_759496_length_1110_cov_3.0000_1 P35573 0.244 337 191 0 117 369 1083 1419 1.205E-10 68
k141_759496_length_1110_cov_3.0000_1 Q06625 0.345 83 51 0 117 195 1067 1149 8.270E-08 59
k141_399534_length_2355_cov_6.0000_2 Q8ZL58 0.680 372 119 0 3 374 24 395 4.669E-169 533
k141_399534_length_2355_cov_6.0000_2 Q9RKF7 0.349 352 226 0 6 357 1 348 1.247E-56 207
k141_399534_length_2355_cov_6.0000_2 H2IFX0 0.317 353 237 0 7 359 5 352 7.486E-51 190
k141_399534_length_2355_cov_6.0000_2 Q97U96 0.339 344 222 0 25 361 20 363 6.501E-50 187
k141_399534_length_2355_cov_6.0000_2 P11444 0.303 357 242 0 6 362 5 352 1.706E-43 168

同样使用其他模块也可以使用相同格式

mmseqs search queryDB targetDB resultDB tmp


简单工作流程模块使用

mmseqs easy-search examples/QUERY.fasta examples/DB.fasta alnResult.m8 tmp

mmseqs easy-cluster examples/DB.fasta clusterRes tmp

mmseqs easy-linclust examples/DB.fasta clusterRes tmp


mmseqs里的结果


 mmseqs包含了近百个使用模块，这其中包括其结果格式转换，例如将比对结果转换为BLAST的几种格式：

mmseqs convertalis queryDB targetDB alnRes alnRes.tab

##默认情况下会以 --format-mode 0 方式转换，
##用户可自定义自己想要的格式比如 --format-mode 4

mmseqs convertalis queryDB targetDB alnRes alnRes.tab --format-mode 4


By default (`--format-mode 0`), `alnRes.tab` will contain alignment result in a BLAST tabular result (comparable to `-m 8` `-outfmt 6`) with 12 columns: (1,2) identifiers for query and target sequences/profiles, (3) sequence identity, (4) alignment length, (5) number of mismatches, (6) number of gap openings, (7-8, 9-10) domain start and end-position in query and in target, (11) E-value, and (12) bit score.


The option `--format-output` defines a custom output format. For example, the format string `--format-output "query,target,evalue,qaln,taln"` prints the query and target identifiers, e-value of the alignment and the alignments.


Column headers can be added to the output with `--format-mode 4`. This mode also supports chosing a custom output format.


The following field are supported


* **query** Query sequence identifier
* **target** Target sequence identifier
* **evalue** E-value
* **gapopen** Number of gap open events (note: this is NOT the number of gap characters)
* **pident** Percentage of identical matches
* **fident** Fraction of identical matches
* **nident** Number of identical matches
* **qstart** 1-indexed alignment start position in query sequence
* **qend** 1-indexed alignment end position in query sequence
* **qlen** Query sequence length
* **tstart** 1-indexed alignment start position in target sequence
* **tend** 1-indexed alignment end position in target sequence
* **tlen** Target sequence length
* **alnlen** Alignment length (number of aligned columns)
* **raw** Raw alignment score
* **bits** Bit score
* **cigar** Alignment as string. Each position contains either M (match), D (deletion, gap in query), or I (Insertion, gap in target)
* **qseq** Query sequence
* **tseq** Target sequence
* **qaln** Aligned query sequence with gaps
* **taln** Aligned target sequence with gaps
* **qheader** Header of Query sequence
* **theader** Header of Target sequence
* **qframe** Query frame (-3 to +3)
* **tframe** Target frame (-3 to +3)
* **mismatch** Number of mismatches
* **qcov** Fraction of query sequence covered by alignment
* **tcov** Fraction of target sequence covered by alignment
* **empty** Dash column '-'
* **taxid** Taxonomical identifier (needs mmseqs tax db)


**自我介绍一下，小编13年上海交大毕业，曾经在小公司待过，也去过华为、OPPO等大厂，18年进入阿里一直到现在。**

**深知大多数Linux运维工程师，想要提升技能，往往是自己摸索成长或者是报班学习，但对于培训机构动则几千的学费，着实压力不小。自己不成体系的自学效果低效又漫长，而且极易碰到天花板技术停滞不前！**

**因此收集整理了一份《2024年Linux运维全套学习资料》，初衷也很简单，就是希望能够帮助到想自学提升又不知道该从何学起的朋友，同时减轻大家的负担。**
![img](https://img-blog.csdnimg.cn/img_convert/12bb48cd58768e08b439cd604fca510a.png)
![img](https://img-blog.csdnimg.cn/img_convert/6b23c316caf9d00b6290ddafeac3422d.png)
![img](https://img-blog.csdnimg.cn/img_convert/8620d7fe3d8278b484425fac0eff4594.png)
![img](https://img-blog.csdnimg.cn/img_convert/e8878169cadaa26a2766f89793e6a1a8.png)
![img](https://img-blog.csdnimg.cn/img_convert/27da155b07ca024741fbd01bcac3097a.png)

**既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，基本涵盖了95%以上Linux运维知识点，真正体系化！**

**由于文件比较大，这里只是将部分目录大纲截图出来，每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频，并且后续会持续更新**

**如果你觉得这些内容对你有帮助，可以添加VX：vip1024b （备注Linux运维获取）**
![img](https://img-blog.csdnimg.cn/img_convert/735bebbb32cd1c81969056c83ecab7a7.jpeg)



为了做好运维面试路上的助攻手，特整理了上百道 **【运维技术栈面试题集锦】** ，让你面试不慌心不跳，高薪offer怀里抱！

这次整理的面试题，**小到shell、MySQL，大到K8s等云原生技术栈，不仅适合运维新人入行面试需要，还适用于想提升进阶跳槽加薪的运维朋友。**

![](https://img-blog.csdnimg.cn/img_convert/d92a80c034ec305e43cb4b821ca3a4fe.png)

本份面试集锦涵盖了

*   **174 道运维工程师面试题**
*   **128道k8s面试题**
*   **108道shell脚本面试题**
*   **200道Linux面试题**
*   **51道docker面试题**
*   **35道Jenkis面试题**
*   **78道MongoDB面试题**
*   **17道ansible面试题**
*   **60道dubbo面试题**
*   **53道kafka面试**
*   **18道mysql面试题**
*   **40道nginx面试题**
*   **77道redis面试题**
*   **28道zookeeper**

**总计 1000+ 道面试题， 内容 又全含金量又高**

*   **174道运维工程师面试题**

> 1、什么是运维?

> 2、在工作中，运维人员经常需要跟运营人员打交道，请问运营人员是做什么工作的?

> 3、现在给你三百台服务器，你怎么对他们进行管理?

> 4、简述raid0 raid1raid5二种工作模式的工作原理及特点

> 5、LVS、Nginx、HAproxy有什么区别?工作中你怎么选择?

> 6、Squid、Varinsh和Nginx有什么区别，工作中你怎么选择?

> 7、Tomcat和Resin有什么区别，工作中你怎么选择?

> 8、什么是中间件?什么是jdk?

> 9、讲述一下Tomcat8005、8009、8080三个端口的含义？

> 10、什么叫CDN?

> 11、什么叫网站灰度发布?

> 12、简述DNS进行域名解析的过程?

> 13、RabbitMQ是什么东西?

> 14、讲一下Keepalived的工作原理?

> 15、讲述一下LVS三种模式的工作过程?

> 16、mysql的innodb如何定位锁问题，mysql如何减少主从复制延迟?

> 17、如何重置mysql root密码?

**一个人可以走的很快，但一群人才能走的更远。不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎扫码加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！**
![img](https://img-blog.csdnimg.cn/img_convert/0a1d16add69f116bd956c7a3c97d008e.jpeg)

Nginx有什么区别，工作中你怎么选择?

> 7、Tomcat和Resin有什么区别，工作中你怎么选择?

> 8、什么是中间件?什么是jdk?

> 9、讲述一下Tomcat8005、8009、8080三个端口的含义？

> 10、什么叫CDN?

> 11、什么叫网站灰度发布?

> 12、简述DNS进行域名解析的过程?

> 13、RabbitMQ是什么东西?

> 14、讲一下Keepalived的工作原理?

> 15、讲述一下LVS三种模式的工作过程?

> 16、mysql的innodb如何定位锁问题，mysql如何减少主从复制延迟?

> 17、如何重置mysql root密码?

**一个人可以走的很快，但一群人才能走的更远。不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎扫码加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！**
[外链图片转存中...(img-ec4DnzL6-1712850581473)]