install
conda install -c conda-forge -c bioconda mmseqs2
# list available modules
mmseqs -h
# list options of module
mmseqs createdb
# or
mmseqs createdb -h
Easy Workflows in MMseqs2
对FASTA/FASTQ文件进行简单的search / cluster,主要包括三个模块:
easy-search / easy-cluster / easy-linclust。
- easy-search
# searching with a FASTA/FASTQ file against another FASTA/FASTQ file or a pre-built MMseqs2 target database
mmseqs easy-search examples/QUERY.fasta examples/DB.fasta alnResult.m8 tmp
# examples/QUERY.fasta is the query file
# examples/DB.fasta is the target database
# alnResult.m8 is the file for alignment results
# tmp is a temporary directory for intermediate files
- easy-cluster
# clusters entries from a FASTA/FASTQ file using the cascaded clustering algorithm
mmseqs easy-cluster examples/DB.fasta clusterRes tmp
# examples/DB.fasta is the input file
# clusterRes is the output
# tmp is for temporary files
- Easy-Linclust
# For larger datasets, easy-linclust offers an effcient clustering workflow, scaling linearly with input size
mmseqs easy-linclust examples/DB.fasta clusterRes tmp
search database
# Before searching, you need to convert your FASTA file containing query sequences and target sequences into a sequence DB.
mmseqs createdb examples/QUERY.fasta queryDB
mmseqs createdb examples/DB.fasta targetDB
# run the search
mmseqs search queryDB targetDB resultDB tmp
cluster
# Before clustering, convert your FASTA database into the MMseqs2 database (DB) format
mmseqs createdb examples/DB.fasta DB
# clustering of your database DB
mmseqs cluster DB DB_clu tmp
# return the result database files DB_clu, DB_clu.index
# To generate a TSV formatted output file from the output file
mmseqs createtsv DB DB_clu DB_clu.tsv
- Updating a clustered database
# create old cluster DB & cluster
mmseqs createdb DB_trimmed.fasta DB_trimmed
mmseqs cluster DB_trimmed DB_trimmed_clu tmp
# new cluster DB
mmseqs createdb DB.fasta DB_new
# update cluster DB & cluster again
mmseqs clusterupdate DB_trimmed DB_new DB_trimmed_clu DB_new_updated DB_update_clu tmp
# DB_update_clu contains now the freshly updated clustering of DB_new
# the clusterupdate creates a new sequence database DB_new_updated
# that has consistent identifiers with the previous version
需要额外注意的地方:
- cluster mode
- The Greedy Set cover (–cluster-mode 0) algorithm is an approximation for the NP-complete
optimization problem called set cover.
就是找representive sequence,然后与representive sequence 比较,满足相似性阈值则划为一类
对于representive sequence 之间不满足相似性阈值 - Connected component (–cluster-mode 1) uses transitive connection to cover more remote homologs.
真正的贪婪算法,只要与目前的cluster members相似,则算为同一个cluster - Greedy incremental (–cluster-mode 2) works analogous to CD-HIT clustering algorithm.
先按照序列的长度排序,然后找 representive(不相似序列),找到这些representive sequence 后然后让其他的sequence 与第一个representive sequence 比较,满足阈值则加入cluster,不满足则与第二个representive 比较,直到加入一个cluster为止。
- The Greedy Set cover (–cluster-mode 0) algorithm is an approximation for the NP-complete