MMSseq2

最新推荐文章于 2025-01-01 15:45:37 发布

All_Will_Be_Fine噻

最新推荐文章于 2025-01-01 15:45:37 发布

阅读量458

点赞数

分类专栏： VHH NGS 文章标签： NSG

原文链接：https://mmseqs.com/latest/userguide.pdf

版权

NGS 同时被 2 个专栏收录

21 篇文章

订阅专栏

VHH

9 篇文章

订阅专栏

install

conda install -c conda-forge -c bioconda mmseqs2

# list available modules
mmseqs -h
# list options of module
mmseqs createdb
# or
mmseqs createdb -h

Easy Workflows in MMseqs2

对FASTA/FASTQ文件进行简单的search / cluster，主要包括三个模块：
easy-search / easy-cluster / easy-linclust。

easy-search

# searching with a FASTA/FASTQ file against another FASTA/FASTQ file or a pre-built MMseqs2 target database
mmseqs easy-search examples/QUERY.fasta examples/DB.fasta alnResult.m8 tmp
# examples/QUERY.fasta is the query file
# examples/DB.fasta is the target database
# alnResult.m8 is the file for alignment results
# tmp is a temporary directory for intermediate files

easy-cluster

# clusters entries from a FASTA/FASTQ file using the cascaded clustering algorithm
mmseqs easy-cluster examples/DB.fasta clusterRes tmp
# examples/DB.fasta is the input file
# clusterRes is the output
# tmp is for temporary files

Easy-Linclust

# For larger datasets, easy-linclust offers an effcient clustering workflow, scaling linearly with input size
mmseqs easy-linclust examples/DB.fasta clusterRes tmp

search database

# Before searching, you need to convert your FASTA file containing query sequences and target sequences into a sequence DB.
mmseqs createdb examples/QUERY.fasta queryDB
mmseqs createdb examples/DB.fasta targetDB
# run the search
mmseqs search queryDB targetDB resultDB tmp

cluster

# Before clustering, convert your FASTA database into the MMseqs2 database (DB) format
mmseqs createdb examples/DB.fasta DB
#  clustering of your database DB
mmseqs cluster DB DB_clu tmp
# return the result database files DB_clu, DB_clu.index
# To generate a TSV formatted output file from the output file
mmseqs createtsv DB DB_clu DB_clu.tsv

Updating a clustered database

# create old cluster DB & cluster
mmseqs createdb DB_trimmed.fasta DB_trimmed
mmseqs cluster DB_trimmed DB_trimmed_clu tmp
# new cluster DB
mmseqs createdb DB.fasta DB_new
# update cluster DB & cluster again
mmseqs clusterupdate DB_trimmed DB_new DB_trimmed_clu DB_new_updated DB_update_clu tmp
# DB_update_clu contains now the freshly updated clustering of DB_new
# the clusterupdate creates a new sequence database DB_new_updated 
# that has consistent identifiers with the previous version

需要额外注意的地方：

cluster mode
- The Greedy Set cover (–cluster-mode 0) algorithm is an approximation for the NP-complete
  optimization problem called set cover.
  就是找representive sequence，然后与representive sequence 比较，满足相似性阈值则划为一类
  对于representive sequence 之间不满足相似性阈值
- Connected component (–cluster-mode 1) uses transitive connection to cover more remote homologs.
  真正的贪婪算法，只要与目前的cluster members相似，则算为同一个cluster
- Greedy incremental (–cluster-mode 2) works analogous to CD-HIT clustering algorithm.
  先按照序列的长度排序，然后找 representive（不相似序列），找到这些representive sequence 后然后让其他的sequence 与第一个representive sequence 比较，满足阈值则加入cluster，不满足则与第二个representive 比较，直到加入一个cluster为止。