MMSseq2

install

conda install -c conda-forge -c bioconda mmseqs2

# list available modules
mmseqs -h
# list options of module
mmseqs createdb
# or
mmseqs createdb -h

Easy Workflows in MMseqs2

对FASTA/FASTQ文件进行简单的search / cluster,主要包括三个模块:
easy-search / easy-cluster / easy-linclust。

  • easy-search
# searching with a FASTA/FASTQ file against another FASTA/FASTQ file or a pre-built MMseqs2 target database
mmseqs easy-search examples/QUERY.fasta examples/DB.fasta alnResult.m8 tmp
# examples/QUERY.fasta is the query file
# examples/DB.fasta is the target database
# alnResult.m8 is the file for alignment results
# tmp is a temporary directory for intermediate files
  • easy-cluster
# clusters entries from a FASTA/FASTQ file using the cascaded clustering algorithm
mmseqs easy-cluster examples/DB.fasta clusterRes tmp
# examples/DB.fasta is the input file
# clusterRes is the output
# tmp is for temporary files
  • Easy-Linclust
# For larger datasets, easy-linclust offers an effcient clustering workflow, scaling linearly with input size
mmseqs easy-linclust examples/DB.fasta clusterRes tmp

search database

# Before searching, you need to convert your FASTA file containing query sequences and target sequences into a sequence DB.
mmseqs createdb examples/QUERY.fasta queryDB
mmseqs createdb examples/DB.fasta targetDB
# run the search
mmseqs search queryDB targetDB resultDB tmp

cluster

# Before clustering, convert your FASTA database into the MMseqs2 database (DB) format
mmseqs createdb examples/DB.fasta DB
#  clustering of your database DB
mmseqs cluster DB DB_clu tmp
# return the result database files DB_clu, DB_clu.index
# To generate a TSV formatted output file from the output file
mmseqs createtsv DB DB_clu DB_clu.tsv
  • Updating a clustered database
# create old cluster DB & cluster
mmseqs createdb DB_trimmed.fasta DB_trimmed
mmseqs cluster DB_trimmed DB_trimmed_clu tmp
# new cluster DB
mmseqs createdb DB.fasta DB_new
# update cluster DB & cluster again
mmseqs clusterupdate DB_trimmed DB_new DB_trimmed_clu DB_new_updated DB_update_clu tmp
# DB_update_clu contains now the freshly updated clustering of DB_new
# the clusterupdate creates a new sequence database DB_new_updated 
# that has consistent identifiers with the previous version

需要额外注意的地方:

  • cluster mode
    • The Greedy Set cover (–cluster-mode 0) algorithm is an approximation for the NP-complete
      optimization problem called set cover.
      就是找representive sequence,然后与representive sequence 比较,满足相似性阈值则划为一类
      对于representive sequence 之间不满足相似性阈值
    • Connected component (–cluster-mode 1) uses transitive connection to cover more remote homologs.
      真正的贪婪算法,只要与目前的cluster members相似,则算为同一个cluster
    • Greedy incremental (–cluster-mode 2) works analogous to CD-HIT clustering algorithm.
      先按照序列的长度排序,然后找 representive(不相似序列),找到这些representive sequence 后然后让其他的sequence 与第一个representive sequence 比较,满足阈值则加入cluster,不满足则与第二个representive 比较,直到加入一个cluster为止。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值