OrthoMCL-同源基因查找软件

最新推荐文章于 2022-09-20 14:34:11 发布

hs6605015

最新推荐文章于 2022-09-20 14:34:11 发布

阅读量2k

点赞数 1

本文链接：https://blog.csdn.net/hs6605015/article/details/106910777

版权

OrthoMCL (http://orthomcl.org/orthomcl/)主要用来找直系同源基因以及旁系同源基因。它主要在比较完整的基因组之间找直系同源基因。OrthoMCL的使用主要13步，可以参考doc/OrthoMCLEngine/Main/UserGuide.txt。为了方便运行OrthoMCL，可以建立一个工作目录“my_orthomcl_dir”。
1>配置OrthoMCL程序
将orthomcl.config.template拷贝到你工作目录下(my_orthomcl_dir)。然后根据所建的mysql数据库名，用户名，密码。修改该文件。例子如下：

主要有两个阈值参数：
percentMatchCutoff:
blastsimilarities with percent match less than this value are ignored.
evalueExponentCutoff:
blastsimilarities with evalue Exponents greather than this value are ignored.

2> 利用orthomclInstallSchema命令对Oracle或者Mysql数据库进行配置
Usage:
orthomclInstallSchema config_file sql_log_file table_suffix
比如在my_orthomcl_dir目录下运行:
…/orthomclSoftware-v2.0.9/bin/orthomclInstallSchemaorthomcl.config.template orthomcl.config.log。

3> 利用orthomclAdjustFasta命令把输入文件转换为orthomcl所需的文件格式
Usage:
orthomclAdjustFasta taxon_code fasta_file id_field
这里从Ensembl下载了Ustilago maydis和Saccharomyces cerevisiae两个物种的蛋白质组文件。
…/orthomclSoftware-v2.0.9/bin/orthomclAdjustFasta Ust Ustilago_maydis.fasta 1
…/orthomclSoftware-v2.0.9/bin/orthomclAdjustFasta Sac Saccharomyces_cerevisiae1
就会生成两个文件:Ust.fasta 和Sac.fasta。为方便运行my_orthomcl_dir/compliantFasta/。
参数意义如下:

4> 利用orthomclFilterFasta命令过滤掉差的序列文件
Usage:
orthomclFilterFasta input_dirmin_length max_percent_stops [good_proteins_file poor_proteins_file]
例如运行:
“…/orthomclSoftware-v2.0.9/bin/orthomclFilterFasta ./compliantFasta/”。
之后就会生成两个文件:goodProteins.fasta与poorProteins.fasta。
参数意义如下:

5> Blast比对
对上一步得到的goodProteins.fasta进行多对多的比对。推荐使用NCBIBlast.
我这里使用是ncbi-blast-2.2.28+。
运行命令:
“~/Universal_softwore_src/ncbi-blast-2.2.28+/bin/makeblastdb-in good_proteins.fasta -dbtype prot -out good_proteins.fasta”
“~/Universal_softwore_src/ncbi-blast-2.2.28+/bin/blastp-db goodProteins.fasta -querygoodProteins.fasta -outfmt 7 –out goodProteins_blastp.out ”。
然后生成tab delimited格式的输出文件goodProteins_blastp.out。生成的比对文件最好是tab文件格式。不同的版本的输出格式参数也许不一样。该软件就是-outfmt 7。
得到该文件之后需进一步处理之后才能被后面的步骤所使用(只把hits行挑选出来,注释信息丢掉)
可以运行如下命令得到:
“grep -v -P"^#" goodProteins_blastp.out > goodProteins_v1_blastp.out”。

6> 利用orthomclBlastParser命令将上一步得到的blast比对结果进行解析，默认阈值为e-value：1e-5 ；Coverage：50%
Usage:
orthomclBlastParser blast_file fasta_files_dir
运行命令:
” …/orthomclSoftware-v2.0.9/bin/orthomclBlastParser goodProteins_v1_blastp.out ./compliantFasta”
运行完之后生成similarSequences.txt文件。
参数意义如下:

7> 利用orthomclLoadBlast命令将blast结果导入到mysql数据库中
Usage:
orthomclLoadBlast config_file similar_seqs_file
运行命令如下:
“…/orthomclSoftware-v2.0.9/bin/orthomclLoadBlast orthomcl.config.template
similarSequences.txt”
参数意义如下:

8> 利用” orthomclPairs”对数据库中的SimilarSequence表中数据，进行pairs的运算
Usage:
orthomclPairs config_file log_file cleanup=[yes|no|only|all]<startAfter=TAG>
运行命令如下:
“…/orthomclSoftware-v2.0.9/bin/orthomclPairs orthomcl.config.template pairs.log cleanup=yes ”
默认情况,下，在mysql中生成三个表: PotentialOrthologs,PotentialInParalogs, PotentialCoOrthologs。
参数意义如下:

9> 利用命令orthomclDumpPairsFiles对数据库中的pairs表进行处理
Usage:
orthomclDumpPairsFiles config_file
运行命令如下:
“…/orthomclSoftware-v2.0.9/bin/orthomclDumpPairsFiles paris.log”
参数意义如下:

生成mcllnput文件和pairs目录。这个目录包含三个文件:
ortholog.txt, coortholog.txt, inparalog.txt。
每一个文件有三列: proteinA, protein B, their normalized score (See the Orthomcl Algorithm Document)。

10> 利用mcl程序把上一步的结果进行聚类
运行命令如下:
mcl ./mclInput --abc-I 1.5 –o ./mclOutput 具体参数可以参考mcl文档。

11> 利用orthomclMclToGroups命令将mcl的输出结果转换为groups.txt
Usage:
orthomclMclToGroups my_prefix 1 < mclOutput > groups.txt
运行命令如下:
参数意义如下:

groups.txt就是最终的结果文件。文件中的每一行代表可能存在的蛋白质家族。