纯python统计基于orthofinder得到的系统发育树的关注点位置的树型结构

原创

已于 2023-05-27 16:39:25 修改 · 972 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#python #开发语言

于 2023-05-17 18:49:46 首次发布

本文详细介绍了如何使用Orthofinder工具找到直系同源基因并构建单基因树，然后通过一系列Python脚本对物种或类群在系统发育树中的位置进行统计分析，包括树型结构的简化、标准化物种名和统计不同拓扑结构的频率，以解决系统发育争议。

对于某一个物种或某类物种在整个系统发育树中的位置存在一定争议的情况，使用直系同源基因构建单基因树，并对该物种或该类物种所在结构进行统计是可以对争议起到一定的解决作用的，在此留下全套流程和大家交流。

主要分为几步：

使用orthofinder进行直系同源基因的寻找和单基因树的构建。（软件安装请自寻教程）
树型统计
2.1 对orthofinder跑出来的单基因树进行合并，并简化
2.2 树型统计
2.3 标签还原

1. 使用orthofinder进行直系同源基因的寻找和单基因树的构建

将待测物种的蛋白序列放到一个文件夹中：~/liuwei/01.tree.230513/01.Orthofinder/DataSet
将每个物种的蛋白序列放到一个文件夹中
在~/liuwei/01.tree.230513/01.Orthofinder目录下跑
脚本为：

orthofinder -f DataSet -t 80 -a 1

跑完结果在
~/liuwei/01.tree.230513/01.Orthofinder/DataSet/OrthoFinder/Results_May15_4
结果文件描述请参考
https://www.jianshu.com/p/3bcd965605f5
https://www.jianshu.com/p/a93ce87ff2d6
OrthoFinder 2.0 原理及所涉及的相关概念
orthofiner对每个cluser内的基因都建了一个树，使用的模型暂时还没去看，了解的可以留言，这个结果放在文件夹Resolved_Gene_Trees中了。
我们需要自己把其中的单基因树抓取出来并放到一个文件中去，orthofiner的结果也给出了单基因的list，文件为Orthogroups/Orthogroups_SingleCopyOrthologues.txt

使用脚本01.cat_files.py将Resolved_Gene_Trees中的单基因树抓出来并放到一个文件中：

import sys
f1 = open("~/liuwei/01.tree.230513/01.Orthofinder/DataSet/OrthoFinder/Results_May15_4/Orthogroups/Orthogroups_SingleCopyOrthologues.txt","r")  # SingleCopyOrthologues clusterID list
f3 = open("all_Single_Copy_Orthologue_Trees.txt","w")
for i in f1:
	i = i.strip()
	f2 = open(f"Resolved_Gene_Trees/{
     
     i}_tree.txt","r")
	content = f2.readline()
	f3.write(content+"\n")
	f2.close()
f3.close()
f1.close()

这样得到了all_Single_Copy_Orthologue_Trees.txt，内含所有单基因树。

——————————————————————————————
到这里单基因树的文件已经得到，后续我们要做的是为统计结构精简一下单基因树。

由于这个单基因树内的每个单基因树对应的单基因id不一致，因此我们只需要留下物种id就行，随后写了脚本进行简化基因id和物种id。

为了方便管理，我新建了一个test_py文件夹，后续工作主要在此文件夹下进行，各位请自行注意脚本的文件路径。

因为开始的思路没有那么清晰，所以开始我是先精简了一下基因id，将基因id去掉，只保留了物种id（蛋白文件名）。
脚本如下：
ps：我可能对…/SpeciesIDs.txt进行过修饰，必然把点"." 修改成了下划线"_"

import re
f1 = open("all_Single_Copy_Orthologue_Trees.txt","r")
f3 = open("../SpeciesIDs.txt","r")
SpeciesIDs = []
for i in f3:
        i=i.strip()
        SpeciesIDs.append(i)
new_Trees = []
for i in f1:
        for j in SpeciesIDs:
                i = re.sub(f'{
     
     j}_[^:]*:','{}:'.format(j),i)
        new_Trees.append(i)
f2 = open("all_concise_Trees.txt","w")
for i in new_Trees:     
        f2.write(i)

随后因为还是很难看，所以直接把物种名标准化了，这里很重要，因为后续的脚本对标准化的名字其实是有要求的，一定要形如M1或M23，ps：M后面的数字不能达3位，你要能自己看脚本也能自己修改，几位其实很easy，只是我懒得改了。
首先建立远物种名和新物种名一一对应文件SpeciesIDs.txt，形如这种，改成以M开头，从小到大依次命名。

1817_protein    M1
1824_protein    M2
21178_protein   M3
21183_protein   M4
21184_protein   M5
21185_protein   M6
21186_protein   M7
21187_protein   M8
21189_protein   M9
21192_protein   M10
2612_protein    M11
2613_protein    M12
Ascim1_GeneCatalog_proteins_20121221_aa M13
Ascni1_GeneCatalog_proteins_20141120_aa M14
Chove1_GeneCatalog_proteins_20131210_aa M15
D10A_protein    M16
D11A_protein    M17
D12A_protein    M18
D13A_protein    M19
D14A_protein    M20
D15A_protein    M21
D16A_protein    M22
D17A_protein    M23
D18A_protein    M24

随后使用脚本changeSpeciesIDs.py将上一步得到的all_concise_Trees.txt中的物种名基于上述文件标准化。

python changeSpeciesIDs.py all_concise_Trees.txt SpeciesIDs.txt

#this script follows the concise_Trees.py shortNameTrees.txt
import sys
#f1 = open("all_concise_Trees.txt","r")
#f2 = open("SpeciesIDs.txt","r")
#f3 = open("shortNameTrees.txt","w")
f1 = open(sys.argv[1],"r")
f2 = open(sys.argv[2],"r")
f3 = open(sys.argv[3],"w")
name = {
   
   }