2020-11-21 根据orthofinder的结果利用单拷贝直系基因建树

本文介绍了如何根据Orthofinder的输出,处理单拷贝直系同源基因序列,进行多序列比对,并使用IQ-TREE和RAXML-ng构建进化树。内容包括序列提取、格式转换、最佳模型选择、进化树构建及美化。参考了多个在线教程和资源。
摘要由CSDN通过智能技术生成
  1. 单拷贝直系同源基因多序列比对
    在orthofinder运行后的结果Single_Copy_Orthologue_Sequences文件夹中有所需要的直系同源基因的序列
for i in *.fa; do linsi $i > $i.1; done
  1. 提取保守序列
conda install Gblocks
for i in *.1; do Gblocks $i -t=p -e=.2; done
  1. 序列合并
for i in *.2; do seqkit sort $i > $i.3; seqkit seq $i.3 -w 0 > $i.3.4; done
mkdir new
mv /home/yuanchao/yuanchao/working/Single_Copy_Orthologue_Sequences/*.4  new/
cd new &&ls
paste -d " " *.4 > all.fa #linux系统报错,打开太多文件
ulimit -a
ulimit -n 2048 #查看与修改同时打开文件数
paste -d " " *.4 > all.fa
sed -i 's\ \\g' all.fa

fa转换为phy的py脚本

import re
with open('all.fa', 'r') as fin:
    sequences = [(m.group(1), ''.join(m.group(2).split()))
    for m in re.finditer(r'(?m)^>([^ \n]+)[^\n]*([^>]*)', fin.read())]
with open('all.phy', 'w') as fout:
    fout.write('%d %d\n' % (len(sequences), len(sequences[0][1])))
    for item in sequences:
        fout.write('%-20s %s\n' % item)

然后手动修改.phy文件中序列名为物种名(或者代号),有几个物种就有几条
4. iqtree寻找最佳模型

nohup iqtree -s all.phy -m MF -cmax 15 -nt AUTO -ntmax 12 &
#-m MF只搜寻模型不建树    
#-cmax 15 (默认10)加速
#-nt AUTO (最常用)
#-ntmax 限制最大线程数

最佳模型结果在all.phy.log文件里,nohup.out存放临时结果(会被后面的文件覆盖)
5. raxml-ng建树

nohup raxml-ng --all --msa all.phy --model LG+F+R4  --bs-trees 1000 --prefix all &

等了一星期了,最后服务器超负荷运转被迫终止
附上nohup文件,改换iqtree试试,-o 参数添加外群构建有根树,1.运行过程中加入;2.运行完后加入,得到很像有根树的无根树

nohup iqtree -s all.phy --threads-max 10   --mem 90% --prefix IQTREE  -B 1000 --boot-trees -T AUTO  --bnni -m LG+F+R4 &
IQ-TREE multicore version 2.0.3 for Linux 64-bit built Apr 26 2020
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor,
Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    habri325 (AVX, 15 GB RAM)
Command: iqtree -s all.phy -m MF -cmax 15 -nt AUTO -ntmax 12
Seed:    744102 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Sat Nov 21 15:44:26 2020
Kernel:  AVX - auto-detect threads (12 CPU cores detected)

Reading alignment file all.phy ... Phylip format detected
Alignment most likely contains protein sequences
Alignment has 8 sequences with 420934 columns, 61795 distinct patterns
67465 parsimony-informative, 83916 singleton sites, 269553 constant sites
                   Gap/Ambiguity  Composition  p-value
   1  S.acidaminiphila     0.00%    failed      0.00%
   2  S.pictorum           0.00%    failed      0.00%
   3  S.nitritireducens    0.00%    failed      0.00%
   4  S.koreensis          0.00%    failed      0.00%
   5  S.humi               0.00%    failed      0.00%
   6  S.terrae             0.00%    failed      0.00%
   7  S.daejeonensis       0.00%    failed      0.00%
   8  S.tumulicola         0.00%    failed      0.00%
****  TOTAL                0.00%  8 sequences failed composition chi2 test (p-value<5%; df=19)
NOTE: minimal branch length is reduced to 0.000000237567 for long alignment

NOTE: Restoring information from model checkpoint file all.phy.model.gz

CHECKPOINT: Initial tree restored
Measuring multi-threading efficiency up to 12 CPU cores
6 trees examined
Threads: 1 / Time: 13.217 sec / Speedup: 1.000 / Efficiency: 100% / LogL: -2929098
Threads: 2 / Time: 7.015 sec / Speedup: 1.884 / Efficiency: 94% / LogL: -2929098
Threads: 3 / Time: 4.894 sec / Speedup: 2.701 / Efficiency: 90% / LogL: -2929098
Threads: 4 / Time: 3.770 sec / Speedup: 3.506 / Efficiency: 88% / LogL: -2929098
Threads: 5 / Time: 3.109 sec / Speedup: 4.251 / Efficiency: 85% / LogL: -2929098
Threads: 6 / Time: 2.717 sec / Speedup: 4.865 / Efficiency: 81% / LogL: -2929098
Threads: 7 / Time: 3.689 sec / Speedup: 3.583 / Efficiency: 51% / LogL: -2929098
BEST NUMBER OF THREADS: 6

Perform fast likelihood tree search using LG+I+G model...
CHECKPOINT: Tree restored, LogL: -2871359.874
NOTE: ModelFinder requires 1146 MB RAM!
ModelFinder will test up to 756 protein models (sample size: 420934) ...
 No. Model         -LnL         df  AIC          AICc         BIC
  1  LG            2971137.838  13  5942301.676  5942301.677  5942444.029
  2  LG+I          2881791.465  14  5763610.930  5763610.931  5763764.233
  3  LG+G4         2872276.674  14  5744581.348  5744581.349  5744734.651
  4  LG+I+G4       2871305.049  15  5742640.098  5742640.099  5742804.351
  5  LG+R2         2873005.079  15  5746040.158  5746040.159  5746204.411
  6  LG+R3         2870678.753  17  5741391.506  5741391.507  5741577.660
  7  LG+R4         2870611.477  19  5741260.954  5741260.956  5741469.008
  8  LG+R5         2870604.991  21  5741251.981  5741251.984  5741481.936
 25  LG+F+R4       2829569.034  38  5659214.069  5659214.076  5659630.178
 43  WAG+R4        2871364.011  19  5742766.022  5742766.024  5742974.076
 61  WAG+F+R4      2835224.156  38  5670524.312  5670524.319  5670940.421
 79  JTT+R4        2875686.834  19  5751411.668  5751411.670  5751619.722
 97  JTT+F+R4      2841354.810  38  5682785.620  5682785.627  5683201.729
115  JTTDCMut+R4   2875894.446  19  5751826.891  5751826.893  5752034.946
133  JTTDCMut+F+R4 2841090.966  38  5682257.933  5682257.940  5682674.042
151  DCMut+R4      2889976.695  19  5779991.389  5779991.391  5780199.444
169  DCMut+F+R4    2841790.114  38  5683656.228  5683656.235  5684072.337
187  VT+R4         2881648.434  19  5763334.868  5763334.870  5763542.922
205  VT+F+R4       2844131.422  38  5688338.844  5688338.851  5688754.953
223  PMB+R4        2902260.100  19  5804558.201  5804558.202  5804766.255
241  PMB+F+R4      2861418.816  38  5722913.632  5722913.639  5723329.740
259  Blosum62+R4   2903945.114  19  5807928.229  5807928.230  5808136.283
277  Blosum62+F+R4 2859138.231  38  5718352.462  5718352.469  5718768.571
295  Dayhoff+R4    2889826.740  19  5779691.480  5779691.481  5779899.534
313  Dayhoff+F+R4  2841662.701  38  5683401.402  5683401.409  5683817.511
331  mtREV+R4      3048557.789  19  6097153.578  6097153.580  6097361.633
349  mtREV+F+R4    2876055.838  38  5752187.676  5752187.683  5752603.785
367  mtART+R4      3100254.416  19  6200546.831  6200546.833  6200754.886
385  mtART+F+R4    2911425.836  38  5822927.671  5822927.678  5823343.780
403  mtZOA+R4      3014825.921  19  6029689.842  6029689.844  6029897.897
421  mtZOA+F+R4    2875761.982  38  5751599.964  5751599.971  5752016.072
439  mtMet+R4      3107742.091  19  6215522.183  6215522.185  6215730.237
457  mtMet+F+R4    2878591.450  38  5757258.899  5757258.906  5757675.008
475  mtVer+R4      3084101.909  19  6168241.817  6168241.819  6168449.872
493  mtVer+F+R4    2900814.652  38  5801705.303  5801705.310  5802121.412
511  mtInv+R4      3151604.022  19  6303246.044  6303246.046  6303454.098
529  mtInv+F+R4    2864072.476  38  5728220.951  5728220.958  5728637.060
547  mtMAM+R4      3093891.668  19  6187821.336  6187821.338  6188029.390
565  mtMAM+F+R4    2917600.974  38  5835277.948  5835277.955  5835694.057
583  HIVb+R4       2947877.165  19  5895792.331  5895792.333  5896000.385
601  HIVb+F+R4     2895358.762  38  5790793.524  5790793.531  5791209.633
619  HIVw+R4       3088006.815  19  6176051.630  6176051.632  6176259.685
637  HIVw+F+R4     2956556.097  38  5913188.193  5913188.200  5913604.302
655  FLU+R4        2968803.671  19  5937645.342  5937645.344  5937853.397
673  FLU+F+R4      2872026.423  38  5744128.846  5744128.853  5744544.955
691  rtREV+R4      2907635.553  19  5815309.107  5815309.109  5815517.161
709  rtREV+F+R4    2843280.734  38  5686637.469  5686637.476  5687053.577
727  cpREV+R4      2899268.272  19  5798574.545  5798574.546  5798782.599
745  cpREV+F+R4    2849736.637  38  5699549.275  5699549.282  5699965.384
Akaike Information Criterion:           LG+F+R4
Corrected Akaike Information Criterion: LG+F+R4
Bayesian Information Criterion:         LG+F+R4
Best-fit model: LG+F+R4 chosen according to BIC

All model information printed to all.phy.model.gz
CPU time for ModelFinder: 56290.041 seconds (15h:38m:10s)
Wall-clock time for ModelFinder: 9545.140 seconds (2h:39m:5s)

NOTE: 305 MB RAM (0 GB) is required!
Estimate model parameters (epsilon = 0.010)
1. Initial log-likelihood: -2829569.034
2. Current log-likelihood: -2829568.969
3. Current log-likelihood: -2829568.893
4. Current log-likelihood: -2829568.820
5. Current log-likelihood: -2829568.706
6. Current log-likelihood: -2829568.646
7. Current log-likelihood: -2829568.607
8. Current log-likelihood: -2829568.561
9. Current log-likelihood: -2829568.504
10. Current log-likelihood: -2829568.470
11. Current log-likelihood: -2829568.449
12. Current log-likelihood: -2829568.412
13. Current log-likelihood: -2829568.381
14. Current log-likelihood: -2829568.363
15. Current log-likelihood: -2829568.330
16. Current log-likelihood: -2829568.302
17. Current log-likelihood: -2829568.285
18. Current log-likelihood: -2829568.272
19. Current log-likelihood: -2829568.255
20. Current log-likelihood: -2829568.225
21. Current log-likelihood: -2829568.208
22. Current log-likelihood: -2829568.192
23. Current log-likelihood: -2829568.181
24. Current log-likelihood: -2829568.169
25. Current log-likelihood: -2829568.155
26. Current log-likelihood: -2829568.145
27. Current log-likelihood: -2829568.135
Optimal log-likelihood: -2829568.120
Site proportion and rates:  (0.368,0.037) (0.287,0.230) (0.284,1.927) (0.060,6.187)
Parameters optimization took 27 rounds (44.638 sec)
BEST SCORE FOUND : -2829568.120
Total tree length: 0.989

Total number of iterations: 0
CPU time used for tree search: 0.000 sec (0h:0m:0s)
Wall-clock time used for tree search: 0.000 sec (0h:0m:0s)
Total CPU time used: 265.375 sec (0h:4m:25s)
Total wall-clock time used: 44.742 sec (0h:0m:44s)

Analysis results written to: 
  IQ-TREE report:                all.phy.iqtree
  Tree used for ModelFinder:     all.phy.treefile
  Screen log file:               all.phy.log

Date and Time: Sat Nov 21 18:24:17 2020

RAxML-NG v. 0.9.0 released on 20.05.2019 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

RAxML-NG was called at 21-Nov-2020 20:43:09 as follows:

raxml-ng --all --msa all.phy --model LG+F+R4 --bs-trees 1000 --prefix all

Analysis options:
  run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
  start tree(s): random (10) + parsimony (10)
  bootstrap replicates: 1000
  random seed: 1605962589
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX
  parallelization: PTHREADS (6 threads), thread pinning: OFF

[00:00:00] Reading alignment from file: all.phy
[00:00:00] Loaded alignment with 8 taxa and 420934 sites

Alignment comprises 1 partitions and 61795 patterns

Partition 0: noname
Model: LG+FC+R4
Alignment sites / patterns: 420934 / 61795
Gaps: 0.00 %
Invariant sites: 64.04 %


NOTE: Binary MSA file already exists: all.raxml.rba

[00:00:00] Generating 10 random starting tree(s) with 8 taxa
[00:00:00] Generating 10 parsimony starting tree(s) with 8 taxa
[00:00:00] Data distribution: max. partitions/sites/weight per thread: 1 / 10300 / 824000

Starting ML tree search with 20 distinct starting trees

[00:15:09] ML tree search #1, logLikelihood: -2829562.100183
[00:31:09] ML tree search #2, logLikelihood: -2829561.100633
[00:49:39] ML tree search #3, logLikelihood: -2829560.619507
[01:11:31] ML tree search #4, logLikelihood: -2829560.290105
[01:23:49] ML tree search #5, logLikelihood: -2829560.131473
[01:45:53] ML tree search #6, logLikelihood: -2829560.292220
[02:05:08] ML tree search #7, logLikelihood: -2829560.418453
[02:17:25] ML tree search #8, logLikelihood: -2829560.457298
[02:34:13] ML tree search #9, logLikelihood: -2829560.632220
[02:49:21] ML tree search #10, logLikelihood: -2829560.813941
[03:01:08] ML tree search #11, logLikelihood: -2829561.097202
[03:17:09] ML tree search #12, logLikelihood: -2829561.012665
[03:31:23] ML tree search #13, logLikelihood: -2829560.630028
[03:46:48] ML tree search #14, logLikelihood: -2829561.110779
[04:03:49] ML tree search #15, logLikelihood: -2829561.122770
[04:15:38] ML tree search #16, logLikelihood: -2829561.052122
[04:29:16] ML tree search #17, logLikelihood: -2829560.673882
[04:45:12] ML tree search #18, logLikelihood: -2829561.212994
[04:49:47] ML tree search #19, logLikelihood: -2829572.187337
[05:03:54] ML tree search #20, logLikelihood: -2829560.698702

[05:03:54] ML tree search completed, best tree logLH: -2829560.131473

[05:03:54] Starting bootstrapping analysis with 1000 replicates.

[05:16:36] Bootstrap tree #1, logLikelihood: -2827652.629756
[05:29:50] Bootstrap tree #2, logLikelihood: -2828368.350447
[05:41:06] Bootstrap tree #3, logLikelihood: -2827059.810637
[05:55:49] Bootstrap tree #4, logLikelihood: -2821164.263529
[05:59:14] Bootstrap tree #5, logLikelihood: -2828186.516885
[06:07:33] Bootstrap tree #6, logLikelihood: -2832066.338951
[06:16:47] Bootstrap tree #7, logLikelihood: -2830408.948682
[06:35:45] Bootstrap tree #8, logLikelihood: -2826606.049605
[06:45:56] Bootstrap tree #9, logLikelihood: -2824189.220799
[06:58:09] Bootstrap tree #10, logLikelihood: -2833159.271852
[07:07:42] Bootstrap tree #11, logLikelihood: -2830956.390698
[07:16:48] Bootstrap tree #12, logLikelihood: -2831951.009978
[07:27:07] Bootstrap tree #13, logLikelihood: -2824713.889968
[07:41:53] Bootstrap tree #14, logLikelihood: -2832560.039860
[07:51:20] Bootstrap tree #15, logLikelihood: -2831965.682622
[08:02:11] Bootstrap tree #16, logLikelihood: -2821715.959956
[08:17:24] Bootstrap tree #17, logLikelihood: -2831571.075277
[08:27:58] Bootstrap tree #18, logLikelihood: -2826622.888111
[08:38:11] Bootstrap tree #19, logLikelihood: -2830657.381464
[08:48:49] Bootstrap tree #20, logLikelihood: -2832895.960579
[08:58:08] Bootstrap tree #21, logLikelihood: -2829531.033391
[09:07:18] Bootstrap tree #22, logLikelihood: -2836531.631576
[09:19:59] Bootstrap tree #23, logLikelihood: -2824095.302481
[09:27:56] Bootstrap tree #24, logLikelihood: -2828452.211052
[09:37:03] Bootstrap tree #25, logLikelihood: -2830375.502659
[09:47:29] Bootstrap tree #26, logLikelihood: -2829193.600428
[09:59:53] Bootstrap tree #27, logLikelihood: -2828623.158640
[10:15:07] Bootstrap tree #28, logLikelihood: -2824712.105612
[10:35:06] Bootstrap tree #29, logLikelihood: -2827839.057275
[10:46:49] Bootstrap tree #30, logLikelihood: -2826220.188318
[11:01:57] Bootstrap tree #31, logLikelihood: -2825971.189481
[11:14:23] Bootstrap tree #32, logLikelihood: -2828886.396345
[11:29:28] Bootstrap tree #33, logLikelihood: -2829393.489736
[11:42:29] Bootstrap tree #34, logLikelihood: -2832751.181423
[11:56:18] Bootstrap tree #35, logLikelihood: -2821253.655157
[12:06:40] Bootstrap tree #36, logLikelihood: -2833806.546337
[12:16:56] Bootstrap tree #37, logLikelihood: -2825327.136226
[12:36:54] Bootstrap tree #38, logLikelihood: -2826549.141106
[12:49:32] Bootstrap tree #39, logLikelihood: -2822592.653724
[12:58:19] Bootstrap tree #40, logLikelihood: -2830434.203491
[13:05:10] Bootstrap tree #41, logLikelihood: -2828413.981928
[13:15:35] Bootstrap tree #42, logLikelihood: -2827830.227101
[13:28:44] Bootstrap tree #43, logLikelihood: -2825499.957294
[13:41:27] Bootstrap tree #44, logLikelihood: -2826461.011279
[14:00:23] Bootstrap tree #45, logLikelihood: -2824301.157453
[14:13:55] Bootstrap tree #46, logLikelihood: -2824617.927624
[14:23:55] Bootstrap tree #47, logLikelihood: -2831936.384558
[14:35:13] Bootstrap tree #48, logLikelihood: -2824255.779613
[14:47:43] Bootstrap tree #49, logLikelihood: -2831075.671365
[14:57:34] Bootstrap tree #50, logLikelihood: -2830255.666543
[15:09:35] Bootstrap tree #51, logLikelihood: -2831726.309364
[15:23:10] Bootstrap tree #52, logLikelihood: -2824781.251735
[15:39:06] Bootstrap tree #53, logLikelihood:
  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值