- 单拷贝直系同源基因多序列比对
在orthofinder运行后的结果Single_Copy_Orthologue_Sequences文件夹中有所需要的直系同源基因的序列
for i in *.fa; do linsi $i > $i.1; done
- 提取保守序列
conda install Gblocks
for i in *.1; do Gblocks $i -t=p -e=.2; done
- 序列合并
for i in *.2; do seqkit sort $i > $i.3; seqkit seq $i.3 -w 0 > $i.3.4; done
mkdir new
mv /home/yuanchao/yuanchao/working/Single_Copy_Orthologue_Sequences/*.4 new/
cd new &&ls
paste -d " " *.4 > all.fa #linux系统报错,打开太多文件
ulimit -a
ulimit -n 2048 #查看与修改同时打开文件数
paste -d " " *.4 > all.fa
sed -i 's\ \\g' all.fa
fa转换为phy的py脚本
import re
with open('all.fa', 'r') as fin:
sequences = [(m.group(1), ''.join(m.group(2).split()))
for m in re.finditer(r'(?m)^>([^ \n]+)[^\n]*([^>]*)', fin.read())]
with open('all.phy', 'w') as fout:
fout.write('%d %d\n' % (len(sequences), len(sequences[0][1])))
for item in sequences:
fout.write('%-20s %s\n' % item)
然后手动修改.phy文件中序列名为物种名(或者代号),有几个物种就有几条
4. iqtree寻找最佳模型
nohup iqtree -s all.phy -m MF -cmax 15 -nt AUTO -ntmax 12 &
#-m MF只搜寻模型不建树
#-cmax 15 (默认10)加速
#-nt AUTO (最常用)
#-ntmax 限制最大线程数
最佳模型结果在all.phy.log文件里,nohup.out存放临时结果(会被后面的文件覆盖)
5. raxml-ng建树
nohup raxml-ng --all --msa all.phy --model LG+F+R4 --bs-trees 1000 --prefix all &
等了一星期了,最后服务器超负荷运转被迫终止
附上nohup文件,改换iqtree试试,-o 参数添加外群构建有根树,1.运行过程中加入;2.运行完后加入,得到很像有根树的无根树
nohup iqtree -s all.phy --threads-max 10 --mem 90% --prefix IQTREE -B 1000 --boot-trees -T AUTO --bnni -m LG+F+R4 &
IQ-TREE multicore version 2.0.3 for Linux 64-bit built Apr 26 2020
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor,
Heiko Schmidt, Dominik Schrempf, Michael Woodhams.
Host: habri325 (AVX, 15 GB RAM)
Command: iqtree -s all.phy -m MF -cmax 15 -nt AUTO -ntmax 12
Seed: 744102 (Using SPRNG - Scalable Parallel Random Number Generator)
Time: Sat Nov 21 15:44:26 2020
Kernel: AVX - auto-detect threads (12 CPU cores detected)
Reading alignment file all.phy ... Phylip format detected
Alignment most likely contains protein sequences
Alignment has 8 sequences with 420934 columns, 61795 distinct patterns
67465 parsimony-informative, 83916 singleton sites, 269553 constant sites
Gap/Ambiguity Composition p-value
1 S.acidaminiphila 0.00% failed 0.00%
2 S.pictorum 0.00% failed 0.00%
3 S.nitritireducens 0.00% failed 0.00%
4 S.koreensis 0.00% failed 0.00%
5 S.humi 0.00% failed 0.00%
6 S.terrae 0.00% failed 0.00%
7 S.daejeonensis 0.00% failed 0.00%
8 S.tumulicola 0.00% failed 0.00%
**** TOTAL 0.00% 8 sequences failed composition chi2 test (p-value<5%; df=19)
NOTE: minimal branch length is reduced to 0.000000237567 for long alignment
NOTE: Restoring information from model checkpoint file all.phy.model.gz
CHECKPOINT: Initial tree restored
Measuring multi-threading efficiency up to 12 CPU cores
6 trees examined
Threads: 1 / Time: 13.217 sec / Speedup: 1.000 / Efficiency: 100% / LogL: -2929098
Threads: 2 / Time: 7.015 sec / Speedup: 1.884 / Efficiency: 94% / LogL: -2929098
Threads: 3 / Time: 4.894 sec / Speedup: 2.701 / Efficiency: 90% / LogL: -2929098
Threads: 4 / Time: 3.770 sec / Speedup: 3.506 / Efficiency: 88% / LogL: -2929098
Threads: 5 / Time: 3.109 sec / Speedup: 4.251 / Efficiency: 85% / LogL: -2929098
Threads: 6 / Time: 2.717 sec / Speedup: 4.865 / Efficiency: 81% / LogL: -2929098
Threads: 7 / Time: 3.689 sec / Speedup: 3.583 / Efficiency: 51% / LogL: -2929098
BEST NUMBER OF THREADS: 6
Perform fast likelihood tree search using LG+I+G model...
CHECKPOINT: Tree restored, LogL: -2871359.874
NOTE: ModelFinder requires 1146 MB RAM!
ModelFinder will test up to 756 protein models (sample size: 420934) ...
No. Model -LnL df AIC AICc BIC
1 LG 2971137.838 13 5942301.676 5942301.677 5942444.029
2 LG+I 2881791.465 14 5763610.930 5763610.931 5763764.233
3 LG+G4 2872276.674 14 5744581.348 5744581.349 5744734.651
4 LG+I+G4 2871305.049 15 5742640.098 5742640.099 5742804.351
5 LG+R2 2873005.079 15 5746040.158 5746040.159 5746204.411
6 LG+R3 2870678.753 17 5741391.506 5741391.507 5741577.660
7 LG+R4 2870611.477 19 5741260.954 5741260.956 5741469.008
8 LG+R5 2870604.991 21 5741251.981 5741251.984 5741481.936
25 LG+F+R4 2829569.034 38 5659214.069 5659214.076 5659630.178
43 WAG+R4 2871364.011 19 5742766.022 5742766.024 5742974.076
61 WAG+F+R4 2835224.156 38 5670524.312 5670524.319 5670940.421
79 JTT+R4 2875686.834 19 5751411.668 5751411.670 5751619.722
97 JTT+F+R4 2841354.810 38 5682785.620 5682785.627 5683201.729
115 JTTDCMut+R4 2875894.446 19 5751826.891 5751826.893 5752034.946
133 JTTDCMut+F+R4 2841090.966 38 5682257.933 5682257.940 5682674.042
151 DCMut+R4 2889976.695 19 5779991.389 5779991.391 5780199.444
169 DCMut+F+R4 2841790.114 38 5683656.228 5683656.235 5684072.337
187 VT+R4 2881648.434 19 5763334.868 5763334.870 5763542.922
205 VT+F+R4 2844131.422 38 5688338.844 5688338.851 5688754.953
223 PMB+R4 2902260.100 19 5804558.201 5804558.202 5804766.255
241 PMB+F+R4 2861418.816 38 5722913.632 5722913.639 5723329.740
259 Blosum62+R4 2903945.114 19 5807928.229 5807928.230 5808136.283
277 Blosum62+F+R4 2859138.231 38 5718352.462 5718352.469 5718768.571
295 Dayhoff+R4 2889826.740 19 5779691.480 5779691.481 5779899.534
313 Dayhoff+F+R4 2841662.701 38 5683401.402 5683401.409 5683817.511
331 mtREV+R4 3048557.789 19 6097153.578 6097153.580 6097361.633
349 mtREV+F+R4 2876055.838 38 5752187.676 5752187.683 5752603.785
367 mtART+R4 3100254.416 19 6200546.831 6200546.833 6200754.886
385 mtART+F+R4 2911425.836 38 5822927.671 5822927.678 5823343.780
403 mtZOA+R4 3014825.921 19 6029689.842 6029689.844 6029897.897
421 mtZOA+F+R4 2875761.982 38 5751599.964 5751599.971 5752016.072
439 mtMet+R4 3107742.091 19 6215522.183 6215522.185 6215730.237
457 mtMet+F+R4 2878591.450 38 5757258.899 5757258.906 5757675.008
475 mtVer+R4 3084101.909 19 6168241.817 6168241.819 6168449.872
493 mtVer+F+R4 2900814.652 38 5801705.303 5801705.310 5802121.412
511 mtInv+R4 3151604.022 19 6303246.044 6303246.046 6303454.098
529 mtInv+F+R4 2864072.476 38 5728220.951 5728220.958 5728637.060
547 mtMAM+R4 3093891.668 19 6187821.336 6187821.338 6188029.390
565 mtMAM+F+R4 2917600.974 38 5835277.948 5835277.955 5835694.057
583 HIVb+R4 2947877.165 19 5895792.331 5895792.333 5896000.385
601 HIVb+F+R4 2895358.762 38 5790793.524 5790793.531 5791209.633
619 HIVw+R4 3088006.815 19 6176051.630 6176051.632 6176259.685
637 HIVw+F+R4 2956556.097 38 5913188.193 5913188.200 5913604.302
655 FLU+R4 2968803.671 19 5937645.342 5937645.344 5937853.397
673 FLU+F+R4 2872026.423 38 5744128.846 5744128.853 5744544.955
691 rtREV+R4 2907635.553 19 5815309.107 5815309.109 5815517.161
709 rtREV+F+R4 2843280.734 38 5686637.469 5686637.476 5687053.577
727 cpREV+R4 2899268.272 19 5798574.545 5798574.546 5798782.599
745 cpREV+F+R4 2849736.637 38 5699549.275 5699549.282 5699965.384
Akaike Information Criterion: LG+F+R4
Corrected Akaike Information Criterion: LG+F+R4
Bayesian Information Criterion: LG+F+R4
Best-fit model: LG+F+R4 chosen according to BIC
All model information printed to all.phy.model.gz
CPU time for ModelFinder: 56290.041 seconds (15h:38m:10s)
Wall-clock time for ModelFinder: 9545.140 seconds (2h:39m:5s)
NOTE: 305 MB RAM (0 GB) is required!
Estimate model parameters (epsilon = 0.010)
1. Initial log-likelihood: -2829569.034
2. Current log-likelihood: -2829568.969
3. Current log-likelihood: -2829568.893
4. Current log-likelihood: -2829568.820
5. Current log-likelihood: -2829568.706
6. Current log-likelihood: -2829568.646
7. Current log-likelihood: -2829568.607
8. Current log-likelihood: -2829568.561
9. Current log-likelihood: -2829568.504
10. Current log-likelihood: -2829568.470
11. Current log-likelihood: -2829568.449
12. Current log-likelihood: -2829568.412
13. Current log-likelihood: -2829568.381
14. Current log-likelihood: -2829568.363
15. Current log-likelihood: -2829568.330
16. Current log-likelihood: -2829568.302
17. Current log-likelihood: -2829568.285
18. Current log-likelihood: -2829568.272
19. Current log-likelihood: -2829568.255
20. Current log-likelihood: -2829568.225
21. Current log-likelihood: -2829568.208
22. Current log-likelihood: -2829568.192
23. Current log-likelihood: -2829568.181
24. Current log-likelihood: -2829568.169
25. Current log-likelihood: -2829568.155
26. Current log-likelihood: -2829568.145
27. Current log-likelihood: -2829568.135
Optimal log-likelihood: -2829568.120
Site proportion and rates: (0.368,0.037) (0.287,0.230) (0.284,1.927) (0.060,6.187)
Parameters optimization took 27 rounds (44.638 sec)
BEST SCORE FOUND : -2829568.120
Total tree length: 0.989
Total number of iterations: 0
CPU time used for tree search: 0.000 sec (0h:0m:0s)
Wall-clock time used for tree search: 0.000 sec (0h:0m:0s)
Total CPU time used: 265.375 sec (0h:4m:25s)
Total wall-clock time used: 44.742 sec (0h:0m:44s)
Analysis results written to:
IQ-TREE report: all.phy.iqtree
Tree used for ModelFinder: all.phy.treefile
Screen log file: all.phy.log
Date and Time: Sat Nov 21 18:24:17 2020
RAxML-NG v. 0.9.0 released on 20.05.2019 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml
RAxML-NG was called at 21-Nov-2020 20:43:09 as follows:
raxml-ng --all --msa all.phy --model LG+F+R4 --bs-trees 1000 --prefix all
Analysis options:
run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
start tree(s): random (10) + parsimony (10)
bootstrap replicates: 1000
random seed: 1605962589
tip-inner: OFF
pattern compression: ON
per-rate scalers: OFF
site repeats: ON
branch lengths: proportional (ML estimate, algorithm: NR-FAST)
SIMD kernels: AVX
parallelization: PTHREADS (6 threads), thread pinning: OFF
[00:00:00] Reading alignment from file: all.phy
[00:00:00] Loaded alignment with 8 taxa and 420934 sites
Alignment comprises 1 partitions and 61795 patterns
Partition 0: noname
Model: LG+FC+R4
Alignment sites / patterns: 420934 / 61795
Gaps: 0.00 %
Invariant sites: 64.04 %
NOTE: Binary MSA file already exists: all.raxml.rba
[00:00:00] Generating 10 random starting tree(s) with 8 taxa
[00:00:00] Generating 10 parsimony starting tree(s) with 8 taxa
[00:00:00] Data distribution: max. partitions/sites/weight per thread: 1 / 10300 / 824000
Starting ML tree search with 20 distinct starting trees
[00:15:09] ML tree search #1, logLikelihood: -2829562.100183
[00:31:09] ML tree search #2, logLikelihood: -2829561.100633
[00:49:39] ML tree search #3, logLikelihood: -2829560.619507
[01:11:31] ML tree search #4, logLikelihood: -2829560.290105
[01:23:49] ML tree search #5, logLikelihood: -2829560.131473
[01:45:53] ML tree search #6, logLikelihood: -2829560.292220
[02:05:08] ML tree search #7, logLikelihood: -2829560.418453
[02:17:25] ML tree search #8, logLikelihood: -2829560.457298
[02:34:13] ML tree search #9, logLikelihood: -2829560.632220
[02:49:21] ML tree search #10, logLikelihood: -2829560.813941
[03:01:08] ML tree search #11, logLikelihood: -2829561.097202
[03:17:09] ML tree search #12, logLikelihood: -2829561.012665
[03:31:23] ML tree search #13, logLikelihood: -2829560.630028
[03:46:48] ML tree search #14, logLikelihood: -2829561.110779
[04:03:49] ML tree search #15, logLikelihood: -2829561.122770
[04:15:38] ML tree search #16, logLikelihood: -2829561.052122
[04:29:16] ML tree search #17, logLikelihood: -2829560.673882
[04:45:12] ML tree search #18, logLikelihood: -2829561.212994
[04:49:47] ML tree search #19, logLikelihood: -2829572.187337
[05:03:54] ML tree search #20, logLikelihood: -2829560.698702
[05:03:54] ML tree search completed, best tree logLH: -2829560.131473
[05:03:54] Starting bootstrapping analysis with 1000 replicates.
[05:16:36] Bootstrap tree #1, logLikelihood: -2827652.629756
[05:29:50] Bootstrap tree #2, logLikelihood: -2828368.350447
[05:41:06] Bootstrap tree #3, logLikelihood: -2827059.810637
[05:55:49] Bootstrap tree #4, logLikelihood: -2821164.263529
[05:59:14] Bootstrap tree #5, logLikelihood: -2828186.516885
[06:07:33] Bootstrap tree #6, logLikelihood: -2832066.338951
[06:16:47] Bootstrap tree #7, logLikelihood: -2830408.948682
[06:35:45] Bootstrap tree #8, logLikelihood: -2826606.049605
[06:45:56] Bootstrap tree #9, logLikelihood: -2824189.220799
[06:58:09] Bootstrap tree #10, logLikelihood: -2833159.271852
[07:07:42] Bootstrap tree #11, logLikelihood: -2830956.390698
[07:16:48] Bootstrap tree #12, logLikelihood: -2831951.009978
[07:27:07] Bootstrap tree #13, logLikelihood: -2824713.889968
[07:41:53] Bootstrap tree #14, logLikelihood: -2832560.039860
[07:51:20] Bootstrap tree #15, logLikelihood: -2831965.682622
[08:02:11] Bootstrap tree #16, logLikelihood: -2821715.959956
[08:17:24] Bootstrap tree #17, logLikelihood: -2831571.075277
[08:27:58] Bootstrap tree #18, logLikelihood: -2826622.888111
[08:38:11] Bootstrap tree #19, logLikelihood: -2830657.381464
[08:48:49] Bootstrap tree #20, logLikelihood: -2832895.960579
[08:58:08] Bootstrap tree #21, logLikelihood: -2829531.033391
[09:07:18] Bootstrap tree #22, logLikelihood: -2836531.631576
[09:19:59] Bootstrap tree #23, logLikelihood: -2824095.302481
[09:27:56] Bootstrap tree #24, logLikelihood: -2828452.211052
[09:37:03] Bootstrap tree #25, logLikelihood: -2830375.502659
[09:47:29] Bootstrap tree #26, logLikelihood: -2829193.600428
[09:59:53] Bootstrap tree #27, logLikelihood: -2828623.158640
[10:15:07] Bootstrap tree #28, logLikelihood: -2824712.105612
[10:35:06] Bootstrap tree #29, logLikelihood: -2827839.057275
[10:46:49] Bootstrap tree #30, logLikelihood: -2826220.188318
[11:01:57] Bootstrap tree #31, logLikelihood: -2825971.189481
[11:14:23] Bootstrap tree #32, logLikelihood: -2828886.396345
[11:29:28] Bootstrap tree #33, logLikelihood: -2829393.489736
[11:42:29] Bootstrap tree #34, logLikelihood: -2832751.181423
[11:56:18] Bootstrap tree #35, logLikelihood: -2821253.655157
[12:06:40] Bootstrap tree #36, logLikelihood: -2833806.546337
[12:16:56] Bootstrap tree #37, logLikelihood: -2825327.136226
[12:36:54] Bootstrap tree #38, logLikelihood: -2826549.141106
[12:49:32] Bootstrap tree #39, logLikelihood: -2822592.653724
[12:58:19] Bootstrap tree #40, logLikelihood: -2830434.203491
[13:05:10] Bootstrap tree #41, logLikelihood: -2828413.981928
[13:15:35] Bootstrap tree #42, logLikelihood: -2827830.227101
[13:28:44] Bootstrap tree #43, logLikelihood: -2825499.957294
[13:41:27] Bootstrap tree #44, logLikelihood: -2826461.011279
[14:00:23] Bootstrap tree #45, logLikelihood: -2824301.157453
[14:13:55] Bootstrap tree #46, logLikelihood: -2824617.927624
[14:23:55] Bootstrap tree #47, logLikelihood: -2831936.384558
[14:35:13] Bootstrap tree #48, logLikelihood: -2824255.779613
[14:47:43] Bootstrap tree #49, logLikelihood: -2831075.671365
[14:57:34] Bootstrap tree #50, logLikelihood: -2830255.666543
[15:09:35] Bootstrap tree #51, logLikelihood: -2831726.309364
[15:23:10] Bootstrap tree #52, logLikelihood: -2824781.251735
[15:39:06] Bootstrap tree #53, logLikelihood: