Megahit
组装软件很多下面介绍三款组装软件:
MEGAHIT下载地址
https://github.com/voutcn/megahit
git clone https://github.com/voutcn/megahit.git
cd megahit
make
其他两款组装软件下载地址
SOAPdenovo下载地址
http://sourceforge.net/projects/soapdenovo2/files/SOAPdenovo2/
metaSPAdes下载地址
http://spades.bioinf.spbau.ru/release3.11.0/
评估软件quast下载地址
git clone https://github.com/ablab/quast.git -b release_4.5
export PYTHONPATH=$(pwd)/quast/libs/
使数据
cd megahit/
Bacterial_F1_1.pe.fq
Bacteria_F1_2.pe.fq
开始组装
megahit -1 Bacterial_F1_2.pe.fq -2 Bacterial_F1_2.pe.fq -o combined
2019-09-26 16:31:55 - MEGAHIT v1.2.8
2019-09-26 16:31:55 - Using megahit_core with POPCNT and BMI2 support
2019-09-26 16:31:55 - Convert reads to binary library
2019-09-26 16:31:55 - INFO sequence/io/sequence_lib.cpp : 77 - Lib 0 (/home/ZQK/Data/megahit_data/Bacterial_F1_1.pe.fq,/home/ZQK/Data/megahit_data/Bacterial_F1_2.pe.fq): pe, 160126 reads, 250 max length
2019-09-26 16:31:55 - INFO utils/utils.h : 152 - Real: 0.3188 user: 0.2668 sys: 0.0560 maxrss: 22892
2019-09-26 16:31:55 - k-max reset to: 141
2019-09-26 16:31:55 - Start assembly. Number of CPU threads 56
2019-09-26 16:31:55 - k list: 21,29,39,59,79,99,119,141
2019-09-26 16:31:55 - Memory used: 304044370329
2019-09-26 16:31:55 - Extract solid (k+1)-mers for k = 21
2019-09-26 16:31:56 - Build graph for k = 21
2019-09-26 16:31:57 - Assemble contigs from SdBG for k = 21
2019-09-26 16:32:00 - Local assembly for k = 21
2019-09-26 16:32:00 - Extract iterative edges from k = 21 to 29
2019-09-26 16:32:01 - Build graph for k = 29
2019-09-26 16:32:01 - Assemble contigs from SdBG for k = 29
2019-09-26 16:32:02 - Local assembly for k = 29
2019-09-26 16:32:02 - Extract iterative edges from k = 29 to 39
2019-09-26 16:32:02 - Build graph for k = 39
2019-09-26 16:32:02 - Assemble contigs from SdBG for k = 39
2019-09-26 16:32:03 - Local assembly for k = 39
2019-09-26 16:32:03 - Extract iterative edges from k = 39 to 59
2019-09-26 16:32:03 - Build graph for k = 59
2019-09-26 16:32:04 - Assemble contigs from SdBG for k = 59
2019-09-26 16:32:04 - Local assembly for k = 59
2019-09-26 16:32:04 - Extract iterative edges from k = 59 to 79
2019-09-26 16:32:05 - Build graph for k = 79
2019-09-26 16:32:05 - Assemble contigs from SdBG for k = 79
2019-09-26 16:32:05 - Local assembly for k = 79
2019-09-26 16:32:06 - Extract iterative edges from k = 79 to 99
2019-09-26 16:32:06 - Build graph for k = 99
2019-09-26 16:32:06 - Assemble contigs from SdBG for k = 99
2019-09-26 16:32:07 - Local assembly for k = 99
2019-09-26 16:32:07 - Extract iterative edges from k = 99 to 119
2019-09-26 16:32:07 - Build graph for k = 119
2019-09-26 16:32:08 - Assemble contigs from SdBG for k = 119
2019-09-26 16:32:08 - Local assembly for k = 119
2019-09-26 16:32:09 - Extract iterative edges from k = 119 to 141
2019-09-26 16:32:09 - Build graph for k = 141
2019-09-26 16:32:09 - Assemble contigs from SdBG for k = 141
2019-09-26 16:32:10 - Merging to output final contigs
2019-09-26 16:32:10 - 177 contigs, total 70612 bp, min 200 bp, max 470 bp, avg 398 bp, N50 445 bp
2019-09-26 16:32:10 - ALL DONE. Time elapsed: 15.146550 seconds
测试文件为了方便演示,只取了原数据的一小部分,原作者用15min,我的服务器运行只用了4min。原始数据使用三种主流软件分析,运行所消耗时间、内存比较。
查看结果
less combined/final.contigs.fa
评估组装结果
运行QUEST
cd assembly
mkdir quast-evaluation
cd quast-evaluation
ln -fs ../combined/final.contigs.fa megahit.contigs.fa
../../quast/quast.py megahit.contigs.fa -o megahit-report
cat megahit-report/report.txt
下载metaSPAdes结果评估并比较
curl -LO https://osf.io/h29jk/download
mv download metaspades.contigs.fa.gz
gunzip metaspades.contigs.fa.gz
../../quast/quast.py metaspades.contigs.fa -o metaspades-report
cat metaspades-report/report.txt
# look at the two reports in parallel
paste *report/report.txt
结果如下:
Assembly megahit.contigs metaspades.contigs
# contigs (>= 0 bp) 7904 4112
# contigs (>= 1000 bp) 2763 1843
# contigs (>= 5000 bp) 582 583
# contigs (>= 10000 bp) 191 244
# contigs (>= 25000 bp) 18 43
# contigs (>= 50000 bp) 2 17
Total length (>= 0 bp) 13222363 12090326
Total length (>= 1000 bp) 11149439 11320830
Total length (>= 5000 bp) 5893043 7955570
Total length (>= 10000 bp) 3186708 5596677
Total length (>= 25000 bp) 663719 2500084
Total length (>= 50000 bp) 112488 1603525
# contigs 3847 2280
Largest contig 61397 261464
Total length 11895322 11615922
GC (%) 46.29 46.27
N50 4924 9303
N75 2524 3937
L50 594 266
L75 1455 754
# N's per 100 kbp 0.00 0.00
结果N50和N75在metaspades结果更好,如果有计算资源,且不缺时间,推荐使用metaspades。但如果没有上T内存的服务器,项目周期又紧张,直接用metahit出结果。