The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies
Aleksey V. Zimin,Steven L. Salzberg
摘要:
The introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8–15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to "polish" the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.
近年来,第三代DNA测序技术的引入使得科学家们能够产生更长的序列读取,当用于全基因组测序项目时,可以产生更好的重复分辨率和更连续的基因组组装。
虽然更好的连续性的承诺是正确的,但相对较高的错误率,平均8-15%的长读,使生成一个高度准确的最终序列具有挑战性。
目前的长读测序技术显示出系统性错误的趋势,特别是在均聚体区域,这带来了额外的挑战。
要生成总体错误率极低且高度连续的程序集,一种经济有效的策略是将长读与低成本的短读数据相结合,后者目前的错误率低于0.5%。
这种混合策略可以通过将短读数据合并到装配的早期阶段(在读取校正步骤中),或者通过使用短读来“完善”由长读构建的共识。
在本报告中,我们介绍了装配抛光工具POLCA(呼叫替代抛光),并将其性能与另外两个流行的抛光程序Pilon和Racon进行比较。
我们表明,在模拟数据上,POLCA比Pilon更准确,在准确性上可与Racon比较。
在真实数据上,这三个程序都显示出相似的性能,但POLCA始终比其他任何一个抛光程序快得多。