Improving quality of high-throughput sequencing reads 提高高通量测序的质量

170 篇文章 7 订阅
30 篇文章 1 订阅

5.3.4 Evaluation Results for PacBio Error Correction Tools

Due to the high error rate of PacBio reads, error correction outputs could have many uncorrected bases.

Therefore, most PacBio error correction tools generate two types of reads:

(1) trimmed reads that only contain corrected regions in input reads and

(2) untrimmed reads that include both corrected and uncorrected regions in input reads.

While PBcR only produces trimmed reads, LSC and Proovread generate both trimmed reads and untrimmed reads, and they were assessed separately.

For LoRDEC, trimmed reads were generated from the untrimmed reads using lordec-trim-split that is included in the LoRDEC package.

Accuracy of PacBio Error Correction Tools

In Figure 5.5A, percentage similarity of the outputs from PacBio read error correction methods for P1 are compared. Percent similarity of the input reads was 76.6 percent before error correction, and all the output results were better than this number. Among the four tools, three tools except LSC showed percent similarity over 95 percent for the trimmed reads. For the untrimmed reads, LoRDEC and Proovread generated more accurate reads than LSC. Except the untrimmed LoRDEC reads, read coverage of Illumina reads gave almost no impact on percentage similarity.

Figure 5.5B and Figure 5.5C show read coverage and NG50 of the outputs of the compared tools. The two charts have similar shapes and the values became high when percentage similarity in Figure 5.5A was low. The trimmed LoRDEC reads and the PBcR outputs were improved a lot by increasing Illumina read coverage. The trimmed reads from Proovread were also improved but the values were saturated for 30 X coverage.

Percentage similarity, read coverage, and NG50 are compared for P2- 40X and P2-40X-EF that is the error-free version of P2-40X in Figure 5.6. Both the trimmed Proovread reads and the trimmed LoRDEC reads showed high percentage similarity. Percentage similarity and read coverage of the untrimmed Proovread reads were almost the same compared to those of the trimmed Proovread reads. However, NG50 of the trimmed Proovread reads was shorter than that of the untrimmed Proovread reads. LoRDEC generated the trimmed reads with high percent similarity but it removed too many bases and read coverage and NG50 of the read set became much lower than those of the original input reads.

For all the three metric, P2-40-EF did not make a meaningful difference when it was compared with P2-40. This means sequencing errors in Illumina reads are not important when Illumina read coverage is about 40 X.

Alignment Results for PacBio Error Correction Tools

We aligned input PacBio reads and their error correction results using BWA with “-x pacbio” option, and evaluated the alignment results. Before error correction, over 95 percent of P1 PacBio reads and over 98 percent of P2 PacBio reads could be aligned to the reference sequences, hence the number was not improved much after error correction.

The ratio of the number of reads that were aligned without any mismatches or indels to the total number of corrected reads is shown in Figure 5.7. The ratio was 0 both for P1 and for P2 before error correction, and some error correction methods improved the number a lot. For P1, over 50 percent of trimmed reads from PBcR and Proovread could be aligned to the reference sequence without any differences. Proovread also showed a good result for P2. However, PBcR generated much worse results for P2 than for P1. The ratio of the LSC trimmed reads for P1 was 0.3 percent and no untrimmed LSC read could be aligned to the reference sequence with no difference. Among untrimmed corrected reads, the quality of the reads from Proovread was the best, and 4.3 percent and 14.5 percent of the reads could be aligned without mismatches or indels for P1 and P2, respectively.

Memory Usage and Runtime of PacBio Error Correction Tools

Memory usage of the PacBio error correction methods is summarized in Figure 5.8A. LoRDEC was the most memory efficient method and it could correct all the reads with under 1 GB of memory. Memory usage of LSC was sensitive to Illumina read coverage, and correcting P1-40X required two times larger memory than that for correcting P1-20X. PBcR corrected errors with relatively small memory for P1, but memory usage increased by four times from P1 to P2. Memory usage of Proovread was constant for all the inputs. This was because Proovread splits PacBio reads into chunks with the small size (20 MB in the experiments). Runtime of the tools are shown in Figure 5.8B. LoRDEC was much faster than the others and the difference became larger as the size of genome and Illumina read coverage increased.

Runtime of LSC was not that long for P1 but it could not finish error correction for P2 even after 40 times longer duration was allowed compared to the runtime for P1. Runtime of PBcR was sensitive both to genome length and Illumina read coverage. Proovread was the slowest among the assessed tools for P1 but it was less sensitive to genome size than PBcR and it became the second fastest for P2.

参考文献

http://xueshu.baidu.com/usercenter/paper/show?paperid=8524fa6c3ae1a8b4c629611b77a17501

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值