0066-【数据质控】-高通量下机数据的Duplication来源分析

最新推荐文章于 2022-04-06 16:58:22 发布

leadingsci

最新推荐文章于 2022-04-06 16:58:22 发布

阅读量3.1k

点赞数 2

分类专栏：【数据质控】

【数据质控】专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一、什么是Duplicated Reads

1、谈到NGS数据的duplicated reads（暂且翻译为“重复数据”），我们通常会直观地认为：duplicated reads是在NGS文库构建过程中，由于PCR过度扩增导致同一个模板DNA片段被反复测序多次，得到一模一样的reads。

2、但是这经不起推敲。仔细一想，就很困惑。
PCR不就是用来产生重复数据的吗？否则不叫PCR了。除了PCR-free的文库构建方法以外，大部分NGS文库构建方法都有PCR步骤，难道说这些NGS数据都有问题？

这是不可能的。或许：
PCR可以产生重复序列，但是不能额外多产生一条或多条。设一个基因组有A、B两个片段，PCR后，如果得到1000A 1000B，是正确的；如果得到1000A 1000A 1000B，多出来的1000A就是重复数据？问题是，PCR怎么会凭空多出来1000条片段A的测序reads呢？这要PCR出了什么样的问题，才能产生出这样的结果？

PCR是不会这样的。或许：
A B经过PCR后得到1500A 1000B，多出来的500条A是重复数据？这不就是大家常说的PCR bias吗？

到底什么是“过度扩增”呢？

3、严格的定义是这样的：
duplicated reads是PCR对同一个分子进行多次镜像复制的后果。
判断是否为镜像分子的标准是：reads的起始和终止位置一样，起点和终点之间的碱基序列一样（不妨简称为“三一样”）。只要起点、终点、或者起点与终点之间的序列三者之中有一个不同，就是不同的分子，称为unique reads。
镜像复制出来的分子个数与总分子数的比例就是duplication rate，duplication rate = 1 - unique reads / total reads。

4、PCR本来就是用来镜像复制DNA片段的。对于最理想的NGS数据分析，难道要尽可能把所有通过PCR获得的子链的测序数据全部去除，要把PCR的效果完全消除，要还原到没有PCR的状态？

是的。
设一个基因组有A、B两个片段，PCR后得到无论多少条reads，比如n?A m?B条，在数据分析的时候，都只保留1条A和1条B（unique reads）用于组装，而去掉(n-1)条A和(m-1)条B。共有(n-1)条A和(m-1)条B被当成duplicatedreads看待，尽管它们是正常PCR的正常产物。

所以，
目前的算法其实是一个简化的处理方案，把所有重复的reads都去掉了，留下完全不重复的reads。算法没有能力区分“假重复”（人为造成的重复序列方面的bias)和“真重复”（天然存在的重复序列）。

所以，
对于NGS 数据而言，Duplicateddata是一个生物信息学概念，不是分子生物学概念；是人为规定的，不是文库构建、高通量测序等生化反应自然生成的。

二、影响duplicationrate的因素

1、模板分子种类的多样性（复杂度，complexity）。
在循环次数相同和扩增效率相同的条件下，PCR起始时模板分子的多样性越多，PCR结束时镜像复制分子的数量就越少，比例就越低，dup rate就越低。NGS文库构建的PCR循环次数最好不要超过6个，以保证PCR产物具有足够的复杂度。

2、模板分子碱基组成的多样性（复杂度，complexity）。
碱基组成不一样，PCR难易程度不一样。容易PCR扩增的分子在测序数据中占优势。

3、连接效率。
在分子多样性相同和PCR条件相同的情况下，建库过程中模板与接头的连接效率越高，NGS数据的dup rate越低。

4、片段化的长度和随机性。
超声波随机打断和酶切随机打断，就是为了获得分子多样性。这里一定强调随机。用一种或者多种内切酶获得的DNA片段，其分子多样性不如随机打断。
DNA片段的长度要适当。片段长度越小，导致PCR扩增越容易，加剧了PCR bias，最后引起PCR产物复杂度降低，dup rate升高。

5、磁珠洗涤条件的严谨性。
磁珠吸附DNA在本质上是电荷吸附。DNA序列不同，电荷密度不同。如果洗涤条件不严谨，就会造成磁珠吸附有偏好性，也就意味着分子多样性降低，影响dup rate。
膜吸附也是同样的道理。

6、探针杂交封闭的有效性。
如果探针杂交时LINE、Alu等重复序列未能有效封闭，必然造成dup rate升高，有效数据降低。
如果adaptor被未有效封闭，必然造成off target数据的比例升高。单位质量的DNA中，on target部分减少，势必造成on target部分中比例低的分子在测序数据中更容易减少或者缺失，即on target部分的分子多样性降低，影响dup rate。

7、Cluster PCR。
除了建库PCR，cluster在flowcell上的生成也是一个PCR过程。这个PCR容易被忽略。
反方：1条模板经过ClusterPCR只能形成1个cluster，测出1条read。Cluster PCR不增加测序reads的数量，所以不影响dup rate。
回答：cluster PCR如果造成cluster变少，则影响dup rate。原因是比例少的分子有可能不能产生cluster，造成唯一性分子数减少，进而影响dup rate。
适当的cluster生成密度，不仅能够获得最佳的数据产量，也能够获得较低的dup rate。无论ILMN还是PGM平台，我们都希望cluster是单克隆(monoclonal)的，多克隆(Polyclonal)的cluster甚至是相互overlap的cluster都会被测序识别程序过滤掉，造成的直接影响就是cluster密度过高，数据产量降低，整张芯片的cluster多样性降低，造成dup rate升高。Cluster生成的第一步，是模板DNA分子与flowcell上的oligo杂交结合的过程。这个过程是随机事件。模板分子的多样性和复杂度越高，各种分子的比例越均一，dup rate就会越低。极端的情况就是每个cluster只是一种模板分子的单克隆，这个时候dup rate是最理想的。
注：Illumina HiSeq X10之前的平台，flowcell上cluster的生成具有排他性，clusters可以长得挨在一起，但是不会相互重叠。只要其光学检测系统的分辨率足够，就不会有cluster信号的重叠。HiSeq X10之后的平台，flowcell上是打孔的，cluster长在孔里，生成多克隆cluster的可能性希望专家指教。

8、试剂质量不好。
比如SBS测序试剂出了问题，有可能造成WES的dup rate升高到30%。

9、Dup rate跟探针的关系最小。
极端的例子就是扩增子（PCR产物）测序，dup rate可以很高，但完全没有探针存在。探针杂交过程中最影响分子多样性的因素是探针分子与目标分子的比例，其次是杂交的时间。探针与目标分子的比例最低应该在100：1（一定的体积内，有体积的要求），高于这个比例，探针是能够将目标分子捕获的。目前的探针法NGS，这个比例是很高的，探针数量是高度冗余的。杂交法的目标是获得最多的分子包容性，获得最多的与参考序列不一样的序列。只有增加分子包容性，才能最大可能地包含变异的类型，因此探针分子要长，杂交时间要长。NGS杂交不是为了特异性，而是为了更高的产量，获得更多的不准确。因为越准确，就意味着测序数据与参考序列（也就是探针序列）是一样的，这样就没有测序的必要了，也就没有发现新的信息的功能了。因此，杂交时间短，不能容纳最大序列可能性，造成分子多样性降低，进而影响dup rate。

三、模板分子多样性非常重要
总之，dup rate与模板分子的多样性呈反相关，所有影响分子多样性的环节与因素都会影响dup rate。

除了上面讲的那些因素，样本的性质也对模板分子的多样性有影响。如FFPE样本的dup rate高是共识。再如单细胞测序，单细胞DNA的有些区域在测序结果中压根就测不到。单细胞全基因测序的覆盖率是80%~90%，而多细胞全基因测序的覆盖率能达到99%以上，原因就是分子多样性不同。对于多细胞测序，相同区域的分子，这个细胞没有扩增到，另一个细胞可能就扩增出来了；而单细胞只有两次机会，没了就是没了。

其他因素还包括模板DNA的质量、实验室科学家的操作习惯等。比如，同样是取200 ng模板DNA，一个人取0.1 uL获得200 ng，另一个人取5 uL获得200ng，这两种方法的分子多样性不一样。

参考文章：
http://blog.sina.com.cn/s/blog_8de3399d0102wy3f.html

fastqc 官网说明

参考官网：
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html

Duplicate Sequences

Summary

In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).

This module counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication.

这里写图片描述

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file. Each sequence is tracked to the end of the file to give a representative count of the overall duplication level. To cut down on the amount of information in the final plot any sequences with more than 10 duplicates are placed into grouped bins to give a clear impression of the overall duplication level without having to show each individual duplication value.

Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. Even so, longer reads are more likely to contain sequencing errors which will artificially increase the observed diversity and will tend to underrepresent highly duplicated sequences.

The plot shows the proportion of the library which is made up of sequences in each of the different duplication level bins. There are two lines on the plot. The blue line takes the full sequence set and shows how its duplication levels are distributed. In the red plot the sequences are de-duplicated and the proportions shown are the proportions of the deduplicated set which come from different duplication levels in the original data.

In a properly diverse library most sequences should fall into the far left of the plot in both the red and blue lines. A general level of enrichment, indicating broad oversequencing in the library will tend to flatten the lines, lowering the low end and generally raising other categories. More specific enrichments of subsets, or the presence of low complexity contaminants will tend to produce spikes towards the right of the plot. These high duplication peaks will most often appear in the blue trace as they make up a high proportion of the original library, but usually disappear in the red trace as they make up an insignificant proportion of the deduplicated set. If peaks persist in the blue trace then this suggests that there are a large number of different highly duplicated sequences which might indicate either a contaminant set or a very severe technical duplication.

The module also calculates an expected overall loss of sequence were the library to be deduplicated. This headline figure is shown at the top of the plot and gives a reasonable impression of the potential overall level of loss.

Warning

This module will issue a warning if non-unique sequences make up more than 20% of the total.

Failure

This module will issue a error if non-unique sequences make up more than 50% of the total.

Common reasons for warnings

The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module.

In general there are two potential types of duplicate in a library, technical duplicates arising from PCR artefacts, or biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected. From a sequence level there is no way to distinguish between these two types and both will be reported as duplicates here.

A warning or error in this module is simply a statement that you have exhausted the diversity in at least part of your library and are re-sequencing the same sequences. In a supposedly diverse library this would suggest that the diversity has been partially or completely exhausted and that you are therefore wasting sequencing capacity. However in some library types you will naturally tend to over-sequence parts of the library and therefore generate duplication and will therefore expect to see warnings or error from this module.

In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. A similar situation can arise in highly enriched ChIP-Seq libraries although the duplication there is less pronounced. Finally, if you have a library where the sequence start points are constrained (a library constructed around restriction sites for example, or an unfragmented small RNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such as random barcoding to allow the distinction of technical and biological duplicates.