常见问题项分析
Per Base Sequence Content
- ATGC碱基在各个位置上的分布。统计read的每个位置ATCG分布,正常四条线平行且相近。当部分出现bias,提示有overrepresented sequence的。
- 如前10个位置,每种碱基频率有略微的差别,说明可能有污染【开头碱基比例跳动原因?】。
- 任一位置的A/T比例与G/C相差超过10%,报"WARN";超过20%,报"FAIL"。一般AT含量高于CG,AT约28%,CG约22%。
- It’s worth noting that
some types of library will always produce biased sequence composition
, normallyat the start of the read
. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries
) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but insteadprovides enrichement of a number of different K-mers at the 5' end of the reads
. Whilst this is a true technical bias, itisn't something which can be corrected by trimming and in 【most cases doesn't seem to adversely affect the downstream analysis】
. It will however produce a warning or error in this module.
Error Reasons
- Overrepresented sequences
Biased fragmentation:
Any library which is generated based on the ligation of random hexamers【六聚体】or through tagmentation should theoretically have good diversity through the sequence, but experience has shown thatthese libraries always have a selection bias in around the first 【12bp】 of each run
. This is due toa biased selection of random primers
, but doesn’t represent any individually biased sequences. 【Nearly all RNA-Seq libraries will fail this module because of this bias
, butthis is not a problem which can be fixed by processing
, and itdoesn't seem to adversely affect the ablity to measure expression.
】- Biased composition libraries
Per Sequence GC Content Error
- 曲线形状的偏差往往是由于
文库的污染
或是部分reads构成的子集有偏差
(overrepresented reads
)。 - 形状接近正态但偏离理论分布提示可能有系统偏差。偏离理论分布的reads超过15%时,报"WARN";偏离理论分布的reads超过30%时,报"FAIL"。
Sequence Duplication Levels
- 序列不同拷贝数的水平。
测序深度越高,越容易产生一定程度的duplication,这是正常的现象,但如果duplication的程度很高,就提示我们可能有bias的存在。
横轴duplication次数,纵轴duplicated reads数目,以unique reads总数作为100%。 - 原始数据很大,统计非常慢,fastqc中用fq数据的
前200,000条reads统计
其在全部数据中的重复情况
。如重复数目大于等于10的reads被合并统计,我们会看到最右侧略有上扬【?】。当非unique的reads占总数比例大于20%
,"WARN
";大于50%
,"FAIL
“。 - 帮助判断文库的复杂程度,如果PCR扩增次数太多或者起始扩增底物太少,都会降低文库的复杂度。
- 如果有大量的重复序列,也就是说文库复杂程度低,可能与某个基因的过表达有关
Overrepresented sequences
- Overrepresented sequences,一条序列重复数,转录组中非常多的转录本,
一条序列
再怎么多也不太会
占整个转录组的一小部分(比如1%
),这种情况,不是这种转录本巨量表达,就是样品被污染
。列出来大于全部转录组1%的reads序列。 - 某个序列大量出现,就叫over-represented。fastqc的
标准
是占全部reads的0.1%
以上。和duplicate analysis一样,为计算方便,只取了fq数据的前200,000条reads统计
,所以有可能over-represented reads不在里面
。而且大于75bp的reads也是只取【50bp】
。如果命令行中加入了-c contaminant file
,出现的over-represented sequence会从contaminant_file里面找匹配的hit
(至少20bp且最多一个mismatch)。 - 展示长度
至少20bp
,数量占总数0.1%以上
的reads碱基组成,帮助判断污染(比如:载体、接头序列) - 若GC含量分布图"挂了",此表帮助判断来源,已知的
载体或者接头
,会列出来;如不是,可以复制序列blast
。 - blast发现是一个基因,则可以验证猜想:基因过表达
- illumina Nova和Nextseq产生的数据容易产生PloyG序列,原因是这两个平台使用两个荧光信号,而没有信号时表示G。 请在质量过滤的时候利用fastp进行去PloyG尾巴