1。为什么明明全部是151M,mapping quality却为0?
以序列“A00928:207:HYLCHDSXY:2:1466:11415:24330”为例,进行分析。
(base) [xxzhang@mu02 V1C]$ grep "A00928:207:HYLCHDSXY:2:1466:11415:24330" result_T45.sam
A00928:207:HYLCHDSXY:2:1466:11415:24330 99 chr16 29568232 0 151M = 29568468 387 GCCTTGGTTGTCTCTTGCATAAAATGAGGATGATACCTATCTAACATGTAAGGTTGTTGTGAGGCTTAAATAAGATAATGAATTATAAAGAAACACTTAGTTTGCAGGGTATGTTAAAAGTTATTATGACCATAATACAAAATAAACTCAG FFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFF:FFF:FFFF:FFFFFFFFF,FFFFF:FFFFF NM:i:0 MD:Z:151 MC:Z:151M AS:i:151 XS:i:151 XA:Z:chr16,+30307936,151M,0;chr16,+21486536,151M,3;chr16,+21918022,151M,3;chr16,-22452551,151M,3;chr16,+18898941,151M,3;
******************************
A00928:207:HYLCHDSXY:2:1466:11415:24330 147 chr16 29568468 0 151M = 29568232 -387 TTAAATACTTCACTGAAGAAAGCAGGTAATCTGTGGCTACCAACAATGTGTTTCCAACTAGGAAGCATCTGATTTCAAACTAAACTGTAAAACCAGGAGCAGCGTCCCCATTGGTTAGAAATAAGGTTTCTAAATCCCCTAAGTTATGAAC FFFF:,FFFFFF:F:FFFFFFFFFFFFFFFFFFFF,FFFFFFF:FFFFFFFFFFFFFF:F,FFF:F,FFFFFFFFFFFFFFFFFFFFFFFF,FFFF:F,:FFF,FFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFF:FF:FFFFFFFFFF NM:i:1 MD:Z:103T47 MC:Z:151M AS:i:146 XS:i:146 XA:Z:chr16,-30308172,151M,1;chr16,-21918260,151M,3;chr16,-21486774,151M,3;chr16,+22452313,151M,4;
从序列比对结果中可以看到,这段序列可以比对到基因组的多个位置,且mapping的效果都是151M。
最终,我们的比对软件,随机的选择其中的一个作为比对结果最好的区域,并且在结果中只显示了这个比对结果。
以上是,重复序列比对的一种情况。这个也是自己之前没整明白,但是现在整明白的地方。
处理方法:========================> 根据mapping quality 去除。(在htseq-count 计数的时候有这样的一个参数)
可以看一下,最终的结果中在chr16 29568468 位点是否有reads存在?
A00928:207:HYLCHDSXY:2:1109:18114:23719 83 chr17 43278459 0 151M = 43278196 -414 GAGCCGGGGCTCCGCTGGGGGCGACTTCCTTGTTCGTATCGAGCCAGCGAAAAGACAGAACCGGAAGAGACCGGGGGCGAAGGCGACAGGGGTCTGTGGAAGAGACCTGTCGGCGGAGAGCGGTCCACGTTTTCCTGGAGAAAGACGAGGC FFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF,FFF,FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:151 MC:Z:151M AS:i:151 XS:i:151
A00928:207:HYLCHDSXY:2:1109:18114:23719 163 chr17 43278196 0 151M = 43278459 414 GTCCCACTCTCATCCACATTCAAGTCGCGGTGAGAGCCCCAGCCTCGCTCCTTGCCCCATTCCCTCTGTCTCGTCCACAGCGCTATTGACGCCCTTACACTCTCGGGCTGATTTCTTATTCTCCGCCTTTGAAAAGGGAAATCTTACACCC FFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:151 MC:Z:151M AS:i:151 XS:i:151
这里并没有显示其他的匹配区域,为什么mapping质量依然很低呢?
我们来看一下,这个optional field的主要的注释说明。
MD
2。默认的输出结果中,是否有secondary以及supplementary reads,如果有,怎样的表现方式?
A00928:207:HYLCHDSXY:2:1110:24542:35383 2163 chr18 61893866 60 75H76M = 61894387 554 CGGCCGCCCGCCCAACTCCGCCAGCTCTCAGGAGACACGCAACTCCCGGGACCACCTCCCCCGTGCCCCGCCAGAC FFFFFFFFFF,FFFFFFFFFFFFFFFFFFF::FFFFFFFFFFF:FFFF:FFFFFFFFFFFFFF,FFFFFFFFFF,F NM:i:0 MD:Z:76 MC:Z:43H108M AS:i:76 XS:i:0 SA:Z:chr18,61894387,+,74S77M,60,0;
A00928:207:HYLCHDSXY:2:1110:24542:35383 2179 chr18 61893866 60 106H45M = 61894387 522 CGGCCGCCCGCCCAACTCCGCCAGCTCTCAGGAGACACGCAACTC FFFFFFFFF:FFF:FFFFFFFF:FFFFFFFF:FFFFFFFFFFFFF NM:i:0 MD:Z:45 MC:Z:74H77M AS:i:45 XS:i:0 SA:Z:chr18,61894387,-,43S108M,60,0;
可以看到,一些supplementary reads 在剪切首尾后,能够很好的匹配,目前怀疑这一段是未去除接头的比对结果。
>(base) [xxzhang@mu02 V1C]$ grep "A00928:207:HYLCHDSXY:2:1110:24542:35383" result_T45.sam
A00928:207:HYLCHDSXY:2:1110:24542:35383 99 chr18 61894387 60 74S77M = 61894387 108 GTCTGGCGGGGCACGGGGGAGGTGGTCCCGGGAGTTGCGTGTCTCCTGAGAGCTGGCGGAGTTGGGCGGGCGGCCGTTCGCGGTCGGACACCGCCGCCAGGCGCTCGGCACCGAGCGAGAGCTTTTTGTGTCACTGAGACTCCGGCACCAA F,FFFFFFFFFF,FFFFFFFFFFFFFF:FFFF:FFFFFFFFFFF::FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFF,:FFFFFF,F:FFF:FFFFFFF:FFFFFF::FFF,F:::FFF,,FFFFFFFF:F,FFFF,FFFFFFF,FFF, NM:i:0 MD:Z:77 MC:Z:43S108M AS:i:77 XS:i:0 SA:Z:chr18,61893866,-,75S76M,60,0;
A00928:207:HYLCHDSXY:2:1110:24542:35383 2163 chr18 61893866 60 75H76M = 61894387 554 CGGCCGCCCGCCCAACTCCGCCAGCTCTCAGGAGACACGCAACTCCCGGGACCACCTCCCCCGTGCCCCGCCAGAC FFFFFFFFFF,FFFFFFFFFFFFFFFFFFF::FFFFFFFFFFF:FFFF:FFFFFFFFFFFFFF,FFFFFFFFFF,F NM:i:0 MD:Z:76 MC:Z:43H108M AS:i:76 XS:i:0 SA:Z:chr18,61894387,+,74S77M,60,0;
A00928:207:HYLCHDSXY:2:1110:24542:35383 147 chr18 61894387 60 43S108M = 61894387 -108 GAGTTGCGTGTCTCCTGAGAGCTGGCGGAGTTGGGCGGGCGGCCGTTCGCGGTCGGACACCGCCGCCAGGCGCTCGGCACCGAGCGAGAGCTTTTTGTGTCACTGAGACTCCGGCACCAAAGGGGGTGGGGGGAGAGGCTGGGCTCCGCTC FFFFFFFFFFFFF:FFFFFFFF:FFFFFFFF:FFF:FFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF NM:i:0 MD:Z:108 MC:Z:74S77M AS:i:108 XS:i:0 SA:Z:chr18,61893866,+,106S45M,60,0;
A00928:207:HYLCHDSXY:2:1110:24542:35383 2179 chr18 61893866 60 106H45M = 61894387 522 CGGCCGCCCGCCCAACTCCGCCAGCTCTCAGGAGACACGCAACTC FFFFFFFFF:FFF:FFFFFFFF:FFFFFFFF:FFFFFFFFFFFFF NM:i:0 MD:Z:45 MC:Z:74H77M AS:i:45 XS:i:0 SA:Z:chr18,61894387,-,43S108M,60,0;
如果去除接头之后,是否会改变这条序列的比对情况呢?
不是,这个supplementary序列并没有消失。
(base) [xxzhang@mu02 V1C]$ grep "A00928:207:HYLCHDSXY:2:1110:24542:35383" result_trim.sam
A00928:207:HYLCHDSXY:2:1110:24542:35383 99 chr18 61894387 60 74S77M = 61894387 108 GTCTGGCGGGGCACGGGGGAGGTGGTCCCGGGAGTTGCGTGTCTCCTGAGAGCTGGCGGAGTTGGGCGGGCGGCCGTTCGCGGTCGGACACCGCCGCCAGGCGCTCGGCACCGAGCGAGAGCTTTTTGTGTCACTGAGACTCCGGCACCAA F,FFFFFFFFFF,FFFFFFFFFFFFFF:FFFF:FFFFFFFFFFF::FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFF,:FFFFFF,F:FFF:FFFFFFF:FFFFFF::FFF,F:::FFF,,FFFFFFFF:F,FFFF,FFFFFFF,FFF, NM:i:0 MD:Z:77 MC:Z:43S108M AS:i:77 XS:i:0 SA:Z:chr18,61893866,-,75S76M,60,0;
A00928:207:HYLCHDSXY:2:1110:24542:35383 2163 chr18 61893866 60 75H76M = 61894387 554 CGGCCGCCCGCCCAACTCCGCCAGCTCTCAGGAGACACGCAACTCCCGGGACCACCTCCCCCGTGCCCCGCCAGAC FFFFFFFFFF,FFFFFFFFFFFFFFFFFFF::FFFFFFFFFFF:FFFF:FFFFFFFFFFFFFF,FFFFFFFFFF,F NM:i:0 MD:Z:76 MC:Z:43H108M AS:i:76 XS:i:0 SA:Z:chr18,61894387,+,74S77M,60,0;
A00928:207:HYLCHDSXY:2:1110:24542:35383 147 chr18 61894387 60 43S108M = 61894387 -108 GAGTTGCGTGTCTCCTGAGAGCTGGCGGAGTTGGGCGGGCGGCCGTTCGCGGTCGGACACCGCCGCCAGGCGCTCGGCACCGAGCGAGAGCTTTTTGTGTCACTGAGACTCCGGCACCAAAGGGGGTGGGGGGAGAGGCTGGGCTCCGCTC FFFFFFFFFFFFF:FFFFFFFF:FFFFFFFF:FFF:FFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF NM:i:0 MD:Z:108 MC:Z:74S77M AS:i:108 XS:i:0 SA:Z:chr18,61893866,+,106S45M,60,0;
A00928:207:HYLCHDSXY:2:1110:24542:35383 2179 chr18 61893866 60 106H45M = 61894387 522 CGGCCGCCCGCCCAACTCCGCCAGCTCTCAGGAGACACGCAACTC FFFFFFFFF:FFF:FFFFFFFF:FFFFFFFF:FFFFFFFFFFFFF NM:i:0 MD:Z:45 MC:Z:74H77M AS:i:45 XS:i:0 SA:Z:chr18,61894387,-,43S108M,60,0;
处理方法 ========== > 这里给我一个提示就是说,mapping的时候不考虑 二次匹配或者切断后匹配的reads,虽然他们的mapping质量有时候确实很高。
3。所以,目前考虑的处理策略是比对的时候,去除低mapping质量以及secondary、supplementary的reads。
但是,需要注意的是,我们去除的这些低质量的reads,可能就是有“信号”的重复序列的reads。
htseq-count -f bam -t CDS -s no --secondary-alignments ignore --supplementary-alignments ignore -a 60 result_trim.sort.bam /home/xxzhang/workplace/RepeatAnnoation/fastq/V1C/vertify_test/repeatfamily_c6.gtf >result_trim_2.txt