用pysam实现在bam 比对文件中获取某一段区间内的unique的mapping数目,代码如下:
def getting_unique_mapping_in_region(bam_file, chr, start, end):
import pysam
sam = pysam.AlignmentFile(bam_file, 'rb')
region_set =set()
counter = 0
for read in sam.fetch(chr, start, end):
region_set.add(read.query_name)
counter = counter +1
print '{} unique reads in [{}, {},{}]'.format(len(region_set), chr, start, end)
print '{} alignment in [{},{},{}]'.format(counter, chr, start, end)
此外,还可以从全基因组上计算非重叠区间的unique mapping的 reads数目,编程小窍门如下:
1. 采用3个for循环,第一个是对总的染色体数目进行循环,第二获取每个染色体上的区间,第三,对每个区间进行数目的统计,获取区间的时候注意,最后一个区间的长度,start 是上面一个区间的位置-1, stop 是染色体的总长度。 必须要注意了,否则,pysam.fetch() 的时候报错。
def CountNonOverMappingIntervalUniqueReads(bam_file, bin_size):
import pysam
samfile = pysam.AlignmentFile(bam_file, "rb")
print samfile.references
print samfile.lengths
window = bin_size
print "ref\tstart\tstop\tuniq_reads"
for i in range(len(samfile.references)):
refname = samfile.references[i]
seqlen = samfile.lengths[i]
for j in range(1, seqlen, window):
if j+window-1 < samfile.lengths[i]:
stop = j + window - 1
else:
stop = samfile.lengths[i]
region_set = set()
for read in samfile.fetch(refname, j, stop):
region_set.add(read.query_name)
print("{}\t{}\t{}\t{}".format(refname, j, stop, len(region_set)))
参考如下: