read_counts转FPKM（基于gtf和read_counts文件）（exon）

最新推荐文章于 2024-07-13 19:50:21 发布

离子回旋

最新推荐文章于 2024-07-13 19:50:21 发布

阅读量4.5k

点赞数 6

分类专栏：转录组 FPKM read_counts 文章标签：大数据

本文链接：https://blog.csdn.net/qq_26012913/article/details/110205935

版权

转录组同时被 3 个专栏收录

2 篇文章 1 订阅

订阅专栏

FPKM

2 篇文章 1 订阅

订阅专栏

read_counts

1 篇文章 0 订阅

订阅专栏

大家可以看最新版https://blog.csdn.net/qq_26012913/article/details/111939262?spm=1001.2014.3001.5501
首先我们要把gtf文件中的exon抓取出来

grep "exon" genome.gtf > genome_exon.gtf

然后提取genome_exon.gtf文件中的gene的exon的长度和得到我们想要的gene的长度

python count_genelen_from_gft.py genome_exon.gtf gene.len

这其中count_genelen_from_gft.py的代码如下：

import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
f1 = open(file1,'r')
f2 = open(file2,'w')
flag = "fuck"
exon = []
for i in f1:
        a = i.split("\"")
        if flag == a[-2]:
                pos = i.split("\t")
                exon.append(abs(int(pos[4])-int(pos[3]))+1)
        elif flag == "fuck":
                flag = a[-2]
                pos = i.split("\t")
                exon.append(abs(int(pos[4])-int(pos[3]))+1)
        else:
                f2.write("{0}\t{1}\n".format(flag,sum(exon)))
                exon = []
                flag = a[-2]
                pos = i.split("\t")
                exon.append(abs(int(pos[4])-int(pos[3]))+1)
f1.close()
f2.close()

就此我们得到了单个基因的长度，存在gene.len文件中eg:
MIM04M24Gene00599 2898 MIM04M24Gene00600 1035 MIM04M24Gene08324 588 MIM04M24Gene08325 468 MIM04M26Gene00001 1770 MIM04M26Gene00002 930 MIM04M26Gene00003 594 MIM04M26Gene00004 426 MIM04M26Gene00005 1002 MIM04M26Gene00006 792 MIM04M26Gene00007 1125 MIM04M26Gene00008 4041 MIM04M26Gene00009 6537 MIM04M26Gene00010 309 MIM04M26Gene00011 1293 MIM04M26Gene00012 282 MIM04M26Gene00013 765 MIM04M26Gene00014 1680 MIM04M26Gene00015 1134 MIM04M26Gene00016 648
我们还要提取准备一下我们每个样本的mapped_reads数的文件，内容如下：

Total Mapped reads      reads number
A1A     18836863
A1B     15478037
A1C     19394549
A2A     19976617
A2B     15964986
A2C     19685810
A3A     18080220
A3B     16627794
A3C     20205794
A4A     16867356
A4B     16409921
A4C     19966924
A5A     17322230
A5B     15118648
A5C     19086094
A6A     17352130
A6B     16489332
A6C     19940296

然后我再展示一下我的read_counts矩阵文件，我的文件名为：raw_counts.matrix
文件内容eg：

        A1A     A1B     A1C     A2A     A2B     A2C     A3A     A3B     A3C     A4A     A4B     A4C     A5A     A5B     A5C     A6A     A6B     A6C
MIM04M24Gene00599       334     179     300     532     261     376     238     284     312     306     191     260     105     187     191     204     177
MIM04M24Gene00600       98      58      80      134     84      122     44      47      65      20      23      27      9       16      16      51      12
MIM04M24Gene08324       13      7       16      19      11      16      15      12      30      19      16      16      11      8       15      29      16
MIM04M24Gene08325       18      18      13      18      21      25      37      30      45      26      32      36      23      22      28      56      31
MIM04M26Gene00001       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

以这三个文件作为输入，我们就能通过脚本得到FPKM矩阵

python Caculate_FPKM.py mapped_gene_number.txt gene.len raw_counts.matrix FPKM.matrix

其中的Caculate_FPKM.py脚本内容贴下:

import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
file4 = sys.argv[4]
f1 = open(file1,'r')
f2 = open(file2,'r')
f3 = open(file3,'r')
f4 = open(file4,'w')
a = []
arrf1 = []
dickf2 = {}
dickf3 = {}
for i in f1:
        i = i.strip("\n")
        if re.match('A',i):
                a = i.split("\t")
                arrf1.append(int(a[1]))
        else:
                continue
f1.close()
for i in f2:
        i = i.strip("\n")
        a = i.split("\t")
        dickf2[a[0]] = int(a[1])
f2.close()
for i in f3:
        i = i.strip("\n")
        if re.match("M",i):
                a = i.split("\t")
                dickf3[a[0]] = a[1:19]
        else:
                f4.write(i)
f3.close()
for i in dickf3.keys():
        f4.write(i+"\t")
        for j in range(0,18):
                a = int(dickf3[i][j])
                #print(a)
                try:
                        b = (a*1000000.0)/(arrf1[j]*(dickf2[i]/1000.0))
                except ZeroDivisionError:
                        b = 0
                except KeyError:
                        continue
                f4.write("{}".format(b))
                f4.write("\t")
        f4.write("\n")
f4.close()

最后我做一个完整的傻瓜式脚本，只要大家准备好gtf文件、mapped_reads文件、read_counts文件和两个python脚本到一个目录下跑就行了

总脚本如下：

grep "exon" genome.gtf > genome_exon.gtf
python count_genelen_from_gft.py genome_exon.gtf gene.len
python Caculate_FPKM.py mapped_gene_number.txt gene.len raw_counts.matrix FPKM.matrix

希望能对大家有所帮助，有困难可以给我发邮件1193226980@qq.com

离子回旋

关注

6
点赞
踩
9

收藏

觉得还不错? 一键收藏
5
评论
read_counts转FPKM（基于gtf和read_counts文件）（exon）

首先我们要把gtf文件中的exon抓取出来grep "exon" genome.gtf > genome_exon.gtf然后提取genome_exon.gtf文件中的gene的exon的长度和得到我们想要的gene的长度python count_genelen_from_gft.py genome_exon.gtf gene.len这其中count_genelen_from_gft.py的代码如下：import sys,refile1 = sys.argv[1]file2 = sy
复制链接

扫一扫

专栏目录