shell + Python3 | 解析理解 gencode gtf 基因组注释文件

9 篇文章 0 订阅
5 篇文章 0 订阅

怎么理解基因组注释文件(gtf)的内容?

1.输入文件:GRCh38.p13.gtf

  • 905M,1995529行=1,995,529行。
$ ls -lth *gtf
-rw-r--r--. 1 wangjl jinlab 905M May  1  2023 GRCh38.p13.gtf

$ wc GRCh38.p13.gtf
  1995529  82989334 948602107 GRCh38.p13.gtf

$ head GRCh38.p13.gtf
##description: evidence-based annotation of the human genome (GRCh38), version 42 (Ensembl 108)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2022-07-20
chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level 2; tag "overlaps_pseudogene";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	12613	12721	.	+	.	gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	13221	14409	.	+	.	gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";

2. 第三列是基因组位置: exon/CDS/UTR等

(1) 按频率统计location

$ grep -v '^#' GRCh38.p13.gtf | awk '{print $3}' | sort | uniq -c | sort -k 1nr
 852230 exon
 653960 CDS
 180584 UTR
 117681 transcript
  64207 start_codon
  64065 stop_codon
  62696 gene
    101 Selenocysteine

(2) 查看文件第一个基因

$ grep -v '^#' GRCh38.p13.gtf | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | head -n 20
chr1	11869	14409	+	gene	"DDX11L2";	2;
chr1	11869	14409	+	transcript	"lncRNA";	"DDX11L2";
chr1	11869	12227	+	exon	"lncRNA";	"DDX11L2";
chr1	12613	12721	+	exon	"lncRNA";	"DDX11L2";
chr1	13221	14409	+	exon	"lncRNA";	"DDX11L2";
chr1	12010	13670	+	gene	"DDX11L1";	2;
chr1	12010	13670	+	transcript	"transcribed_unprocessed_pseudogene";	"DDX11L1";
chr1	12010	12057	+	exon	"transcribed_unprocessed_pseudogene";	"DDX11L1";
chr1	12179	12227	+	exon	"transcribed_unprocessed_pseudogene";	"DDX11L1";
chr1	12613	12697	+	exon	"transcribed_unprocessed_pseudogene";	"DDX11L1";
chr1	12975	13052	+	exon	"transcribed_unprocessed_pseudogene";	"DDX11L1";
chr1	13221	13374	+	exon	"transcribed_unprocessed_pseudogene";	"DDX11L1";
chr1	13453	13670	+	exon	"transcribed_unprocessed_pseudogene";	"DDX11L1";
chr1	14404	29570	-	gene	"WASH7P";	2;
chr1	14404	29570	-	transcript	"unprocessed_pseudogene";	"WASH7P";
chr1	29534	29570	-	exon	"unprocessed_pseudogene";	"WASH7P";
chr1	24738	24891	-	exon	"unprocessed_pseudogene";	"WASH7P";
chr1	18268	18366	-	exon	"unprocessed_pseudogene";	"WASH7P";
chr1	17915	18061	-	exon	"unprocessed_pseudogene";	"WASH7P";
chr1	17606	17742	-	exon	"unprocessed_pseudogene";	"WASH7P";

可见,对于基因 DDX11L1

  • 第一级是 gene
  • 第二级是它的两个转录本 transcript
  • 第三级是该转录本的外显子

(3) 找一个蛋白编码基因:+链上的 CD5

$ grep -v '^#' GRCh38.p13.gtf | grep -P '\"CD5\"' | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}'
chr11	61102489	61127852	+	gene	"CD5";	2;
chr11	61102489	61127852	+	transcript	"protein_coding";	"CD5";
chr11	61102489	61102615	+	exon	"protein_coding";	"CD5";
chr11	61102561	61102615	+	CDS	"protein_coding";	"CD5";
chr11	61102561	61102563	+	start_codon	"protein_coding";	"CD5";
chr11	61115056	61115094	+	exon	"protein_coding";	"CD5";
chr11	61115056	61115094	+	CDS	"protein_coding";	"CD5";
chr11	61118175	61118480	+	exon	"protein_coding";	"CD5";
chr11	61118175	61118480	+	CDS	"protein_coding";	"CD5";
chr11	61118915	61118977	+	exon	"protein_coding";	"CD5";
chr11	61118915	61118977	+	CDS	"protein_coding";	"CD5";
chr11	61119234	61119575	+	exon	"protein_coding";	"CD5";
chr11	61119234	61119575	+	CDS	"protein_coding";	"CD5";
chr11	61121611	61121904	+	exon	"protein_coding";	"CD5";
chr11	61121611	61121904	+	CDS	"protein_coding";	"CD5";
chr11	61122907	61123032	+	exon	"protein_coding";	"CD5";
chr11	61122907	61123032	+	CDS	"protein_coding";	"CD5";
chr11	61123884	61123937	+	exon	"protein_coding";	"CD5";
chr11	61123884	61123937	+	CDS	"protein_coding";	"CD5";
chr11	61125032	61125151	+	exon	"protein_coding";	"CD5";
chr11	61125032	61125151	+	CDS	"protein_coding";	"CD5";
chr11	61125751	61125841	+	exon	"protein_coding";	"CD5";
chr11	61125751	61125836	+	CDS	"protein_coding";	"CD5";
chr11	61125837	61125839	+	stop_codon	"protein_coding";	"CD5";
chr11	61126288	61127852	+	exon	"protein_coding";	"CD5";
chr11	61102489	61102560	+	UTR	"protein_coding";	"CD5";
chr11	61125837	61125841	+	UTR	"protein_coding";	"CD5";
chr11	61126288	61127852	+	UTR	"protein_coding";	"CD5";

经过和IGV图对照,发现从上到下,分别对应着基因的5-3方向。

可以忽略其中的CDS位置。

$ grep -v '^#' GRCh38.p13.gtf | grep -P '\"CD5\"' | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | grep -v  "CDS"
chr11	61102489	61127852	+	gene	"CD5";	2;
chr11	61102489	61127852	+	transcript	"protein_coding";	"CD5";
chr11	61102489	61102615	+	exon	"protein_coding";	"CD5";
chr11	61102561	61102563	+	start_codon	"protein_coding";	"CD5";
chr11	61115056	61115094	+	exon	"protein_coding";	"CD5";
chr11	61118175	61118480	+	exon	"protein_coding";	"CD5";
chr11	61118915	61118977	+	exon	"protein_coding";	"CD5";
chr11	61119234	61119575	+	exon	"protein_coding";	"CD5";
chr11	61121611	61121904	+	exon	"protein_coding";	"CD5";
chr11	61122907	61123032	+	exon	"protein_coding";	"CD5";
chr11	61123884	61123937	+	exon	"protein_coding";	"CD5";
chr11	61125032	61125151	+	exon	"protein_coding";	"CD5";
chr11	61125751	61125841	+	exon	"protein_coding";	"CD5";
chr11	61125837	61125839	+	stop_codon	"protein_coding";	"CD5";
chr11	61126288	61127852	+	exon	"protein_coding";	"CD5";
chr11	61102489	61102560	+	UTR	"protein_coding";	"CD5";
chr11	61125837	61125841	+	UTR	"protein_coding";	"CD5";
chr11	61126288	61127852	+	UTR	"protein_coding";	"CD5";

对比IGV图,可知:

  • 层级结构是: 1) gene, 2)转录本,3)外显子
    1. UTR,其实第一个是5UTR,后两个是3UTR,结果都排在了exon后面。
  • stop_coden在倒数第二个外显子上,所以3UTR分成了两段:exon倒数第二最后一段,和最后一个exon。

(4) 找一个蛋白编码基因:-链上的 CD7

直接过滤掉CDS行:

$ grep -v '^#' GRCh38.p13.gtf | grep -P '\"CD7\"' | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | grep -v "CDS"
chr17	82314868	82317608	-	gene	"CD7";	1;
chr17	82314868	82317577	-	transcript	"protein_coding";	"CD7";
chr17	82317414	82317577	-	exon	"protein_coding";	"CD7";
chr17	82317493	82317495	-	start_codon	"protein_coding";	"CD7";
chr17	82316667	82316981	-	exon	"protein_coding";	"CD7";
chr17	82314868	82316409	-	exon	"protein_coding";	"CD7";
chr17	82316141	82316143	-	stop_codon	"protein_coding";	"CD7";
chr17	82317496	82317577	-	UTR	"protein_coding";	"CD7";
chr17	82314868	82316143	-	UTR	"protein_coding";	"CD7";
chr17	82314873	82317608	-	transcript	"protein_coding";	"CD7";
chr17	82317414	82317608	-	exon	"protein_coding";	"CD7";
chr17	82317493	82317495	-	start_codon	"protein_coding";	"CD7";
chr17	82316667	82316981	-	exon	"protein_coding";	"CD7";
chr17	82316195	82316409	-	exon	"protein_coding";	"CD7";
chr17	82314873	82315431	-	exon	"protein_coding";	"CD7";
chr17	82315321	82315323	-	stop_codon	"protein_coding";	"CD7";
chr17	82317496	82317608	-	UTR	"protein_coding";	"CD7";
chr17	82314873	82315323	-	UTR	"protein_coding";	"CD7";
chr17	82314874	82317552	-	transcript	"protein_coding";	"CD7";
chr17	82316667	82317552	-	exon	"protein_coding";	"CD7";
chr17	82316761	82316763	-	start_codon	"protein_coding";	"CD7";
chr17	82316195	82316409	-	exon	"protein_coding";	"CD7";
chr17	82314874	82315431	-	exon	"protein_coding";	"CD7";
chr17	82315321	82315323	-	stop_codon	"protein_coding";	"CD7";
chr17	82316764	82317552	-	UTR	"protein_coding";	"CD7";
chr17	82314874	82315323	-	UTR	"protein_coding";	"CD7";
chr17	82315975	82317558	-	transcript	"protein_coding";	"CD7";
chr17	82317414	82317558	-	exon	"protein_coding";	"CD7";
chr17	82316667	82317011	-	exon	"protein_coding";	"CD7";
chr17	82316761	82316763	-	start_codon	"protein_coding";	"CD7";
chr17	82315975	82316409	-	exon	"protein_coding";	"CD7";
chr17	82316141	82316143	-	stop_codon	"protein_coding";	"CD7";
chr17	82317414	82317558	-	UTR	"protein_coding";	"CD7";
chr17	82316764	82317011	-	UTR	"protein_coding";	"CD7";
chr17	82315975	82316143	-	UTR	"protein_coding";	"CD7";
  • 还是4个层级结构:1)gene;2)transcript 4个;3) exon; 4) UTR跟在最后
  • 如果有start_coden和stop_coden,会在exon之间显示。
  • 从上到下,对应着RNA中基因的5-3,也就是从IGV的右侧到左侧(-链上的基因)
  • 第一个UTR是5UTR,第二个是3UTR.

3. 获取基因范围

$ grep -v '^#' GRCh38.p13.gtf | grep "protein_coding" | awk '{if($3=="gene")print $1"\t"$4"\t"$5"\t"$7"\t"$14}' |sed -e 's/;//' -e 's/\"//g' | head
chr1	65419	71585	+	OR4F5
chr1	450740	451678	-	OR4F29
chr1	685716	686654	-	OR4F16
chr1	923923	944575	+	SAMD11
chr1	944203	959309	-	NOC2L
chr1	960584	965719	+	KLHL17
chr1	966482	975865	+	PLEKHN1
chr1	975198	982117	-	PERM1
chr1	998962	1000172	-	HES4
chr1	1001138	1014540	+	ISG15

4. 获取3UTR区域的方法

  • shell: gtf 文件,保留每个转录本的stop和UTR,strand
  • Python: 对于+,UTR>=stop的保留;对于-,UTR<=stop的保留

(1) shell 获取 bed 文件

$ grep -v '^#' GRCh38.p13.gtf | grep -P '\"CD7\"' | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | awk '{if($5=="UTR" || $5=="stop_codon") print $0}' | head
chr17	82316141	82316143	-	stop_codon	"protein_coding";	"CD7";
chr17	82317496	82317577	-	UTR	"protein_coding";	"CD7";
chr17	82314868	82316143	-	UTR	"protein_coding";	"CD7";
chr17	82315321	82315323	-	stop_codon	"protein_coding";	"CD7";
chr17	82317496	82317608	-	UTR	"protein_coding";	"CD7";
chr17	82314873	82315323	-	UTR	"protein_coding";	"CD7";
chr17	82315321	82315323	-	stop_codon	"protein_coding";	"CD7";
chr17	82316764	82317552	-	UTR	"protein_coding";	"CD7";
chr17	82314874	82315323	-	UTR	"protein_coding";	"CD7";
chr17	82316141	82316143	-	stop_codon	"protein_coding";	"CD7";

格式化:去掉"和;

$ grep -v '^#' GRCh38.p13.gtf | grep "protein_coding" | awk '{print $1"\t"$4"\t"$5"\t"$7"\t"$3"\t"$14"\t"$16}' | awk '{if($5=="UTR" || $5=="stop_codon") print $0}' | sed -e 's/;//g' -e 's/\"//g' | head
chr1	70006	70008	+	stop_codon	protein_coding	OR4F5
chr1	65419	65433	+	UTR	protein_coding	OR4F5
chr1	65520	65564	+	UTR	protein_coding	OR4F5
chr1	70006	71585	+	UTR	protein_coding	OR4F5
chr1	450740	450742	-	stop_codon	protein_coding	OR4F29
chr1	450740	450742	-	UTR	protein_coding	OR4F29
chr1	685716	685718	-	stop_codon	protein_coding	OR4F16
chr1	685716	685718	-	UTR	protein_coding	OR4F16
chr1	944151	944153	+	stop_codon	protein_coding	SAMD11
chr1	923923	924431	+	UTR	protein_coding	SAMD11

(2) 步骤2:Python 获取UTR区域

一个转录本只有一个stop,接着怎么判断紧邻的UTR是否是UTR3?
伪代码如下:

如果是 stop,则记录 stop_codon=当前行
如果不是stop(就只能是 UTR了),则要求基因名和上一个stop行一致:stop_codon[6] == arr[6]:
	if +:
		if UTR[1]>=stop_codon[1]
			save UTR3
		else:
			save UTR5
	if -:
		UTR[2]<=stop_codon[2]
			save UTR3
		else:
			save UTR5
如果基因名和上一个stop不一致,则该行只能是UTR5.

注意 Note:

  • start codon 3个碱基属于cds区域,不属于UTR5区;
  • 而 stop codon 3个碱基属于UTR3区。
  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值