摘要
在RNA-seq项目中,需要将差异基因比对到各个数据库当中,生成相应的注释结果和图像,便于深度挖掘信息。COG(Cluster of Orthologous Groups ofproteins 同源蛋白簇)数据库可以帮助了解蛋白功能甚至进化关系(细/真菌)。此次记录一下COG分类图的绘制方法
环境与方法
R version 3.6.1 (2019-07-05)
文档准备
分类简称及描述
# Code Name
J Translation, ribosomal structure and biogenesis
A RNA processing and modification
K Transcription
L Replication, recombination and repair
B Chromatin structure and dynamics
D Cell cycle control, cell division, chromosome partitioning
Y Nuclear structure
V Defense mechanisms
T Signal transduction mechanisms
M Cell wall/membrane/envelope biogenesis
N Cell motility
Z Cytoskeleton
W Extracellular structures
U Intracellular trafficking, secretion, and vesicular transport
O Posttranslational modification, protein turnover, chaperones
X Mobilome: prophages, transposons
C Energy production and conversion
G Carbohydrate transport and metabolism
E Amino acid transport and metabolism
F Nucleotide transport and metabolism
H Coenzyme transport and metabolism
I Lipid transport and metabolism
P Inorganic ion transport and metabolism
Q Secondary metabolites biosynthesis, transport and catabolism
R General function prediction only
S Function unknown
比对结果
这里使用将差异基因的序列比对到eggnog数据库分析出来的结果
红色方框为COG分类注释结果
使用代码
对注释结果进行汇总
import sys
fun_file = sys.argv[1] #分类文件输入
anno_list = sys.argv[2] #注释文件输入
Describe_file = open(fun_file,'r')
Class_file = open(anno_list,'r')
Describe_line = Describe_file.readlines()
Class_line = Class_file.readlines()
newgtf_name = 'DrawAnnotationPic.R'
desktop_path = './'
file_path = desktop_path+newgtf_name+'.txt'
file = open(file_path,'w')
for i in Describe_line[1:]:
anno_ID = i.split('\t')
describe = anno_ID[1].replace('\n','')
count = 0
for j in Class_line:
COG_ID = j.split('\t')
#print(COG_ID)
if len(COG_ID) > 2:
#print(COG_ID[-2])
#print(anno_ID[0])
if anno_ID[0] == COG_ID[-2]:
count += 1
else:
continue
COG_list = anno_ID[0]+"\t"+describe+"\t"+str(count)+"\n"
#print(COG_list)
file.write(COG_list)
file.close()
Class_file.close()
Describe_file.close()
得到汇总文件
绘制图像
args<-commandArgs(T)
if(length(args)<2){
cat("[usage:] ")
quit("no")
}
cazyclass<-args[1]
outfile<-args[2]
library(ggplot2)
dat<-read.table(cazyclass,header=F,sep='\t')
pdf(file=paste(outfile,".pdf",sep=""),width=6*1.5,height=6)
colnames(dat)<-c("Motif","Description","Number")
ggplot(dat,aes(x=dat$Motif,y=dat$Number,fill=paste(dat$Motif,':',dat$Description)))+geom_bar(stat="identity")+theme(legend.title=element_blank(),legend.text=element_text(size=6),legend.key=element_rect(size=6))+labs(x='Function class',y='Number of matched genes')
dev.off()
结果展示
总结
这里再提供另外两个思路供大家去探索,一是将eggnog的注释结果改为在本地对cog数据库进行比对(之前尝试过分析,时间较长,操作有些复杂);二是将得到的统计结果放到一些生信公司提供的云工具上进行分析(联川、百迈客等),或许比自己分析更快。