对于大多数的细菌来说做,查找未知基因的功能都是一个问题,我在做细菌的转录组分析时就碰到了这个问题,得到差异基因之后无法系统的进行下一步分析,GO数据库只提供很少的物种选择,kegg虽然提供了不少这种信息,但是对于大多数的菌群来说还是不够,所以对于部分数据来说直接从NCBI或者Ensemble数据库中给的gtf注释文件中提取基因的注释信息也是一种方法。
一、基因组及注释文件GTF文件的下载
1、NCBI下载
以本次我研究的Staphylococcus aureus为例,从NCBI-Genome中直接搜索Staphylococcus aureus,然后进入下载页面,从下载页面中能看出NCBI提供的Reference数据,包括参考基因组以及GTF注释信息,如果没有的可以复制参考基因组链接然后删除最后的文件名进入到FTP下载页面查看提供的文件信息;
2、Ensemble 下载
Ensemble提供的数据要比NCBI要少一些,可以直接搜索Ensemble进入数据库,然后在下侧导航栏进入到Ensemble Bacteria 数据库或者直接点击链接进入,进入后可以直接根据物种或者基因组信息进行检索,然后直接点击可直接进入FTP下载页面,然后下载所需的信息;Ensembl Bacteriahttp://bacteria.ensembl.org/
二、提取GTF注释信息
以NCBI数据为例,首先在linux中使用shell去除‘#’号开头的说明文件
less GCF_001900185.1_ASM190018v1_genomic.gtf | grep -v '#' > less GCF_001900185.1_ASM190018v1_genomic.txt
然后在R中进行提取信息
#读取文件中包含注释信息的信息
gtf_data <- read.delim("GCF_001900185.1_ASM190018v1_genomic.txt", sep = "\t", header = F)
gtf_data <- gtf_data[gtf_data$V3 == 'CDS', ]
gtf_data$V9 <- stringr::str_remove(string = gtf_data$V9,pattern = 'gbkey CDS; ')
#构建注释信息表格,首先判断注释文件中GO_term的数量,然后再进行提取
summary_df <- data_frame()
for (i in 1:nrow(gtf_data)){
GO_num <- stringr::str_count(string = gtf_data[i,'V9'], pattern = 'Ontology_term')
print(GO_num)
print(i)
data <- stringr::str_split(string = gtf_data[i,'V9'], pattern = ";")[[1]]
df <- data_frame()
if(GO_num > 0){
for (j in 1:GO_num){
gene_ID = stringr::str_remove(data[1], pattern = 'gene_id ')
GO_ID = stringr::str_remove(data[j+2], pattern = 'Ontology_term ')
if (isEmpty(data[GO_num+3] %>% stringr::str_subset(pattern = "gene"))) {
gene = data %>% stringr::str_subset(pattern = "locus_tag") %>% str_remove(pattern = "locus_tag ")
GO_term = data[j+GO_num+2]
}else{
gene = data[GO_num+3] %>% str_remove(pattern = "gene ")
GO_term = data[j+GO_num+3]
}
inference_COORDINATES = ""
product = ifelse(isEmpty(data %>% stringr::str_subset(pattern = "product")),"", data %>% stringr::str_subset(pattern = "product")) %>% str_remove(pattern = "product ")
protein_id = ifelse(isEmpty(data %>% stringr::str_subset(pattern = "protein_id")),"", data %>% stringr::str_subset(pattern = "protein_id")) %>% str_remove(pattern = "protein_id ")
df <- rbind(df, data.frame(gene_ID, GO_ID, GO_term, inference_COORDINATES,gene, product, protein_id))
}
}else{
gene_ID = stringr::str_remove(data[1], pattern = 'gene_id ')
GO_ID = ""
GO_term = ""
inference_COORDINATES = ifelse(isEmpty(data %>% stringr::str_subset(pattern = "inference COORDINATES")),"", data %>% stringr::str_subset(pattern = "inference COORDINATES")) %>% str_remove(pattern = "inference COORDINATES: ")
gene = ifelse(isEmpty(data[3] %>% stringr::str_subset(pattern = "gene")),data %>% stringr::str_subset(pattern = "locus_tag"), data[3]) %>% str_remove(pattern = "gene ") %>% str_remove(pattern = "locus_tag ")
product = ifelse(isEmpty(data %>% stringr::str_subset(pattern = "product")),"", data %>% stringr::str_subset(pattern = "product")) %>% str_remove(pattern = "product ")
protein_id = ifelse(isEmpty(data %>% stringr::str_subset(pattern = "protein_id")),"", data %>% stringr::str_subset(pattern = "protein_id")) %>% str_remove(pattern = "protein_id ")
df <- rbind(df, data_frame(gene_ID, GO_ID, GO_term,inference_COORDINATES,gene, product, protein_id))
}
summary_df <- rbind(summary_df, df)
}
最后得到的注释表格,对于有GO_term注释信息的文件,提取其GO信息,没有的则提取其预测的信息,根据此表格能够进一步的对关键基因进行统计,观察这些基因都富集在GO的什么通路上;