(17) CAFE之后的特异节点基因家族富集分析

琉璃月月呀

已于 2024-05-22 16:48:29 修改

阅读量213

点赞数

文章标签： spring mybatis nio dubbo

于 2023-10-11 11:33:42 首次发布

本文链接：https://blog.csdn.net/weixin_44562189/article/details/133762876

版权

主要参考了这篇比较基因组学分析3：特异节点基因家族富集分析（非模式物种GO/KEEG富集分析） - 简书

1.获取目的基因

主要是目标基因与背景基因的选择，目标基因可以是该节点显著扩张Increase的基因家族包含的所有物种基因，背景基因一般为该节点包含的Orthogroup

1.1 获取<6>显著扩张的基因（以南极石耳U.antarctica为例）

补充：
Orthogroups.GeneCount.tsv：记录了每个 Orthogroup 中基因在物种间的分布情况，可以用于分析同源基因在物种间的收缩和扩张（直系同源组中基因数 >= 2）
Orthogroups.txt：记录了 MCL 中成功聚类的每个 Orthogroup 所包含的基因，OrhtoMCL的输出格式（直系同源组中基因数 >= 1）

cd /ifs1/User/wuqi/Qxy/knowngenome/CAFE/
mkdir node6analysis/
cd node6analysis/
根据之前cafe_out.out结果，传输到本地，在WPS中手动筛选出node6 U.antarctica的显著扩张的基因名字，写入node6significant_name.expand中（图）
cp /ifs1/User/wuqi/Qxy/knowngenome/orthofinder/OrthoFinder/Results_Aug01/Orthogroups/Orthogroups.txt ./
grep -f node6significant_name.expand Orthogroups.txt |sed "s/ /\n/g" | grep -E "thamnodes|squamosa|indica|antarctica|cylindrica|maculata|virginis|decussata|aprina|arctica|subpolyphylla|grisea|freyi|spodochroa|vellea|loboperipherica|torrefacta|phaea|lyngei|esculenta|yunnana|muehlenbergii|deusta|caroliniana|scalaris" | sort | uniq > node6significant.expand.genes
#上面代码的作用是从Orthogroups.txt中提取出node6significant_name.expand所有的基因家族，且在每个物种中的对应基因信息（图）

显著扩张的基因名字

基因家族在每个物种的对应信息

1.2 node6背景基因选择

从orthofinder结果文件Orthogroups.GeneCount.tsv出发，手动挑出U.antarctica的orthogroups同源基因（就是筛选出U.antarctica列不为零的基因家族的名称）
写入node6.orthogroups文件中去

grep -f node6.orthogroups Orthogroups.txt |sed "s/ /\n/g" | grep -E "thamnodes|squamosa|indica|antarctica|cylindrica|maculata|virginis|decussata|aprina|arctica|subpolyphylla|grisea|freyi|spodochroa|vellea|loboperipherica|torrefacta|phaea|lyngei|esculenta|yunnana|muehlenbergii|deusta|caroliniana|scalaris" | sort | uniq > node6.genes

是进行富集分析所需要的基因号，接下来就是构建背景基因的GO和KEGG注释，由于是无参物种，所以我们需要自己构建

2 背景基因的GO和KEGG注释

2.1 GO/KEGG注释

先是去了orthofinder文件夹下把25个物种的蛋白质序列cat到一起，生成25allproteins.fasta，然后转到当前工作目录下
seqkit grep -f node6.genes 25allproteins.fasta > node6background.fa #提取背景基因，需要严格注意基因数是否一致175511，但是这个名字可以再该一下
emapper.py -m diamond -i node6background.fa -o node6 --cpu 20 #这是线上跑的代码，但是在服务器上没法跑
遂转移到本地来跑，登录http://eggnog-mapper.embl.de/提交蛋白序列（要求<100000条序列），输入邮箱后会发送邮件，点击邮件然后点击start开始运行
注1：因为要求序列<100000条，所以可以用TBtools分割，也可以用spilt.py脚本分割
注2：跑的速度还行，大概100000条序列3——5小时
注3：跑完之后把前几行注释和最后几行注释删掉，就保留标题和数据（就类似于一个表格一样）然后如果是分别做的eggnog注释，还要合在一起，得到node6.emapper.annotations文件

spilt.py脚本内容如下：

# 打开原始文件，读取基因名称列表
with open('node6.genes', 'r') as f: genes_list = f.readlines()

# 求出基因名称总数，计算拆分位置（向下取整）
total_genes = len(genes_list)
split_point = total_genes // 2

# 按照拆分位置将基因名称列表分为两个子列表
genes_list1 = genes_list[:split_point]
genes_list2 = genes_list[split_point:]

# 将两个子列表写入独立的文件
with open('node6_part1.genes', 'w') as f: f.writelines(genes_list1)

with open('node6_part2.genes', 'w') as f: f.writelines(genes_list2)

该脚本作用是对于单行的文件（比如一行是一个基因名称啊什么的）修改一下该脚本打开文件的名称，保存运行后能够将该文件分成两半然后再运行下面代码，分别得到相应蛋白序列，去网站注释

seqkit grep -f node6_part1.genes 25allproteins.fasta > node6background1.fa
seqkit grep -f node6_part2.genes 25allproteins.fasta > node6background2.fa

2.2GO注释库构建

在~/Qxy/go目录下，已有go-basic.obo文件，这是GO的注释信息
然后运行下列代码，提取相对应信息
grep "^id:" go-basic.obo |awk '{print $2}' > GO.id #获得GO的ID号
grep "^name:" go-basic.obo |awk '{print $2}' > GO.name #获得GO的名字
grep "^namespace:" go-basic.obo |awk '{print $2}' > GO.class #获得GO的类别
paste GO.id GO.name GO.class -d "\t" > GO.library #获得GO的ID、名字、类别的对应信息，应该最终要保留的是这个文件
ln -s /ifs1/User/wuqi/Qxy/go/GO.library /ifs1/User/wuqi/Qxy/knowngenome/CAFE/node6analysis/

2.3KEGG注释库构建

在~/Qxy/目录下，新建文件夹，并将在本地https://www.genome.jp/kegg-bin/get_htext?ko00001网站上下载的ko00001.json文件上传
mkdir KEGG/ 
cd KEGG/ 
vim kegg_script.R #这是一个从ko00001.json中提取有效信息的R脚本，结果文件为kegg_info.RData和KEGG.library，要将KEGG.library引超链接至/ifs1/User/wuqi/Qxy/knowngenome/CAFE/node6analysis
chmod +x kegg_script.R
Rscript kegg_script.R
ln -s /ifs1/User/wuqi/Qxy/KEGG/KEGG.library /ifs1/User/wuqi/Qxy/knowngenome/CAFE/node6analysis/
ln -s /ifs1/User/wuqi/Qxy/KEGG/kegg_info.RData /ifs1/User/wuqi/Qxy/knowngenome/CAFE/node6analysis/

注：kegg_script.R脚本内容如下

注：kegg_script.R脚本内容如下
# 代码来自：http://www.genek.tv/course/225/task/4861/show
library(jsonlite)
library(purrr)
library(RCurl)
library(dplyr)
library(stringr)
 kegg <- function(json = "ko00001.json") {
    pathway2name <- tibble(Pathway = character(), Name = character())
    ko2pathway <- tibble(Ko = character(), Pathway = character())
    
    kegg <- fromJSON(json)
    
    for (a in seq_along(kegg[["children"]][["children"]])) {
      A <- kegg[["children"]][["name"]][[a]]
      
      for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
        B <- kegg[["children"]][["children"]][[a]][["name"]][[b]] 
        
        for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
          pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
          
          pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
          pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
          pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
          
          kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
          
          kos <- str_match(kos_info, "K[0-9]*")[,1]
          
          ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
        }
      }
    }
    colnames(ko2pathway) <- c("KO","Pathway")
    save(pathway2name, ko2pathway, file = "kegg_info.RData")
    write.table(pathway2name,"KEGG.library",sep="\t",row.names = F)
  }
  
  
kegg(json = "ko00001.json")

2.4背景注释构建

#目的是将自己的基因集和注释库的相结合，思路是先载入包，输入数据文件，分别生成GO注释、KEGG注释等信息

2.5GO/KEGG富集分析

还是在/ifs1/User/wuqi/Qxy/knowngenome/CAFE/node6analysis这个目录下，该目录下有node6.emapper.annotations、KEGG.library、GO.library、kegg_info.RData

vim all.R
chmod +x all.R
Rscript all.R  #直接运行这个代码，别分段跑了，咱全着来！！！

结果文件是：node6_GO.rda、node6_KEGG.rda、node6expand.xls、node6significant.expand.KEGG.xls
将Excel表传输到本地查看有无重复行，有无动物相关注释（删掉）

脚本内容如下：

library(clusterProfiler)
library(dplyr)
library(stringr)
options(stringsAsFactors = F)
load("kegg_info.RData")
##===============================STEP1:GO注释生成=======================
#自己构建的话，首先需要读入文件
egg <- read.delim("node6.emapper.annotations",header = T,sep="\t")
egg[egg==""]<-NA  #将空行变成NA，方便下面的去除
#从文件中挑出基因query与eggnog注释信息
##gene_info <- egg %>% 
##  dplyr::select(GID = query, GENENAME = eggNOG_OGs) %>% na.omit()
#挑出query_name与GO注释信息
gterms <- egg %>%
  dplyr::select(query, GOs) %>% na.omit()
gene_ids <- egg$query
eggnog_lines_with_go <- egg$GOs!= ""
eggnog_lines_with_go
eggnog_annoations_go <- str_split(egg[eggnog_lines_with_go,]$GOs, ",")
gene2go <- data.frame(gene = rep(gene_ids[eggnog_lines_with_go],
                                    times = sapply(eggnog_annoations_go, length)),
                         term = unlist(eggnog_annoations_go))
names(gene2go) <- c('gene_id', 'ID')
go2name <- read.delim('GO.library', header = FALSE, stringsAsFactors = FALSE)
names(go2name) <- c('ID', 'Description', 'Ontology')

go_anno <- merge(gene2go, go2name, by = 'ID', all.x = TRUE)

## 将GO注释信息保存
save(go_anno,file = "node6_GO.rda")

##===============================STEP2:KEGG注释生成=======================
gene2ko <- egg %>%
  dplyr::select(GID = query, KO = KEGG_ko) %>%
  na.omit()
pathway2name <- read.delim("KEGG.library")
colnames(pathway2name)<-c("Pathway","Name")
gene2ko$KO <- str_replace(gene2ko$KO, "ko:","")
gene2pathway <- gene2ko %>% left_join(ko2pathway, by = "KO") %>% 
  dplyr::select(GID, Pathway) %>%
  na.omit()
kegg_anno<- merge(gene2pathway,pathway2name,by = 'Pathway', all.x = TRUE)[,c(2,1,3)]
colnames(kegg_anno) <- c('gene_id','pathway_id','pathway_description')
save(kegg_anno,file = "node6_KEGG.rda")
##===============================STEP3:GO富集分析=================
#目标基因列表(全部基因)
gene_select <- read.delim(file = 'node6significant.expand.genes', stringsAsFactors = FALSE,header = F)$V1
#GO 富集分析
#默认以所有注释到 GO 的基因为背景集，也可通过 universe 参数输入背景集
#默认以 p<0.05 为标准，Benjamini 方法校正 p 值，q 值阈值 0.2
#默认输出 top500 富集结果
#如果想输出所有富集结果（不考虑 p 值阈值等），将 p、q 等值设置为 1 即可
#或者直接在 enrichResult 类对象中直接提取需要的结果
go_rich <- enricher(gene = gene_select,
                    TERM2GENE = go_anno[c('ID', 'gene_id')], 
                    TERM2NAME = go_anno[c('ID', 'Description')], 
                    pvalueCutoff = 1, 
                    pAdjustMethod = 'BH', 
                    qvalueCutoff = 1
                   )
#输出默认结果，即根据上述 p 值等阈值筛选后的

tmp <- merge(go_rich, go2name[c('ID', 'Ontology')], by = 'ID')
tmp <- tmp[c(10, 1:9)]
tmp <- tmp[order(tmp$pvalue), ]
write.table(tmp, 'node6expand.xls', sep = '\t', row.names = FALSE, quote = FALSE)

##===============================STEP4:KEGG注释=================
gene_select <- read.delim('node6significant.expand.genes', stringsAsFactors = FALSE,header = F)$V1
#KEGG 富集分析
#默认以所有注释到 KEGG 的基因为背景集，也可通过 universe 参数指定其中的一个子集作为背景集
#默认以 p<0.05 为标准，Benjamini 方法校正 p 值，q 值阈值 0.2
#默认输出 top500 富集结果
kegg_rich <- enricher(gene = gene_select,
                      TERM2GENE = kegg_anno[c('pathway_id', 'gene_id')], 
                      TERM2NAME = kegg_anno[c('pathway_id', 'pathway_description')], 
                      pvalueCutoff = 1, 
                      pAdjustMethod = 'BH', 
                      qvalueCutoff = 1, 
                      maxGSSize = 500)

#输出默认结果，即根据上述 p 值等阈值筛选后的
write.table(kegg_rich, 'node6significant.expand.KEGG.xls', sep = '\t', row.names = FALSE, quote = FALSE)

补充：得到的node6expand.xls和node6significant.expand.KEGG.xls两个表解读
node6expand.xls是GO注释表
GO（Gene Ontology）是一种用于对基因和蛋白质功能进行注释的标准化分类系统。
GO将基因和蛋白质的功能分为三个主要方面，分别是生物过程（biological process）、细胞组分（cellular component）和分子功能（molecular function）。这三个方面覆盖了生物体内不同层次的功能。
生物过程（Biological Process）：描述基因或蛋白质在生物学中所参与的生物过程、生物学任务或事件。例如，细胞分裂、免疫反应、代谢途径等都属于生物过程。
细胞组分（Cellular Component）：描述基因或蛋白质在细胞中定位的位置或组分。它指的是细胞内的不同结构和组织，例如核、细胞膜、线粒体等。
分子功能（Molecular Function）：描述基因或蛋白质所具有的特定的分子功能或活性。这可以是某种化学反应、某种酶活性、某种结合能力等。例如，酶活性、DNA结合、受体活性等都属于分子功能。

2.6GO/KEGG作图

GO作图

vim GOplot.R
chmod +x GOplot.R
Rscript GOplot.R

脚本内容如下：
library(ggplot2)
pathway = read.delim("node6significant.expand.GO.xls",header = T,sep="\t")
pathway$GeneRatio<- pathway[,10]/14728 ##这个14728是GeneRatio的那个数值，自己修改

pathway$log<- -log10(pathway[,6])

library(dplyr)
library(ggrepel)
GO <- arrange(pathway,pathway[,6])
GO_dataset <- GO[1:10,]
#按照PValue从低到高排序[升序]
GO_dataset$Description<- factor(GO_dataset$Description,levels = rev(GO_dataset$Description))
GO_dataset$GeneRatio <- as.numeric(GO_dataset$GeneRatio)
GO_dataset$GeneRatio
GO_dataset$log
#图片背景设定
mytheme <- theme(axis.title=element_text(face="bold", size=8,colour = 'black'), #坐标轴标题
                 axis.text.y = element_text(face="bold", size=6,colour = 'black'), #坐标轴标签
                 axis.text.x = element_text(face ="bold",color="black",angle=0,vjust=1,size=8),
                 axis.line = element_line(size=0.5, colour = 'black'), #轴线
                 panel.background = element_rect(color='black'), #绘图区边框
                 plot.title = element_text(face="bold", size=8,colour = 'black',hjust = 0.8),
                 legend.key = element_blank() #关闭图例边框
)

#绘制GO气泡图
p <- ggplot(GO_dataset,aes(x=GeneRatio,y=Description,colour=log,size=Count,shape=Ontology))+
  geom_point()+
  scale_size(range=c(2, 8))+
  scale_colour_gradient(low = "#52c2eb",high = "#EA4F30")+
  theme_bw()+
  labs(x='Fold Enrichment',y='GO Terms', #自定义x、y轴、标题内容
       title = 'Enriched GO Terms')+
  labs(color=expression(-log[10](pvalue)))+
  theme(legend.title=element_text(size=8), legend.text = element_text(size=14))+
  theme(axis.title.y = element_text(margin = margin(r = 50)),axis.title.x = element_text(margin = margin(t = 20)))+
  theme(axis.text.x = element_text(face ="bold",color="black",angle=0,vjust=1))
plot <- p+mytheme
plot
#保存图片
ggsave(plot,filename = "node6GO.pdf",width = 210,height = 210,units = "mm",dpi=300)
ggsave(plot,filename = "node6GO.png",width = 210,height = 210,units = "mm",dpi=300)

KEGG作图

vim KEGG_plot.R
chmod +x KEGG_plot.R
Rscript KEGG_plot.R

library(ggplot2)
pathway = read.delim("node6significant.expand.KEGG.xls",header = T,sep="\t")
pathway$GeneRatio<- pathway[,9]/72255

pathway$log<- -log10(pathway[,5])

library(dplyr)
library(ggrepel)
GO <- arrange(pathway,pathway[,5])
GO_dataset <- GO[1:10,]

#Pathway列最好转化成因子型，否则作图时ggplot2会将所有Pathway按字母顺序重排序
#将Pathway列转化为因子型
GO_dataset$Description<- factor(GO_dataset$Description,levels = rev(GO_dataset$Description))
GO_dataset$GeneRatio <- as.numeric(GO_dataset$GeneRatio)
GO_dataset<- arrange(GO_dataset,GO_dataset[,5])
GO_dataset$GeneRatio
GO_dataset$log
#图片背景设定
mytheme <- theme(axis.title=element_text(face="bold", size=8,colour = 'black'), #坐标轴标题
                 axis.text.y = element_text(face="bold", size=6,colour = 'black'), #坐标轴标签
                 axis.text.x = element_text(face ="bold",color="black",angle=0,vjust=1,size=8),
                 axis.line = element_line(size=0.5, colour = 'black'), #轴线
                 panel.background = element_rect(color='black'), #绘图区边框
                 plot.title = element_text(face="bold", size=8,colour = 'black',hjust = 0.8),
                 legend.key = element_blank() #关闭图例边框
)

#绘制KEGG气泡图
p <- ggplot(GO_dataset,aes(x=GeneRatio,y=Description,colour=log,size=Count))+
  geom_point()+
  scale_size(range=c(2, 8))+
  scale_colour_gradient(low = "#52c2eb",high = "#EA4F30")+
  theme_bw()+
  labs(x='Fold Enrichment',y='KEEG Terms', #自定义x、y轴、标题内容
       title = 'Enriched KEEG Terms')+
  labs(color=expression(-log[10](pvalue)))+
  theme(legend.title=element_text(size=8), legend.text = element_text(size=14))+
  theme(axis.title.y = element_text(margin = margin(r = 50)),axis.title.x = element_text(margin = margin(t = 20)))+
  theme(axis.text.x = element_text(face ="bold",color="black",angle=0,vjust=1))
plot <- p+mytheme
plot
#保存图片
ggsave(plot,filename = "node6KEGG.pdf",width = 210,height = 210,units = "mm",dpi=300)
ggsave(plot,filename = "node6KEGG.png",width = 210,height = 210,units = "mm",dpi=300)

注：在服务器上作图GO失败，KEGG出来字太小，故将node6significant.expand.GO.xls和node6significant.expand.KEGG.xls结果文件下载至本地进行作图

2.6.1 查看EXCEL表，去除重复的，去除非真菌功能的，比如会有什么胰岛素调节等等

2.6.2 计算fold enrichment富集倍数，fold enrichment = GeneRatio / BgRatio，这个要手动去分列计算一下

ONTOLOGY：区分是BP，MF还是CC 生物过程（Biological Process）细胞组分（Cellular Component）分子功能（Molecular Function）
ID：具体的GO条目的ID号
Description：GO条目的描述
GeneRatio：这里是一个分数，分子是富集到这个GO条目上的gene的数目，分母是所有输入的做富集分析的gene的数目，可以是差异表达分析得到的gene
BgRatio：Background Ratio. 这里也是一个分数，分母是所有编码蛋白的基因中有GO注释的gene的数目，分子是注释到这个GO条目上面的gene的数目
pvalue：富集的p值
p.adjust：校正之后的p值
qvalue：q值
geneID：输入的做富集分析的gene中富集到这个GO条目上面的具体的gene名字
Count：输入的做富集分析的gene中富集到这个GO条目上面的gene的数目

然后在微生信http://bioinformatics.com.cn/plot_basic_gopathway_enrichment_bubbleplot_081上进行富集分析图绘制
需要整理表格：第一列基因描述、第二列富集倍数、第三列 P值、第四列基因数目
更改下面 colorbar说明：填-log10(pvalue) 颜色是上（大）69C4FF蓝下（小）FF540E红
其他的适当修改，然后生成并保存

图解读：横轴是富集倍数，越大越多，气泡大小是基因数目，越大越多，颜色是P值，越红越小越显著，纵轴是基因描述