WGCNA

WGCNA分析,简单全面的最新教程 - 简书 (jianshu.com)icon-default.png?t=N7T8https://www.jianshu.com/p/e9cc3f43441d

一文看懂WGCNA 分析(2019更新版) - 云+社区 - 腾讯云 (tencent.com)icon-default.png?t=N7T8https://cloud.tencent.com/developer/article/1516749

安装WGCNA编辑包出现如下问题怎么解决? - 组学大讲堂问答社区 (omicsclass.com)icon-default.png?t=N7T8https://www.omicsclass.com/question/3995

一、数据处理

1.表达数据是从网上下载的不同批次的数据,要先对数据进行去除批次效应

        用limma包的removeBatchEffect功能。

        Q:有个问题是数据是否要先经过log处理

多种批次效应去除的方法比较-转自生信技能树 - 简书 (jianshu.com)icon-default.png?t=N7T8https://www.jianshu.com/p/61b2238a24b9

生信小白--关于批次效应的学习及实战 - 简书 (jianshu.com)icon-default.png?t=N7T8https://www.jianshu.com/p/4c0734c3473c

2.表达数据的筛选问题

        无向性网络的power值最好不要超过15;

        首先尝试基因的样本表达均值<1,或者差异系数<1的基因删除;但软阈值为24,过高,不可取;

        其次尝试用mad值筛选,去除后25%,power值为14,勉强可取

Q:差异系数的计算问题

3.样本的性状数据问题

        目前思考查看文献内对各样本的处理是什么,先统计是否是赤霉病相关,白粉、条锈病相关,再根据其处理分一下类

        根据三种病的类型分了表型信息,做了表型信息与模块的相关性分析:

2022.3.29 进行野生二粒小麦的WGCNA分析——只针对各个组织时期(17)的表达量

有一个样本的表达量均为0,已剔除。共有16个样本

第一步:获得样本聚类树与处理间相关性

#######此版本适用于组织的表达量的WGCNA
rm(list=ls())
library(WGCNA)
library(reshape2)
library(limma)
library(stringr)
#setwd("/users/songwn/user/ZY/database/NLR/WGCNA")
options(stringsAsFactors = FALSE)
exprMat <- "periodTPM.csv"
type = "unsigned"#无方向性
corType = "pearson"
corFnc = ifelse(corType=="pearson", cor, bicor)
maxPOutliers = ifelse(corType=="pearson",1,0.05)
robustY = ifelse(corType=="pearson",T,F)# 关联样品性状的二元变量时,设置

####################导入数据###############
dataExpr <- read.csv(exprMat, sep=',', row.names=1,header=T, 
                     quote="", comment="", check.names=F)
index <- rownames(dataExpr)
dataExpr <- as.data.frame(lapply(dataExpr,as.numeric))
sum(is.na(dataExpr))
dataExpr[is.na(dataExpr)] <-0
row.names(dataExpr) <- index
#y2 <- dataExpr
###########消除批次效应##########

#batch <- c(rep("salt",27),rep("drout",16),rep("period",17))
#batch <- c(rep("period",17))
#y2 <- removeBatchEffect(dataExpr,batch = batch)

#pdf("test3.pdf",width = 20,height = 10)
#sampleTree <- hclust(dist(t(y2)),method = "average")
#plot(sampleTree,main="Sample clustering to detect outliers",sub="",xlab="")
#dev.off()
####################筛选基因和样本############
#剔除基因表达丰度低,结果不合适
#dataExpr <- y2[which(rowSums(y2)>165),]

#用mad筛选,取前75%
m.mad = apply(dataExpr,1,mad)
dataExpr <- dataExpr[which(m.mad >
                       max(quantile(m.mad, probs=seq(0, 1, 0.25))[2],0.01)),]
datExpr <- as.data.frame(t(dataExpr))
###筛选掉异常的样本
#A=adjacency(t(datExpr),type="signed")  # 计算相似性矩阵
#k=as.numeric(apply(A,2,sum))-1   ## 计算网络的连接度
#Z.k=scale(k)  # 连接度标准化
#thresholdZ.k=- 2.5   # 设置连接度筛选阀值, 这部分是关键,筛选掉异常的样本
#outlierColor=ifelse(Z.k<thresholdZ.k,"red","black")  # 将异常样本进行标注
#remove.samples= Z.k<thresholdZ.k | is.na(Z.k)  # 删除异常样本

#datExpr=datExpr[!remove.samples,]

nGenes=ncol(datExpr)
nSamples=nrow(datExpr)

sampleTree = hclust(dist(datExpr),method = "average")
sizeGrwindow(12,9)

par(cex=0.6)
par(mar=c(0,4,2,0))
plot(sampleTree,main="Sample clustering",sub="",xlab="",cex.lab=1.5,cex.axis=1.5,cex.main=2)

####################获取处理的文件##################
#trait=colnames(dataExpr)
#write.csv(trait,file = "character.csv",row.names = F)
####################################################

traitData=read.csv("periodCharacter.csv",row.names=1,header = T)
#######################以下为处理表达矩阵的
#miRNASample=rownames(datExpr)
#traitRows=match(miRNASample,traitData$Treat)
#datTraits=traitData[traitRows,-1]
#rownames(datTraits)=traitData[traitRows,1]

#collectGarbage()


traitColors= numbers2colors(traitData,signed=FALSE)
pdf("sample_L.pdf",width = 25,height = 12)
plotDendroAndColors(sampleTree,traitColors,groupLabels = names(traitData),main="Sample dendrogram and treat heatmap")
dev.off()

save(datExpr,traitData,sampleTree,traitColors,file = "01-data_sample_L.RData")

注意事项:

在导入数据时,要先储存一下df的行名,因为将NA替换成0的操作过程中,行名会丢失

因为只分析各个时期的表达量,不存在批次效应

无需筛异常样本,筛完后样本数为0 

第二步:获得net与TOM矩阵

#对野生小麦的表达矩阵进行WGCNA分析,基因聚类与模块划分
#常规表达矩阵即可,即基因在行,样品在列,进入分析前做一个转置。
#RPKM、FPKM或其它标准化方法影响不大,推荐使用Deseq2的varianceStabilizingTransformation或log2(x+1)对标准化后的数据做个转换
rm(list=ls())
library(WGCNA)
library(reshape2)
library(limma)
library(stringr)
setwd("/users/songwn/user/ZY/database/NLR/WGCNA/miRNA")
options(stringsAsFactors = FALSE)
#exprMat <- "allTreatTPM.csv"
exprMat <- "periodTPM.csv"
type = "unsigned"#无方向性
corType = "pearson"
corFnc = ifelse(corType=="pearson", cor, bicor)
maxPOutliers = ifelse(corType=="pearson",1,0.05)
robustY = ifelse(corType=="pearson",T,F)# 关联样品性状的二元变量时,设置

####################导入数据###############
dataExpr <- read.csv(exprMat, sep=',', row.names=1, header=T, 
                       quote="", comment="", check.names=F)
index <- rownames(dataExpr)
dataExpr <- as.data.frame(lapply(dataExpr,as.numeric))
#sum(is.na(dataExpr))
dataExpr[is.na(dataExpr)] <-0
row.names(dataExpr) <- index
###########消除批次效应##########
#chara <- read.csv("character.csv",sep = ",",header = T)
#batch <- c(rep("salt",27),rep("drout",16),rep("period",17))
#y2 <- removeBatchEffect(dataExpr,batch = batch)

#pdf("test3.pdf",width = 20,height = 10)
#sampleTree <- hclust(dist(t(y2)),method = "average")
#plot(sampleTree,main="Sample clustering to detect outliers",sub="",xlab="")
#dev.off()
####################筛选基因和样本############
#剔除基因表达丰度低,结果不合适
#dataExpr <- y2[which(rowSums(y2)>165),]

###用mad筛选,取前75%
#m.mad = apply(y2 ,1,mad)
#dataExpr <- y2[which(m.mad >
                       #max(quantile(m.mad, probs=seq(0, 1, 0.25))[2],0.01)),]
#datExpr <- as.data.frame(t(dataExpr))
##################组织表达矩阵用以下版本筛选,且不筛异常样本
m.mad = apply(dataExpr,1,mad)
dataExpr <- dataExpr[which(m.mad >
                             max(quantile(m.mad, probs=seq(0, 1, 0.25))[2],0.01)),]
datExpr <- as.data.frame(t(dataExpr))
#筛选掉异常的样本
#A=adjacency(t(datExpr),type="signed")  # 计算相似性矩阵
#k=as.numeric(apply(A,2,sum))-1   ## 计算网络的连接度
#Z.k=scale(k)  # 连接度标准化
#thresholdZ.k=- 2.5   # 设置连接度筛选阀值, 这部分是关键,筛选掉异常的样本
#outlierColor=ifelse(Z.k<thresholdZ.k,"red","black")  # 将异常样本进行标注
#remove.samples= Z.k<thresholdZ.k | is.na(Z.k)  # 删除异常样本

#datExpr=datExpr[!remove.samples,]

nGenes=ncol(datExpr)
nSamples=nrow(datExpr)

##样本聚类
##sampleTree <- hclust(dist(datExpr),method = "average")
##plot(sampleTree,main="Sample clustering to detect outliers",sub="",xlab="")

##筛选软阈值
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(datExpr, powerVector=powers, networkType=type, verbose=5)
# 横轴是Soft threshold (power),纵轴是无标度网络的评估参数,数值越高,
# 网络越符合无标度特征 (non-scale)
pdf('powerL.pdf',width = 14,height = 9)
par(mfrow = c(1,2))
cex1 = 0.8
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",
     ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"))
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red")
# 筛选标准。R-square=0.85
abline(h=0.8,col="red")
# Soft threshold与平均连通性
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, 
     cex=cex1, col="red")
dev.off()

power=sft$powerEstimate

if (is.na(power)){
  power = ifelse(nSamples<20, ifelse(type == "unsigned", 9, 18),
                     ifelse(nSamples<30, ifelse(type == "unsigned", 8, 16),
                           ifelse(nSamples<40, ifelse(type == "unsigned", 7, 14),
                                   ifelse(type == "unsigned", 6, 12))       
                     )
  )
}

##网络构建
# corType: pearson or bicor
# numericLabels: 返回数字而不是颜色作为模块的名字,后面可以再转换为颜色
# saveTOMs:最耗费时间的计算,存储起来,供后续使用
# mergeCutHeight: 合并模块的阈值,越大模块越少
cor <- WGCNA::cor
net = blockwiseModules(datExpr, power = power, maxBlockSize = nGenes,
                       TOMType = type, minModuleSize = 30,
                       reassignThreshold = 0, mergeCutHeight = 0.25,
                       numericLabels = TRUE, pamRespectsDendro = FALSE,
                       saveTOMs=TRUE, corType = corType, 
                       maxPOutliers=maxPOutliers, loadTOMs=TRUE,
                       saveTOMFileBase = paste0("TOM_L.RData"),
                       verbose = 3)

# 根据模块中基因数目的多少,降序排列,依次编号为 `1-最大模块数`。
# **0 (grey)**表示**未**分入任何模块的基因。 
table(net$colors)
cor <- stats::cor
pdf('net_L.pdf',width = 14,height = 9)
moduleLabels = net$colors
moduleColors = labels2colors(moduleLabels)
# Plot the dendrogram and the module colors underneath
# 如果对结果不满意,还可以recutBlockwiseTrees,节省计算时间
plotDendroAndColors(net$dendrograms[[1]], moduleColors[net$blockGenes[[1]]],
                    "Module colors",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
dev.off()

MEs=net$MEs
geneTree=net$dendrograms[[1]];
save(MEs,moduleLabels,power,moduleColors,geneTree,file="02-network_L-auto.RData")
#save(power,file="02-network_L-auto.RData")

没有合适的power值,取的经验值:9

第三步:获得模块与处理的相关性,以及各个处理与模块的GS和MM值

#对野生二粒小麦进行WGCNA分析,针对模块与表型的相关性,寻找感兴趣的模块与表型寻找hub基因

rm(list = ls())
library(WGCNA)
options(stringsAsFactors = FALSE)

lname=load(file = "01-data_sample_L.RData")
lname
lname=load(file="02-network_L-auto.RData")
lname

nGenes=ncol(datExpr)
nSamples=nrow(datExpr)
#datTraits[45,"period"] <- 0
MEs=moduleEigengenes(datExpr,moduleColors)$eigengenes
MEs=orderMEs(MEs)

moduleTraitCor=cor(MEs,traitData,use = "p")

moduleTraitPvalue=corPvalueStudent(moduleTraitCor,nSamples)

sizeGrWindow(10,6) #绘图框大小
textMatrix=paste(signif(moduleTraitCor,2)," (",signif(moduleTraitPvalue,1),")",sep="")
dim(textMatrix)=dim(moduleTraitCor)


pdf("Module-trait relationships_L.pdf",width = 20,height = 10)
par(mar=c(9,12,3,3))
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(traitData),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = blueWhiteRed(50),
               xColorWidth = 1,
               setStdMargins = FALSE,#默认参数出图
               #yColorWidth=2,
               
               textMatrix = textMatrix,
               cex.text = 0.75,
               zlim=c(-1,1),
               main=paste("Module-trait relationships"))
dev.off()

############ 寻找hubgene ###########

#####################循环的414过程###########################
#sapply(traitData$L007_R1_DS_1_cm_1,mode)
#sapply(traitData[[A]],mode)  两个方法的内容是一样的,数据类型也一样
#GS_name=paste("GS",A,sep = ".")
#A <- "L007_R1_DS_1_cm_1"
####################################

#以下循环可行
list <- names(traitData)
for (i in list){
  
  i_file=as.data.frame(traitData[[i]])
  names(i_file)=i #对CM的列名进行更换,上一步操作后的列名为“datTrait$CM”
  
  modNames=substring(names(MEs),3)#获取模块颜色名称,从ME后开始取
  geneModuleMembership=as.data.frame(cor(datExpr,MEs,use = "p"))
  MMPvalue=as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership),nSamples))
  names(geneModuleMembership)=paste("MM",modNames,sep = "")
  names(MMPvalue)=paste("p.MM",modNames,sep="")
  geneTraitSignificance=as.data.frame(cor(datExpr,i_file,use = "p"))
  GSPvalue=as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance),nSamples))
  names(geneTraitSignificance)=paste("GS.",names(i_file),sep = "")
  names(GSPvalue)=paste("p.GS.",names(i_file),sep="")
  
  probes=names(datExpr)
  geneInfo = data.frame(geneID=probes,
                        moduleColor=moduleColors,
                        geneTraitSignificance,
                        GSPvalue)
  modOrder = order(-abs(cor(MEs,i_file,use = "p")))
  for(mod in 1:ncol(geneModuleMembership)){
    oldnames=names(geneInfo)
    geneInfo=data.frame(geneInfo,geneModuleMembership[,modOrder[mod]],MMPvalue[,modOrder[mod]])
    names(geneInfo)=c(oldnames,paste("MM.",modNames[modOrder[mod]],sep = ""),paste("p.MM.",modNames[modOrder[mod]],sep = ""))
    print(mod)
  }
  GS_name=paste("GS",i,sep = ".")
  geneOrder = order(geneInfo$moduleColor,-abs(geneInfo[[GS_name]]))
  geneInfo=geneInfo[geneOrder,]
  write.csv(geneInfo,file = paste("geneInfo_",i,".csv",sep = ""))
  
}

##################################
############################也可分布进行,比较麻烦,代码如下####################
L007_R1_DS_1_cm_1=as.data.frame(traitData$L007_R1_DS_1_cm_1)
names(L007_R1_DS_1_cm_1)="L007_R1_DS_1_cm_1" #对CM的列名进行更换,上一步操作后的列名为“datTrait$CM”

modNames=substring(names(MEs),3)#获取模块颜色名称,从ME后开始取
geneModuleMembership=as.data.frame(cor(datExpr,MEs,use = "p"))
MMPvalue=as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership),nSamples))
names(geneModuleMembership)=paste("MM",modNames,sep = "")
names(MMPvalue)=paste("p.MM",modNames,sep="")
geneTraitSignificance=as.data.frame(cor(datExpr,L007_R1_DS_1_cm_1,use = "p"))
GSPvalue=as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance),nSamples))
names(geneTraitSignificance)=paste("GS.",names(L007_R1_DS_1_cm_1),sep = "")
names(GSPvalue)=paste("p.GS.",names(L007_R1_DS_1_cm_1),sep="")

#########输出各模块基因及相关结果#########
probes=names(datExpr)
geneInfo = data.frame(geneID=probes,
                      moduleColor=moduleColors,
                      geneTraitSignificance,
                      GSPvalue)
modOrder = order(-abs(cor(MEs,L007_R1_DS_1_cm_1,use = "p")))
for(mod in 1:ncol(geneModuleMembership)){
  oldnames=names(geneInfo)
  geneInfo=data.frame(geneInfo,geneModuleMembership[,modOrder[mod]],MMPvalue[,modOrder[mod]])
  names(geneInfo)=c(oldnames,paste("MM.",modNames[modOrder[mod]],sep = ""),paste("p.MM.",modNames[modOrder[mod]],sep = ""))
  print(mod)
}
geneOrder = order(geneInfo$moduleColor,-abs(geneInfo$GS.L007_R1_DS_1_cm_1))
geneInfo=geneInfo[geneOrder,]
write.csv(geneInfo,file = "geneInfo_L007_R1_DS_1_cm_1.csv")

##################################观察具体模块与处理的相关性##########################
module="midnightblue"
column=match(module,modNames)
moduleGenes=moduleColors==module

sizeGrWindow(7,7)
par(mfrow=c(1,1))
pdf("midnightblue module membership vs. TX gene significance.pdf",width = 10,height = 6)
verboseScatterplot(abs(geneModuleMembership[moduleGenes,column]),
                   abs(geneTraitSignificance[moduleGenes,1]),
                   xlab = paste("Module Membership in",module,"module"),
                   ylab = "Gene significance for TX",
                   main = paste("Module membership vs. gene significance\n"),
                   cex.main = 1.2,
                   cex.lab = 1.2,
                   cex.axis = 1.2,
                   col = module)
dev.off()

####################################################

 计算每个处理(共16个)与各个模块的GS和MM值时,可用循环进行计算

R中的循环格式:

R语言:for循环使用小结_LandH的Blog的博客-CSDN博客_r语言for语句icon-default.png?t=N7T8https://blog.csdn.net/u013084616/article/details/73740250

 paste用法:

R中获取dataframe某列值的方法:

dataframe$X     等同于   dataframe[[X]]

                        R的一些基础命令

取列表子集、 列表中含有列表

> x = list(id = 1:4, height = 170, gender = "male")
> x
$id
[1] 1 2 3 4

$height
[1] 170

$gender
[1] "male"
x[1]       #取到名称 和 内容
x["id"]      #取到名称 和 内容
$id
[1] 1 2 3 4

x[[1]]      #只取内容
x[["id"]]    #只取内容
> [1] 1 2 3 4

> x = list( a = list(1,2,3,4), b = c("a","d","e"))
> x
$a
$a[[1]]
[1] 1

$a[[2]]
[1] 2

$a[[3]]
[1] 3

$a[[4]]
[1] 4


$b
[1] "a" "d" "e"

> x[[1]][[2]]     #取元素里的内容
> [1] 2
> x[[1]][2]      #取元素
[[1]]
[1] 2
> x[[c(1,3)]]     #取元素
[1] 3
> x[[c(2,2)]]      #取元素
[1] "d"

R语言基础之第二部分 操纵数据/取子集 - 简书 (jianshu.com)icon-default.png?t=N7T8https://www.jianshu.com/p/acc88b353bbe

变量名称和字符串: 

R学习笔记(二)变量名称和字符串的转换 - 知乎 (zhihu.com)icon-default.png?t=N7T8https://zhuanlan.zhihu.com/p/30383865

 R中针对dataframe的一些操作,如修改某数值:

R语言data.frame的常用操作总结 - HuskySir - 博客园 (cnblogs.com)icon-default.png?t=N7T8https://www.cnblogs.com/huskysir/p/10841595.html

> df[3,'Chinese'] <- 65 #将3号同学的语文成绩修改为65
#将语文成绩低于20的同学的语文成绩修改为20
> df[which(df$Chinese < 20),'Chinese'] <- 20
> df
       ID Class Chinese Math English
1   1     2      65   59      23
2   2     2      37   38      45
3   3     3      65   76      67
4   4     1      90   49      87
5   5     2      20   71      34
6   6     3      89   99      46
7   7     1      94   38      87
8   8     2      66   77      95
9   9     3      62   93      43
10 10     1      20   21      76
11 11     2      20   65      23
12 12     3      20   12      94

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值