WGCNA分析,简单全面的最新教程 - 简书 (jianshu.com)https://www.jianshu.com/p/e9cc3f43441d
一文看懂WGCNA 分析(2019更新版) - 云+社区 - 腾讯云 (tencent.com)https://cloud.tencent.com/developer/article/1516749
安装WGCNA编辑包出现如下问题怎么解决? - 组学大讲堂问答社区 (omicsclass.com)https://www.omicsclass.com/question/3995
一、数据处理
1.表达数据是从网上下载的不同批次的数据,要先对数据进行去除批次效应
用limma包的removeBatchEffect功能。
Q:有个问题是数据是否要先经过log处理
多种批次效应去除的方法比较-转自生信技能树 - 简书 (jianshu.com)https://www.jianshu.com/p/61b2238a24b9
生信小白--关于批次效应的学习及实战 - 简书 (jianshu.com)https://www.jianshu.com/p/4c0734c3473c
2.表达数据的筛选问题
无向性网络的power值最好不要超过15;
首先尝试基因的样本表达均值<1,或者差异系数<1的基因删除;但软阈值为24,过高,不可取;
其次尝试用mad值筛选,去除后25%,power值为14,勉强可取
Q:差异系数的计算问题
3.样本的性状数据问题
目前思考查看文献内对各样本的处理是什么,先统计是否是赤霉病相关,白粉、条锈病相关,再根据其处理分一下类
根据三种病的类型分了表型信息,做了表型信息与模块的相关性分析:
2022.3.29 进行野生二粒小麦的WGCNA分析——只针对各个组织时期(17)的表达量
有一个样本的表达量均为0,已剔除。共有16个样本
第一步:获得样本聚类树与处理间相关性
#######此版本适用于组织的表达量的WGCNA
rm(list=ls())
library(WGCNA)
library(reshape2)
library(limma)
library(stringr)
#setwd("/users/songwn/user/ZY/database/NLR/WGCNA")
options(stringsAsFactors = FALSE)
exprMat <- "periodTPM.csv"
type = "unsigned"#无方向性
corType = "pearson"
corFnc = ifelse(corType=="pearson", cor, bicor)
maxPOutliers = ifelse(corType=="pearson",1,0.05)
robustY = ifelse(corType=="pearson",T,F)# 关联样品性状的二元变量时,设置
####################导入数据###############
dataExpr <- read.csv(exprMat, sep=',', row.names=1,header=T,
quote="", comment="", check.names=F)
index <- rownames(dataExpr)
dataExpr <- as.data.frame(lapply(dataExpr,as.numeric))
sum(is.na(dataExpr))
dataExpr[is.na(dataExpr)] <-0
row.names(dataExpr) <- index
#y2 <- dataExpr
###########消除批次效应##########
#batch <- c(rep("salt",27),rep("drout",16),rep("period",17))
#batch <- c(rep("period",17))
#y2 <- removeBatchEffect(dataExpr,batch = batch)
#pdf("test3.pdf",width = 20,height = 10)
#sampleTree <- hclust(dist(t(y2)),method = "average")
#plot(sampleTree,main="Sample clustering to detect outliers",sub="",xlab="")
#dev.off()
####################筛选基因和样本############
#剔除基因表达丰度低,结果不合适
#dataExpr <- y2[which(rowSums(y2)>165),]
#用mad筛选,取前75%
m.mad = apply(dataExpr,1,mad)
dataExpr <- dataExpr[which(m.mad >
max(quantile(m.mad, probs=seq(0, 1, 0.25))[2],0.01)),]
datExpr <- as.data.frame(t(dataExpr))
###筛选掉异常的样本
#A=adjacency(t(datExpr),type="signed") # 计算相似性矩阵
#k=as.numeric(apply(A,2,sum))-1 ## 计算网络的连接度
#Z.k=scale(k) # 连接度标准化
#thresholdZ.k=- 2.5 # 设置连接度筛选阀值, 这部分是关键,筛选掉异常的样本
#outlierColor=ifelse(Z.k<thresholdZ.k,"red","black") # 将异常样本进行标注
#remove.samples= Z.k<thresholdZ.k | is.na(Z.k) # 删除异常样本
#datExpr=datExpr[!remove.samples,]
nGenes=ncol(datExpr)
nSamples=nrow(datExpr)
sampleTree = hclust(dist(datExpr),method = "average")
sizeGrwindow(12,9)
par(cex=0.6)
par(mar=c(0,4,2,0))
plot(sampleTree,main="Sample clustering",sub="",xlab="",cex.lab=1.5,cex.axis=1.5,cex.main=2)
####################获取处理的文件##################
#trait=colnames(dataExpr)
#write.csv(trait,file = "character.csv",row.names = F)
####################################################
traitData=read.csv("periodCharacter.csv",row.names=1,header = T)
#######################以下为处理表达矩阵的
#miRNASample=rownames(datExpr)
#traitRows=match(miRNASample,traitData$Treat)
#datTraits=traitData[traitRows,-1]
#rownames(datTraits)=traitData[traitRows,1]
#collectGarbage()
traitColors= numbers2colors(traitData,signed=FALSE)
pdf("sample_L.pdf",width = 25,height = 12)
plotDendroAndColors(sampleTree,traitColors,groupLabels = names(traitData),main="Sample dendrogram and treat heatmap")
dev.off()
save(datExpr,traitData,sampleTree,traitColors,file = "01-data_sample_L.RData")
注意事项:
在导入数据时,要先储存一下df的行名,因为将NA替换成0的操作过程中,行名会丢失
因为只分析各个时期的表达量,不存在批次效应
无需筛异常样本,筛完后样本数为0
第二步:获得net与TOM矩阵
#对野生小麦的表达矩阵进行WGCNA分析,基因聚类与模块划分
#常规表达矩阵即可,即基因在行,样品在列,进入分析前做一个转置。
#RPKM、FPKM或其它标准化方法影响不大,推荐使用Deseq2的varianceStabilizingTransformation或log2(x+1)对标准化后的数据做个转换
rm(list=ls())
library(WGCNA)
library(reshape2)
library(limma)
library(stringr)
setwd("/users/songwn/user/ZY/database/NLR/WGCNA/miRNA")
options(stringsAsFactors = FALSE)
#exprMat <- "allTreatTPM.csv"
exprMat <- "periodTPM.csv"
type = "unsigned"#无方向性
corType = "pearson"
corFnc = ifelse(corType=="pearson", cor, bicor)
maxPOutliers = ifelse(corType=="pearson",1,0.05)
robustY = ifelse(corType=="pearson",T,F)# 关联样品性状的二元变量时,设置
####################导入数据###############
dataExpr <- read.csv(exprMat, sep=',', row.names=1, header=T,
quote="", comment="", check.names=F)
index <- rownames(dataExpr)
dataExpr <- as.data.frame(lapply(dataExpr,as.numeric))
#sum(is.na(dataExpr))
dataExpr[is.na(dataExpr)] <-0
row.names(dataExpr) <- index
###########消除批次效应##########
#chara <- read.csv("character.csv",sep = ",",header = T)
#batch <- c(rep("salt",27),rep("drout",16),rep("period",17))
#y2 <- removeBatchEffect(dataExpr,batch = batch)
#pdf("test3.pdf",width = 20,height = 10)
#sampleTree <- hclust(dist(t(y2)),method = "average")
#plot(sampleTree,main="Sample clustering to detect outliers",sub="",xlab="")
#dev.off()
####################筛选基因和样本############
#剔除基因表达丰度低,结果不合适
#dataExpr <- y2[which(rowSums(y2)>165),]
###用mad筛选,取前75%
#m.mad = apply(y2 ,1,mad)
#dataExpr <- y2[which(m.mad >
#max(quantile(m.mad, probs=seq(0, 1, 0.25))[2],0.01)),]
#datExpr <- as.data.frame(t(dataExpr))
##################组织表达矩阵用以下版本筛选,且不筛异常样本
m.mad = apply(dataExpr,1,mad)
dataExpr <- dataExpr[which(m.mad >
max(quantile(m.mad, probs=seq(0, 1, 0.25))[2],0.01)),]
datExpr <- as.data.frame(t(dataExpr))
#筛选掉异常的样本
#A=adjacency(t(datExpr),type="signed") # 计算相似性矩阵
#k=as.numeric(apply(A,2,sum))-1 ## 计算网络的连接度
#Z.k=scale(k) # 连接度标准化
#thresholdZ.k=- 2.5 # 设置连接度筛选阀值, 这部分是关键,筛选掉异常的样本
#outlierColor=ifelse(Z.k<thresholdZ.k,"red","black") # 将异常样本进行标注
#remove.samples= Z.k<thresholdZ.k | is.na(Z.k) # 删除异常样本
#datExpr=datExpr[!remove.samples,]
nGenes=ncol(datExpr)
nSamples=nrow(datExpr)
##样本聚类
##sampleTree <- hclust(dist(datExpr),method = "average")
##plot(sampleTree,main="Sample clustering to detect outliers",sub="",xlab="")
##筛选软阈值
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(datExpr, powerVector=powers, networkType=type, verbose=5)
# 横轴是Soft threshold (power),纵轴是无标度网络的评估参数,数值越高,
# 网络越符合无标度特征 (non-scale)
pdf('powerL.pdf',width = 14,height = 9)
par(mfrow = c(1,2))
cex1 = 0.8
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",
ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red")
# 筛选标准。R-square=0.85
abline(h=0.8,col="red")
# Soft threshold与平均连通性
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers,
cex=cex1, col="red")
dev.off()
power=sft$powerEstimate
if (is.na(power)){
power = ifelse(nSamples<20, ifelse(type == "unsigned", 9, 18),
ifelse(nSamples<30, ifelse(type == "unsigned", 8, 16),
ifelse(nSamples<40, ifelse(type == "unsigned", 7, 14),
ifelse(type == "unsigned", 6, 12))
)
)
}
##网络构建
# corType: pearson or bicor
# numericLabels: 返回数字而不是颜色作为模块的名字,后面可以再转换为颜色
# saveTOMs:最耗费时间的计算,存储起来,供后续使用
# mergeCutHeight: 合并模块的阈值,越大模块越少
cor <- WGCNA::cor
net = blockwiseModules(datExpr, power = power, maxBlockSize = nGenes,
TOMType = type, minModuleSize = 30,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs=TRUE, corType = corType,
maxPOutliers=maxPOutliers, loadTOMs=TRUE,
saveTOMFileBase = paste0("TOM_L.RData"),
verbose = 3)
# 根据模块中基因数目的多少,降序排列,依次编号为 `1-最大模块数`。
# **0 (grey)**表示**未**分入任何模块的基因。
table(net$colors)
cor <- stats::cor
pdf('net_L.pdf',width = 14,height = 9)
moduleLabels = net$colors
moduleColors = labels2colors(moduleLabels)
# Plot the dendrogram and the module colors underneath
# 如果对结果不满意,还可以recutBlockwiseTrees,节省计算时间
plotDendroAndColors(net$dendrograms[[1]], moduleColors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
dev.off()
MEs=net$MEs
geneTree=net$dendrograms[[1]];
save(MEs,moduleLabels,power,moduleColors,geneTree,file="02-network_L-auto.RData")
#save(power,file="02-network_L-auto.RData")
没有合适的power值,取的经验值:9
第三步:获得模块与处理的相关性,以及各个处理与模块的GS和MM值
#对野生二粒小麦进行WGCNA分析,针对模块与表型的相关性,寻找感兴趣的模块与表型寻找hub基因
rm(list = ls())
library(WGCNA)
options(stringsAsFactors = FALSE)
lname=load(file = "01-data_sample_L.RData")
lname
lname=load(file="02-network_L-auto.RData")
lname
nGenes=ncol(datExpr)
nSamples=nrow(datExpr)
#datTraits[45,"period"] <- 0
MEs=moduleEigengenes(datExpr,moduleColors)$eigengenes
MEs=orderMEs(MEs)
moduleTraitCor=cor(MEs,traitData,use = "p")
moduleTraitPvalue=corPvalueStudent(moduleTraitCor,nSamples)
sizeGrWindow(10,6) #绘图框大小
textMatrix=paste(signif(moduleTraitCor,2)," (",signif(moduleTraitPvalue,1),")",sep="")
dim(textMatrix)=dim(moduleTraitCor)
pdf("Module-trait relationships_L.pdf",width = 20,height = 10)
par(mar=c(9,12,3,3))
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(traitData),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),
xColorWidth = 1,
setStdMargins = FALSE,#默认参数出图
#yColorWidth=2,
textMatrix = textMatrix,
cex.text = 0.75,
zlim=c(-1,1),
main=paste("Module-trait relationships"))
dev.off()
############ 寻找hubgene ###########
#####################循环的414过程###########################
#sapply(traitData$L007_R1_DS_1_cm_1,mode)
#sapply(traitData[[A]],mode) 两个方法的内容是一样的,数据类型也一样
#GS_name=paste("GS",A,sep = ".")
#A <- "L007_R1_DS_1_cm_1"
####################################
#以下循环可行
list <- names(traitData)
for (i in list){
i_file=as.data.frame(traitData[[i]])
names(i_file)=i #对CM的列名进行更换,上一步操作后的列名为“datTrait$CM”
modNames=substring(names(MEs),3)#获取模块颜色名称,从ME后开始取
geneModuleMembership=as.data.frame(cor(datExpr,MEs,use = "p"))
MMPvalue=as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership),nSamples))
names(geneModuleMembership)=paste("MM",modNames,sep = "")
names(MMPvalue)=paste("p.MM",modNames,sep="")
geneTraitSignificance=as.data.frame(cor(datExpr,i_file,use = "p"))
GSPvalue=as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance),nSamples))
names(geneTraitSignificance)=paste("GS.",names(i_file),sep = "")
names(GSPvalue)=paste("p.GS.",names(i_file),sep="")
probes=names(datExpr)
geneInfo = data.frame(geneID=probes,
moduleColor=moduleColors,
geneTraitSignificance,
GSPvalue)
modOrder = order(-abs(cor(MEs,i_file,use = "p")))
for(mod in 1:ncol(geneModuleMembership)){
oldnames=names(geneInfo)
geneInfo=data.frame(geneInfo,geneModuleMembership[,modOrder[mod]],MMPvalue[,modOrder[mod]])
names(geneInfo)=c(oldnames,paste("MM.",modNames[modOrder[mod]],sep = ""),paste("p.MM.",modNames[modOrder[mod]],sep = ""))
print(mod)
}
GS_name=paste("GS",i,sep = ".")
geneOrder = order(geneInfo$moduleColor,-abs(geneInfo[[GS_name]]))
geneInfo=geneInfo[geneOrder,]
write.csv(geneInfo,file = paste("geneInfo_",i,".csv",sep = ""))
}
##################################
############################也可分布进行,比较麻烦,代码如下####################
L007_R1_DS_1_cm_1=as.data.frame(traitData$L007_R1_DS_1_cm_1)
names(L007_R1_DS_1_cm_1)="L007_R1_DS_1_cm_1" #对CM的列名进行更换,上一步操作后的列名为“datTrait$CM”
modNames=substring(names(MEs),3)#获取模块颜色名称,从ME后开始取
geneModuleMembership=as.data.frame(cor(datExpr,MEs,use = "p"))
MMPvalue=as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership),nSamples))
names(geneModuleMembership)=paste("MM",modNames,sep = "")
names(MMPvalue)=paste("p.MM",modNames,sep="")
geneTraitSignificance=as.data.frame(cor(datExpr,L007_R1_DS_1_cm_1,use = "p"))
GSPvalue=as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance),nSamples))
names(geneTraitSignificance)=paste("GS.",names(L007_R1_DS_1_cm_1),sep = "")
names(GSPvalue)=paste("p.GS.",names(L007_R1_DS_1_cm_1),sep="")
#########输出各模块基因及相关结果#########
probes=names(datExpr)
geneInfo = data.frame(geneID=probes,
moduleColor=moduleColors,
geneTraitSignificance,
GSPvalue)
modOrder = order(-abs(cor(MEs,L007_R1_DS_1_cm_1,use = "p")))
for(mod in 1:ncol(geneModuleMembership)){
oldnames=names(geneInfo)
geneInfo=data.frame(geneInfo,geneModuleMembership[,modOrder[mod]],MMPvalue[,modOrder[mod]])
names(geneInfo)=c(oldnames,paste("MM.",modNames[modOrder[mod]],sep = ""),paste("p.MM.",modNames[modOrder[mod]],sep = ""))
print(mod)
}
geneOrder = order(geneInfo$moduleColor,-abs(geneInfo$GS.L007_R1_DS_1_cm_1))
geneInfo=geneInfo[geneOrder,]
write.csv(geneInfo,file = "geneInfo_L007_R1_DS_1_cm_1.csv")
##################################观察具体模块与处理的相关性##########################
module="midnightblue"
column=match(module,modNames)
moduleGenes=moduleColors==module
sizeGrWindow(7,7)
par(mfrow=c(1,1))
pdf("midnightblue module membership vs. TX gene significance.pdf",width = 10,height = 6)
verboseScatterplot(abs(geneModuleMembership[moduleGenes,column]),
abs(geneTraitSignificance[moduleGenes,1]),
xlab = paste("Module Membership in",module,"module"),
ylab = "Gene significance for TX",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2,
cex.lab = 1.2,
cex.axis = 1.2,
col = module)
dev.off()
####################################################
计算每个处理(共16个)与各个模块的GS和MM值时,可用循环进行计算
R中的循环格式:
R语言:for循环使用小结_LandH的Blog的博客-CSDN博客_r语言for语句https://blog.csdn.net/u013084616/article/details/73740250
paste用法:
R中获取dataframe某列值的方法:
dataframe$X 等同于 dataframe[[X]]
R的一些基础命令
取列表子集、 列表中含有列表
> x = list(id = 1:4, height = 170, gender = "male")
> x
$id
[1] 1 2 3 4
$height
[1] 170
$gender
[1] "male"
x[1] #取到名称 和 内容
x["id"] #取到名称 和 内容
$id
[1] 1 2 3 4
x[[1]] #只取内容
x[["id"]] #只取内容
> [1] 1 2 3 4
> x = list( a = list(1,2,3,4), b = c("a","d","e"))
> x
$a
$a[[1]]
[1] 1
$a[[2]]
[1] 2
$a[[3]]
[1] 3
$a[[4]]
[1] 4
$b
[1] "a" "d" "e"
> x[[1]][[2]] #取元素里的内容
> [1] 2
> x[[1]][2] #取元素
[[1]]
[1] 2
> x[[c(1,3)]] #取元素
[1] 3
> x[[c(2,2)]] #取元素
[1] "d"
R语言基础之第二部分 操纵数据/取子集 - 简书 (jianshu.com)https://www.jianshu.com/p/acc88b353bbe
变量名称和字符串:
R学习笔记(二)变量名称和字符串的转换 - 知乎 (zhihu.com)https://zhuanlan.zhihu.com/p/30383865
R中针对dataframe的一些操作,如修改某数值:
R语言data.frame的常用操作总结 - HuskySir - 博客园 (cnblogs.com)https://www.cnblogs.com/huskysir/p/10841595.html
> df[3,'Chinese'] <- 65 #将3号同学的语文成绩修改为65
#将语文成绩低于20的同学的语文成绩修改为20
> df[which(df$Chinese < 20),'Chinese'] <- 20
> df
ID Class Chinese Math English
1 1 2 65 59 23
2 2 2 37 38 45
3 3 3 65 76 67
4 4 1 90 49 87
5 5 2 20 71 34
6 6 3 89 99 46
7 7 1 94 38 87
8 8 2 66 77 95
9 9 3 62 93 43
10 10 1 20 21 76
11 11 2 20 65 23
12 12 3 20 12 94