知识分享| 转录组个性化分析(6)——WGCNA

       近年来,很多SCI高分文章中都使用了WGCNA,那么WGCNA是什么,它可以应用于哪些研究方向,又如何进行WGCNA分析,其分析结果如何看?现在就带着这些问题,跟着小易一起学习探讨吧!

1  WGCNA介绍

       WGCNA(weighted gene co-expression network analysis,权重基因共表达网络分析)是一种分析多个样本基因表达模式的分析方法,可将表达模式相似的基因进行聚类,并分析模块与特定性状或表型之间的关联关系,因此在疾病以及其他性状与基因关联分析等方面的研究中被广泛应用。

WGCNA与趋势分析都为基因共表达分析方法,与趋势分析相比,WGCNA 有以下优势:

a. 聚类方法:使用权重共表达策略(无尺度分布),更加符合生物学现象;

b. 基因间的关系:能呈现基因间的相互作用关系,且能找出处于调控网络中心的 hub 基因;

c. 样本数:适合于大样本量,且越多越好。而趋势分析 5 个点以上就会使结果非常复杂,准确性 下降,且最多只能分析 8 个点;

d. 与表型的关联:可以利用模块特征值和 hub 基因与特定性状、表型进行关联分析,更准确地分 析生物学问题。

2  数据要求

       由于WGCNA是基于相关系数的表达调控网络分析方法。当样本数过低的时候,相关系数的计算是不可靠的,得到的调控网络价值不大。所以,我们推荐的样本数如下:

a. 当独立样本数≥8(非重复样本)时,可以考虑基于Pearson相关系数的WGCNA共表达网络的方法(效果看实际情况而定);

b. 当样本数≥15(可以包含生物学重复)时,WGCNA方法会有更好的效果。

c. 当样品数<8时,不建议进行该项分析。

d. 该方法对于不同材料或不同组织进行分析更有意义,对于不同时间点处理相同样品意义不大。

3  分析代码

a. wgcna01_1.R,wgcna01_2.R是用来构建网络和相关图,适用于性状关联和无性状关联的WGCNA网络构建。因为这两步很费时,就拆开了并行跑。

用法:Rscript wgcna01_1.R expre.xls;

b. wgcna02.R 是可视化基因网络图。

Rscript wgcna02.R trait.txt(optional)

代码:wgcna01_1

# Description: To perform co‐expression analysis based on WGCNA Rpackage
# Version: 1.0, mics.com, 2018‐10‐09
# Usage: Rscript wgcna.R rpkm.xls outdir
args <‐ commandArgs(TRUE);
exp_data=args[1]
# Part 1. Basic Analysis
# 1. Data input, cleaning and pre‐processing
# 1.1 Loading expression data
# Load the package
library(WGCNA);
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
#Read in the data set
data = read.table(exp_data, head=T, row.names=1);
multiExpr = t(data);
# 1.2 Checking data for excessive missing values
gsg = goodSamplesGenes(multiExpr, verbose = 3);
#
if (!gsg$allOK)
{
# Remove the offending genes and samples from the data:
multiExpr = multiExpr[gsg$goodSamples, gsg$goodGenes]
}

nGenes = ncol(multiExpr)
nSamples = nrow(multiExpr)

if(nGenes > 30000)
{
library(genefilter)
data = read.table(exp_data, head=T, row.names=1)
f1 <‐ pOverA(0.8,0.5)
f2 <‐ function(x) (IQR(x) > 0.5)
ff <‐ filterfun(f1, f2)
selected <‐ genefilter(data, ff)
sum(selected)
esetSub <‐ data[selected, ]
multiExpr <‐ as.data.frame(t(esetSub))
nGenes = ncol(multiExpr)
nSamples = nrow(multiExpr)
}

save(multiExpr, file = "01‐dataInput_Expr.RData")

####Plot 1 samples cluster
sampleTree = hclust(dist(multiExpr), method = "average");
sizeGrWindow(12,9)
pdf(file = "sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers",
sub="", xlab="", cex.lab = 1.5,cex.axis = 1.5, cex.main = 2)
dev.off()

#########################################remove outlier samples(depend
on samples totally)
# Plot a line to show the cut
#abline(h = 15, col = "red");
# Determine cluster under the line
#clust = cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)
#table(clust)
# clust 1 contains the samples we want to keep.
#keepSamples = (clust==1)
#multiExpr = multiExpr[keepSamples, ]
#nGenes = ncol(multiExpr)
#nSamples = nrow(multiExpr)
# 2. Step‐by‐step network construction and module detection
#=====================================================================
=========
# 2.1 Choosing the soft‐thresholding power: analysis of network
# Choose a set of soft‐thresholding powers
powers = c(c(1:10), seq(from = 12, to=30, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(multiExpr, powerVector = powers, verbose = 5 )
save(sft, file = "01‐sftpower.RData")

####Plot 2 selected power
pdf(file ="Network_power.pdf", width = 9, height = 5);
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale‐free topology fit index as a function of the soft‐thresholdingpower
plot(sft$fitIndices[,1], ‐sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology ModelFit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], ‐sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
abline(h=0.85,col="red")
# Mean connectivity as a function of the soft‐thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers,
cex=cex1,col="red")
dev.off()
# Select power
softPower <‐ sft$powerEstimate
####Plot 3 检验选定的β值下记忆网络是否逼近 scale free
# 基因多的时候使用下面的代码:
k <‐ softConnectivity(datE=multiExpr,power=softPower)
pdf(file = "Check_Scalefree.pdf", width = 10, height = 5);
par(mfrow=c(1,2))
hist(k)
scaleFreePlot(k,main="Check Scale free topology\n")
#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
‐‐‐‐‐‐‐‐‐
# 2.2 One‐step network construction and module detection
#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
‐‐‐‐‐‐‐‐‐
#2.2.1 获得临近矩阵,根据β值获得临近矩阵和拓扑矩阵,:
softPower <‐ sft$powerEstimate
adjacency = adjacency(multiExpr, power = softPower);
# 将临近矩阵转为 Tom 矩阵
TOM = TOMsimilarity(adjacency);
# 计算基因之间的相异度
dissTOM = 1‐TOM
save(TOM, file = "TOMsimilarity_adjacency.RData")
#2.2.2a Clustering using TOM
geneTree = hclust(as.dist(dissTOM),method="average");
minModuleSize = 30
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
minClusterSize = minModuleSize);
table(dynamicMods)
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
#Plot 4 plot the dendrogram and colors underneath
pdf(file = "GeneDendrogramColors.pdf", width = 12, height = 9);
sizeGrWindow(12,9)
plotDendroAndColors(geneTree, dynamicColors,
"Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")
dev.off()
#2.2.2b Merging of modules whose expression profiles are very similar
MEList = moduleEigengenes(multiExpr, colors = dynamicColors)
MEs = MEList$eigengenes
MEDiss = 1‐cor(MEs);
METree = hclust(as.dist(MEDiss), method = "average");
#Plot 5 merge the dendrogram and colors underneath for similar module
pdf(file = "Clustering_similar_module.pdf", width = 12, height = 9);
sizeGrWindow(12, 9)
plot(METree, main = "Clustering of similar module eigengenes",
xlab = "", sub = "")
MEDissThres = 0.25
abline(h=MEDissThres, col = "red")
dev.off()
merge = mergeCloseModules(multiExpr, dynamicColors, cutHeight =
MEDissThres, verbose = 3)
mergedColors = merge$colors;
mergedMEs = merge$newMEs;
#Plot 6 plot the gene dendrogram again, with the original and merged
module colors underneath
pdf(file = "Merged_GeneDendrogramColors.pdf", width = 12, height = 9)
sizeGrWindow(12, 9)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors), c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05)
dev.off()
moduleColors = mergedColors
colorOrder = c("grey", standardColors(50));
moduleLabels = match(moduleColors, colorOrder)‐1;
MEs = mergedMEs;
save(MEs, moduleLabels, moduleColors, geneTree, file = "02‐networkConstruction‐stepByStep.RData")

 代码:wgcna01_2

args <‐ commandArgs(TRUE);
exp_data=args[1]
# Part 1.1 Tom Network visualization
# 1. Data input, cleaning and pre‐processing
# 1.1 Loading expression data
# Load the package
library(WGCNA);
options(stringsAsFactors = FALSE)
enableWGCNAThreads()
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
#Read in the data set
data = read.table(exp_data, head=T, row.names=1);
multiExpr = t(data)
# 1.2 Checking data for excessive missing values
gsg = goodSamplesGenes(multiExpr, verbose = 3);
#
if (!gsg$allOK)
{
# Remove the offending genes and samples from the data:
multiExpr = multiExpr[gsg$goodSamples, gsg$goodGenes]
}
nGenes = ncol(multiExpr)
nSamples = nrow(multiExpr)
if(nGenes > 30000)
{
library(genefilter)
data = read.table(exp_data, head=T, row.names=1)
f1 <‐ pOverA(0.8,0.5)
f2 <‐ function(x) (IQR(x) > 0.5)
ff <‐ filterfun(f1, f2)
selected <‐ genefilter(data, ff)
sum(selected)
esetSub <‐ data[selected, ]
multiExpr <‐ as.data.frame(t(esetSub))
nGenes = ncol(multiExpr)
nSamples = nrow(multiExpr)
}
# 1.3 Choosing the soft‐thresholding power: analysis of network
topology
# Choose a set of soft‐thresholding powers
powers = c(c(1:10), seq(from = 12, to=30, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(multiExpr, powerVector = powers, verbose = 5 )
softPower <‐ sft$powerEstimate
# 1.4 Visualizing the gene network
# Calculate topological overlap anew: this could be done more
efficiently by saving the TOM
TOM = TOMsimilarityFromExpr(multiExpr, power = softPower);
save(TOM, file = "TOMsimilarityFromExpr_multiExpr.power.RData")
dissTOM = 1‐TOM
nSelect = 500
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
#Plot 1 Visualizing the gene network in all modules for Tom matrix
pdf(file="Tom_GeneNetworkHeatmap.pdf", width = 9, height = 9);
sizeGrWindow(9,9)
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Gene Network
heatmap plot")
dev.off()

代码:wgcna02

library(WGCNA)
options(stringsAsFactors = FALSE)
enableWGCNAThreads()
lname1= load("01‐dataInput_Expr.RData")
lname2=load("01‐sftpower.RData")
lname3=load("02‐networkConstruction‐stepByStep.RData")
args <‐ commandArgs(TRUE);
trait_data=args[1]
#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
‐‐‐‐‐‐‐‐‐
# 3.2 Visualizing the network of module eigengenes
#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
‐‐‐‐‐‐‐‐‐
# Recalculate module eigengenes
MEs = moduleEigengenes(multiExpr, moduleColors)$eigengenes
MET = orderMEs(MEs)
#Plot 3.2.a plot the relationships among the module eigengenes
pdf(file="Module_Eigengene_network.pdf", width = 9, height = 9);
par(mfrow=c(1,2))
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro =
c(0,4,2,0),plotHeatmaps = FALSE)
par(cex = 1.0)
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap =
c(3,4,2,2),
plotDendrograms = FALSE, xLabelsAngle = 90)
dev.off()
#Plot 3.2.b plot MEs corelation between different module
library(PerformanceAnalytics)
Matrix_MEs <‐ as.matrix(MEs)
pdf(file="Correlation_modules.pdf",width = 9, height = 9)
chart.Correlation(Matrix_MEs,histogram = TRUE,pch=19)
dev.off()
#=====================================================================
=========
# 4.Find all genes for each module and Exporting to Cytoscape
#=====================================================================
=========
# Recalculate topological overlap if needed
# Select modules
Allmodules = unique(moduleColors);
multiExpr <‐ as.data.frame(multiExpr)
probes = names(multiExpr) ###"probes means gene names"
# Select module probes
for (i in Allmodules){
module=i
inModule = is.finite(match(moduleColors, module));
modProbes = probes[inModule];
#4.1 Write all gene from each modules
write.table(modProbes,paste("allgenefrom_module",module,".txt",
sep=""),quote = FALSE,row.names = F,col.names = FALSE,sep="\t")
#4.2 Select the corresponding Topological Overlap for Cytoscape
input files
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# Export the network into edge and node list files Cytoscape can
read
cyt = exportNetworkToCytoscape(modTOM,
edgeFile = paste("CytoscapeInput‐edges‐", module, ".txt",sep=""),
nodeFile = paste("CytoscapeInput‐nodes‐", module, ".txt",sep=""),
weighted = TRUE,
threshold = 0.02,
nodeNames = modProbes,
altNodeNames = modProbes,
nodeAttr = moduleColors[inModule])
}
#=====================================================================
=========
# 5. Find the hub gene for all modules
#=====================================================================
=========
hubs <‐ chooseTopHubInEachModule(multiExpr, moduleColors,
omitColors = "grey",
power = 14,
type = "unsigned")
H <‐ as.matrix(hubs)
write.table(H,"hubGenes_foreachModule.txt",quote = FALSE,row.names =
T,col.names = FALSE,sep="\t")
# Part 6. External Analysis
if(file.exists(trait_data)){
#=====================================================================
=========
# 6.1 Loading trait data
#=====================================================================
=========
traitData = read.table(trait_data);
# Form a data frame analogous to expression data that will hold the
clinical traits.
nGenes =ncol(multiExpr);
nSamples = nrow(multiExpr);
sampleName=rownames(multiExpr)
traitRows = match(sampleName,traitData$Sample) #colum name for all
sample must be called "Sample"
datTraits = traitData[traitRows, ‐1];
rownames(datTraits) = traitData[traitRows, 1];
save(datTraits,file="01‐dataInput_Trait.RData")

collectGarbage();
# 6.2 Quantifying module–trait associations
MEs0 = moduleEigengenes(multiExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0)
moduleTraitCor = cor(MEs, datTraits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);
####Plot1 Module−trait relationships in heatmap plot
sizeGrWindow(10,6)
pdf(file = "Module_trait_relationships.pdf", width = 12, height = 9);
#Will display correlations and their p‐values
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep ="");
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));
#Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = greenWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(‐1,1),
main = paste("Module‐trait relationships"))
dev.off()
}

 4  结果说明

4.1 GeneDendrogramColors.pdf:基因进化树(基因共表达模块分析)

       横坐标:基因共表达模块分析的结果展示,不同的颜色代表不同的基因模块,灰色属于未知模块的基因

       纵坐标:基因间的不相似性系数,进化树的每一个树枝代表一个基因

4.2 NetworkHeatmap.pdf :共表达模块基因相关系数热图

       图上和图左是基因系统发育树,每一个树枝代表一个基因,不同颜色亮带表示不同的module,每一个亮点表示每个基因与其他基因的相关性强弱,每个点的颜色越深(白→黄→红)代表行和列对应的两个基因间的连通性越强。P值采用student's t test计算所得,P值越小代表基因与模块相关性的显著性越强。

4.3 EigengeneDendrogramHeatmap.pdf :共表达模块的聚类图和热图

       聚类图:对每个基因模块的特征向量基因进行聚类,

       热图:我们将所有模块两两间的相关性进行分析,并绘制热图,颜色越红,相关性越高

4.4 ModuleSampleRelation.pdf : 性状相关特征模块分析(此分析需结合生理数据,如没有生理数据则没有此结果)

       每一列代表一个性状,每一行代表一个基因模块。每个格子里的数字代表模块与性状的相关性,该数值越接近1,表示模块与性状正相关性越强;越接近-1,表示模块与性状负相关性越强。括号里的数字代表显著性P value,该数值越小,表示显著性越强。P值采用student's t test计算所得,P值越小代表性状与模块相关性的显著性越强。

  • 0
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
WGCNA (Weighted Gene Co-expression Network Analysis) 是一种基于基因共表达网络的数据分析方法,旨在揭示基因之间的关系以及它们与样本性状之间的关联。Python是一种流行的编程语言,可以用于实现WGCNA。 Python中有许多库可以用来进行WGCNA分析,如scipy、networkx和numpy等。首先,我们需要根据基因表达数据构建基因共表达网络。可以使用scipy库中的函数计算基因之间的相关性,然后根据相关性构建共表达网络。接下来,可以使用networkx库对网络进行分析,例如计算基因的度中心性和介数中心性等指标。这些指标可以帮助我们了解网络中的重要基因。 在WGCNA中,为了实现模块化,通常会将相似的基因分到同一个模块中。在Python中,可以使用numpy库中的函数执行这个步骤。首先,通过hierarchical clustering算法对基因进行聚类分析,然后使用动态切割算法将聚类结果划分为不同的模块。这些模块可以表示不同的生物学功能模块。 最后,可以使用WGCNA分析来探索基因模块与样本性状之间的关联。可以使用scipy库中的统计函数计算基因模块与样本性状之间的相关性,并进行统计学显著性检验。这可以帮助我们找到与样本性状相关的基因模块,并了解这些基因模块在样本分类和特征选择中的重要性。 总之,Python是一种强大的工具,可以用于实现WGCNA分析。通过使用Python中的一些库和函数,我们可以构建基因共表达网络,进行模块化分析,并揭示基因模块与样本性状之间的关联。这有助于我们在基因表达数据中发现重要的生物学信息。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值