「转录组」WGCNA实战原理两不误

最新推荐文章于 2024-07-29 20:44:07 发布

陈有朴

最新推荐文章于 2024-07-29 20:44:07 发布

阅读量1.8k

点赞数

分类专栏：生物信息学-软件文章标签：线性代数矩阵 r语言生物信息学

本文链接：https://blog.csdn.net/qq524730309/article/details/126594619

版权

生物信息学-软件专栏收录该内容

12 篇文章 1 订阅

订阅专栏

表达矩阵的处理

后续分析所用到的数据，均为FPKM标准化后的表达矩阵。

从流程上对WGCNA进行解读

1）当对芯片数据或者RNA-Seq等数据完成分析之后，我们可以得到一张表达矩阵

一行为一个样本，一列为一个gene。

2）读取表达矩阵之后，对其进行adjacency矩阵的计算

adjacency矩阵，基于gene之间相关性的矩阵。每一个单元代表了2个gene之间的关联性（similarity）。

adjacency矩阵的构建，涉及到软阈值的选择，即构建一个幂函数，选择一个指数（power），强化强相关性，弱化弱相关性。

「adjacency矩阵的拓展」

adjacency矩阵有2种计算方式，

1）unsigned： $a_{ij} = |cor(x_{x}, x_{j})|^{β}$

2）signed： $a_{ij}=(0.5*(1+cor))^{β}$

第一种情况，认为具有高负相关的gene之间，是有联系的。

第二种情况，认为具有高负相关的gene之间，是没有联系的。

举个例子，

假设cor = -1，β=2，unsigned情况下，计算得到的adjacency为1，即gene之间高度关联。signed情况下，计算得到的adjacency为0，即gene之间无关联。

3）TOM矩阵的构建

引入一个指标，即topological overlaps的计算，用于定义该gene是否有多个高关联性的gene（用于gene module的构建）
$Cor(x_{i},x_{j}) -> TOM(x_{i}, x_{j})$

「TOM矩阵的拓展」

但是针对unsigned类型的adjacency矩阵，可能会出现无法判断几个基因之间的关联关系。

比如现有i，j，k 3个gene，计算得到它们之间的关联关系为 $(+, +, -)$ ，

1）i和j之间为正相关；2）i和k之间为正相关；3）j和k之间为负相关。

这就矛盾了，说明上面的关联关系可能受到了一些噪音的影响等等。

此时，引入signed TOM，其相较于unsigned TOM能够更好的解决gene之间的冲突情况。

Note：需要注意的是，当使用signed adjacency进行分析时，不会出现上述矛盾。

同时，这些矛盾作者都已经考虑到了，但还是有一定区别，

TOMsimilarity()中，默认设置为TOMType = "unsigned"
TOMsimilarityFromExpr()中，默认设置为TOMType = "signed"

4）TOM dissimilarity矩阵的构建

TOM dissimilarity矩阵代表了某个gene与其他gene之间的距离。

用上述矩阵进行聚类分析，得到gene module。

5）初步构建gene module

以TOM dissimilarity矩阵作为输入，进行聚类分析。

代码如下，

# Turn data expression into topological overlap matrix
power=sft$powerEstimate   # 使用sft$powerEstimate调用预估出的软阈值
TOM = TOMsimilarityFromExpr(datExpr, power = power)
dissTOM = 1-TOM
# Plot gene tree
geneTree = hclust(as.dist(dissTOM), method = "average");   # 用于后续cutreeDynamic()，对gene tree进行裁剪
# pdf(file = "3-gene_cluster.pdf", width = 12, height = 9);
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
     labels = FALSE, hang = 0.04);
# dev.off()

6）使用`cutreeDynamic()`进行聚类分析的优化

使用pairwise eigengene，进一步计算得到eigengene dissimilarity，用于聚类分析，筛选指标，最终合并gene module。

Note：eigengene为gene表达模式的指标（PCA降维得到的第一个主成分）

cutreeDynamic()代码如下，

cutreeDynamic
labels2colors
plotDendroAndColors

# Module identification using dynamic tree cut
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM, deepSplit = 2, 
                            pamRespectsDendro = FALSE, minClusterSize = 30);
table(dynamicMods)
length(table(dynamicMods)) 
# Convert numeric labels into colors
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
# Plot the dendrogram and colors underneath
# pdf(file = "4-module_tree.pdf", width = 8, height = 6);
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",dendroLabels = FALSE,
                    hang = 0.03,addGuide = TRUE, guideHang = 0.05,main = "Gene dendrogram and module colors")
# dev.off()

绘制图如下，

gene module合并代码如下 -> 基于cutreeDynamic()的分析结果进行聚类分析的gene module的合并

mergeCloseModules：合并gene module
plotDendroAndColors

# Merge close modules
MEDissThres=0.25
abline(h=MEDissThres, col = "red")
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3) 
mergedColors = merge$colors  
mergedMEs = merge$newMEs  
# Plot merged module tree
# pdf(file = "5-merged_Module_Tree.pdf", width = 12, height = 9)  
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors), 
                    c("Dynamic Tree Cut", "Merged dynamic"), dendroLabels = FALSE, 
                    hang = 0.03, addGuide = TRUE, guideHang = 0.05)  
# dev.off()
# merge$oldMEs，为数据框，行为样本，列为对应的gene module，其中的数值代表了它们之间的关联度
write.table(merge$oldMEs,file="oldMEs.txt");
write.table(merge$newMEs,file="newMEs.txt");

绘制图如下，可以看到有一些gene module被合并了，

7）将gene module与表型特征相联系

使用标准化的eigengene进行计算。
$C or (ME s, T r ai t s)$
代码如下，

moduleEigengenes，计算每个gene module的特征值
moduleTraitCor = cor(MEs, datTraits, use = "p")
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)

# Define numbers of genes and samples
nGenes = ncol(datExpr);
nSamples = nrow(datExpr);
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, mergedColors)$eigengenes
MEs = orderMEs(MEs0)
# r$> MEs[1:5, 1:5]
#           MElightcyan1 MEdarkolivegreen MEgreenyellow       MEcyan     MEwhite
# LH38_Z1_1    0.3093871      -0.01665288     0.1583588 -0.017809267 -0.01305706
# LH38_Z1_2    0.3318247      -0.01679628     0.1629585 -0.010450719  0.04613664
# LH38_Z1_3    0.3087032      -0.01762908     0.1527085 -0.018456470 -0.03661408
# LH38_Z2_1    0.3050361      -0.01431782     0.1589569 -0.020199995  0.06161998
# LH38_Z2_2    0.3422982      -0.01466196     0.1722949 -0.006663298  0.08192540

# 一些数据处理部分
# Read microbial data as traits
bac_traits = read.table("traits_file/b_order_234.txt", header = T, sep = "\t")
rownames(bac_traits) = bac_traits[, 1]
bac_traits = bac_traits[, -1]
# r$> bac_traits[1:5, 1:5]
#           Pseudomonadales Enterobacteriales Xanthomonadales Burkholderiales Verrucomicrobiales
# LH38_Z1_1      0.02120943       0.006338742      0.07261663      0.05920385         0.02674949
# LH38_Z1_2      0.04192444       0.009089757      0.06880071      0.05583164         0.02156440
# LH38_Z1_3      0.01393256       0.004525862      0.06961207      0.05100152         0.02189402
# LH39_Z1_1      0.11288033       0.013045132      0.07138692      0.04186106         0.01743154
# LH39_Z1_2      0.01214503       0.003410243      0.08236562      0.03733519         0.01953600


rownames(MEs) = paste(substr(rownames(MEs), 1, nchar(rownames(MEs))-1), rep(c("1", "2", "3"), 60), sep = "")
# sample names should be consistent in eigen genes and traits !!!!
bac_traits = bac_traits[match(rownames(MEs), rownames(bac_traits)), ]
table(rownames(MEs) == rownames(bac_traits))

# Calculate pearson correlation coefficients between module eigen-genes and traits
moduleTraitCor = cor(MEs, bac_traits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);
write.table(moduleTraitCor,file="moduleTrait_correlation.txt");
write.table(moduleTraitPvalue,file="moduleTrait_pValue.txt");

结果图可视化代码如下，

sizeGrWindow(10,6)
# Will display correlations and their p-values
# 合并
textMatrix =  paste(signif(moduleTraitCor, 2), "\n(",
                    signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
# pdf("module-traits-bacteria-order.pdf", width = 100, height = 30)
par(mar = c(15, 12, 5, 5));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(bac_traits),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = greenWhiteRed(50),
               textMatrix = textMatrix,  # 矩阵单元格上需要显示的信息
               setStdMargins = FALSE,
               cex.text = 0.5,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))
# dev.off()

结果图如下，

Note：右边的图例，代表相关性程度

8）Hub gene的鉴定

2种方法，

与目标module有强关联性的gene（gene significance）
使用module membership来鉴定关键gene， $C or (i, ME)$

一般先绘制出以module membership为x，gene significance为y的散点图。

代码如下，

# 以相关性最高的模块为例
which(moduleTraitCor == max(moduleTraitCor, na.rm = TRUE), arr.ind = TRUE)
#            row col
# MEskyblue3  25  30

Nitrosomonadales <- as.data.frame(bac_traits[, 30])
names(Nitrosomonadales) = "Nitrosomonadales"

modNames = substring(names(MEs), 3) # 去除ME前缀

# 计算module membership
# 使用的是WGCNA自带cor函数，使用皮尔逊计算相关性
geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
geneModuleMembership[1:5, 1:5]
#                MElightcyan1 MEdarkolivegreen MEgreenyellow      MEcyan       MEwhite
# Zm00001d001763 -0.045547074      -0.10334193   -0.20030129  0.22671189  0.1501121833
# Zm00001d001766  0.004752275      -0.01548396   -0.11130522 -0.22346048  0.0178421845
# Zm00001d001770 -0.286230379      -0.23340743   -0.39930773  0.22791746  0.0925346831
# Zm00001d001774  0.029505956      -0.06714112   -0.06323784 -0.34302212  0.1677206174
# Zm00001d001775 -0.029767437      -0.07642270   -0.13732818 -0.07876773 -0.0008678361

MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
MMPvalue[1:5, 1:5]

names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");

# 计算gene significance
geneTraitSignificance = as.data.frame(cor(datExpr, Nitrosomonadales, use = "p"));
# head(geneTraitSignificance)
# nrow(geneTraitSignificance)
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) = paste("GS.", names(Nitrosomonadales), sep="");
names(GSPvalue) = paste("p.GS.", names(Nitrosomonadales), sep="");


module = "skyblue3"
column = match(module, modNames);  # match(x, y)，找到y中x的索引位置
moduleGenes = mergedColors==module;
# length(mergedColors)
# nrow(geneModuleMembership)
# nrow(geneTraitSignificance)
# table(mergedColors)
table(moduleGenes)

sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                   abs(geneTraitSignificance[moduleGenes, 1]),
                   xlab = paste("Module Membership in", module, "module"),
                   ylab = "Gene significance for Nitrosomonadales",
                   main = paste("Module membership vs. gene significance\n"),
                   cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

散点图结果如下，可以看到module membership值越高的gene，其gene significance也越高。

其他

绘制gene module eigengenes与samples之间的热图

library("pheatmap")

# Heatmap of old module eigen-genes and samples
pdf(file="oldMEs.pdf",heigh=80,width=20)
row.names(merge$oldMEs)=names(data0)  # oldMEs，是一个矩阵
pheatmap(merge$oldMEs,cluster_col=T,cluster_row=T,show_rownames=T,show_colnames=T,fontsize=6)
dev.off()

# Heatmap of new module eigen-genes and samples
# pdf(file="newMEs.pdf",heigh=60,width=20)
row.names(merge$newMEs)=names(data0)
pheatmap(merge$newMEs,cluster_col=T,cluster_row=T,show_rownames=T,show_colnames=T,fontsize=6)
# dev.off()