Mfuzz模糊聚类
硬聚类和软聚类/模糊聚类的区别
硬聚类 (Hard) //软聚类 or 模糊聚类 (Soft / Fuzzy) | |
❓ 定义: 每个对象只属于一个 cluster //每个对象可以部分属于多个 cluster | |
📌 表达形式 :一个样本 → 指定 cluster,如 cluster 3/一个样本 → cluster1: 0.2, cluster2: 0.7, cluster3: 0.1 | |
🎯 适用场景: 分组清晰明确,如分型、细胞类型、K-means 等///连续谱系、时间趋势、转录调控网络,表达相似但边界模糊 | |
📦 常见算法: K-means、层次聚类、DBSCAN 等Fuzzy C-Means(如 Mfuzz)、GMM(高斯混合模型)等 | |
📊 输出: 一个类标签//一个隶属度矩阵(membership) | |
📈 表达趋势图: 每组一条线每组中心线 + 多条带 membership 的线 |
WGCNA vs. 模糊聚类(Mfuzz)
共同点:
🎯 都是无监督聚类方法
🔍 都是基于表达相似性来定义模块/簇
🧪 都常用于转录组数据(bulk RNA-seq)和基因共表达分析
WGCNAMfuzz | |
比较 WGCNA/ Mfuzz 模糊聚类 | |
💡 原理:构建基因共表达网络,按拓扑结构找模块/用 Fuzzy C-Means 算法按表达趋势划分 fuzzy cluster | |
🎯 聚焦目标:网络模块(module)→ 强相关的基因组/表达趋势一致的基因簇(cluster) | |
📊 数据输入:多样本表达矩阵(一般 >15 样本///时间序列或 group 表达均值(少样本也可) | |
🔄 关系类型:基因与基因之间的边 → 构建网络/基因 vs. cluster 的 membership 值 | |
📈 聚类输出:每个 module 对应一组基因(软聚类形式,1 gene 只属于 1 module)///每个 gene 拥有多个 cluster 的隶属度(soft membership) | |
🧠 是否模糊聚类:❌ 不是(虽然有人称它为“软聚类”,但 gene 只能属于一个 module) ✅ 是,典型 fuzzy clustering,每个 gene 属于多个 cluster | |
📌 是否能捕捉动态模式:一般不强调时间趋势/// ✅ 专为时间序列设计,适合分析动态响应 | |
📦 R 包:WGCNA// Mfuzz |
# ===== 安装和加载必要的包 =====
library(tidyverse)
library(Biobase)
library(Mfuzz)
library(readxl)
# ===== 读取数据 =====
group_info <- read_excel("group.xlsx") # 列: sample, group
expr_data <- read_excel("expression_data.xlsx") # 第一列是 Gene name,其余为表达矩阵
expr_data <- expr_data %>%
mutate(RowMean = rowMeans(select(., -1), na.rm = TRUE))
######仅取平均值最大的行
expr_data <- expr_data %>%
group_by(Gene_name) %>%
filter(RowMean == max(RowMean)) %>%
ungroup()
expr_data <- na.omit(expr_data)
# 设置行名并转为表达矩阵
expr_matrix <- expr_data %>%
rename(gene = Gene_name) %>%
column_to_rownames("gene") %>%
as.matrix()
# === 2. 清洗 group 信息 ===
group_info_clean <- group_info %>%
filter(!is.na(group), !is.na(sample)) %>%
mutate(
group = str_trim(group),
sample = str_trim(sample),
group = str_to_upper(group)
) %>%
filter(group %in% c("组别")) %>%
mutate(group = factor(group, levels = c("组别"))) # 强制顺序
# 提取 sample 列名
expr_samples <- colnames(expr_matrix)
# === 4. 仅保留在 group_info 中出现的样本 ===
valid_samples <- intersect(expr_samples, group_info_clean$sample)
expr_data_matched <- expr_matrix
group_info_matched <- group_info_clean %>%
filter(sample %in% valid_samples) %>%
arrange(factor(group, levels = c("组别")))
# 重新按 sample 顺序排列表达矩阵列
expr_data_matched <- expr_data_matched[, group_info_matched$sample]
# ✅ 检查是否匹配成功
stopifnot(all(colnames(expr_data_matched) == group_info_matched$sample))
# ✅ group_mean 现在可直接用于 Mfuzz 后续流程
# 转为 ExpressionSet
eset <- ExpressionSet(assayData = expr_data_matched)
# ===== 2. 数据预处理 =====
# 去除缺失比例过高的行
eset <- filter.NA(eset, thres = 0.25)
# 用均值填充剩余 NA
eset <- fill.NA(eset, mode = "mean")
# 过滤表达量变化太小的基因(标准差太低)
eset <- filter.std(eset, min.std = 0.1)
# 标准化
eset <- standardise(eset)
# ===== 3. Mfuzz 聚类 =====
# 自动估计模糊度参数 m
m <- mestimate(eset)
#m 决定了聚类的“模糊程度”,默认 1.25–2.0
#越大 → 越模糊(每个基因属于多个 cluster)
#越小 → 趋向硬聚类(只属于某一个)
# 设置聚类数量 c(建议试试不同值,6~9 较常见)
set.seed(1234)
c <- 6
cl <- mfuzz(eset, c = c, m = m)
# ===== 4. 绘图(表达模式图)=====
# 1. 打开 PDF 图形设备
cairo_pdf("DEG_change_trend_std_0.1_c9.pdf", width = 12, height = 10, bg = "transparent")
custom_palette <- c(
"#ADAFB1", # 灰蓝
"#D8D9DA", # 淡灰
"#ECB884", # 柔橙
"#E4E45F", # 淡黄
"#4758A2", # 靛蓝
"#E08D8B" # 珊瑚粉
)
mfuzz.plot(
eset,
cl,
mfrow = c(3, 3),
new.window = FALSE,
time.labels = colnames(exprs(eset)),
col = custom_palette
)
# 3. 关闭图形设备,保存图像
dev.off()
# ===== 可选:导出每个 cluster 的基因 =====
cluster_membership <- tibble(
gene = rownames(cl$membership),
cluster = cl$cluster
)
write.csv(cluster_membership, "mfuzz_cluster_assignment_2.csv", row.names = FALSE)
write.csv(cluster_membership, "mfuzz_cluster_assignment_C9.csv", row.names = FALSE)
library(tidyverse)
library(ggrepel)
# 设定阈值
membership_threshold <- 0.7
# 提取表达矩阵
expr_mat <- exprs(eset) %>%
as.data.frame() %>%
rownames_to_column("gene")
# 整理 membership 到 tidy 格式
membership_df <- as.data.frame(cl$membership) %>%
rownames_to_column("gene") %>%
pivot_longer(-gene, names_to = "cluster", values_to = "membership")
# 合并表达和membership信息
combined_df <- expr_mat %>%
pivot_longer(-gene, names_to = "Condition", values_to = "Expression") %>%
left_join(membership_df, by = "gene")
# 设置 Condition 顺序
combined_df$Condition <- factor(combined_df$Condition, levels = c("组别"))
# 4. 筛选高 membership 的数据
high_membership_df <- combined_df %>% filter(membership > membership_threshold)
# 5. 绘图
p1<-ggplot(high_membership_df, aes(x = Condition, y = Expression, group = gene)) +
geom_line(alpha = 0.3, color = "#E4E45F") +
geom_text_repel(
data = high_membership_df %>%
group_by(gene, cluster) %>%
filter(Expression == max(Expression)) %>%
slice(1),
aes(label = gene),
size = 2.5,
box.padding = 0.3,
max.overlaps = Inf
) +
facet_wrap(~ cluster, ncol = 3, scales = "free_y") +
theme_minimal(base_size = 12) +
theme(strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Gene Expression Trends (Mfuzz Clustering)",
x = "Condition", y = "Standardized Expression")
cairo_pdf("cluster_annotated_markers_0.7.pdf", width = 10, height = 8, bg = "transparent")
print(p1)
dev.off()
# 感兴趣的基因
genes_of_interest <- c("ApoE")
# 查看 membership(隶属度)
cl$membership[genes_of_interest, ]
tibble(
gene = genes_of_interest,
assigned_cluster = cl$cluster[genes_of_interest],
as.data.frame(cl$membership[genes_of_interest, ])
)
完整的是6个cluster,这里仅展示3个: