kmeans 最佳聚类个数 | 轮廓系数（越大越好）

biomooc

已于 2024-12-16 19:27:09 修改

阅读量783

点赞数 3

分类专栏： R 数学与统计可视化文章标签： kmeans 聚类

于 2024-11-26 16:04:25 首次发布

本文链接：https://blog.csdn.net/wangjunliang/article/details/144060572

版权

R 同时被 3 个专栏收录

78 篇文章

订阅专栏

可视化

41 篇文章

订阅专栏

数学与统计

10 篇文章

订阅专栏

轮廓系数越大，表示簇内实例之间紧凑，簇间距离大，这正是聚类的标准概念。

簇内的样本应该尽可能相似。
不同簇之间应该尽可能不相似。

目的：鸢尾花数据进行kmeans聚类，最佳聚类个数是多少？

plot(iris[,1:4], col=iris$Species)
在这里插入图片描述

1. 标准化很重要

假设已经知道最佳是3类，

使用原始数据做kmeans，和原始标签不一致的很多。
如果做了标准化，kmeans的分类结果和原始标签一模一样。

(1). raw dat (错了好多)

dat=iris[, 1:4]
rownames(dat) = paste0("obs", 1:nrow(dat))
dat[1:3,]

km_model <- kmeans( dat, centers = 3)

# 获取分类结果
predictions <- km_model$cluster
table(predictions)

dat$origin=iris$Species
dat$pred=predictions

table(dat$origin, dat$pred)
#           1  2  3
#setosa      0  0 50
#versicolor 48  2  0
#virginica  14 36  0

plot(dat$Sepal.Length, dat$Sepal.Width, col=dat$origin, pch=19)
plot(dat$Sepal.Length, dat$Sepal.Width, col=dat$pred, pch=19)

(2). normalized dat (几乎全对)

dat=iris[, 1:4]
rownames(dat) = paste0("obs", 1:nrow(dat))
dat[1:3,]

dat=apply(dat, 1, function(x){
  x/sum(x) * 1e4
}) |> t()  |> as.data.frame()
head(dat)

# 行作为观测值
km_model <- kmeans( dat, centers = 3)

# 获取分类结果
predictions <- km_model$cluster
table(predictions)

dat$origin=iris$Species
dat$pred=predictions

table(dat$origin, dat$pred)
#             1  2  3
#setosa     50  0  0
#versicolor  0 45  5
#virginica   0  0 50

2. 最佳分类数

(0) 预处理

dat=iris[, 1:4]
rownames(dat) = paste0("obs", 1:nrow(dat))
dat[1:3,]

dat=apply(dat, 1, function(x){
  x/sum(x) * 1e4
}) |> t()  |> as.data.frame()
head(dat)

(1) factoextra - silhouette: n=2

library(factoextra)
tmp = factoextra::fviz_nbclust( dat, kmeans, method = "silhouette")
#str(tmp)
tmp #图

# fviz_nbclust(dat, kmeans, method = "silhouette", k.max = 20)

在这里插入图片描述

(2) 碎石图: n=2

# 在一个循环中进行15次的kmeans聚类分析
{
totalwSS=vector(mode = "numeric", 15)
for (i in 1:15){
  t1= kmeans(dat, i)
  totalwSS[i] <- t1$tot.withinss
}
# 聚类碎石图 - 使用plot函数绘制total_wss与no-of-clusters的数值。
plot(x=1:15,                         # x= 类数量, 1 to 15
     totalwSS,                      #每个类的total_wss值
     col="navy", lwd=2,
     type="b"                       # 绘制两点，并将它们连接起来
)
}

在这里插入图片描述

(3) silhouette 画图: n=2?

逐个画:

# 逐个画轮廓系数
library(cluster)
dis = dist(dat) #行之间的距离
#
n=3
kclu <- kmeans(dat, centers = 3, nstart=25)
kclu.sil=sortSilhouette( silhouette(kclu$cluster, dist = dis) )
plot(kclu.sil, 
     col =1:n, #c("red", "orange", "blue"), 
     main="")

#
n=4
#library(cluster)
#dis = dist(dat) #行之间的距离
kclu <- kmeans(dat, centers = n, nstart=25)
kclu.sil=sortSilhouette( silhouette(kclu$cluster, dist = dis) )
plot(kclu.sil, 
     col =1:n, # c("red", "orange", "blue"), 
     main="")
#
#
n=8
#library(cluster)
#dis = dist(dat) #行之间的距离
kclu <- kmeans(dat, centers = n, nstart=25)
kclu.sil=sortSilhouette( silhouette(kclu$cluster, dist = dis) )
plot(kclu.sil, 
     col =1:n, # c("red", "orange", "blue"), 
     main="")
#

在这里插入图片描述

批量计算：

silhouette_score <- function(k){
  km <- kmeans(dat, centers = k, nstart=25)
  ss <- silhouette(km$cluster, dist(dat))
  mean(ss[, 3])
}
k <- 2:15
avg_sil <- sapply(k, silhouette_score)
plot(k, avg_sil, 
     type='b',
     xlab='Number of clusters', ylab='Average Silhouette Scores', 
     frame=FALSE)

在这里插入图片描述

最大是2，其次是3类。

根据本文图1，忽略颜色，只看数值分布，确实最佳是2类。

用标准化后的数据呢？
plot(dat, col=iris$Species, main="Normalized data")
在这里插入图片描述

plot(dat,main="Normalized data")

结论不变：如果忽略颜色，依旧是很清晰的2类。
在这里插入图片描述

(4) pam 是一种更稳定的 kmeans

Partitioning Around Medoids:
Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.

# 最佳分类数：
Ks=sapply(2:15, function(i){
  summary(silhouette(pam(dat, k=i)))$avg.width
})
plot(2:15,Ks,xlab="k",ylab="av. silhouette",type="b", pch=19)


效果：
t1=pam(dat, k=3)
> table(t1$clustering, iris$Species)   
    setosa versicolor virginica
  1     50          0         0
  2      0         44         0
  3      0          6        50
还是有几个错的。

3. 对 infercnv结果进行按细胞聚类

(2) 确定最佳分类数

cnv_table <- read.table(paste0("infercnv.observations.txt"), header=T)
dat=t(cnv_table)

# ii) 这个较快
totalwSS=vector(mode = "numeric", 15)
for (i in 1:15){
  t1= kmeans(dat, i)
  totalwSS[i] <- t1$tot.withinss
  print(i)
}
pdf(paste0(outputRoot, keyword, "_00_5_inferCNV_topSD1000-kmeans.Elbow.pdf"), width=5, height = 4)
# 聚类碎石图 - 使用plot函数绘制total_wss与no-of-clusters的数值。
plot(x=1:15,                         # x= 类数量, 1 to 15
     totalwSS,                      #每个类的total_wss值
     col="navy", lwd=2,
     type="b"                       # 绘制两点，并将它们连接起来
)
dev.off()

碎石图的拐点附近为最佳分类数。

(3) 用cnv全基因对细胞进行聚类

# 这一行是核心代码，很快
kclus = kmeans( t(cnv_table), 5) #默认对行做聚类 # 21:17->21:20
table(kclus$cluster)
#    1     2     3     4     5 
# 6497  2654  4943 12020  2669

为了重画热图，添加颜色bar：

t1=kclus$cluster
annotation_col2=annotation_col
annotation_col2$cnvcluster= paste0("kmeans", t1[rownames(annotation_col)])
annotation_col2=annotation_col2[order(annotation_col2$cnvcluster),]
annotation_col2$cnvcluster=factor(annotation_col2$cnvcluster)

table(annotation_col2$cnvcluster)
# kmeans1  kmeans2  kmeans3  kmeans4  kmeans5 kmeansNA 
#    6497     2654     4943    12020     2669      500

ann_colors2=ann_colors
names(ann_colors2$cnvcluster)=levels(annotation_col2$cnvcluster)
ann_colors2
#$celltype
#        c0       c10       c11       c14       c16       c17       c19       c22        c3        c4        c5        c6        c7        c9 
#"#DC143C" "#0000FF" "#20B2AA" "#FFA500" "#9370DB" "#98FB98" "#F08080" "#1E90FF" "#7CFC00" "#FFFF00" "#808000" "#FF00FF" "#FA8072" "#7B68EE" 
#
#$cnvcluster
#   kmeans1     kmeans2     kmeans3     kmeans4     kmeans5    kmeansNA 
#"#E64B35FF" "#4DBBD5FF" "#00A087FF" "#3C5488FF" "#F39B7FFF" "#8491B4FF" 




# 取共有细胞：
common.cid=intersect(rownames(annotation_col2), colnames(cnv_table))
annotation_col2.common = annotation_col2[which( rownames(annotation_col2) %in% common.cid),]

ph.plotAll3 = pheatmap( t(cnv_table)[rownames(annotation_col2.common), ], border_color = NA,
                        scale = "none",
                        show_rownames = F, show_colnames = F,
                        cluster_rows = F, cluster_cols = F,
                        
                        annotation_row = annotation_col2, annotation_colors = ann_colors2,
                        
                        main="CNV score raw:\nall gene, k-means=5")
CairoPNG(paste0(outputRoot, keyword, "_00_4_inferCNV_allGene-withSeuratCluster3.pheatmap.png"), width=800, height =470 )
grid::grid.newpage()
grid::grid.draw(ph.plotAll3$gtable) #col is gene; row is cell;
dev.off()

# 保存cnv cluster结果：
write.table(annotation_col2, paste0(outputRoot, keyword, "_00_4_inferCNV_cnvCluster-kmeans.df.txt") )
table(annotation_col2$celltype, annotation_col2$cnvcluster)

End