R中的聚类分析:确定最佳聚类数

在R中进行k-means聚类分析时,如何选择最佳的簇数是一个挑战。通过观察平方误差总和(SSE)的弯头、使用partitioning around medoids、Calinski准则、Bayesian Information Criterion、Affinity Propagation、Gap Statistic、clustergrams、NbClust等多种方法,可以估算数据集中的最佳簇数。此外,互动式地从树状图选择簇和使用并行处理的Elbow方法也是有效策略。
摘要由CSDN通过智能技术生成

本文翻译自:Cluster analysis in R: determine the optimal number of clusters

Being a newbie in R, I'm not very sure how to choose the best number of clusters to do a k-means analysis. 作为R的新手,我不太确定如何选择最佳数目的聚类来进行k均值分析。 After plotting a subset of below data, how many clusters will be appropriate? 在绘制了以下数据的子集之后,多少个簇才合适? How can I perform cluster dendro analysis? 如何执行聚类dendro分析?

n = 1000
kk = 10    
x1 = runif(kk)
y1 = runif(kk)
z1 = runif(kk)    
x4 = sample(x1,length(x1))
y4 = sample(y1,length(y1)) 
randObs <- function()
{
  ix = sample( 1:length(x4), 1 )
  iy = sample( 1:length(y4), 1 )
  rx = rnorm( 1, x4[ix], runif(1)/8 )
  ry = rnorm( 1, y4[ix], runif(1)/8 )
  return( c(rx,ry) )
}  
x = c()
y = c()
for ( k in 1:n )
{
  rPair  =  randObs()
  x  =  c( x, rPair[1] )
  y  =  c( y, rPair[2] )
}
z <- rnorm(n)
d <- data.frame( x, y, z )

#1楼

参考:https://stackoom.com/question/12W1D/R中的聚类分析-确定最佳聚类数


#2楼

If your question is how can I determine how many clusters are appropriate for a kmeans analysis of my data? 如果您的问题是how can I determine how many clusters are appropriate for a kmeans analysis of my data? , then here are some options. ,那么这里有一些选择。 The wikipedia article on determining numbers of clusters has a good review of some of these methods. 维基百科上有关确定簇数的文章对其中一些方法进行了很好的回顾。

First, some reproducible data (the data in the Q are... unclear to me): 首先,一些可重现的数据(Q中的数据对我来说尚不清楚):

n = 100
g = 6 
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
plot(d)

在此处输入图片说明

One . Look for a bend or elbow in the sum of squared error (SSE) scree plot. 在平方误差总和(SSE)碎石图上查找弯曲或弯头。 See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. 有关更多信息 ,请参见http://www.statmethods.net/advstats/cluster.htmlhttp://www.mattpeeples.net/kmeans.html The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans: 弯头在结果图中的位置表明适合kmeans的簇数:

mydata <- d
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                       centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

We might conclude that 4 clusters would be indicated by this method: 我们可以得出结论,此方法将指示4个群集: 在此处输入图片说明

Two . You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package. 您可以使用fpc软件包中的pamk函数对类pamk进行分区以估计簇数。

library(fpc)
pamk.best <- pamk(d)
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
plot(pam(d, pamk.best$nc))

在此处输入图片说明在此处输入图片说明

# we could also do:
library(fpc)
asw <- numeric(20)
for (k in 2:20)
  asw[[k]] <- pam(d, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
# still 4

Three . Calinsky criterion: Another approach to diagnosing how many clusters suit the data. Calinsky准则:诊断有多少簇适合数据的另一种方法。 In this case we try 1 to 10 groups. 在这种情况下,我们尝试1至10组。

require(vegan)
fit <- cascadeKM(scale(d, center = TRUE,  scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
# 5 clusters!

在此处输入图片说明

Four . Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models 根据期望最大化的贝叶斯信息准则确定最佳模型和聚类数目,并通过分层聚类对参数化的高斯混合模型进行初始化

# See http://www.jstatsoft.org/v18/i06/paper
# http://www.stat.washington.edu/research/reports/2006/tr504.pdf
#
library(mclust)
# Run the function to see how many clusters
# it finds to be optimal, set it to search for
# at least 1 model and up 20.
d_clust <- Mclust(as.matrix(d), G=1:20)
m.best <- dim(d_clust$z)[2]
cat("model-based optimal number of clusters:", m.best, "\n")
# 4 clusters
plot(d_clust)

在此处输入图片说明在此处输入图片说明在此处输入图片说明

Five . Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800 相似性传播(AP)群集,请参见http://dx.doi.org/10.1126/science.11​​36800

library(apcluster)
d.apclus <- apcluster(negDistMat(r=2), d)
cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
# 4
heatmap(d.apclus)
plot(d.apclus, d)

在此处输入图片说明在此处输入图片说明

Six . Gap Statistic for Estimating the Number of Clusters. 估计簇数的差距统计。 See also some code for a nice graphical output . 另请参见一些代码,以获得漂亮的图形输出 Trying 2-10 clusters here: 在这里尝试2-10个集群:

library(cluster)
clusGap(d, kmeans, 10, B = 100, verbose = interactive())

Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
Clustering Gap statistic ["clusGap"].
B=100 simulated reference sets, k = 1..10
 --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
          logW   E.logW        gap     SE.sim
 [1,] 5.991701 5.970454 -0.0212471 0.04388506
 [2,] 5.152666 5.367256  0.2145907 0.04057451
 [3,] 4.557779 5.069601  0.5118225 0.03215540
 [4,] 3.928959 4.880453  0.9514943 0.04630399
 [5,] 3.789319 4.766903  0.9775842 0.04826191
 [6,] 3.747539 4.670100  0.9225607 0.03898850
 [7,] 3.582373 4.590136  1.0077628 0.04892236
 [8,] 3.528791 4.509247  0.9804556 0.04701930
 [9,] 3.442481 4.433200  0.9907197 0.04935647
[10,] 3.445291 4.369232  0.9239414 0.05055486

Here's the output from Edwin Chen's implementation of the gap statistic: 这是Edwin Chen实施差值统计的结果: 在此处输入图片说明

Seven . You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details. 您可能还会发现使用聚类图探索数据以可视化聚类分配很有用,请参阅http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-代码/了解更多详情。

Eight . The NbClust package provides 30 indices to determine the number of clusters in a dataset. NbClust软件包提供30个索引来确定数据集中的簇数。

library(NbClust)
nb <- NbClust(d, diss=NULL, distance = "euclidean",
        method = "kmeans", min.nc=2, max.nc=15, 
        index = "alllong", alphaBeale = 0.1)
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
# Looks like 3 is the most frequently determined number of clusters
# and curiously, four clusters is not in the output at all!

在此处输入图片说明

If your question is how can I produce a dendrogram to visualize the results of my cluster analysis , then you should start with these: http://www.statmethods.net/advstats/cluster.html http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html 如果你的问题是how can I produce a dendrogram to visualize the results of my cluster analysis :,那么你应该用这些启动http://www.statmethods.net/advstats/cluster.html HTTP://www.r-tutor .com / gpu-computing / clustering / hierarchical-cluster-analysis http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ ,请参阅此处了解更多异国情调方法: http//cran.r-project.org/web/views/Cluster.html

Here are a few examples: 这里有一些例子:

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist))           # apply hirarchical clustering and plot

在此处输入图片说明

# a Bayesian clustering method, good for high-dimension data, more details:
# http://vahid.probstat.ca/paper/2012-bclust.pdf
install.packages("bclust")
library(bclust)
x <- as.matrix(d)
d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
# I just include the dendrogram here

在此处输入图片说明

Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. pvclust库也用于高维数据,该库通过多尺度自pvclust来计算p值以进行层次聚类。 Here's the example from the documentation (wont work on such low dimensional data as in my example): 这是文档中的示例(不会像我的示例中那样处理低维数据):

library(pvclust)
library(MASS)
data(Boston)
boston.pv <- pvclust(Boston)
plot(boston.pv)

在此处输入图片说明

Does any of that help? 有什么帮助吗?


#3楼

It's hard to add something too such an elaborate answer. 很难添加过于详尽的答案。 Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples. 虽然我觉得我们应该在这里提到identify ,尤其是因为@Ben显示了许多树状图示例。

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist)) 
clusters <- identify(hclust(d_dist))

identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. identify ,您可以从树状图交互式选择聚类,并将选择存储到列表中。 Hit Esc to leave interactive mode and return to R console. 按Esc键退出交互模式并返回R控制台。 Note, that the list contains the indices, not the rownames (as opposed to cutree ). 注意,该列表包含索引,而不包含行名(与cutree )。


#4楼

In order to determine optimal k-cluster in clustering methods. 为了确定最佳的k聚类方法。 I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. 我通常在并行处理中使用Elbow方法,以避免浪费时间。 This code can sample like this: 这段代码可以像这样采样:

Elbow method 肘法

elbow.k <- function(mydata){
dist.obj <- dist(mydata)
hclust.obj <- hclust(dist.obj)
css.obj <- css.hclust(dist.obj,hclust.obj)
elbow.obj <- elbow.batch(css.obj)
k <- elbow.obj$k
return(k)
}

Running Elbow parallel 平行运行弯头

no_cores <- detectCores()
    cl<-makeCluster(no_cores)
    clusterEvalQ(cl, library(GMD))
    clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
 start.time <- Sys.time()
 elbow.k.handle(data.clustering))
 k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
    end.time <- Sys.time()
    cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)

It works well. 它运作良好。


#5楼

Splendid answer from Ben. Ben的精彩回答。 However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. 但是,令我惊讶的是,这里建议使用“亲和传播(AP)”方法只是为了找到k均值方法的聚类数,通常在此方法中AP可以更好地对数据进行聚类。 Please see the scientific paper supporting this method in Science here: 请在此处查看支持此方法的科学论文:

Frey, Brendan J., and Delbert Dueck. Frey,Brendan J.和Delbert Dueck。 "Clustering by passing messages between data points." “通过在数据点之间传递消息进行集群。” science 315.5814 (2007): 972-976. 科学315.5814(2007):972-976。

So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters: 因此,如果您不偏向于k均值,我建议直接使用AP,它将在不需要知道簇数的情况下对数据进行聚类:

library(apcluster)
apclus = apcluster(negDistMat(r=2), data)
show(apclus)

If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. 如果负欧氏距离不合适,则可以使用同一软件包中提供的另一种相似性度量。 For example, for similarities based on Spearman correlations, this is what you need: 例如,对于基于Spearman相关性的相似性,这是您需要的:

sim = corSimMat(data, method="spearman")
apclus = apcluster(s=sim)

Please note that those functions for similarities in the AP package are just provided for simplicity. 请注意,为简化起见,仅提供了AP软件包中用于相似性的那些功能。 In fact, apcluster() function in R will accept any matrix of correlations. 实际上,R中的apcluster()函数将接受任何相关矩阵。 The same before with corSimMat() can be done with this: 使用corSimMat()之前可以通过以下操作完成:

sim = cor(data, method="spearman")

or 要么

sim = cor(t(data), method="spearman")

depending on what you want to cluster on your matrix (rows or cols). 取决于要在矩阵(行或列)上进行聚类的对象。


#6楼

The answers are great. 答案很好。 If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting. 如果您想给其他聚类方法一个机会,可以使用分层聚类并查看数据如何拆分。

> set.seed(2)
> x=matrix(rnorm(50*2), ncol=2)
> hc.complete = hclust(dist(x), method="complete")
> plot(hc.complete)

在此处输入图片说明

Depending on how many classes you need you can cut your dendrogram as; 根据需要的类数,可以将树状图剪切为:

> cutree(hc.complete,k = 2)
 [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
[26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2

If you type ?cutree you will see the definitions. 如果键入?cutree您将看到定义。 If your data set has three classes it will be simply cutree(hc.complete, k = 3) . 如果您的数据集具有三个类别,则将只是cutree(hc.complete, k = 3) The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9) . cutree(hc.complete,k = 2)的等效cutree(hc.complete,k = 2)cutree(hc.complete,h = 4.9)

  • 3
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值