在层次聚类中,每个初始实例或观测值属于一类,聚类就是每一次把两类聚成一类,直到所有的类聚成单类为止。
当需要嵌套聚类和有意义的层次结构时,层次聚类可发挥奇效,(生物科学中这种情况就很常见),缺点是层次聚类中一旦一个观测值被划分到一个类,它就不能再重新分配。层次聚类难以应用到数百甚至数千观测值的大样本中。
下面用R语言中的flexclust包中的内置数据集尝试做层次聚类分析:
# 用flexclust包里面的数据集做层次聚类
library(flexclust) #数据集nutrient可以用来做聚类分析
rm(list=ls())
data(nutrient,package = 'flexclust')
head(nutrient,4)
d <- dist(nutrient,method = )
d1 <- as.matrix(d)[1:5,1:5]
row.names(nutrient) <- tolower(row.names(nutrient))
nutrient.scaled <- scale(nutrient)
d <- dist(nutrient.scaled)
fit.average <- hclust(d,method = 'average')
plot(fit.average,hang = -1,cex=.8,main = 'average linkage clustering')
library(NbClust)
devAskNewPage(ask = TRUE)
nc <- NbClust(nutrient.scaled,distance = 'euclidean',
min.nc = 2,max.nc = 15,method = 'average')
table(nc$Best.n[1,])
barplot(table(nc$Best.n[1,]),
xlab = 'number of clusters', ylab = 'number of critera',
main = "number of clusters chosen by 26 criteria")
clusters <- cutree(fit.average,k = 5)
table(clusters)
aggregate(nutrient,by = list(cluster=clusters),median)
aggregate(as.data.frame(nutrient.scaled),by = list(cluster = clusters),median)
plot(fit.average,hang = -1,cex=0.8,
main = 'average linkage clustering/n5 cluster solution')
rect.hclust(fit.average,k=5)