3.1 Hierarchical Clustering层次聚类
method可为"ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
也就是常说的全链或那什么层次聚类
Can we find things that are close together?
Clustering organizes things that are close into groups
- How do we define close?
- How do we group things?
- How do we visualize the grouping?
- How do we interpret the grouping?
一个关于聚类论文的链接,除非翻墙,否则是打不开的了
Hierarchical clustering
- An agglomerative approach(一种凝聚的方法)
- Find closest two things
- Put them together
- Find next closest
- Requires
- A defined distance
- A merging approach
- Produces
- A tree showing how close things are to each other
How do we define close?
- Most important step
- Garbage in -> garbage out
- Distance or similarity
- Continuous - euclidean distance
- Continuous - correlation similarity
- Binary - manhattan distance
- Pick a distance/similarity that makes sense for your problem
Hierarchical clustering - dist
- Important parameters: x,method
## 1 2 3 4 5 6 7 8 9
## 2 0.34121
## 3 0.57494 0.24103
## 4 0.26382 0.52579 0.71862
## 5 1.69425 1.35818 1.11953 1.80667
## 6 1.65813 1.31960 1.08339 1.78081 0.08150
## 7 1.49823 1.16621 0.92569 1.60132 0.21110 0.21667
## 8 1.99149 1.69093 1.45649 2.02849 0.61704 0.69792 0.65063
## 9 2.13630 1.83168 1.67836 2.35676 1.18350 1.11500 1.28583 1.76461
## 10 2.06420 1.76999 1.63110 2.29239 1.23848 1.16550 1.32063 1.83518 0.14090
## 11 2.14702 1.85183 1.71074 2.37462 1.28154 1.21077 1.37370 1.86999 0.11624
## 12 2.05664 1.74663 1.58659 2.27232 1.07701 1.00777 1.17740 1.66224 0.10849
下面还有很多,此处略下不表
## Hierarchical clustering - #1
```{r dependson="createData",echo=FALSE, fig.height=4,fig.width=8}suppressMessages(library(fields))dataFrame <- data.frame(x=x,y=y)rdistxy <- rdist(dataFrame)diag(rdistxy) <- diag(rdistxy) + 1e5
# Find the index of the points with minimum distanceind <- which(rdistxy == min(rdistxy),arr.ind=TRUE)par(mfrow=c(1,2),mar=rep(0.2,4))# Plot the points with the minimum overlayedplot(x,y,col="blue",pch=19,cex=2)text(x+0.05,y+0.05,labels=as.character(1:12))points(x[ind[1,]],y[ind[1,]],col="orange",pch=19,cex=2)
# Make a cluster and cut it at the right heightdistxy <- dist(dataFrame)hcluster <- hclust(distxy)dendro <- as.dendrogram(hcluster)cutDendro <- cut(dendro,h=(hcluster$height[1]+0.00001) )plot(cutDendro$lower[[11]],yaxt="n")```
原来这里的大圆不过就是把pch=1变大而已,日无语啊,还能再弱智点不?## Hierarchical clustering - #2
```{r dependson="createData",echo=FALSE}library(fields)dataFrame <- data.frame(x=x,y=y)rdistxy <- rdist(dataFrame)diag(rdistxy) <- diag(rdistxy) + 1e5
# Find the index of the points with minimum distanceind <- which(rdistxy == min(rdistxy),arr.ind=TRUE)par(mar=rep(0.2,4))# Plot the points with the minimum overlayedplot(x,y,col="blue",pch=19,cex=2)text(x+0.05,y+0.05,labels=as.character(1:12))points(x[ind[1,]],y[ind[1,]],col="orange",pch=19,cex=2)points(mean(x[ind[1,]]),mean(y[ind[1,]]),col="black",cex=3,lwd=3,pch=3)points(mean(x[ind[1,]]),mean(y[ind[1,]]),col="orange",cex=5,lwd=3,pch=1)
此处对ind的取值很值得学习啊# Find the index of the points with minimum distanceind <- which(rdistxy == rdistxy[order(rdistxy)][3],arr.ind=TRUE)par(mfrow=c(1,3),mar=rep(0.2,4))# Plot the points with the minimum overlayedplot(x,y,col="blue",pch=19,cex=2)text(x+0.05,y+0.05,labels=as.character(1:12))points(x[c(5,6)],y[c(5,6)],col="orange",pch=19,cex=2)points(x[ind[1,]],y[ind[1,]],col="red",pch=19,cex=2)
# Make dendogram plotsdistxy <- dist(dataFrame)hcluster <- hclust(distxy)dendro <- as.dendrogram(hcluster)cutDendro <- cut(dendro,h=(hcluster$height[2]) )plot(cutDendro$lower[[10]],yaxt="n")plot(cutDendro$lower[[5]],yaxt="n")