探索性数据分析week3

最新推荐文章于 2023-01-31 15:29:17 发布

林思

最新推荐文章于 2023-01-31 15:29:17 发布

阅读量1.6k

点赞数

分类专栏： datascience 探索性数据分析

本文链接：https://blog.csdn.net/u014596936/article/details/38078377

版权

本文深入探讨了层次聚类方法，包括如何定义相近、使用不同的聚类策略以及绘制美观的树状图。同时，介绍了K-means聚类的基本原理和参数选择。此外，还讨论了数据降维、缺失值处理以及颜色在数据可视化中的应用，强调了正确选择颜色对提升数据表达的重要性。

摘要由CSDN通过智能技术生成

3.1 Hierarchical Clustering层次聚类

method可为"ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

也就是常说的全链或那什么层次聚类

Can we find things that are close together?

Clustering organizes things that are close into groups

How do we define close?
How do we group things?
How do we visualize the grouping?
How do we interpret the grouping?

http://scholar.google.com/scholar?hl=en&q=cluster+analysis&btnG=&as_sdt=1%2C21&as_sdtp=

一个关于聚类论文的链接，除非翻墙，否则是打不开的了

Hierarchical clustering

An agglomerative approach（一种凝聚的方法）
- Find closest two things
- Put them together
- Find next closest
Requires
- A defined distance
- A merging approach
Produces
- A tree showing how close things are to each other

How do we define close?

Most important step
- Garbage in -> garbage out
Distance or similarity
- Continuous - euclidean distance
- Continuous - correlation similarity
- Binary - manhattan distance
Pick a distance/similarity that makes sense for your problem

http://rafalab.jhsph.edu/688/lec/lecture5-clustering.pdf

Hierarchical clustering - `dist`

Important parameters: x,method

dataFrame <- data.frame(x = x, y = y)
dist(dataFrame)

##          1       2       3       4       5       6       7       8       9
## 2  0.34121                                                                
## 3  0.57494 0.24103                                                        
## 4  0.26382 0.52579 0.71862                                                
## 5  1.69425 1.35818 1.11953 1.80667                                        
## 6  1.65813 1.31960 1.08339 1.78081 0.08150                                
## 7  1.49823 1.16621 0.92569 1.60132 0.21110 0.21667                        
## 8  1.99149 1.69093 1.45649 2.02849 0.61704 0.69792 0.65063                
## 9  2.13630 1.83168 1.67836 2.35676 1.18350 1.11500 1.28583 1.76461        
## 10 2.06420 1.76999 1.63110 2.29239 1.23848 1.16550 1.32063 1.83518 0.14090
## 11 2.14702 1.85183 1.71074 2.37462 1.28154 1.21077 1.37370 1.86999 0.11624
## 12 2.05664 1.74663 1.58659 2.27232 1.07701 1.00777 1.17740 1.66224 0.10849

下面还有很多，此处略下不表

 
 
 
  
  
  ## Hierarchical clustering - #1
 
 
 
 
 
 
  
  
  

 
 
 
 
 
 
  
  
  ```{r dependson="createData",echo=FALSE, fig.height=4,fig.width=8}
 
 
 
 
 
 
  
  
  suppressMessages(library(fields))
 
 
 
 
 
 
  
  
  dataFrame <- data.frame(x=x,y=y)
 
 
 
 
 
 
  
  
  rdistxy <- rdist(dataFrame)
 
 
 
 
 
 
  
  
  diag(rdistxy) <- diag(rdistxy) + 1e5
 
 
 
 
 
 
  
  
  

 
 
 
 
 
 
  
  
  # Find the index of the points with minimum distance
 
 
 
 
 
 
  
  
  ind <- which(rdistxy == min(rdistxy),arr.ind=TRUE)
 
 
 
 
 
 
  
  
  par(mfrow=c(1,2),mar=rep(0.2,4))
 
 
 
 
 
 
  
  
  # Plot the points with the minimum overlayed
 
 
 
 
 
 
  
  
  plot(x,y,col="blue",pch=19,cex=2)
 
 
 
 
 
 
  
  
  text(x+0.05,y+0.05,labels=as.character(1:12))
 
 
 
 
 
 
  
  
  points(x[ind[1,]],y[ind[1,]],col="orange",pch=19,cex=2)
 
 
 
 
 
 
  
  
  

 
 
 
 
 
 
  
  
  # Make a cluster and cut it at the right height
 
 
 
 
 
 
  
  
  distxy <- dist(dataFrame)
 
 
 
 
 
 
  
  
  hcluster <- hclust(distxy)
 
 
 
 
 
 
  
  
  dendro <- as.dendrogram(hcluster)
 
 
 
 
 
 
  
  
  cutDendro <- cut(dendro,h=(hcluster$height[1]+0.00001) )
 
 
 
 
 
 
  
  
  plot(cutDendro$lower[[11]],yaxt="n")
 
 
 
 
 
 
  
  
  ```

 
 
 
  
  
  ## Hierarchical clustering - #2
 
 
 
 
 
 
  
  
  

 
 
 
 
 
 
  
  
  ```{r dependson="createData",echo=FALSE}
 
 
 
 
 
 
  
  
  library(fields)
 
 
 
 
 
 
  
  
  dataFrame <- data.frame(x=x,y=y)
 
 
 
 
 
 
  
  
  rdistxy <- rdist(dataFrame)
 
 
 
 
 
 
  
  
  diag(rdistxy) <- diag(rdistxy) + 1e5
 
 
 
 
 
 
  
  
  

 
 
 
 
 
 
  
  
  # Find the index of the points with minimum distance
 
 
 
 
 
 
  
  
  ind <- which(rdistxy == min(rdistxy),arr.ind=TRUE)
 
 
 
 
 
 
  
  
  par(mar=rep(0.2,4))
 
 
 
 
 
 
  
  
  # Plot the points with the minimum overlayed
 
 
 
 
 
 
  
  
  plot(x,y,col="blue",pch=19,cex=2)
 
 
 
 
 
 
  
  
  text(x+0.05,y+0.05,labels=as.character(1:12))
 
 
 
 
 
 
  
  
  points(x[ind[1,]],y[ind[1,]],col="orange",pch=19,cex=2)
 
 
 
 
 
 
  
  
  points(mean(x[ind[1,]]),mean(y[ind[1,]]),col="black",cex=3,lwd=3,pch=3)
 
 
 
 
 
 
  
  
  points(mean(x[ind[1,]]),mean(y[ind[1,]]),col="orange",cex=5,lwd=3,pch=1)

原来这里的大圆不过就是把pch=1变大而已，日无语啊，还能再弱智点不？

 
 
 
  
  
  # Find the index of the points with minimum distance
 
 
 
 
 
 
  
  
  ind <- which(rdistxy == rdistxy[order(rdistxy)][3],arr.ind=TRUE)
 
 
 
 
 
 
  
  
  par(mfrow=c(1,3),mar=rep(0.2,4))
 
 
 
 
 
 
  
  
  # Plot the points with the minimum overlayed
 
 
 
 
 
 
  
  
  plot(x,y,col="blue",pch=19,cex=2)
 
 
 
 
 
 
  
  
  text(x+0.05,y+0.05,labels=as.character(1:12))
 
 
 
 
 
 
  
  
  points(x[c(5,6)],y[c(5,6)],col="orange",pch=19,cex=2)
 
 
 
 
 
 
  
  
  points(x[ind[1,]],y[ind[1,]],col="red",pch=19,cex=2)
 
 
 
 
 
 
  
  
  

 
 
 
 
 
 
  
  
  # Make dendogram plots
 
 
 
 
 
 
  
  
  distxy <- dist(dataFrame)
 
 
 
 
 
 
  
  
  hcluster <- hclust(distxy)
 
 
 
 
 
 
  
  
  dendro <- as.dendrogram(hcluster)
 
 
 
 
 
 
  
  
  cutDendro <- cut(dendro,h=(hcluster$height[2]) )
 
 
 
 
 
 
  
  
  plot(cutDendro$lower[[10]],yaxt="n")
 
 
 
 
 
 
  
  
  plot(cutDendro$lower[[5]],yaxt="n")

此处对ind的取值很值得学习啊

Prettier dendrograms（树形图）

myplclust <- function(hclust, lab = hclust$labels, lab.col = rep(1, length(hclust$labels)), 
    hang = 0.1, ...) {
   
    ## modifiction of plclust for plotting hclust objects *in colour*!  Copyright
    ## Eva KF Chan 2009 Arguments: hclust: hclust object lab: a character vector
    ## of labels of the leaves of the tree lab.col: colour for the labels;
    ## NA=default device foreground colour hang: as in hclust & plclust Side
    ## effect: A display of hierarchical cluster with coloured leaf labels.
    y <- rep(hclust$height, 2)
    x <- as.numeric(hclust$merge)
    y