STAT3888 CLUSTERING

鱼鱼冲鸭

已于 2022-08-15 15:31:23 修改

阅读量218

点赞数

文章标签：统计学

于 2022-08-08 20:28:35 首次发布

本文链接：https://blog.csdn.net/m0_55541117/article/details/126227770

版权

Agglomeration: Begin with all observations in singleton clusters. Sequentially join points into clusters, until all are in one cluster.
Divisive: Begin with all observations in one cluster, and sequentially divide until all observations are in singleton clusters.
Partitioning clustering: subdivide the data sets into a set of k(pre-specified) groups.

k-means(sensitive to outliers)/k-medoids(less sensitive)

Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated. A tree diagram illustrating the process, called a dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level.
Fuzzy clustering/soft clustering: items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster.
Model-based clustering: the data are viewed as coming from a distribution that is mixture of two or more clusters. It finds the best fit of models to data and estimates the number of clusters.
Density-based clustering: can find out clusters of different shapes and sizes from data containing noise and outliers

Similarity

for continues: 注意scale

Jaccard similarity coefficient (for binary values)

Dice matching coefficient for nominal values

$DMC = \frac{2|A\bigcap B|}{|A| + |B|}$

For mixed var. data:

Gower, Wishart, Podani,Huang, and Harikumar.

#Kmed package for mixed data, a is matrix, two kinds of distance

matching(a, a)
cooccur(a)

distmix(mtcars, 
        method = "gower/wishart", 
        idnum = c(1:7,10,11), 
        idbin = c(8,9), 
        idcat = c())[1:N,1:N]


#Distance for continuous variables

library(kmed)
num <- as.matrix(iris[,1:4])
mrwdist <- distNumeric(num, num, method = "mrw")
result <- fastkmed(mrwdist, ncluster = 3, iterate = 50)

K-means

idea: we want to partition the observations into k clusters such that the total within-cluster variation, summed over all clusters, is as small as possible

Algorithm:

1. Randomly assign a number, from 1 to k, to each of the observations. (These serve as initial cluster assignments for the observations).

2. Iterate until the cluster assignments stop changing.

a. For each of the clusters, computer the cluster centroid. The k cluster centroid is the vector of the covariate means for the observations in the kth cluster.

b. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

Guaranteed convergence to local min, depend (usually heavily) on the initial (random) cluster assignment in Step 1, so multiple runs with the different initial configuration.

#all var. same weight
dat <- read.csv(here("data","USArrests.csv"),header=TRUE)
df <- scale(dat[,2:5]) 
kmeans(x, centers, iter.max = 10, nstart = 1)#(df,k/cluster centers,iter_maxtime,number of random starting partitions)

library("factoextra")
# convenient solution to estimate the optimal number of clusters.
fviz_nbclust(df, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2)

#compute the mean of each variables by clusters using the original data:
aggregate(dat[,2:5], by=list(cluster=km.res$cluster), mean)

#add label
dd <- cbind(dat[,2:5], cluster = km.res$cluster)

#选取K值可视化
set.seed(31)
flea$km2 <- kmeans(scale(flea[,c(1,4)]), 2, nstart=5)$cluster
flea$km3 <- kmeans(scale(flea[,c(1,4)]), 3, nstart=5)$cluster
flea$km4 <- kmeans(scale(flea[,c(1,4)]), 4, nstart=5)$cluster
flea$km5 <- kmeans(scale(flea[,c(1,4)]), 5, nstart=5)$cluster
ggplot(data=flea) + 
  geom_point(aes(x=tars1, y=aede1, colour=factor(km2),
                 shape=factor(km2))) +
  scale_colour_brewer("", palette="Dark2") + 
  theme(aspect.ratio=1, legend.position = "none") + 
  xlab("") + ylab("")+
  theme_bw()

Cluster statistics

with-in SS 小

#需要直接抄，根据statistics选clusters number
library(fpc)
set.seed(31)
f.km <- NULL; f.km.stats <- NULL
for (k in 2:10) {
    res <- kmeans(scale(flea[,c(1,4)]), k, nstart=5)
    cl <- res$cluster
    x <- cluster.stats(dist(scale(flea[,c(1,4)])), cl)
    f.km <- cbind(f.km, cl)
    f.km.stats <- rbind(f.km.stats,
        c(x$within.cluster.ss,
        x$wb.ratio,
        x$ch,
        x$pearsongamma,
        x$dunn,
        x$dunn2))
}
colnames(f.km.stats) <- c("within.cluster.ss","wb.ratio","ch","pearsongamma","dunn","dunn2")
f.km.stats <- data.frame(f.km.stats)
f.km.stats$cl <- 2:10
f.km.stats.m <- f.km.stats %>%
    gather(stat, value, -cl)
ggplot(data=f.km.stats.m) +
    geom_line(aes(x=cl, y=value)) +
    xlab("# clusters") + ylab("") +
    facet_wrap(~stat,
        ncol=3,
        scales = "free_y") +
    theme_bw()

#PCA降维,散点图中可视化数据，根据其簇分配给每个数据点着色。
fviz_cluster(km.res,
    data = df,
    palette = c("#2E9FDF","#00AFBB","#E7B800","#FC4E07"),
    ellipse.type = "euclid",
    star.plot = TRUE,
    repel = TRUE,
    ggtheme = theme_minimal()
)

Mixture of normal distribution

A model where the model consists of a weighted sum (i.e., a mixture) of simpler densities where the weights sum up to 1.

$p(x; \mu,\sigma^2 , w) = \sum_{K}^{k = 1}w_k(x,{\mu}_{k}, \sigma_k^2 )$

where $\Phi$ is the normal density with mean $\mu$ and variance $\sigma^2$ , and $w\in [0,1]$ .

Useful when:

1. The data clearly consists of more than 1 subpopulation.

2. The data is non-normal but a mixture of normals.

k-means 不行的：

3. Clusters are not spherical in shape (correlation of variables within cluster where the correlation can be different for each cluster).

4. When there is a large difference in proportions of observations in each cluster.

Fittting the model

The mixture of normals model is fit by a method called Expectation Maximization. The algorithm treats the cluster assignments as random variables and uses conditional probability to estimate the assignment of each point to each cluster. The algorithm consists of two steps, an expecation step and a maximization step.

The expectation step uses conditional probability to estimate the probability that a point belongs to a cluster.
The maximization step estimates the mean and variance for each cluster.

The algorithm iterates between these two steps until convergence. Like K-means this algorithm depends on the initial conditions of the algorithm and is only guaranteed to converge to a local maximizer of the likelihood. This means that the algorithm needs to be run multiple times to converge to a "good enough" solution.

library(mixtools)
data(faithful)
wait1 <- normalmixEM(
  faithful$waiting, 
  lambda = .5, 
  mu = c(55, 80), 
  sigma = 5)
plot(wait1, loglik=FALSE, density=TRUE, 
  cex.axis=1.4, cex.lab=1.4, cex.main=1.8,
  main2="Time between Old Faithful eruptions", 
  xlab2="Minutes")

Selecting the number of clusters

1.BIC：disadv.:

We don't know the value of kk which maximizes the likelihood since we don't know if the EM estimates maximize the likelihood (we only get a local maximizer).
If any component wkwk is too small.
If two components have almost the same means and variances.
The model components in the mixture are misspecified.