847人阅读 评论(0)

# 什么是聚类

Clustering partitions unlabeled data into subsets

• The resulting subsets (called clusters) should consist of internally similar, but externally distinctive data points
• Various measures of “similarity” and “distinctiveness” of data points(example: Euclidean distance (in continuous spaces))

• Sometimes, such measures rely on subjective designs (e.g. apple, banana, monkey)

• Various representations of the clusters:
– generative (a.k.a. prototypical, distributional, …)

– discriminative

# 聚类的方法

## • Hierarchical

### – agglomerative (bottom-up)由下向上的凝聚

(1) Assume each data point forms a unique cluster把每一个点当做一个类
(2) Find a pair of clusters closest to each other找到最近的两个点（已知全部距离）
(3) Combine these clusters into one cluster凝聚成一个类
(4) Update distance matrix更新现有的距离矩阵
(5) Repeat (2) to (4) until onlyone cluster left, unless some other stopping condition is satisfied earlier

– smallest distance between any pair of members任意对距离最小
– largest distance between any pair of members任意对距离最大
– distance between the centroids (means)中心距离

### – divisive (top-down)由上向下的分裂(和agglomerative相反的思路)

(2) For each existing cluster, containing more than one data point, find a pair of points which are farthest apart找最远
(3) Select the cluster with a pair of points separated by the greatest distance; select this pair
(4) Create two new clusters: one for each member of the selected pair
(5) Distribute the remaining points of the selected cluster among the two new clusters 把数据点按照相似性分入新的两个聚集
(6) Repeat (2) to (5) until each cluster has exactly one point in it,
unless some stopping condition is satisfied earlier

## –K-means

(1) Guess (or assume) initial locations of cluster centers 随便找一个看的顺眼的中心
(2) For each data point compute its distances to each of the centers计算每一个点到每个中心点距离
(3) Assign the considered point to the nearest center然后就划到那个cluster去
(4) Re-compute locations of the involved centers as means of the assigned data points从新计算每个cluster的中心（所有点的平均）
(5) Repeat (2) to (4) until there are no changes in assignments往复到没有点移动为止

k-mean计算量还是比较大的，fairly computationally expensive. 需每次重新计算到中心点的距离，但不需要每次更新距离矩阵

## • Density-based

### – mixture models

Fit to data a set of (fairly simple and tractable) additive component models. For instance, use Gaussians:

### Expectation-Maximization Algorithm

EM和K-means的开始类似，但是除了猜测初始的中心点，EM还需要猜测初始的(co)variances和prior probabilities (they can be estimated as relative frequencies of data belonging to each cluster)

#### E-step of the E-M procedure:

– 根据已有的高斯参数，可以计算出数据点属于某一个高斯分布的期望
Given the current values of Gaussians’ parameters we may compute our expectation of how likely it is for each of the data points to belong to each of the Gaussians
(此处和K-means不同，现在每一个点可以数据多个cluster，只要每个%加合为1就可以)

#### the M-step:

–根据最大的概率来计算新的高斯参数
We compute new estimates of the Gaussians’ parameters which maximizethe likelihood of the recently recomputed cluster membership pattern

Then we repeat steps E-M until the model matches the data closely enough (or until the data point assignments stabilize)

EM算法最后也能保证收敛，但是不能保证收敛在全局最优解

Occam’s Razor:
Other things being equal, simple theories are preferable to complex ones

### Minimum Description Length principle (MDL)

postulates selecting models that need the smallest amount of memoryto encode:
Minimize the sum of the following two factors:
– Amount of memory required to describe the data using the model

– Amount of memory required to describe the model and its parameters

The number of Gaussian Mixture parameters to fit grows very fast with the dimensionality of the data space; It results in an increased risk of over-fitting and it could reduce statistical significance of the learned models

– Put constraints on the shapes of the covariances (e.g. restrict analysis to diagonal matrices only, etc.)？？？？不懂诶
– Reduce dimensionality of data (using feature selection & reduction beforehand) 降维
– Exploit structural dependencies between attributes of data 找出属性之间的依赖
E.g. using Bayesian Networks, Gaussian Mixture Trees, or other Graphical Models (some of these will be covered in the follow-up course
0
0

* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
个人资料
• 访问：20251次
• 积分：176
• 等级：
• 排名：千里之外
• 原创：3篇
• 转载：1篇
• 译文：6篇
• 评论：1条
文章存档
阅读排行
最新评论