什么是聚类

Clustering partitions unlabeled data into subsets

• The resulting subsets (called clusters) should consist of internally similar, but externally distinctive data points
• Various measures of “similarity” and “distinctiveness” of data points(example: Euclidean distance (in continuous spaces))

• Sometimes, such measures rely on subjective designs (e.g. apple, banana, monkey)

• Various representations of the clusters:
– generative (a.k.a. prototypical, distributional, …)

– discriminative

聚类的方法

• Hierarchical

– agglomerative (bottom-up)由下向上的凝聚

(1) Assume each data point forms a unique cluster把每一个点当做一个类
(2) Find a pair of clusters closest to each other找到最近的两个点（已知全部距离）
(3) Combine these clusters into one cluster凝聚成一个类
(4) Update distance matrix更新现有的距离矩阵
(5) Repeat (2) to (4) until onlyone cluster left, unless some other stopping condition is satisfied earlier

– smallest distance between any pair of members任意对距离最小
– largest distance between any pair of members任意对距离最大
– distance between the centroids (means)中心距离

– divisive (top-down)由上向下的分裂(和agglomerative相反的思路)

(2) For each existing cluster, containing more than one data point, find a pair of points which are farthest apart找最远
(3) Select the cluster with a pair of points separated by the greatest distance; select this pair
(4) Create two new clusters: one for each member of the selected pair
(5) Distribute the remaining points of the selected cluster among the two new clusters 把数据点按照相似性分入新的两个聚集
(6) Repeat (2) to (5) until each cluster has exactly one point in it,
unless some stopping condition is satisfied earlier

–K-means

(1) Guess (or assume) initial locations of cluster centers 随便找一个看的顺眼的中心
(2) For each data point compute its distances to each of the centers计算每一个点到每个中心点距离
(3) Assign the considered point to the nearest center然后就划到那个cluster去
(4) Re-compute locations of the involved centers as means of the assigned data points从新计算每个cluster的中心（所有点的平均）
(5) Repeat (2) to (4) until there are no changes in assignments往复到没有点移动为止

k-mean计算量还是比较大的，fairly computationally expensive. 需每次重新计算到中心点的距离，但不需要每次更新距离矩阵

• Density-based

– mixture models

Fit to data a set of (fairly simple and tractable) additive component models. For instance, use Gaussians:

Expectation-Maximization Algorithm

EM和K-means的开始类似，但是除了猜测初始的中心点，EM还需要猜测初始的(co)variances和prior probabilities (they can be estimated as relative frequencies of data belonging to each cluster)

E-step of the E-M procedure:

– 根据已有的高斯参数，可以计算出数据点属于某一个高斯分布的期望
Given the current values of Gaussians’ parameters we may compute our expectation of how likely it is for each of the data points to belong to each of the Gaussians
(此处和K-means不同，现在每一个点可以数据多个cluster，只要每个%加合为1就可以)

the M-step:

–根据最大的概率来计算新的高斯参数
We compute new estimates of the Gaussians’ parameters which maximizethe likelihood of the recently recomputed cluster membership pattern

Then we repeat steps E-M until the model matches the data closely enough (or until the data point assignments stabilize)

EM算法最后也能保证收敛，但是不能保证收敛在全局最优解

Occam’s Razor:
Other things being equal, simple theories are preferable to complex ones

Minimum Description Length principle (MDL)

postulates selecting models that need the smallest amount of memoryto encode:
Minimize the sum of the following two factors:
– Amount of memory required to describe the data using the model

– Amount of memory required to describe the model and its parameters

The number of Gaussian Mixture parameters to fit grows very fast with the dimensionality of the data space; It results in an increased risk of over-fitting and it could reduce statistical significance of the learned models

– Put constraints on the shapes of the covariances (e.g. restrict analysis to diagonal matrices only, etc.)？？？？不懂诶
– Reduce dimensionality of data (using feature selection & reduction beforehand) 降维
– Exploit structural dependencies between attributes of data 找出属性之间的依赖
E.g. using Bayesian Networks, Gaussian Mixture Trees, or other Graphical Models (some of these will be covered in the follow-up course
• 本文已收录于以下专栏：

数据挖掘笔记-聚类-SpectralClustering-原理与简单实现

• wulinshishen
• 2014年09月02日 15:42
• 1935

轨迹聚类（二）：分段及归组框架（Trajectory Clustering:A Partition-and-Group Framework）

• jsc9410
• 2016年03月30日 19:42
• 2167

TLD（Tracking-Learning-Detection）学习与源码理解之（六）

TLD（Tracking-Learning-Detection）学习与源码理解之（六） zouxy09@qq.com http://blog.csdn.net/zouxy09      ...
• GarfieldEr007
• 2015年12月14日 13:54
• 894

聚类算法总结 - Partitional Clustering

• sysu_smie_WuZewei
• 2015年09月21日 10:31
• 1070

聚类系数（clustering coefficient）计算

• u011089523
• 2016年10月31日 22:18
• 1899

TLD 详细解析之 综合器

• Jake_cai
• 2017年05月18日 15:53
• 351

国际MOT研究及智能视频监控应用中的目标跟踪梳理

• yunduwu0010
• 2016年11月08日 15:39
• 2242

聚类（Clustering）理论简介

• Hisun_Gwen
• 2017年06月06日 16:58
• 201

聚类（clustering）与分类（Classification）的区别

• gdp12315
• 2015年11月11日 10:59
• 8078

TLD(Tracking-Learning-Detection)算法学习与源码解析（三）之 tld.cpp源码解析

• linger2012liu
• 2014年03月05日 17:52
• 3154

举报原因： 您举报文章：聚类简介Clustering 色情 政治 抄袭 广告 招聘 骂人 其他 (最多只允许输入30个字)