Clustering聚类
聚类是指通过无监督(unsupervised)技术对相似的数据对象进行分组形成簇。
K-means Clustering
给定m个对象的集合每个对象都有n个可测量的属性。
分四步:
- Choose the value of k, and the k initial guess for the centroids
选定 k 值,以及 k 个质心的初始猜测值。 - Compute the distance from each data point to each centroid. Assign each point to the closest centroid.
计算从每个数据点(xi ,yi )到每个质心的距离,然后每个点分配给最近的质心。所有分配
给同一个质心的点组成一个簇。一共形成 k 个簇,使用欧几里得距离来测量。
- Update the centroid of each cluster to become the center of gravity of the cluster.
算步骤 2 中新形成定义的每一个簇的质量中心,即新质心。
其中center of gravity是
- Repeat Steps 2 and 3 until convergence
什么情况下算收敛呢?
当计算出的质心不改变,或者质心和所分配的点在两个相邻迭代间来回振荡时,就达到收敛了。当有一个或多个点到计算出的质心距离相等时,就可能发生后面这种情况。
确定Clusters的数量
k的值应该是多少?
我们应该设定几个Clusters呢?
我们使用内平方和WSS(Within Sum of Square,WSS)的启发式算法,来确定一个合理的最优 k 值。
举一个书上的例子,由图所示,我们可以根据WSS的值来判断到底设置几个集群最合适
图中的例子是K=3,所以3个集群最合适
肘=elbow
具体的R语言例子将会在练习中展现。
Diagnostics
A principle
– If using more clusters does not better distinguish the groups, it is almost certainly better to go with fewer clusters
所以clusters的数量在最优的情况下越少越好。
Reasons to Choose and Cautions
- Identify any highly correlated attributes
使用散点图矩阵找出高相关连的属性,只保留一个 - Units of measure could affect clustering result
更改属性的单位,得到的簇将稍有不同 - Rescaling attributes affect clustering result
—Divide each attribute by its standard deviation
k 均值聚类算法对初始质心的开始位置是敏感的。因此,针对一个特定的 k 值运行多次 k均值分析是非常重要的,以确保聚类结果具有整体上最小的 WSS。前面讲到,这可以在 R 的kmeans()函数调用中使用 nstart 选项来完成。
在这里运行25次来确定初始质心。
Additional Algorithms 额外的算法
其他聚类方法之k 模式(k-mode)
Use the number of differences in the respective components of the attributes
• What is the distance between (a,b,e,d) and (d,d,d,d)
Sometimes it is better to convert categorial (or symbolic) data to numerical i.e. {hot, warm, cold} to {1,0,-1}.
Density Based Clustering
Density-based clustering locates regions of high density that are separated from one another by regions of low density.
换句话说,簇是数据空间中的密集区域,由较低目标密度的区域分隔开来
基于密度的聚类的主要特点:
–发现任意形状的群集
–处理noise
–需要密度参数作为终止条件
DBSCAN
这部分挺容易理解的。
Given a density threshold (MinPts) and a radius (Eps), the points in a dataset are classified into three types: core point, border point, and noise point
Core points: Point whose density >= MinPts
A border point is not a core point but falls within the neighborhood of a core point.
A noise point is any point that is neither a core point nor a border point.
In the example, The density of A is 7, The density of B is 4<6, The density of C is 3<6 and C doesn’t fall within the neighborhood of any core point, so it is not a border point.
Steps of DBSCAN clustering
• Step 1: Label each point as either core, border, or noise point.
• Step 2: Mark each group of Eps connected core points as a separate cluster
• Step 3: Assign each border point to one of the clusters of its associate core points.
DBSCAN:
• Resistant to noise and outliers
• Can handle clusters of different shapes and sizes
• Computational complexity is similar to K-means
When DBSCAN does not work well
Varying densities
• Can be overcome by using sampling Sparse and high-dimensional data
• Can be overcome by using topology preserving dimension reduction techniques.
参考书目
-
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, EMC Education Services, John Wiley & Sons, 27 Jan. 2015
-
Data Mining: The Textbook by Charu C. Aggarwal, Springer 2015
-
C.M. Christopher, P. Raghavan and H. Schutze. Introduction to Information Retrieval, Cambridge University Press. 20084.
-
Computer Vision: A Modern Approach (2nd Edition), by David A. Forsyth and Jean Ponce, Pearson, 2011.
图片来自课件和个人的整理。
中文图片来自网络。