数据科学与大数据分析学习笔记-5聚类


聚类是指通过无监督(unsupervised)技术对相似的数据对象进行分组形成簇。

K-means Clustering

给定m个对象的集合每个对象都有n个可测量的属性。
分四步:

  1. Choose the value of k, and the k initial guess for the centroids
    选定 k 值,以及 k 个质心的初始猜测值。
  2. Compute the distance from each data point to each centroid. Assign each point to the closest centroid.
    计算从每个数据点(xi ,yi )到每个质心的距离,然后每个点分配给最近的质心。所有分配
    给同一个质心的点组成一个簇。一共形成 k 个簇,使用欧几里得距离来测量。
    在这里插入图片描述
  3. Update the centroid of each cluster to become the center of gravity of the cluster.
    算步骤 2 中新形成定义的每一个簇的质量中心,即新质心。
    在这里插入图片描述
    其中center of gravity是
    在这里插入图片描述
  4. Repeat Steps 2 and 3 until convergence
    什么情况下算收敛呢?
    当计算出的质心不改变,或者质心和所分配的点在两个相邻迭代间来回振荡时,就达到收敛了。当有一个或多个点到计算出的质心距离相等时,就可能发生后面这种情况。
    K-means

确定Clusters的数量

k的值应该是多少?
我们应该设定几个Clusters呢?
我们使用内平方和WSS(Within Sum of Square,WSS)的启发式算法,来确定一个合理的最优 k 值。
在这里插入图片描述
在这里插入图片描述
举一个书上的例子,由图所示,我们可以根据WSS的值来判断到底设置几个集群最合适
图中的例子是K=3,所以3个集群最合适
肘=elbow

具体的R语言例子将会在练习中展现。

Diagnostics

A principle
– If using more clusters does not better distinguish the groups, it is almost certainly better to go with fewer clusters
所以clusters的数量在最优的情况下越少越好。

Reasons to Choose and Cautions

  1. Identify any highly correlated attributes
    使用散点图矩阵找出高相关连的属性,只保留一个
  2. Units of measure could affect clustering result
    更改属性的单位,得到的簇将稍有不同
  3. Rescaling attributes affect clustering result
    —Divide each attribute by its standard deviation

k 均值聚类算法对初始质心的开始位置是敏感的。因此,针对一个特定的 k 值运行多次 k均值分析是非常重要的,以确保聚类结果具有整体上最小的 WSS。前面讲到,这可以在 R 的kmeans()函数调用中使用 nstart 选项来完成。
在这里插入图片描述
在这里运行25次来确定初始质心。

Additional Algorithms 额外的算法

其他聚类方法之k 模式(k-mode)
Use the number of differences in the respective components of the attributes
• What is the distance between (a,b,e,d) and (d,d,d,d)
Sometimes it is better to convert categorial (or symbolic) data to numerical i.e. {hot, warm, cold} to {1,0,-1}.

Density Based Clustering

Density-based clustering locates regions of high density that are separated from one another by regions of low density.

换句话说,簇是数据空间中的密集区域,由较低目标密度的区域分隔开来
基于密度的聚类的主要特点:

–发现任意形状的群集
–处理noise
–需要密度参数作为终止条件

DBSCAN

这部分挺容易理解的。

Given a density threshold (MinPts) and a radius (Eps), the points in a dataset are classified into three types: core point, border point, and noise point
Core points: Point whose density >= MinPts
A border point is not a core point but falls within the neighborhood of a core point.
A noise point is any point that is neither a core point nor a border point.
在这里插入图片描述
In the example, The density of A is 7, The density of B is 4<6, The density of C is 3<6 and C doesn’t fall within the neighborhood of any core point, so it is not a border point.
Steps of DBSCAN clustering
• Step 1: Label each point as either core, border, or noise point.
• Step 2: Mark each group of Eps connected core points as a separate cluster
• Step 3: Assign each border point to one of the clusters of its associate core points.
在这里插入图片描述
DBSCAN:
• Resistant to noise and outliers
• Can handle clusters of different shapes and sizes
• Computational complexity is similar to K-means
When DBSCAN does not work well
Varying densities
• Can be overcome by using sampling Sparse and high-dimensional data
• Can be overcome by using topology preserving dimension reduction techniques.

参考书目

  1. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, EMC Education Services, John Wiley & Sons, 27 Jan. 2015

  2. Data Mining: The Textbook by Charu C. Aggarwal, Springer 2015

  3. C.M. Christopher, P. Raghavan and H. Schutze. Introduction to Information Retrieval, Cambridge University Press. 20084.

  4. Computer Vision: A Modern Approach (2nd Edition), by David A. Forsyth and Jean Ponce, Pearson, 2011.

图片来自课件和个人的整理。
中文图片来自网络。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值