数据科学与大数据分析学习笔记-5聚类

最新推荐文章于 2022-04-21 22:14:20 发布

Jifu_M

最新推荐文章于 2022-04-21 22:14:20 发布

阅读量526

点赞数

分类专栏：数据科学与大数据分析文章标签：大数据聚类算法

本文链接：https://blog.csdn.net/Jifu_M/article/details/111825913

版权

数据科学与大数据分析专栏收录该内容

26 篇文章 10 订阅

订阅专栏

Clustering聚类

聚类是指通过无监督（unsupervised）技术对相似的数据对象进行分组形成簇。

K-means Clustering

给定m个对象的集合每个对象都有n个可测量的属性。
分四步：

Choose the value of k, and the k initial guess for the centroids
选定 k 值，以及 k 个质心的初始猜测值。
Compute the distance from each data point to each centroid. Assign each point to the closest centroid.
计算从每个数据点(xi ,yi )到每个质心的距离，然后每个点分配给最近的质心。所有分配
给同一个质心的点组成一个簇。一共形成 k 个簇,使用欧几里得距离来测量。
Update the centroid of each cluster to become the center of gravity of the cluster.
算步骤 2 中新形成定义的每一个簇的质量中心，即新质心。

其中center of gravity是
Repeat Steps 2 and 3 until convergence
什么情况下算收敛呢？
当计算出的质心不改变，或者质心和所分配的点在两个相邻迭代间来回振荡时，就达到收敛了。当有一个或多个点到计算出的质心距离相等时，就可能发生后面这种情况。

确定Clusters的数量

k的值应该是多少?
我们应该设定几个Clusters呢？
我们使用内平方和WSS（Within Sum of Square，WSS）的启发式算法，来确定一个合理的最优 k 值。
在这里插入图片描述

举一个书上的例子，由图所示，我们可以根据WSS的值来判断到底设置几个集群最合适
图中的例子是K=3，所以3个集群最合适
肘=elbow

具体的R语言例子将会在练习中展现。

Diagnostics

A principle
– If using more clusters does not better distinguish the groups, it is almost certainly better to go with fewer clusters
所以clusters的数量在最优的情况下越少越好。

Reasons to Choose and Cautions

Identify any highly correlated attributes
使用散点图矩阵找出高相关连的属性，只保留一个
Units of measure could affect clustering result
更改属性的单位，得到的簇将稍有不同
Rescaling attributes affect clustering result
—Divide each attribute by its standard deviation

k 均值聚类算法对初始质心的开始位置是敏感的。因此，针对一个特定的 k 值运行多次 k均值分析是非常重要的，以确保聚类结果具有整体上最小的 WSS。前面讲到，这可以在 R 的kmeans()函数调用中使用 nstart 选项来完成。
在这里插入图片描述
在这里运行25次来确定初始质心。

Additional Algorithms 额外的算法

其他聚类方法之k 模式（k-mode）
Use the number of differences in the respective components of the attributes
• What is the distance between (a,b,e,d) and (d,d,d,d)
Sometimes it is better to convert categorial (or symbolic) data to numerical i.e. {hot, warm, cold} to {1,0,-1}.

Density Based Clustering

Density-based clustering locates regions of high density that are separated from one another by regions of low density.

换句话说，簇是数据空间中的密集区域，由较低目标密度的区域分隔开来
基于密度的聚类的主要特点:

–发现任意形状的群集
–处理noise
–需要密度参数作为终止条件

DBSCAN

这部分挺容易理解的。

Given a density threshold (MinPts) and a radius (Eps), the points in a dataset are classified into three types: core point, border point, and noise point
Core points: Point whose density >= MinPts
A border point is not a core point but falls within the neighborhood of a core point.
A noise point is any point that is neither a core point nor a border point.
在这里插入图片描述
In the example, The density of A is 7, The density of B is 4<6, The density of C is 3<6 and C doesn’t fall within the neighborhood of any core point, so it is not a border point.
Steps of DBSCAN clustering
• Step 1: Label each point as either core, border, or noise point.
• Step 2: Mark each group of Eps connected core points as a separate cluster
• Step 3: Assign each border point to one of the clusters of its associate core points.
在这里插入图片描述
DBSCAN:
• Resistant to noise and outliers
• Can handle clusters of different shapes and sizes
• Computational complexity is similar to K-means
When DBSCAN does not work well
Varying densities
• Can be overcome by using sampling Sparse and high-dimensional data
• Can be overcome by using topology preserving dimension reduction techniques.

参考书目

Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, EMC Education Services, John Wiley & Sons, 27 Jan. 2015
Data Mining: The Textbook by Charu C. Aggarwal, Springer 2015
C.M. Christopher, P. Raghavan and H. Schutze. Introduction to Information Retrieval, Cambridge University Press. 20084.
Computer Vision: A Modern Approach (2nd Edition), by David A. Forsyth and Jean Ponce, Pearson, 2011.

图片来自课件和个人的整理。
中文图片来自网络。

Jifu_M

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据科学与大数据分析学习笔记-5聚类

Clustering聚类K-means Clustering确定Clusters的数量DiagnosticsReasons to Choose and CautionsAdditional Algorithms 额外的算法Density Based ClusteringDBSCAN聚类是指通过无监督（unsupervised）技术对相似的数据对象进行分组形成簇。K-means Clustering给定m个对象的集合每个对象都有n个可测量的属性。分四步：Choose the value of k,
复制链接

扫一扫

专栏目录