聚类分析①

置顶 hyk今天写算法了吗

已于 2022-06-29 18:30:30 修改

阅读量1k

点赞数 1

分类专栏：机器学习与深度学习文章标签：聚类算法机器学习

于 2022-01-25 18:11:37 首次发布

本文链接：https://blog.csdn.net/m0_52000372/article/details/122690011

版权

机器学习与深度学习专栏收录该内容

20 篇文章 5 订阅

订阅专栏

1、概念

1.1 聚类的定义

聚类(Clustering)是按照某个特定标准(如距离)把一个数据集分割成不同的类或簇，使得同一个簇内的数据对象的相似性尽可能大，同时不在同一个簇中的数据对象的差异性也尽可能地大。也即聚类后同一类的数据尽可能聚集到一起，不同类数据尽量分离。

1.2 聚类和分类的区别

聚类(Clustering)：是指把相似的数据划分到一起，具体划分的时候并不关心这一类的标签，目标就是把相似的数据聚合到一起，聚类是一种无监督学习(Unsupervised Learning)方法。
分类(Classification)：是把不同的数据划分开，其过程是通过训练数据集获得一个分类器，再通过分类器去预测未知数据，分类是一种监督学习(Supervised Learning)方法。

1.3 聚类的一般过程

数据准备：特征标准化和降维
特征选择：从最初的特征中选择最有效的特征，并将其存储在向量中
特征提取：通过对选择的特征进行转换形成新的突出特征
聚类：基于某种距离函数进行相似度度量，获取簇
聚类结果评估：分析聚类结果，如距离误差和(SSE)等

2、聚类方法

（方法名称）	Parameters（参数）	Scalability（可扩展性）	Usecase（使用场景）	Geometry (metric used)（几何图形（公制使用））
K-Means（K-均值）	number of clusters（聚类形成的簇的个数）	非常大的 `n_samples`, 中等的 `n_clusters` 使用 MiniBatch 代码）	通用, 均匀的 cluster size（簇大小）, flat geometry（平面几何）, 不是太多的 clusters（簇）	Distances between points（点之间的距离）
Affinity propagation	damping（阻尼）, sample preference（样本偏好）	Not scalable with n_samples（n_samples 不可扩展）	Many clusters, uneven cluster size, non-flat geometry（许多簇，不均匀的簇大小，非平面几何）	Graph distance (e.g. nearest-neighbor graph)（图距离（例如，最近邻图））
Mean-shift	bandwidth（带宽）	Not scalable with `n_samples` （`n_samples`不可扩展）	Many clusters, uneven cluster size, non-flat geometry（许多簇，不均匀的簇大小，非平面几何）	Distances between points（点之间的距离）
Spectral clustering	number of clusters（簇的个数）	中等的 `n_samples`, 小的 `n_clusters`	Few clusters, even cluster size, non-flat geometry（几个簇，均匀的簇大小，非平面几何）	Graph distance (e.g. nearest-neighbor graph)（图距离（例如最近邻图））
Ward hierarchical clustering	number of clusters（簇的个数）	大的 `n_samples` 和 `n_clusters`	Many clusters, possibly connectivity constraints（很多的簇，可能连接限制）	Distances between points（点之间的距离）
Agglomerative clustering	number of clusters（簇的个数）, linkage type（链接类型）, distance（距离）	大的 `n_samples` 和 `n_clusters`	Many clusters, possibly connectivity constraints, non Euclidean distances（很多簇，可能连接限制，非欧氏距离）	Any pairwise distance（任意成对距离）
DBSCAN	neighborhood size（neighborhood 的大小）	非常大的 `n_samples`, 中等的 `n_clusters`	Non-flat geometry, uneven cluster sizes（非平面几何，不均匀的簇大小）	Distances between nearest points（最近点之间的距离）
Gaussian mixtures（高斯混合）	many（很多）	Not scalable（不可扩展）	Flat geometry, good for density estimation（平面几何，适用于密度估计）	Mahalanobis distances to centers（与中心的马氏距离）
Birch	branching factor（分支因子）, threshold（阈值）, optional global clusterer（可选全局簇）.	大的 `n_clusters` 和 `n_samples`	Large dataset, outlier removal, data reduction.（大型数据集，异常值去除，数据简化）	Euclidean distance between points（点之间的欧氏距离）