Scikit Learn-聚类方法

最新推荐文章于 2024-05-18 19:29:24 发布

cunzai1985

最新推荐文章于 2024-05-18 19:29:24 发布

阅读量1k

点赞数

文章标签：聚类算法数据库 python 机器学习

原文链接：https://www.tutorialspoint.com/scikit_learn/scikit_learn_clustering_methods.htm

版权

Scikit Learn-聚类方法 (Scikit Learn - Clustering Methods)

Here, we will study about the clustering methods in Sklearn which will help in identification of any similarity in the data samples.

在这里，我们将研究Sklearn中的聚类方法，这将有助于识别数据样本中的任何相似性。

Clustering methods, one of the most useful unsupervised ML methods, used to find similarity & relationship patterns among data samples. After that, they cluster those samples into groups having similarity based on features. Clustering determines the intrinsic grouping among the present unlabeled data, that’s why it is important.

聚类方法是最有用的无监督ML方法之一，用于查找数据样本之间的相似性和关系模式。之后，他们将这些样本基于特征聚类为具有相似性的组。聚类决定了当前未标记数据之间的固有分组，这就是为什么它很重要。

The Scikit-learn library have sklearn.cluster to perform clustering of unlabeled data. Under this module scikit-leran have the following clustering methods −

Scikit-learn库具有sklearn.cluster以执行未标记数据的聚类。在这个模块下scikit-leran具有以下聚类方法-

均值 (KMeans)

This algorithm computes the centroids and iterates until it finds optimal centroid. It requires the number of clusters to be specified that’s why it assumes that they are already known. The main logic of this algorithm is to cluster the data separating samples in n number of groups of equal variances by minimizing the criteria known as the inertia. The number of clusters identified by algorithm is represented by ‘K.

该算法计算质心并进行迭代，直到找到最佳质心为止。它要求指定簇的数量，这就是为什么它假定它们已经已知的原因。该算法的主要逻辑是，通过最小化称为惯性的标准，将分离样本的数据聚类为n个等方差组。用算法标识的簇数用'K表示。

Scikit-learn have sklearn.cluster.KMeans module to perform K-Means clustering. While computing cluster centers and value of inertia, the parameter named sample_weight allows sklearn.cluster.KMeans module to assign more weight to some samples.

Scikit-learn具有sklearn.cluster.KMeans模块来执行K-Means聚类。在计算聚类中心和惯性值时，名为sample_weight的参数允许sklearn.cluster.KMeans模块为某些样本分配更多的权重。

亲和力传播 (Affinity Propagation)

This algorithm is based on the concept of ‘message passing’ between different pairs of samples until convergence. It does not require the number of clusters to be specified before running the algorithm. The algorithm has a time complexity of the order 𝑂(𝑁2𝑇), which is the biggest disadvantage of it.

该算法基于不同对样本之间的“消息传递”直到收敛的概念。在运行算法之前，不需要指定群集数。该算法的时间复杂度约为𝑂(𝑁2𝑇)，这是其最大的缺点。

Scikit-learn have sklearn.cluster.AffinityPropagation module to perform Affinity Propagation clustering.

Scikit-learn具有sklearn.cluster.AffinityPropagation模块，以执行Affinity Propagation聚类。

均值漂移 (Mean Shift)

This algorithm mainly discovers blobs in a smooth density of samples. It assigns the datapoints to the clusters iteratively by shifting points towards the highest density of datapoints. Instead of relying on a parameter named bandwidth dictating the size of the region to search through, it automatically sets the number of clusters.

该算法主要发现样本平滑密度中的斑点。它通过将数据点移向最高密度的数据点来将数据点迭代地分配给群集。它会自动设置簇数，而不是依靠一个称为带宽的参数来指示要搜索的区域的大小。

Scikit-learn have sklearn.cluster.MeanShift module to perform Mean Shift clustering.

Scikit-learn具有sklearn.cluster.MeanShift模块来执行均值漂移聚类。

光谱聚类 (Spectral Clustering)

Before clustering, this algorithm basically uses the eigenvalues i.e. spectrum of the similarity matrix of the data to perform dimensionality reduction in fewer dimensions. The use of this algorithm is not advisable when there are large number of clusters.

在聚类之前，该算法基本上使用特征值，即数据相似矩阵的频谱，以较少的维数进行维数减少。当群集数量很多时，建议不要使用此算法。

Scikit-learn have sklearn.cluster.SpectralClustering module to perform Spectral clustering.

Scikit-learn具有sklearn.cluster.SpectralClustering模块来执行光谱聚类。

层次聚类 (Hierarchical Clustering)

This algorithm builds nested clusters by merging or splitting the clusters successively. This cluster hierarchy is represented as dendrogram i.e. tree. It falls into following two categories −

该算法通过依次合并或拆分群集来构建嵌套群集。该群集层次结构表示为树状图，即树。它分为以下两类-

Agglomerative hierarchical algorithms − In this kind of hierarchical algorithm, every data point is treated like a single cluster. It then successively agglomerates the pairs of clusters. This uses the bottom-up approach.

聚集层次算法 -在这种层次算法中，每个数据点都被视为单个群集。然后，它连续聚集成对的集群。这使用了自下而上的方法。

Divisive hierarchical algorithms − In this hierarchical algorithm, all data points are treated as one big cluster. In this the process of clustering involves dividing, by using top-down approach, the one big cluster into various small clusters.

划分层次算法 -在此层次算法中，所有数据点都被视为一个大群集。在此过程中，群集过程涉及通过使用自上而下的方法将一个大群集分为多个小群集。

Scikit-learn have sklearn.cluster.AgglomerativeClustering module to perform Agglomerative Hierarchical clustering.

Scikit-learn具有sklearn.cluster.AgglomerativeClustering模块来执行聚集层次聚类。

数据库扫描 (DBSCAN)

It stands for “Density-based spatial clustering of applications with noise”. This algorithm is based on the intuitive notion of “clusters” & “noise” that clusters are dense regions of the lower density in the data space, separated by lower density regions of data points.

它代表“基于噪声的应用程序的基于密度的空间聚类” 。该算法基于“集群”和“噪声”的直观概念，即群集是数据空间中密度较低的密集区域，被数据点的密度较低的区域分隔开。

Scikit-learn have sklearn.cluster.DBSCAN module to perform DBSCAN clustering. There are two important parameters namely min_samples and eps used by this algorithm to define dense.

Scikit-learn具有sklearn.cluster.DBSCAN模块来执行DBSCAN集群。该算法使用两个重要参数，即min_samples和eps来定义稠密度。

Higher value of parameter min_samples or lower value of the parameter eps will give an indication about the higher density of data points which is necessary to form a cluster.

参数min_samples的较高值或参数eps的较低值将指示有关形成群集所必需的较高数据点密度。

光学 (OPTICS)

It stands for “Ordering points to identify the clustering structure”. This algorithm also finds density-based clusters in spatial data. It’s basic working logic is like DBSCAN.

它代表“订购点以识别聚类结构” 。该算法还可以在空间数据中找到基于密度的聚类。它的基本工作逻辑就像DBSCAN。

It addresses a major weakness of DBSCAN algorithm-the problem of detecting meaningful clusters in data of varying density-by ordering the points of the database in such a way that spatially closest points become neighbors in the ordering.

它解决了DBSCAN算法的一个主要弱点-通过以使空间上最近的点成为该次序中的邻居的方式对数据库的点进行排序来检测变化密度的数据中有意义的簇的问题。

Scikit-learn have sklearn.cluster.OPTICS module to perform OPTICS clustering.

Scikit-learn具有sklearn.cluster.OPTICS模块来执行OPTICS集群。

桦木 (BIRCH)

It stands for Balanced iterative reducing and clustering using hierarchies. It is used to perform hierarchical clustering over large data sets. It builds a tree named CFT i.e. Characteristics Feature Tree, for the given data.

它代表使用层次结构的平衡迭代减少和聚类。它用于对大型数据集执行分层聚类。它为给定的数据构建一个名为CFT的 树，即特征特征树 。

The advantage of CFT is that the data nodes called CF (Characteristics Feature) nodes holds the necessary information for clustering which further prevents the need to hold the entire input data in memory.

CFT的优势在于，称为CF(特征功能)节点的数据节点拥有用于群集的必要信息，这进一步避免了将整个输入数据保存在内存中的需求。

Scikit-learn have sklearn.cluster.Birch module to perform BIRCH clustering.

Scikit-learn具有sklearn.cluster.Birch模块来执行BIRCH聚类。

比较聚类算法 (Comparing Clustering Algorithms)

Following table will give a comparison (based on parameters, scalability and metric) of the clustering algorithms in scikit-learn.

下表将对scikit-learn中的聚类算法进行比较(基于参数，可伸缩性和度量)。

Sr.No	Algorithm Name	Parameters	Scalability	Metric Used
1	K-Means	No. of clusters	Very large n_samples	The distance between points.
2	Affinity Propagation	Damping	It’s not scalable with n_samples	Graph Distance
3	Mean-Shift	Bandwidth	It’s not scalable with n_samples.	The distance between points.
4	Spectral Clustering	No.of clusters	Medium level of scalability with n_samples. Small level of scalability with n_clusters.	Graph Distance
5	Hierarchical Clustering	Distance threshold or No.of clusters	Large n_samples Large n_clusters	The distance between points.
6	DBSCAN	Size of neighborhood	Very large n_samples and medium n_clusters.	Nearest point distance
7	OPTICS	Minimum cluster membership	Very large n_samples and large n_clusters.	The distance between points.
8	BIRCH	Threshold, Branching factor	Large n_samples Large n_clusters	The Euclidean distance between points.

序号	算法名称	参量	可扩展性	公制使用
1个	K均值	集群数	非常大的n_samples	点之间的距离。
2	亲和力传播	减震	n_samples无法扩展	图形距离
3	均值漂移	带宽	n_samples无法扩展。	点之间的距离。
4	光谱聚类	簇数	具有n_samples的中等级别的可伸缩性。 n_clusters的可伸缩性较小。	图形距离
5	层次聚类	距离阈值或簇数	大n_sample大n_clusters	点之间的距离。
6	数据库扫描	邻里大小	非常大的n_samples和中等n_clusters。	最近点距离
7	光学	最低集群成员	非常大的n_samples和大n_clusters。	点之间的距离。
8	桦木	阈值，分支因子	大n_sample大n_clusters	点之间的欧几里得距离。

Scikit学习数字数据集的K均值聚类 (K-Means Clustering on Scikit-learn Digit dataset)

In this example, we will apply K-means clustering on digits dataset. This algorithm will identify similar digits without using the original label information. Implementation is done on Jupyter notebook.

在此示例中，我们将对数字数据集应用K-means聚类。该算法将在不使用原始标签信息的情况下识别相似的数字。在Jupyter笔记本上完成实现。


%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

输出量 (Output)


1797, 64)

This output shows that digit dataset is having 1797 samples with 64 features.

此输出显示数字数据集具有1797个具有64个特征的样本。

例 (Example)

Now, perform the K-Means clustering as follows −

现在，按以下步骤执行K-Means聚类：


kmeans = KMeans(n_clusters = 10, random_state = 0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

输出量 (Output)


(10, 64)

This output shows that K-means clustering created 10 clusters with 64 features.

此输出显示K-means聚类创建了具有64个特征的10个聚类。

例 (Example)


fig, ax = plt.subplots(2, 5, figsize = (8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks = [], yticks = [])
axi.imshow(center, interpolation = 'nearest', cmap = plt.cm.binary)

输出量 (Output)

The below output has images showing clusters centers learned by K-Means Clustering.

以下输出的图像显示了通过K-Means聚类学习的聚类中心。

Next, the Python script below will match the learned cluster labels (by K-Means) with the true labels found in them −

接下来，下面的Python脚本将匹配学习到的集群标签(通过K-Means)和在其中找到的真实标签-


from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]

We can also check the accuracy with the help of the below mentioned command.

我们还可以借助下面提到的命令检查准确性。


from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)

输出量 (Output)


0.7935447968836951

完整的实施实例 (Complete Implementation Example)


%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np

from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
kmeans = KMeans(n_clusters = 10, random_state = 0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
fig, ax = plt.subplots(2, 5, figsize = (8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
   axi.set(xticks=[], yticks = [])
   axi.imshow(center, interpolation = 'nearest', cmap = plt.cm.binary)
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
   mask = (clusters == i)
   labels[mask] = mode(digits.target[mask])[0]
from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)

翻译自: https://www.tutorialspoint.com/scikit_learn/scikit_learn_clustering_methods.htm

cunzai1985

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Scikit Learn-聚类方法

Scikit Learn-聚类方法 (Scikit Learn - Clustering Methods)Advertisements 广告 Previous Page 上一页 Next Page 下一页 Here, we will study about the clustering methods in Sklearn which will help i...
复制链接

扫一扫