Scikit-learn:聚类clustering

http://blog.csdn.net/pipisorry/article/details/53185758

不同聚类效果比较

sklearn不同聚类示例比较

../_images/sphx_glr_plot_cluster_comparison_0011.png

A comparison of the clustering algorithms in scikit-learn

不同聚类综述

Method nameParametersScalabilityUsecaseGeometry (metric used)
K-Meansnumber of clustersVery large n_samples, medium n_clusters withMiniBatch codeGeneral-purpose, even cluster size, flat geometry, not too many clustersDistances between points
Affinity propagationdamping, sample preferenceNot scalable with n_samplesMany clusters, uneven cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
Mean-shiftbandwidthNot scalable with n_samplesMany clusters, uneven cluster size, non-flat geometryDistances between points
Spectral clusteringnumber of clustersMedium n_samples, small n_clustersFew clusters, even cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
Ward hierarchical clusteringnumber of clustersLarge n_samples and n_clustersMany clusters, possibly connectivity constraintsDistances between points
Agglomerative clusteringnumber of clusters, linkage type, distanceLarge n_samples and n_clustersMany clusters, possibly connectivity constraints, non EuclideandistancesAny pairwise distance
DBSCANneighborhood sizeVery large n_samples, medium n_clustersNon-flat geometry, uneven cluster sizesDistances between nearest points
Gaussian mixturesmanyNot scalableFlat geometry, good for density estimationMahalanobis distances to centers
Birchbranching factor, threshold, optional global clusterer.Large n_clusters and n_samplesLarge dataset, outlier removal, data reduction.Euclidean distance between points

kmeans原理

 sklearn kmeans原型:[sklearn.cluster.KMeans]

class sklearn.cluster.KMeans(n_clusters=8*init='k-means++'n_init=10max_iter=300tol=0.0001precompute_distances='deprecated'verbose=0random_state=Nonecopy_x=Truen_jobs='deprecated'algorithm='auto')

Note: 默认的init方法实际是kmeans++。

Attributes

cluster_centers_ndarray of shape (n_clusters, n_features)
Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.

labels_ndarray of shape (n_samples,)
Labels of each point

inertia_float
Sum of squared distances of samples to their closest cluster center.

n_iter_int
Number of iterations run.

sklearn中kmeans的学习目标

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:

 

[2.3. Clustering — scikit-learn 0.24.2 documentation]

[聚类算法:kmeans]

皮皮blog

sklearn实现聚类代码示例

Kmeans聚类

kmeans代码1

from sklearn.cluster import KMeans

data = [[1, 2, 3], [4, 5, 6], [1, 3, 4], [5, 6, 7]]  # line_num * feature_num的二维数组
num_clusters = 2
km_cluster = KMeans(n_clusters=num_clusters, init='k-means++', n_init=10, verbose=1, tol=1e-6)
#km_cluster = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1)
km_cluster.fit(data)
# print("labels:\n", km_cluster.labels_)
print("cluster_centers:\n", km_cluster.cluster_centers_) #numpy.ndarray

cluster_centers:
 [[4.5 5.5 6.5]
 [1.  2.5 3.5]]

kmeans代码2

[MachineLearning/Kmeans.py at master · pipilove/MachineLearning · GitHub]

DBSCAN聚类

def Dist(x, y):
    from geopy import distance
    return distance.vincenty(x, y).meters

import pickle, subprocess, pwd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

df = pd.read_pickle(os.path.join(CWD, 'middlewares/df.pkl'))

ll = df[['longitude', 'latitude']].values
x, y = ll[:, 0], ll[:, 1]

print('starting dbsan...')
dbscaner = DBSCAN(eps=DBSCAN_R, min_samples=DBSCAN_MIN_S, metric=Dist, n_jobs=-1).fit(ll)
pickle.dump(dbscaner, open(os.path.join(CWD, 'middlewares/dbscaner.pkl'), 'wb'))
print('dbsan dumping end...')

dbscaner = pickle.load(open(os.path.join(CWD, 'middlewares/dbscaner.pkl'), 'rb'))
labels = dbscaner.labels_
# print(set(labels))
colors = plt.cm.Spectral(np.linspace(0, 1, len(set(labels))))
for k, col in zip(set(labels), colors):
    marker = '.'
    if k == -1:
        col = 'k'
        marker = 'x'
    inds_k = labels == k
    plt.scatter(x[inds_k], y[inds_k], marker=marker, color=col)
if pwd.getpwuid(os.geteuid()).pw_name == 'pi':
    plt.savefig('./1.png')
elif pwd.getpwuid(os.geteuid()).pw_name == 'pipi':
    plt.show()

[DBSCAN]

皮皮blog

聚类模型评估

聚类趋势(聚类前)

霍普金斯统计量(Hopkins Statistic)评估给定数据集是否存在有意义的可聚类的非随机结构。如果一个数据集是有随机的均匀的点生成的,虽然也可以产生聚类结果,但该结果没有意义。聚类的前提需要数据是非均匀分布的。该值在区间[0, 1]之间,[0.01, 0.3]表示数据结构regularly spaced,该值为0.5时数据是均匀分布的,[0.7, 0.99]表示聚类趋势很强。

[Python的Hopkins Statistic实现]

无监督聚类评估方法(聚类后)

轮廓系数/Silhouette Coefficient

较高的 Silhouette 系数得分和能够更好定义的聚类的模型相关联。Silhouette 系数 是依据每个样本进行定义,由两个得分组成:

  • a: 样本与同一类别中所有其他点之间的平均距离。
  • b: 样本与 下一个距离最近的簇 中的所有其他点之间的平均距离。

然后将单个样本的 Silhouette 系数 s 给出为:

s = \frac{b - a}{max(a, b)}

示例:

import numpy as np
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.silhouette_score(X, labels, metric='euclidean')

CH指数/Calinski-Harabasz Index

亦称方差比准则(Variance Ratio Criterion),较高的 Calinski-Harabaz 的得分和能够更好定义的聚类的模型相关联。

对于 k 簇,Calinski-Harabaz 得分是通过簇间色散平均值(between-clusters dispersion mean)与 簇内色散之间(within-cluster dispersion)的比值给出的

s(k) = \frac{\mathrm{Tr}(B_k)}{\mathrm{Tr}(W_k)} \times \frac{N - k}{k - 1}

其中 B_k 是组间色散矩阵(between group dispersion matrix), W_k 是由以下定义的簇内色散矩阵(within-cluster dispersion matrix):

W_k = \sum_{q=1}^k \sum_{x \in C_q} (x - c_q) (x - c_q)^T

B_k = \sum_q n_q (c_q - c) (c_q - c)^T

示例:

import numpy as np
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.calinski_harabasz_score(X, labels)

Davies-Bouldin Index

Contingency Matrix

[Clustering performance evaluation]

[聚类 · sklearn 中文文档]

[如何评价聚类结果的好坏?]

from: sklearn:聚类clustering

ref: 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值