http://blog.csdn.net/pipisorry/article/details/53185758
不同聚类效果比较
sklearn不同聚类示例比较
不同聚类综述
Method name | Parameters | Scalability | Usecase | Geometry (metric used) |
---|---|---|---|---|
K-Means | number of clusters | Very large n_samples , medium n_clusters withMiniBatch code | General-purpose, even cluster size, flat geometry, not too many clusters | Distances between points |
Affinity propagation | damping, sample preference | Not scalable with n_samples | Many clusters, uneven cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) |
Mean-shift | bandwidth | Not scalable with n_samples | Many clusters, uneven cluster size, non-flat geometry | Distances between points |
Spectral clustering | number of clusters | Medium n_samples , small n_clusters | Few clusters, even cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) |
Ward hierarchical clustering | number of clusters | Large n_samples and n_clusters | Many clusters, possibly connectivity constraints | Distances between points |
Agglomerative clustering | number of clusters, linkage type, distance | Large n_samples and n_clusters | Many clusters, possibly connectivity constraints, non Euclideandistances | Any pairwise distance |
DBSCAN | neighborhood size | Very large n_samples , medium n_clusters | Non-flat geometry, uneven cluster sizes | Distances between nearest points |
Gaussian mixtures | many | Not scalable | Flat geometry, good for density estimation | Mahalanobis distances to centers |
Birch | branching factor, threshold, optional global clusterer. | Large n_clusters and n_samples | Large dataset, outlier removal, data reduction. | Euclidean distance between points |
kmeans原理
sklearn kmeans原型:[sklearn.cluster.KMeans¶]
class sklearn.cluster.
KMeans
(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='deprecated', verbose=0, random_state=None, copy_x=True, n_jobs='deprecated', algorithm='auto')
Note: 默认的init方法实际是kmeans++。
Attributes
cluster_centers_ndarray of shape (n_clusters, n_features)
Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.
labels_ndarray of shape (n_samples,)
Labels of each point
inertia_float
Sum of squared distances of samples to their closest cluster center.
n_iter_int
Number of iterations run.
sklearn中kmeans的学习目标
The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:
[2.3. Clustering — scikit-learn 0.24.2 documentation]
sklearn实现聚类代码示例
Kmeans聚类
kmeans代码1
from sklearn.cluster import KMeans data = [[1, 2, 3], [4, 5, 6], [1, 3, 4], [5, 6, 7]] # line_num * feature_num的二维数组 num_clusters = 2 km_cluster = KMeans(n_clusters=num_clusters, init='k-means++', n_init=10, verbose=1, tol=1e-6) #km_cluster = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1) km_cluster.fit(data) # print("labels:\n", km_cluster.labels_) print("cluster_centers:\n", km_cluster.cluster_centers_) #numpy.ndarray
cluster_centers:
[[4.5 5.5 6.5]
[1. 2.5 3.5]]
kmeans代码2
[MachineLearning/Kmeans.py at master · pipilove/MachineLearning · GitHub]
DBSCAN聚类
def Dist(x, y):
from geopy import distance
return distance.vincenty(x, y).meters
import pickle, subprocess, pwd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
df = pd.read_pickle(os.path.join(CWD, 'middlewares/df.pkl'))
ll = df[['longitude', 'latitude']].values
x, y = ll[:, 0], ll[:, 1]
print('starting dbsan...')
dbscaner = DBSCAN(eps=DBSCAN_R, min_samples=DBSCAN_MIN_S, metric=Dist, n_jobs=-1).fit(ll)
pickle.dump(dbscaner, open(os.path.join(CWD, 'middlewares/dbscaner.pkl'), 'wb'))
print('dbsan dumping end...')
dbscaner = pickle.load(open(os.path.join(CWD, 'middlewares/dbscaner.pkl'), 'rb'))
labels = dbscaner.labels_
# print(set(labels))
colors = plt.cm.Spectral(np.linspace(0, 1, len(set(labels))))
for k, col in zip(set(labels), colors):
marker = '.'
if k == -1:
col = 'k'
marker = 'x'
inds_k = labels == k
plt.scatter(x[inds_k], y[inds_k], marker=marker, color=col)
if pwd.getpwuid(os.geteuid()).pw_name == 'pi':
plt.savefig('./1.png')
elif pwd.getpwuid(os.geteuid()).pw_name == 'pipi':
plt.show()
[DBSCAN]
聚类模型评估
聚类趋势(聚类前)
霍普金斯统计量(Hopkins Statistic)评估给定数据集是否存在有意义的可聚类的非随机结构。如果一个数据集是有随机的均匀的点生成的,虽然也可以产生聚类结果,但该结果没有意义。聚类的前提需要数据是非均匀分布的。该值在区间[0, 1]之间,[0.01, 0.3]表示数据结构regularly spaced,该值为0.5时数据是均匀分布的,[0.7, 0.99]表示聚类趋势很强。
无监督聚类评估方法(聚类后)
轮廓系数/Silhouette Coefficient¶
较高的 Silhouette 系数得分和能够更好定义的聚类的模型相关联。Silhouette 系数 是依据每个样本进行定义,由两个得分组成:
- a: 样本与同一类别中所有其他点之间的平均距离。
- b: 样本与 下一个距离最近的簇 中的所有其他点之间的平均距离。
然后将单个样本的 Silhouette 系数 s 给出为:
示例:
import numpy as np from sklearn.cluster import KMeans kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X) labels = kmeans_model.labels_ metrics.silhouette_score(X, labels, metric='euclidean')
CH指数/Calinski-Harabasz Index¶
亦称方差比准则(Variance Ratio Criterion),较高的 Calinski-Harabaz 的得分和能够更好定义的聚类的模型相关联。
对于 k 簇,Calinski-Harabaz 得分是通过簇间色散平均值(between-clusters dispersion mean)与 簇内色散之间(within-cluster dispersion)的比值给出的
其中 B_k 是组间色散矩阵(between group dispersion matrix), W_k 是由以下定义的簇内色散矩阵(within-cluster dispersion matrix):
示例:
import numpy as np from sklearn.cluster import KMeans kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X) labels = kmeans_model.labels_ metrics.calinski_harabasz_score(X, labels)
Davies-Bouldin Index¶
Contingency Matrix¶
[Clustering performance evaluation]
from: sklearn:聚类clustering
ref: