Python【极简】聚类算法(KMeans+DBSCAN+MeanShift)

Python【极简】聚类算法(KMeans+DBSCAN+MeanShift)

链接:https://blog.csdn.net/Yellow_python/article/details/81461056?utm_source=copy

1、聚类算法极简代码
1.1、K-Means:基于欧式距离
1.2、DBSCAN:基于密度
1.3、Mean Shift:均值漂移(三维可视化)
2、聚类评估:轮廓系数(Silhouette Coefficient)
2.1、KMeans聚类评估
2.2、DBSCAN聚类评估
2.3、MeanShift聚类评估
4、附录
4.1、翻译
4.2、数据集
4.2.1、数据集1
4.2.2、数据集2
1、聚类算法极简代码
1.1、K-Means:基于欧式距离
K-Means聚类算法的时间复杂度是O(nkt) ,适合挖掘大规模数据集
n:数据集中对象的数量
t:算法迭代的次数
k:簇的数目

创建数据

import numpy as np
X = np.array([[3, 4], [6, 8], [1, 2], [6, 7], [3, 1], [5, 8], [2, 3], [8, 7], [2, 2], [4, 2], [8, 6], [7, 8], [5, 1]])

聚类算法

from sklearn.cluster import KMeans
km = KMeans(n_clusters=2) # 创建KMeans对象,设置簇的数量
km.fit(X) # 传入数据
labels = km.labels_ # 聚类结果(分类标签)
print(labels)
centers = km.cluster_centers_ # 簇的中心
print(centers)

可视化

import matplotlib.pyplot as mp
for x, l in zip(X, labels): # 聚类标签
if l == 0:
mp.scatter(x[0], x[1], c=‘r’)
else:
mp.scatter(x[0], x[1], c=‘g’)
for i in range(len(centers)): # 簇的中心
if i == 0:
mp.scatter(centers[i][0], centers[i][1], c=‘r’, marker=‘x’, s=99)
else:
mp.scatter(centers[i][0], centers[i][1], c=‘g’, marker=‘x’, s=99)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

1.2、DBSCAN:基于密度
Density-Based Spatial Clustering of Applications with Noise
优点:
1、不需要事先知道要形成的簇类的数量
2、可发现任意形状的簇类
3、可识别出噪声点
4、对样本的顺序不敏感。但对于处于簇类之间边界样本,可能会根据哪个簇类优先被探测到而其归属有所摆动
缺点:
1、不能很好反映高维数据
2、不能很好反映数据集以变化的密度
3、如果样本集的密度不均匀、聚类间距差相差很大时,聚类质量较差

创造数据

from sklearn.datasets.samples_generator import make_blobs
X, _ = make_blobs(n_samples=100, centers=[[1, 1], [9, 9], [7, 3]])

DBSCAN:基于密度的聚类方法

from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=1, min_samples=3).fit(X).labels_
print(labels)

可视化

import matplotlib.pyplot as mp
colors = [‘red’, ‘blue’, ‘green’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13

1.3、Mean Shift:均值漂移(三维可视化)
寻找核密度极值点并作为簇的质心,然后根据最近邻原则为样本点赋予质心

从网络读取数据

import requests, re, numpy as np
def download():
url = ‘https://blog.csdn.net/Yellow_python/article/details/81461056
header = {‘User-Agent’: ‘Opera/8.0 (Windows NT 5.1; U; en)’}
r = requests.get(url, headers=header)
data = re.findall(’

([\s\S]+?)
’, r.text)[1].strip()
array = np.array([i.split(’,’) for i in data.split()]).astype(float)
return array
X = download()

均值漂移

from sklearn.cluster import MeanShift
labels = MeanShift().fit(X).labels_

可视化

import matplotlib.pyplot as mp
from mpl_toolkits import mplot3d
fig = mp.figure()
ax = mplot3d.Axes3D(fig)
colors = [‘red’, ‘blue’, ‘green’, ‘black’]
for x, l in zip(X, labels):
ax.scatter(x[0], x[1], x[2], c=colors[l], s=150, alpha=0.3)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

2、聚类评估:轮廓系数(Silhouette Coefficient)

a(i)a(i):样本ii到同簇其他样本的平均距离
b(i)b(i):样本ii的簇间不相似度

s(i)s(i)接近1:样本ii聚类合理
s(i)s(i)接近-1:样本ii更适合分到别的簇
s(i)s(i)接近0:样本ii在两个簇的边界上

from sklearn import metrics
score = metrics.silhouette_score(X, labels)

2.1、KMeans聚类评估

从网络读取数据

import requests, re, numpy as np
def download():
url = ‘https://blog.csdn.net/Yellow_python/article/details/81461056
header = {‘User-Agent’: ‘Opera/8.0 (Windows NT 5.1; U; en)’}
r = requests.get(url, headers=header)
data = re.findall(’

([\s\S]+?)
’, r.text)[0].strip()
array = np.array([i.split(’,’) for i in data.split()]).astype(float)
return array
X = download()
m, n = 2, 6 # 设定簇的数量
for i in range(m, n):
# KMeans聚类算法
from sklearn.cluster import KMeans
labels = KMeans(n_clusters=i).fit(X).labels_
# 可视化
import matplotlib.pyplot as mp
mp.subplot(1, n - m, i - m + 1)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
# 聚类评估:轮廓系数(Silhouette Coefficient)
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘n_clusters = %d 的聚类得分为:’ % i, score)
mp.tight_layout()
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

打印结果
n_clusters = 2 的聚类得分为: 0.6094103841500139
n_clusters = 3 的聚类得分为: 0.4249285827871494
n_clusters = 4 的聚类得分为: 0.3447569550742587
n_clusters = 5 的聚类得分为: 0.34076078057327047
2.2、DBSCAN聚类评估

创建数据

import numpy as np
X = np.array([[1, 4], [6, 8], [1, 2], [6, 7], [5, 3], [5, 8], [2, 3], [8, 7], [2, 2], [4, 2], [8, 6], [7, 8], [5, 1]])
radii = [1.414, 1.415, 2]
for i in range(3):
# DBSCAN:基于密度的聚类方法
from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=radii[i], min_samples=2).fit(X).labels_
# 可视化
import matplotlib.pyplot as mp
mp.subplot(1, 3, i + 1)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
# 聚类评估:轮廓系数(Silhouette Coefficient)
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘eps = %.3f 的聚类得分是:’ % radii[i], score)
mp.tight_layout()
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

打印结果
eps = 1.414 的聚类得分是: 0.36739772676132704
eps = 1.415 的聚类得分是: 0.6018738849706604
eps = 2.000 的聚类得分是: 0.6431136276704154
2.3、MeanShift聚类评估

创建数据 -------------------------------------------------------------------------------------------------------------

from sklearn.datasets.samples_generator import make_blobs
centers = [[0, 0, 0], [6, 4, 1], [9, 9, 9]]
X, _ = make_blobs(n_samples=100, centers=centers, cluster_std=2, random_state=0)

均值偏移 -------------------------------------------------------------------------------------------------------------

from sklearn.cluster import MeanShift, estimate_bandwidth
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=50) # 带宽(分位点、样本数)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True).fit(X)

聚类标签

labels = ms.labels_

簇的中心

centers = ms.cluster_centers_
print(centers)

聚类评估 ---------------------------------------------------------------------------------------------------------

from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘聚类得分是:%.2f’ % score)

可视化 -----------------------------------------------------------------------------------------------------------

import matplotlib.pyplot as mp
from mpl_toolkits import mplot3d
fig = mp.figure()
ax = mplot3d.Axes3D(fig)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]

样本集聚类结果

for x, l in zip(X, labels):
ax.scatter(x[0], x[1], x[2], c=colors[l], s=120, alpha=0.2)

簇的中心

for i in range(len(centers)):
ax.scatter(centers[i][0], centers[i][1], centers[i][2], c=colors[i], s=200, marker=‘x’)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

4、附录
4.1、翻译
cluster
n. 簇;v 群聚
radius
半径(复数:radii)
cyan
蓝绿色
density
密度
spatial
空间的
distance
距离
silhouette
轮廓
coefficient
系数;合作的
shift
n. 移动;vi. 转换;vt. 转移
bandwidth
带宽
quantile
n. [计] 分位数;分位点
4.2、数据集
4.2.1、数据集1
1.093,1.227
-1.386,-2.334
1.040,1.181
-1.663,-0.969
-1.273,-0.990
1.920,1.882
1.106,0.759
-1.382,-0.594
1.707,0.892
-0.024,2.170
-0.287,-0.810
0.456,1.031
-0.597,-0.756
-1.059,-1.398
0.298,-0.198
1.086,1.873
-1.041,0.028
-0.632,-0.447
-0.654,-1.125
-1.417,-1.090
-1.462,-0.676
0.694,0.292
1.125,1.586
1.241,0.589
1.214,1.424
1.519,0.555
0.758,1.733
2.121,0.414
0.301,1.540
0.130,-1.809
1.738,1.721
2.362,0.127
-1.704,0.166
-0.625,-1.961
-0.537,-0.506
-1.600,-1.927
-1.776,-0.840
0.512,-0.036
-1.621,-0.591
0.430,-0.433
-0.532,1.392
-0.324,-1.648
-1.790,-1.277
-0.447,-0.809
-0.821,-0.204
1.587,2.345
1.893,2.138
1.570,0.909
1.006,2.072
-0.762,-1.656
0.791,1.094
1.684,0.259
0.768,0.819
2.058,1.240
1.188,0.488
-1.024,-1.701
1.279,0.078
-1.482,-1.414
-1.735,-0.493
-0.486,-1.391
-1.261,0.110
0.121,-0.456
-1.248,-1.448
-1.762,-0.418
0.022,1.278
0.619,0.782
0.983,1.257
-1.447,-1.496
-1.161,-0.519
0.371,0.148
0.463,1.232
0.154,-0.112
0.597,0.784
-0.686,-1.103
0.938,1.246
0.032,0.872
0.248,1.466
-1.517,0.146
0.467,-0.188
-0.774,-1.660
-1.405,-0.981
-0.900,-0.619
-0.430,-0.947
1.457,1.073
-1.212,-1.825
-1.688,-1.263
0.694,0.737
-1.299,0.158
1.266,1.200
-1.444,-0.074
1.896,0.877
0.813,1.034
0.478,0.653
-1.895,-0.736
1.027,0.888
0.358,1.633
-1.548,-0.330
1.076,1.241
-0.432,-1.093
1.437,1.077
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
4.2.2、数据集2
1.0050,1.8073,2.7386
1.9259,2.4306,0.2451
1.6545,0.4596,1.1321
-0.9050,2.0691,1.0404
2.7505,-0.4984,0.6483
-0.0468,2.2348,-0.5511
0.4628,0.6909,1.4954
-0.6743,-0.5436,-0.6257
1.9558,3.7735,2.3948
1.9286,3.3877,-0.5870
1.2800,1.3642,1.4622
0.3422,2.5313,2.7598
2.1371,0.4341,-0.0924
-0.3058,2.8154,-0.8745
-0.8878,0.9490,2.2726
0.5742,1.4715,0.0139
2.2790,1.9328,1.6814
3.1572,2.8172,-0.5836
3.0747,-0.7420,2.2908
1.9663,1.6197,2.9873
3.9519,-0.0063,-0.9588
1.7085,1.0218,-0.8038
1.0407,0.4408,1.0324
1.3979,-0.4051,0.8771
2.3624,2.4067,3.8003
1.5867,1.1362,0.9736
2.2561,-0.8329,3.8295
0.4978,2.5666,3.7187
2.3976,0.5715,1.1566
0.6931,2.8798,1.4282
0.9098,0.4515,2.6086
-0.9297,-0.6534,1.8047
2.7165,2.5808,3.1228
3.8165,2.4418,1.5569
2.9530,0.2010,2.6209
1.7720,3.0011,3.9736
3.1591,0.6074,1.1515
0.7855,2.1894,1.6889
1.5885,0.4375,2.8251
1.9275,1.2649,1.3579
1.6962,4.5295,4.3640
1.4627,3.3351,2.4612
2.2010,1.2483,4.5353
0.6686,3.0405,3.1677
2.9674,1.9180,1.7931
-0.6761,4.8257,1.4628
0.9636,2.0809,3.3889
0.0456,2.9660,1.7052
2.6762,9.3597,9.7102
2.2002,9.2157,6.9957
1.3057,8.5777,9.4119
0.9371,10.8166,9.4066
1.0637,7.9897,7.1191
0.7614,10.0867,7.1291
-0.1443,7.4889,8.3340
0.9175,8.2982,6.8904
2.9528,9.4759,7.1891
3.8934,10.4592,5.9079
3.4689,8.6936,8.2476
0.4932,10.5898,7.8731
3.7129,8.9430,5.2596
1.3458,10.4512,6.7899
0.9601,8.7501,7.9669
1.0073,8.2057,5.9963
4.6192,9.2501,10.2311
4.2770,9.6612,8.8755
3.0643,7.9322,10.9946
1.4359,8.8022,9.7678
4.9768,7.8666,7.7231
2.2322,9.6188,7.1282
2.3977,6.1008,9.5358
2.8024,7.9230,7.7847
1.7787,9.6656,8.5730
2.5588,9.2667,5.3680
2.2381,7.4282,8.5095
-0.3990,9.5905,7.2105
1.6016,6.4013,5.2564
0.2722,8.6830,5.5263
-0.0809,7.4194,8.4713
0.8066,6.9738,6.9004
9.1410,8.5563,6.4202
9.9706,8.4647,4.7648
9.2004,5.5402,5.7502
7.7044,7.2929,5.9979
10.0255,6.8514,4.1539
7.9756,8.3657,3.8871
7.5363,5.7118,5.9797
8.0823,5.8570,3.7665
7.1929,10.9547,5.5867
7.1595,10.1112,3.8669
8.6083,8.6523,5.0490
6.2183,9.3562,4.0778
8.5655,7.1762,2.7066
6.6339,10.6751,2.1566
6.3456,8.5746,4.9021
6.7410,7.2915,3.9987
7.1588,9.6255,7.4575
8.4862,9.7134,5.3465
7.8276,7.9110,6.6756
5.9208,8.8276,7.3346
8.7747,7.0975,5.5719
6.1980,8.1578,4.7254
5.8346,7.6449,6.6972
5.8241,6.9158,4.8532
8.8481,8.2460,4.6843
9.9423,8.5196,2.1585
8.1127,6.3995,4.4851
7.7731,9.5787,5.9392
8.5547,6.3210,3.8450
6.6214,9.6489,2.2730
6.8169,7.2050,5.9877
6.8193,7.5884,3.0205
9.6705,9.1018,4.9061
9.7746,8.8037,2.3269
9.8415,7.0949,5.7011
8.2221,8.3008,5.6954
10.5455,7.2361,2.2893
7.9733,8.6451,2.1530
7.8287,6.6419,4.5996
8.0243,6.2347,3.9297
9.2898,10.9877,7.0184
9.2897,10.3033,4.8531
9.2510,7.3422,7.1935
6.6630,10.3948,6.8442
9.7234,7.7998,5.5406
6.7309,10.5378,5.5303
6.9013,7.0826,6.2990
7.4853,7.9020,4.5881
10.9317,10.9432,5.1278
10.3645,10.3791,4.4309
9.4693,8.6197,5.1706
8.4627,10.2630,5.3924
10.1341,7.2184,3.0029
8.4236,10.5672,4.8410
8.2941,7.5425,5.4322
7.2079,7.7771,3.5309
10.9911,9.9904,4.8578
10.0575,10.4442,2.7242
10.9198,8.1909,4.7242
7.8456,10.1644,4.4754
9.3154,7.7046,1.6364
8.5910,10.6952,1.2513
8.1458,7.0717,4.8291
8.3870,7.6361,1.0139

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
生成点云数据集可以使用Python中的numpy库和sklearn库中的make_blobs函数,如下所示: ```python import numpy as np from sklearn.datasets import make_blobs # 生成300个样本,4个中心点,方差为0.5,坐标范围为(-10,10) X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.5, random_state=0) ``` 然后可以使用sklearn库中的KMeansDBSCANMeanShift进行聚类,如下所示: ```python from sklearn.cluster import KMeans, DBSCAN, MeanShift # KMeans聚类 kmeans = KMeans(n_clusters=4) kmeans_pred = kmeans.fit_predict(X) # DBSCAN聚类 dbscan = DBSCAN(eps=0.5, min_samples=5) dbscan_pred = dbscan.fit_predict(X) # MeanShift聚类 ms = MeanShift() ms_pred = ms.fit_predict(X) ``` 最后可以使用matplotlib库进行可视化,如下所示: ```python import matplotlib.pyplot as plt # 可视化KMeans聚类结果 plt.scatter(X[:, 0], X[:, 1], c=kmeans_pred) plt.title("KMeans Clustering") plt.show() # 可视化DBSCAN聚类结果 plt.scatter(X[:, 0], X[:, 1], c=dbscan_pred) plt.title("DBSCAN Clustering") plt.show() # 可视化MeanShift聚类结果 plt.scatter(X[:, 0], X[:, 1], c=ms_pred) plt.title("MeanShift Clustering") plt.show() ``` 评价指标可以使用sklearn库中的metrics模块进行计算,如下所示: ```python from sklearn import metrics # 计算KMeans的评价指标 kmeans_score = metrics.silhouette_score(X, kmeans_pred) # 计算DBSCAN的评价指标 dbscan_score = metrics.silhouette_score(X, dbscan_pred) # 计算MeanShift的评价指标 ms_score = metrics.silhouette_score(X, ms_pred) ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值