Python【极简】聚类算法（KMeans+DBSCAN+MeanShift）

最新推荐文章于 2024-08-11 16:00:02 发布

WitsMakeMen

最新推荐文章于 2024-08-11 16:00:02 发布

阅读量2.3k

点赞数

分类专栏：算法学习

算法学习专栏收录该内容

213 篇文章 6 订阅

订阅专栏

Python【极简】聚类算法（KMeans+DBSCAN+MeanShift）

链接：https://blog.csdn.net/Yellow_python/article/details/81461056?utm_source=copy

1、聚类算法极简代码
1.1、K-Means：基于欧式距离
1.2、DBSCAN：基于密度
1.3、Mean Shift：均值漂移（三维可视化）
2、聚类评估：轮廓系数（Silhouette Coefficient）
2.1、KMeans聚类评估
2.2、DBSCAN聚类评估
2.3、MeanShift聚类评估
4、附录
4.1、翻译
4.2、数据集
4.2.1、数据集1
4.2.2、数据集2
1、聚类算法极简代码
1.1、K-Means：基于欧式距离
K-Means聚类算法的时间复杂度是O(nkt) ，适合挖掘大规模数据集
n：数据集中对象的数量
t：算法迭代的次数
k：簇的数目

创建数据

import numpy as np
X = np.array([[3, 4], [6, 8], [1, 2], [6, 7], [3, 1], [5, 8], [2, 3], [8, 7], [2, 2], [4, 2], [8, 6], [7, 8], [5, 1]])

聚类算法

from sklearn.cluster import KMeans
km = KMeans(n_clusters=2) # 创建KMeans对象，设置簇的数量
km.fit(X) # 传入数据
labels = km.labels_ # 聚类结果（分类标签）
print(labels)
centers = km.cluster_centers_ # 簇的中心
print(centers)

可视化

import matplotlib.pyplot as mp
for x, l in zip(X, labels): # 聚类标签
if l == 0:
mp.scatter(x[0], x[1], c=‘r’)
else:
mp.scatter(x[0], x[1], c=‘g’)
for i in range(len(centers)): # 簇的中心
if i == 0:
mp.scatter(centers[i][0], centers[i][1], c=‘r’, marker=‘x’, s=99)
else:
mp.scatter(centers[i][0], centers[i][1], c=‘g’, marker=‘x’, s=99)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

1.2、DBSCAN：基于密度
Density-Based Spatial Clustering of Applications with Noise
优点：
1、不需要事先知道要形成的簇类的数量
2、可发现任意形状的簇类
3、可识别出噪声点
4、对样本的顺序不敏感。但对于处于簇类之间边界样本，可能会根据哪个簇类优先被探测到而其归属有所摆动
缺点：
1、不能很好反映高维数据
2、不能很好反映数据集以变化的密度
3、如果样本集的密度不均匀、聚类间距差相差很大时，聚类质量较差

创造数据

from sklearn.datasets.samples_generator import make_blobs
X, _ = make_blobs(n_samples=100, centers=[[1, 1], [9, 9], [7, 3]])

DBSCAN：基于密度的聚类方法

from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=1, min_samples=3).fit(X).labels_
print(labels)

可视化

import matplotlib.pyplot as mp
colors = [‘red’, ‘blue’, ‘green’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13

1.3、Mean Shift：均值漂移（三维可视化）
寻找核密度极值点并作为簇的质心，然后根据最近邻原则为样本点赋予质心

从网络读取数据

import requests, re, numpy as np
def download():
url = ‘https://blog.csdn.net/Yellow_python/article/details/81461056’
header = {‘User-Agent’: ‘Opera/8.0 (Windows NT 5.1; U; en)’}
r = requests.get(url, headers=header)
data = re.findall(’

([\s\S]+?)

’, r.text)[1].strip()
array = np.array([i.split(’,’) for i in data.split()]).astype(float)
return array
X = download()

均值漂移

from sklearn.cluster import MeanShift
labels = MeanShift().fit(X).labels_

可视化

import matplotlib.pyplot as mp
from mpl_toolkits import mplot3d
fig = mp.figure()
ax = mplot3d.Axes3D(fig)
colors = [‘red’, ‘blue’, ‘green’, ‘black’]
for x, l in zip(X, labels):
ax.scatter(x[0], x[1], x[2], c=colors[l], s=150, alpha=0.3)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

2、聚类评估：轮廓系数（Silhouette Coefficient）

a(i)a(i)：样本ii到同簇其他样本的平均距离
b(i)b(i)：样本ii的簇间不相似度

s(i)s(i)接近1：样本ii聚类合理
s(i)s(i)接近-1：样本ii更适合分到别的簇
s(i)s(i)接近0：样本ii在两个簇的边界上

from sklearn import metrics
score = metrics.silhouette_score(X, labels)

2.1、KMeans聚类评估

从网络读取数据

([\s\S]+?)

’, r.text)[0].strip()
array = np.array([i.split(’,’) for i in data.split()]).astype(float)
return array
X = download()
m, n = 2, 6 # 设定簇的数量
for i in range(m, n):
# KMeans聚类算法
from sklearn.cluster import KMeans
labels = KMeans(n_clusters=i).fit(X).labels_
# 可视化
import matplotlib.pyplot as mp
mp.subplot(1, n - m, i - m + 1)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
# 聚类评估：轮廓系数（Silhouette Coefficient）
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘n_clusters = %d 的聚类得分为：’ % i, score)
mp.tight_layout()
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

打印结果
n_clusters = 2 的聚类得分为： 0.6094103841500139
n_clusters = 3 的聚类得分为： 0.4249285827871494
n_clusters = 4 的聚类得分为： 0.3447569550742587
n_clusters = 5 的聚类得分为： 0.34076078057327047
2.2、DBSCAN聚类评估

创建数据

import numpy as np
X = np.array([[1, 4], [6, 8], [1, 2], [6, 7], [5, 3], [5, 8], [2, 3], [8, 7], [2, 2], [4, 2], [8, 6], [7, 8], [5, 1]])
radii = [1.414, 1.415, 2]
for i in range(3):
# DBSCAN：基于密度的聚类方法
from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=radii[i], min_samples=2).fit(X).labels_
# 可视化
import matplotlib.pyplot as mp
mp.subplot(1, 3, i + 1)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
# 聚类评估：轮廓系数（Silhouette Coefficient）
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘eps = %.3f 的聚类得分是：’ % radii[i], score)
mp.tight_layout()
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

打印结果
eps = 1.414 的聚类得分是： 0.36739772676132704
eps = 1.415 的聚类得分是： 0.6018738849706604
eps = 2.000 的聚类得分是： 0.6431136276704154
2.3、MeanShift聚类评估

创建数据 -------------------------------------------------------------------------------------------------------------

from sklearn.datasets.samples_generator import make_blobs
centers = [[0, 0, 0], [6, 4, 1], [9, 9, 9]]
X, _ = make_blobs(n_samples=100, centers=centers, cluster_std=2, random_state=0)

均值偏移 -------------------------------------------------------------------------------------------------------------

from sklearn.cluster import MeanShift, estimate_bandwidth
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=50) # 带宽（分位点、样本数）
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True).fit(X)

聚类标签

labels = ms.labels_

簇的中心

centers = ms.cluster_centers_
print(centers)

聚类评估 ---------------------------------------------------------------------------------------------------------

from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘聚类得分是：%.2f’ % score)

可视化 -----------------------------------------------------------------------------------------------------------

import matplotlib.pyplot as mp
from mpl_toolkits import mplot3d
fig = mp.figure()
ax = mplot3d.Axes3D(fig)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]

样本集聚类结果

for x, l in zip(X, labels):
ax.scatter(x[0], x[1], x[2], c=colors[l], s=120, alpha=0.2)

簇的中心

for i in range(len(centers)):
ax.scatter(centers[i][0], centers[i][1], centers[i][2], c=colors[i], s=200, marker=‘x’)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30