无监督学习-聚类
2.DBSCAN方法及应用
DBSCAN密度聚类
DBSCAN算法是一种基于密度的聚类算法:聚类的时候不需要预先指定簇的个数
最终的簇的个数不定
DBSCAN算法将数据点分为三类:核心点:在半径Eps内含有超过MinPts数目的点
边界点:在半径Eps内点的数量小于MinPts,但是落在核心点的邻域内
噪音点:既不是核心点也不是边界点的点
DBSCAN算法流程:将所有点标记为核心点、边界点和噪声点;
删除噪声点;
为距离在Eps之内的所有核心点之间赋予一条边;
每组连通的核心点形成一个簇;
将每个边界点指派到一个与之关联的核心点的簇中(哪一个核心点的半径范围之内)
DBSCAN的应用实例
数据介绍:
现有大学校园网的日志数据,290条大学生的校园网使用情况数据,数据包括用户ID,设备的MAC地址,IP地址,开始上网时间,停止上网时间,上网时长,校园网套餐等。 利用已有数据,分析学生上网的模式。
实验目的:
通过DBSCAN聚类,分析学生上网时间和上网时长的模式。
技术路线:
sklearn.cluster.DBSCAN
1. 建立工程,导入sklearn相关包
import numpy as np
from sklearn.cluster import DBSCAN
DBSCAN主要参数:eps: 两个样本被看作邻居节点的最大距离
min_samples: 簇的样本数
metric:距离计算方式
例:sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
2. 读入数据并进行处理
3-1. 上网时间聚类,创建DBSCAN算法实例,并进行训练,获得标签:
4. 输出标签,查看结果:
Labels:
[ 0 -1 0 1 -1 1 0 1 2 -1 1 0 1 1 3 -1 -1 3 -1 1 1 -1 1 3
4 -1 1 1 2 0 2 2 -1 0 1 0 0 0 1 3 -1 0 1 1 0 0 2 -1
1 3 1 -1 3 -1 3 0 1 1 2 3 3 -1 -1 -1 0 1 2 1 -1 3 1 1
2 3 0 1 -1 2 0 0 3 2 0 1 -1 1 3 -1 4 2 -1 -1 0 -1 3 -1
0 2 1 -1 -1 2 1 1 2 0 2 1 1 3 3 0 1 2 0 1 0 -1 1 1
3 -1 2 1 3 1 1 1 2 -1 5 -1 1 3 -1 0 1 0 0 1 -1 -1 -1 2
2 0 1 1 3 0 0 0 1 4 4 -1 -1 -1 -1 4 -1 4 4 -1 4 -1 1 2
2 3 0 1 0 -1 1 0 0 1 -1 -1 0 2 1 0 2 -1 1 1 -1 -1 0 1
1 -1 3 1 1 -1 1 1 0 0 -1 0 -1 0 0 2 -1 1 -1 1 0 -1 2 1
3 1 1 -1 1 0 0 -1 0 0 3 2 0 0 5 -1 3 2 -1 5 4 4 4 -1
5 5 -1 4 0 4 4 4 5 4 4 5 5 0 5 4 -1 4 5 5 5 1 5 5
0 5 4 4 -1 4 4 5 4 0 5 4 -1 0 5 5 5 -1 4 5 5 5 5 4
4]
Noise raito: 22.15% # 噪声数据的比例
Estimated number of clusters: 6 # 簇的个数
Silhouette Coefficient: 0.710 # 聚类效果评价指标
Cluster 0 :
[22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22]
Cluster 1 :
[23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23]
Cluster 2 :
[20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
Cluster 3 :
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21]
Cluster 4 :
[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
Cluster 5 :
[7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7]
5.画直方图,分析实验结果:转换直方图分析
观察:上网时间大多聚集在22:00和23:00
6. 数据分布 vs 聚类:
3-2. 上网时长聚类,创建DBSCAN算法实例,并进行训练,获得标签:
4-2. 输出标签,查看结果:
本节完整代码:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
"""@author: antenna@license: (C) Copyright 2019, Antenna.@contact: lilyef2000@gmail.com@software:@file: dbscan-test.py@time: 2019/4/9 18:47@desc:"""
import numpy as np
import sklearn.cluster as skc
from sklearn import metrics
import matplotlib.pyplot as plt
mac2id = dict()
onlinetimes = []
f = open('TestData.txt', encoding='utf-8')
for line in f:
mac = line.split(',')[2]
onlinetime = int(line.split(',')[6])
starttime = int(line.split(',')[4].split(' ')[1].split(':')[0])
if mac not in mac2id:
mac2id[mac] = len(onlinetimes)
onlinetimes.append((starttime, onlinetime))
else:
onlinetimes[mac2id[mac]] = [(starttime, onlinetime)]
real_X = np.array(onlinetimes).reshape((-1, 2))
X = real_X[:, 0:1]
db = skc.DBSCAN(eps=0.01, min_samples=20).fit(X)
labels = db.labels_
print('Labels:')
print(labels)
raito = len(labels[labels[:] == -1]) / len(labels)
print('Noise raito:', format(raito, '.2%'))
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters:%d' % n_clusters_)
print("Silhouette Coefficient:%0.3f" % metrics.silhouette_score(X, labels))
for i in range(n_clusters_):
print('Cluster ', i, ':')
print(list(X[labels == i].flatten()))
plt.hist(X, 24)
plt.show()