数据挖掘之K-means聚类练习

最新推荐文章于 2024-08-11 16:00:02 发布

qq_39409944

最新推荐文章于 2024-08-11 16:00:02 发布

阅读量2.7k

点赞数 1

分类专栏：机器学习

本文链接：https://blog.csdn.net/qq_39409944/article/details/80308225

版权

机器学习专栏收录该内容

6 篇文章 1 订阅

订阅专栏

参考点击打开链接

K-means聚类是一种无监督学习，是将数据集分为若干个不相交的子集，每个子集称为一个“簇”。

步骤：

1. 设置k值，即希望将数据集分为k个类

2. 从数据集中随机选取K个数据点，作为质心

3. 对剩余的数据，计算每个数据与k个质心的距离，离哪个最近，该数据就是这个簇的

4. 初始分类后，计算每个簇的均值，重新确定每个簇的质心

5. 如果新的质心与上一次质心的差别小于某个阈值时，（即质心变动不太大，趋于稳定，收敛），则聚类已达到理想结果，终值算法。

6. 否则重复3-5步骤

代码

# -*- coding: utf-8 -*-
"""
creat on 2018-05-09
k-means
@author:Wendy
"""

"""
第一部分：导入包
从sklearn.cluster机器学习聚类学习中导入kmeans聚类
"""

from sklearn.cluster import Birch
from sklearn.cluster import KMeans

"""
第二部分：数据集
x表示二维矩阵，篮球运动员比赛数据
第一列表示球员每分钟的助攻数：assists_per_minute
第二列表示球员每分钟的的分数：points_per_minute
"""

x = [[0.0888,0.5885],  
     [0.1399,0.8291],  
     [0.0747,0.4974],  
     [0.0983,0.5772],  
     [0.1276,0.5703],  
     [0.1671,0.5835],  
     [0.1906,0.5276],  
     [0.1061,0.5523],  
     [0.2446,0.4007],  
     [0.1670,0.4770],  
     [0.2485,0.4313],  
     [0.1227,0.4909],  
     [0.1240,0.5668],  
     [0.1461,0.5113],  
     [0.2315,0.3788],  
     [0.0494,0.5590],  
     [0.1107,0.4799],  
     [0.2521,0.5735],  
     [0.1007,0.6318],  
     [0.1067,0.4326],  
     [0.1956,0.4280]     
    ]  
print x


"""
第三部分：kmeans聚类
clf = KMeans(n_clusters=3) 表示类簇为3，聚类成3类数据
y_pred = clf.fit_predict(x) 载入数据，并且将聚类结果赋予y_pred
"""

clf = KMeans(n_clusters=3) 
y_pred = clf.fit_predict(x)

 #输出完整的kmeans函数，包括很多省略参数
print(clf)
 #输出聚类预测结果，20行数据，每个y_pred对应x一行或一个球员，聚成三类，类标为0,1,2
print(y_pred)


"""
第四部分：可视化绘图
导入matplotlib包
"""

import numpy as np
import matplotlib.pyplot as plt

#获取第一列和第二列数据 
X = np.array(x,dtype=np.float64)
x = [n[0] for n in X]
print x
y = [n[1] for n in X]
print y

#绘制散点图 x轴，y轴，c=y_pred聚类预测结果显示   marker显示类型，o表示圆点，*表示星型， x表示点
plt.scatter(x,y,c=y_pred,marker='x')

#绘制标题
plt.title('kmeans-basketball data')

#绘制x、y轴
plt.xlabel('assists_per_minute')
plt.ylabel('points_per_minute')

#设置右上角图例
plt.legend(['A','B','C'])

#显示图形
plt.show()

运行结果

[[0.0888, 0.5885], [0.1399, 0.8291], [0.0747, 0.4974], [0.0983, 0.5772], [0.1276, 0.5703], [0.1671, 0.5835], [0.1906, 0.5276], [0.1061, 0.5523], [0.2446, 0.4007], [0.167, 0.477], [0.2485, 0.4313], [0.1227, 0.4909], [0.124, 0.5668], [0.1461, 0.5113], [0.2315, 0.3788], [0.0494, 0.559], [0.1107, 0.4799], [0.2521, 0.5735], [0.1007, 0.6318], [0.1067, 0.4326], [0.1956, 0.428]]
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
[1 2 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 0]
[0.088800000000000004, 0.1399, 0.074700000000000003, 0.098299999999999998, 0.12759999999999999, 0.1671, 0.19059999999999999, 0.1061, 0.24460000000000001, 0.16700000000000001, 0.2485, 0.1227, 0.124, 0.14610000000000001, 0.23150000000000001, 0.049399999999999999, 0.11070000000000001, 0.25209999999999999, 0.1007, 0.1067, 0.1956]
[0.58850000000000002, 0.82909999999999995, 0.49740000000000001, 0.57720000000000005, 0.57030000000000003, 0.58350000000000002, 0.52759999999999996, 0.55230000000000001, 0.4007, 0.47699999999999998, 0.43130000000000002, 0.4909, 0.56679999999999997, 0.51129999999999998, 0.37880000000000003, 0.55900000000000005, 0.47989999999999999, 0.57350000000000001, 0.63180000000000003, 0.43259999999999998, 0.42799999999999999]

读取本地文件进行聚类，matplotlib画图优化

# -*- coding: utf-8 -*-

"""
读取本地文件，赋值给x变量
改为矩阵
"""

data = []
for line in open('d:\ml data\data.txt','r').readlines():
    line = line.rstrip() #删除换行
    result = ' '.join(line.split())#删除多余空格，以一个空格来连接
    s = [float(x) for x in result.split(' ')]  #获取每行五个值 '0 0.0888 201 36.02 28 0.5885' 注意：字符串转换为浮点型数
    #print s  
    data.append(s)#存储数据
    
#print u'完整数据集'
#print data
#print type(data)

print '第一列 第五列数据'
L2 = [n[0] for n in data]
print L2
L5 = [n[4] for n in data]
print L5

'''
X表示二维矩阵数据，篮球运动员比赛数据 
总共96行，每行获取两列数据 
第一列表示球员每分钟助攻数：assists_per_minute 
第五列表示球员每分钟得分数：points_per_minute 
''' 
#两列数据生成二维数据
print '两列数据合并成二维矩阵'
T = dict(zip(L2,L5))
print type(T)

#dict类型转变为list
print 'List'
x = list(map(lambda x,y:(x,y),T.keys(),T.values()))
print x
print type(x)

""" 
KMeans聚类 
clf = KMeans(n_clusters=3) 表示类簇数为3，聚成3类数据，clf即赋值为KMeans 
y_pred = clf.fit_predict(X) 载入数据集X，并且将聚类的结果赋值给y_pred 
"""  

from sklearn.cluster import Birch  
from sklearn.cluster import KMeans  
  
clf = KMeans(n_clusters=3)  
y_pred = clf.fit_predict(x)  
print(clf)  
#输出聚类预测结果，96行数据，每个y_pred对应X一行或一个球员，聚成3类，类标为0、1、2  
print(y_pred)  

""" 
可视化绘图 
Python导入Matplotlib包，专门用于绘图 
import matplotlib.pyplot as plt 此处as相当于重命名，plt用于显示图像 
"""  
  
import numpy as np  
import matplotlib.pyplot as plt  

k = np.array(x)
#获取第一列和第二列数据 使用for循环获取 n[0]表示X第一列  
x = [n[0] for n in k]  
print x  
y = [n[1] for n in k]  
print y   

#绘制散点图 参数：x横轴 y纵轴 c=y_pred聚类预测结果 marker类型 o表示圆点 *表示星型 x表示点  
#plt.scatter(x, y, c=y_pred, marker='x')  

#坐标  
x1 = []  
y1 = []  
  
x2 = []  
y2 = []  
  
x3 = []  
y3 = []  
  

#分布获取类标为0、1、2的数据 赋值给(x1,y1) (x2,y2) (x3,y3)  
i = 0  
while i < len(k):  
    if y_pred[i]==0:  
        x1.append(k[i][0])  
        y1.append(k[i][1])  
    elif y_pred[i]==1:  
        x2.append(k[i][0])  
        y2.append(k[i][1])  
    elif y_pred[i]==2:  
        x3.append(k[i][0])  
        y3.append(k[i][1])  
      
    i = i + 1  
    
  
 
plot1, = plt.plot(x1, y1, 'or', marker="x")    #四种颜色 红 绿 蓝 黑    
plot2, = plt.plot(x2, y2, 'og', marker="o")    
plot3, = plt.plot(x3, y3, 'ob', marker="*")  


#绘制标题  
plt.title("Kmeans-Basketball Data")  
  
#绘制x轴和y轴坐标  
plt.xlabel("assists_per_minute")  
plt.ylabel("points_per_minute")  

#设置右上角图例  
plt.legend((plot1, plot2, plot3), ('A', 'B', 'C'), fontsize=10) 

plt.show()