python 实现周志华 机器学习书中 k-means 算法

hello,all

上节采用python实现了决策树,本节使用python实现k-means算法,后一节将会采用map-reduce实现k-means算法

算法程序如下:




算法代码如下:

# coding=utf-8
import pprint
import uniout
import math
from collections import Counter
import copy as cp
import random as rd
import matplotlib
import matplotlib.pyplot as plt


'''
@author :chenyuqing
@mail   :chen_yu_qin_g@163.com
'''
from numpy import *
def load_data(path):
    '''
    :param path:传递路径,返回样例的数据
    :return:
    '''
    data_set=[]
    file_object=open(path)
    for line in file_object.readlines():
        lineArr = line.strip().split('\t')
        lineArr = [float(x) for x in lineArr] #将字符串转换成数字
        data_set.append(lineArr)
    data_set=array(data_set)
    return data_set
def my_kmeans(k ,data_set):
    '''
    :param k:
    :param data_set:
    :return:
    '''
    sample_data_index=rd.sample(list(range(0,len(data_set))),k)
    start_list=[] #定义起始的结果向量
    end_list=[[0,0] for n in range(k)]#定义结束的向量
    end_result=[[] for n in range(k)]# 分类完毕后的结果
    for temp in sample_data_index:
        start_list.append(data_set[temp].tolist())


    iter_n=10
    while(start_list<>end_list): #
        for i in range(0,len(data_set)):
            temp_distance=float("inf")
            temp_result=0
            for j in range(0,len(start_list)):
                distance= math.sqrt(math.pow(data_set[i][0]-start_list[j][0],2)+math.pow(data_set[i][1]-start_list[j][1],2))
                if distance<temp_distance:
                    temp_distance = distance
                    temp_result=j #明确该点是属于哪一个类别
            end_result[temp_result].append(data_set[i].tolist())
        end_list=cp.deepcopy(start_list)
        for i in range(0,len(end_result)):
            start_list[i][0]=round(sum([x[0] for x in end_result[i]])/float(len(end_result[i])) ,6) #注意这里保留小数,不然会死循环,因为拷贝的时候也有精度误差。
            start_list[i][1]=round(sum([x[1] for x in end_result[i]])/float(len(end_result[i])) ,6)
    print "the result is :\n" ,end_result
    return end_result


if __name__ == '__main__':
    print("------------my kmeans-----------")
    path=u"./西瓜数据集4.0.txt"
    data_set=load_data(path=path)
    print data_set
    result=my_kmeans(3,data_set=data_set)
    print result[0]
    print result[1]
    print result[2]


    one_x=[x[0] for x in result[0]]
    one_y=[x[1] for x in result[0]]


    two_x=[x[0] for x in result[1]]
    two_y=[x[1] for x in result[1]]


    three_x=[x[0] for x in result[2]]
    three_y=[x[1] for x in result[2]]


    plt.scatter(one_x,one_y,s=20,marker='o',color='m')
    plt.scatter(two_x,two_y,s=20,marker='+',color='c')
    plt.scatter(three_x,three_y,s=20,marker='*',color='r')
    plt.show()

结果如下:

[[0.697, 0.46], [0.744, 0.376], [0.634, 0.264], [0.608, 0.318], [0.639, 0.161], [0.657, 0.198], [0.719, 0.103], [0.748, 0.232], [0.714, 0.346], [0.751, 0.489], [0.725, 0.445]]
[[0.403, 0.237], [0.243, 0.267], [0.36, 0.37], [0.339, 0.241], [0.282, 0.257], [0.483, 0.312], [0.478, 0.437], [0.525, 0.369], [0.532, 0.472], [0.473, 0.376], [0.446, 0.459]]
[[0.556, 0.215], [0.481, 0.149], [0.437, 0.211], [0.666, 0.091], [0.245, 0.057], [0.343, 0.099], [0.593, 0.042], [0.359, 0.188]]
11

结果展示






采用西瓜数据集4.0:

0.697 0.46
0.744 0.376
0.634 0.264
0.608 0.318
0.556 0.215
0.403 0.237
0.481 0.149
0.437 0.211
0.666 0.091
0.243 0.267
0.245 0.057
0.343 0.099
0.639 0.161
0.657 0.198
0.36 0.37
0.593 0.042
0.719 0.103
0.359 0.188
0.339 0.241
0.282 0.257
0.748 0.232
0.714 0.346
0.483 0.312
0.478 0.437
0.525 0.369
0.751 0.489
0.532 0.472
0.473 0.376
0.725 0.445
0.446 0.459



后续将会将其采用mr程序重新编写,敬请关注。

ths

  • 3
    点赞
  • 48
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值