《机器学习实战》第二章 2.1 k-近邻

最新推荐文章于 2022-09-12 14:01:39 发布

csdn_lzw

最新推荐文章于 2022-09-12 14:01:39 发布

阅读量484

点赞数

分类专栏：机器学习实战文章标签：机器学习 numpy python KNN

本文链接：https://blog.csdn.net/csdn_lzw/article/details/53350685

版权

机器学习实战专栏收录该内容

8 篇文章 3 订阅

订阅专栏

这里写图片描述
《机器学习实战》系列博客主要是实现并理解书中的代码，相当于读书笔记了。毕竟实战不能光看书。动手就能遇到许多奇奇怪怪的问题。博文比较粗糙，需结合书本。博主边查边学，水平有限，有问题的地方评论区请多指教。书中的代码和数据，网上有很多请自行下载。

k-近邻算法采用测量不同特征值之间的距离方法进行分类

2.1.1导入数据

 #coding=utf-8
from numpy import *
import operator
def createDataSet():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    #Numpy库中多维数组表示方法array
    labels = ['A', 'A', 'B', 'B']
    return group, labels

命令行运行结果

>>> import kNN
>>> group , labels = kNN.createDataSet()
>>> group
array([[ 1. ,  1.1],
       [ 1. ,  1. ],
       [ 0. ,  0. ],
       [ 0. ,  0.1]])
>>> labels
['A', 'A', 'B', 'B']

2.1.2 实施KNN分类算法

def classify(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0] # 数据集大小
    # 计算距离
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    # 按距离排序
    sortedDistIndicies = distances.argsort()
    # 统计前k个点所属的类别
    classCount = {}
    for i in range(k):
        votaIlabel = labels[sortedDistIndicies[i]]
        classCount[votaIlabel] = classCount.get(votaIlabel, 0) + 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    # 返回前k个点中频率最高的类别
    return sortedClassCount[0][0]

命令行输入样例

>>> kNN.classify0([0,0],group,labels,3)
'B'

2.1.3相关函数学习

shape 函数读取矩阵的维度

>>> from numpy import *
>>> shape([1])
(1L,)
>>> shape([[1],[2]])
(2L, 1L)
>>> shape([[1,2]])
(1L, 2L)
>>> shape(3)
()
>>> e = eye(3)
>>> e
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])
>>> e.shape
(3L, 3L)
>>> e.shape[0]
3L

tile函数将数组重复n次构成新的数组

>>> from numpy import*
>>> a = [0,1,2]
>>> b = tile(a,2)
>>> b
array([0, 1, 2, 0, 1, 2])
>>> b = tile(a,[1,2])
>>> b
array([[0, 1, 2, 0, 1, 2]])
>>> b = tile(a,[2,1])
>>> b
array([[0, 1, 2],
       [0, 1, 2]])
>>> b = tile(a,[2,2])
>>> b
array([[0, 1, 2, 0, 1, 2],
       [0, 1, 2, 0, 1, 2]])

Sum函数求和

>>> from numpy import*
>>> sum([1,2])
3
>>> sum([[1,2],[2,3],[3,4]])
15
>>> sum([[1,2],[2,3],[3,4]],axis=0)  #按列求和
array([6, 9])
>>> sum([[1,2],[2,3],[3,4]],axis=1)  #按行求和
array([3, 5, 7])

Argsort函数返回数组值从小到大的索引值

>>> from numpy import*
>>> x = [4,2,5]
>>> argsort(x)
array([1, 0, 2], dtype=int64)
>>> x = ([[2,3],[-5,8]])
>>> argsort(x,axis=0)   #按列排序
array([[1, 0],
       [0, 1]], dtype=int64)
>>> argsort(x,axis=1)   #按行排序
array([[0, 1],
       [0, 1]], dtype=int64)

Range函数

>>> range(1,5)
[1, 2, 3, 4]
>>> range(5)
[0, 1, 2, 3, 4]
>>> range(1,5,2)
[1, 3]

字典：d = {key1 : value1, key2 : value2 }。键/值对用冒号分割，而各个对用逗号分割，所有这些都包括在花括号中

ab = {       'Swaroop'   : 'swaroopch@byteofpython.info',
             'Larry'     : 'larry@wall.org',             
             'Matsumoto' : 'matz@ruby-lang.org',             
             'Spammer'   : 'spammer@hotmail.com'     }
print "Swaroop's address is %s" % ab['Swaroop']
# Adding a key/value pair 
ab['Guido'] = 'guido@python.org'
# Deleting a key/value pair 
del ab['Spammer']
print 'There are %d contacts in the address-book' % len(ab) 
for name, address in ab.items():    
    print 'Contact %s at %s' % (name, address)
if 'Guido' in ab: 
# OR ab.has_key('Guido')    
    print "Guido's address is %s" % ab['Guido']

Swaroop's address is swaroopch@byteofpython.info
There are 4 contacts in the address-book
Contact Swaroop at swaroopch@byteofpython.info
Contact Matsumoto at matz@ruby-lang.org
Contact Larry at larry@wall.org
Contact Guido at guido@python.org
Guido's address is guido@python.org
[Finished in 0.4s]

字典get() 函数，返回指定键的值，如果值不在字典中返回默认值。get()方法语法：dict.get(key, default=None)

 #coding=utf-8
dict = {'Name': 'Zara','Age':27}
print "Value ： %s" %  dict.get('Age')
print "Value ： %s" %  dict.get('Sex')
print "Value ： %s" %  dict.get('Sex','guy')

Value ： 27
Value ： None
Value ： guy
[Finished in 0.2s]

字典访问的几种方式 Iteritems 函数

 #coding=utf-8
dict = {'Name': 'Zara','Age':27}
print '****Method one****'
for key in dict:
    print key ,dict[key]
    print key + str(dict[key])
print '****Method two****'
for (k,v) in dict.items():
    print "dict[%s]="%k,v
print '****Method three****'
#items()返回的是列表的对象，而iteritems()返回的是iterator对象
#iteritor是迭代器的意思，一次返回一个数据项，直到没有为止
for k,v in dict.iteritems():
    print "dict[%s]="%k,v
print '****Method four****'
for i in dict.iteritems():
    print i

****Method one****
Age 27
Age27
Name Zara
NameZara
****Method two****
dict[Age]= 27
dict[Name]= Zara
****Method three****
dict[Age]= 27
dict[Name]= Zara
****Method four****
('Age', 27)
('Name', 'Zara')
[Finished in 0.3s]

operator.itemgetter函数

>>> import operator
>>> a = [[1,2],[3,4],'hh']
>>> b = operator.itemgetter(2,1)#获取对象的第2个和第1个值
>>> b(a)
('hh', [3, 4])

Sorted函数 sorted(iterable,cmp=None, key=None, reverse=False)。

>>> students = [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
>>> sorted(students, key=lambda student : student[2])
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
>>> sorted(students, key=lambda student : student[2],reverse=0)
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
>>> sorted(students, key=lambda student : student[2],reverse=1)
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
>>> sorted(students, key=operator.itemgetter(2)) 
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
>>> sorted(students, key=operator.itemgetter(1,2))
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]

key为函数，指定取待排序元素的哪一项进行排序，
sorted(students, key=operator.itemgetter(1,2))即根据第二个域排序，再根据第三个域排序。
reverse参数，是一个bool变量，默认为false升序排列，True降序排列

csdn_lzw

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
《机器学习实战》第二章 2.1 k-近邻

《机器学习实战》系列博客主要是实现并理解书中的代码，相当于读书笔记了。毕竟实战不能光看书。动手就能遇到许多奇奇怪怪的问题。博文比较粗糙，需结合书本。博主边查边学，水平有限，有问题的地方评论区请多指教。书中的代码和数据，网上有很多请自行下载。k-近邻算法采用测量不同特征值之间的距离方法进行分类
复制链接

扫一扫

专栏目录