一、综述
1.属于分类算法
2.输入基于实例的学习,懒惰学习
二、算法详述
1.步骤:
(1)为了判断未知实例的类别,以所有已知类别的实例作为参照
(2)选择参数
(3)计算未知实例与所有已知实例的距离(两点之间的距离)
(4)选择最近k个已知实例
(5)根据少数服从多数的投票法则,让未知实例归类为k个最临近样本中最多数的类别
2.算法优缺点:
优点:简单、易于理解、容易实现、通过对k的选择可具备丢噪音数据的健壮性
缺点:需要大量空间存储已知实例
算法复杂度高(需要比较所有已知实例与要分类的实例)
当分布不平衡时,比如其中一类样本过大(实例数量过多)占主导的时候,新的未知实例容易被归类为这个主导样本,因为这类样本实例的数量过大,但这个新的未知实例实际并未接近目标样本。
3.改进:
考虑距离,根据距离加上权重
三、算法实现
1.利用python自带的分类器实现:
#导入相应模块
from sklearn import neighbors
from sklearn import datasets
#实例化knn分类器
knn = neighbors.KNeighborsClassifier()
#加载训练数据集
iris = datasets.load_iris()
print(iris)
#利用特征向量和目标向量构建模型
knn.fit(iris.data,iris.target)
#传入测试集预测目标
predictedLabel = knn.predict([[0.1,0.2,0.3,0.4]])
print(predictedLabel)
运行结果:
{'DESCR': 'Iris Plants Database\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...\n', 'target_names': array(['setosa', 'versicolor', 'virginica'],
dtype='<U10'), 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'data': array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.2],
[ 5. , 3.6, 1.4, 0.2],
[ 5.4, 3.9, 1.7, 0.4],
[ 4.6, 3.4, 1.4, 0.3],
[ 5. , 3.4, 1.5, 0.2],
[ 4.4, 2.9, 1.4, 0.2],
[ 4.9, 3.1, 1.5, 0.1],
[ 5.4, 3.7, 1.5, 0.2],
[ 4.8, 3.4, 1.6, 0.2],
[ 4.8, 3. , 1.4, 0.1],
[ 4.3, 3. , 1.1, 0.1],
[ 5.8, 4. , 1.2, 0.2],
[ 5.7, 4.4, 1.5, 0.4],
[ 5.4, 3.9, 1.3, 0.4],
[ 5.1, 3.5, 1.4, 0.3],
[ 5.7, 3.8, 1.7, 0.3],
[ 5.1, 3.8, 1.5, 0.3],
[ 5.4, 3.4, 1.7, 0.2],
[ 5.1, 3.7, 1.5, 0.4],
[ 4.6, 3.6, 1. , 0.2],
[ 5.1, 3.3, 1.7, 0.5],
[ 4.8, 3.4, 1.9, 0.2],
[ 5. , 3. , 1.6, 0.2],
[ 5. , 3.4, 1.6, 0.4],
[ 5.2, 3.5, 1.5, 0.2],
[ 5.2, 3.4, 1.4, 0.2],
[ 4.7, 3.2, 1.6, 0.2],
[ 4.8, 3.1, 1.6, 0.2],
[ 5.4, 3.4, 1.5, 0.4],
[ 5.2, 4.1, 1.5, 0.1],
[ 5.5, 4.2, 1.4, 0.2],
[ 4.9, 3.1, 1.5, 0.1],
[ 5. , 3.2, 1.2, 0.2],
[ 5.5, 3.5, 1.3, 0.2],
[ 4.9, 3.1, 1.5, 0.1],
[ 4.4, 3. , 1.3, 0.2],
[ 5.1, 3.4, 1.5, 0.2],
[ 5. , 3.5, 1.3, 0.3],
[ 4.5, 2.3, 1.3, 0.3],
[ 4.4, 3.2, 1.3, 0.2],
[ 5. , 3.5, 1.6, 0.6],
[ 5.1, 3.8, 1.9, 0.4],
[ 4.8, 3. , 1.4, 0.3],
[ 5.1, 3.8, 1.6, 0.2],
[ 4.6, 3.2, 1.4, 0.2],
[ 5.3, 3.7, 1.5, 0.2],
[ 5. , 3.3, 1.4, 0.2],
[ 7. , 3.2, 4.7, 1.4],
[ 6.4, 3.2, 4.5, 1.5],
[ 6.9, 3.1, 4.9, 1.5],
[ 5.5, 2.3, 4. , 1.3],
[ 6.5, 2.8, 4.6, 1.5],
[ 5.7, 2.8, 4.5, 1.3],
[ 6.3, 3.3, 4.7, 1.6],
[ 4.9, 2.4, 3.3, 1. ],
[ 6.6, 2.9, 4.6, 1.3],
[ 5.2, 2.7, 3.9, 1.4],
[ 5. , 2. , 3.5, 1. ],
[ 5.9, 3. , 4.2, 1.5],
[ 6. , 2.2, 4. , 1. ],
[ 6.1, 2.9, 4.7, 1.4],
[ 5.6, 2.9, 3.6, 1.3],
[ 6.7, 3.1, 4.4, 1.4],
[ 5.6, 3. , 4.5, 1.5],
[ 5.8, 2.7, 4.1, 1. ],
[ 6.2, 2.2, 4.5, 1.5],
[ 5.6, 2.5, 3.9, 1.1],
[ 5.9, 3.2, 4.8, 1.8],
[ 6.1, 2.8, 4. , 1.3],
[ 6.3, 2.5, 4.9, 1.5],
[ 6.1, 2.8, 4.7, 1.2],
[ 6.4, 2.9, 4.3, 1.3],
[ 6.6, 3. , 4.4, 1.4],
[ 6.8, 2.8, 4.8, 1.4],
[ 6.7, 3. , 5. , 1.7],
[ 6. , 2.9, 4.5, 1.5],
[ 5.7, 2.6, 3.5, 1. ],
[ 5.5, 2.4, 3.8, 1.1],
[ 5.5, 2.4, 3.7, 1. ],
[ 5.8, 2.7, 3.9, 1.2],
[ 6. , 2.7, 5.1, 1.6],
[ 5.4, 3. , 4.5, 1.5],
[ 6. , 3.4, 4.5, 1.6],
[ 6.7, 3.1, 4.7, 1.5],
[ 6.3, 2.3, 4.4, 1.3],
[ 5.6, 3. , 4.1, 1.3],
[ 5.5, 2.5, 4. , 1.3],
[ 5.5, 2.6, 4.4, 1.2],
[ 6.1, 3. , 4.6, 1.4],
[ 5.8, 2.6, 4. , 1.2],
[ 5. , 2.3, 3.3, 1. ],
[ 5.6, 2.7, 4.2, 1.3],
[ 5.7, 3. , 4.2, 1.2],
[ 5.7, 2.9, 4.2, 1.3],
[ 6.2, 2.9, 4.3, 1.3],
[ 5.1, 2.5, 3. , 1.1],
[ 5.7, 2.8, 4.1, 1.3],
[ 6.3, 3.3, 6. , 2.5],
[ 5.8, 2.7, 5.1, 1.9],
[ 7.1, 3. , 5.9, 2.1],
[ 6.3, 2.9, 5.6, 1.8],
[ 6.5, 3. , 5.8, 2.2],
[ 7.6, 3. , 6.6, 2.1],
[ 4.9, 2.5, 4.5, 1.7],
[ 7.3, 2.9, 6.3, 1.8],
[ 6.7, 2.5, 5.8, 1.8],
[ 7.2, 3.6, 6.1, 2.5],
[ 6.5, 3.2, 5.1, 2. ],
[ 6.4, 2.7, 5.3, 1.9],
[ 6.8, 3. , 5.5, 2.1],
[ 5.7, 2.5, 5. , 2. ],
[ 5.8, 2.8, 5.1, 2.4],
[ 6.4, 3.2, 5.3, 2.3],
[ 6.5, 3. , 5.5, 1.8],
[ 7.7, 3.8, 6.7, 2.2],
[ 7.7, 2.6, 6.9, 2.3],
[ 6. , 2.2, 5. , 1.5],
[ 6.9, 3.2, 5.7, 2.3],
[ 5.6, 2.8, 4.9, 2. ],
[ 7.7, 2.8, 6.7, 2. ],
[ 6.3, 2.7, 4.9, 1.8],
[ 6.7, 3.3, 5.7, 2.1],
[ 7.2, 3.2, 6. , 1.8],
[ 6.2, 2.8, 4.8, 1.8],
[ 6.1, 3. , 4.9, 1.8],
[ 6.4, 2.8, 5.6, 2.1],
[ 7.2, 3. , 5.8, 1.6],
[ 7.4, 2.8, 6.1, 1.9],
[ 7.9, 3.8, 6.4, 2. ],
[ 6.4, 2.8, 5.6, 2.2],
[ 6.3, 2.8, 5.1, 1.5],
[ 6.1, 2.6, 5.6, 1.4],
[ 7.7, 3. , 6.1, 2.3],
[ 6.3, 3.4, 5.6, 2.4],
[ 6.4, 3.1, 5.5, 1.8],
[ 6. , 3. , 4.8, 1.8],
[ 6.9, 3.1, 5.4, 2.1],
[ 6.7, 3.1, 5.6, 2.4],
[ 6.9, 3.1, 5.1, 2.3],
[ 5.8, 2.7, 5.1, 1.9],
[ 6.8, 3.2, 5.9, 2.3],
[ 6.7, 3.3, 5.7, 2.5],
[ 6.7, 3. , 5.2, 2.3],
[ 6.3, 2.5, 5. , 1.9],
[ 6.5, 3. , 5.2, 2. ],
[ 6.2, 3.4, 5.4, 2.3],
[ 5.9, 3. , 5.1, 1.8]])}
[0]
2.用python实现knn分类器算法
首先需要在http://archive.ics.uci.edu/ml/页面中下载Iris数据集
点击Iris
点击Data Floder
下载iris.data
代码实现:
# -*- coding:utf-8 -*-
#导入相应模块
import csv
import random
import math
import operator
#加载数据集
def load_dataset(fileName,split,traningData = [],testingData = []):
with open(fileName,"rt") as csvfile:
lines = csv.reader(csvfile)#读出csvfile文件内容
dataset = list(lines)#将内容封装为list类型
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])#将数据集转换为float类型
if(random.random()<split):#用split为界限,随机生出的数字小于split归为训练集,反正归为测试集
traningData.append(dataset[x])
else:
testingData.append(dataset[x])
#计算测试结点与周围结点之间得距离
def cal_instance(instance_1,instance_2,dimen):
distance = 0
for x in range(dimen):
distance += pow((instance_1[x]-instance_2[x]),2)#计算结点所有维度的坐标方和
return math.sqrt(distance)
#从周围结点中选取k个距离最近的结点
def select_neark(trainngSet,testingSet,k):
distance = []#用来存储每个训练集点到测试集点的距离
length = len(testingSet)-1#计算测试集的维度
for x in range(len(trainngSet)):
dst = cal_instance(testingSet,trainngSet[x],length)#计算每个训练集结点到测试集点的距离
distance.append((trainngSet[x],dst))#列表元素为字典类型
distance.sort(key=operator.itemgetter(1))#以距离为根据排序
neighbors = []
for x in range(k):
neighbors.append(distance[x][0])#将距离测试集点距离最近的k个点存储在列表里
return neighbors
#利用少数服从多数原则对测试结点进行分类
def calssify(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]#获取到结点类别
if(response in classVotes):
classVotes[response] += 1#如果在已存在的类别里,为相应类别加1
else:
classVotes[response] = 1#否则新加入类别及类别个数
sortedVotes = sorted(classVotes.items(),key=operator.itemgetter(1),reverse=True)
return sortedVotes[0][0]#将类别个数逆排序,返回最大个数的类别
#计算预测精度
def precision(testSet,prediction):
correct = 0
for x in range(len(testSet)):
if(testSet[x][-1]==prediction[x]):
correct += 1
return (correct/float(len(testSet)))*100.0
#主程序
def main():
trainningSet = []
testingSet = []
split = 0.67
load_dataset(r"E:\demo_py\python\machine_learning\iris.data.txt",split,trainningSet,testingSet)
print("trainSet:"+repr(len(trainningSet)))
print("testSet:"+repr(len(testingSet)))
prediction = []
k = 3
for x in range(len(testingSet)):
neighbors = select_neark(trainningSet,testingSet[x],k)#获取k个最近邻居
result = calssify(neighbors)#根据邻居将测试结点分类
prediction.append(result)#将预测目标加入列表
print("prediction:"+repr(result)+"acctully:"+repr(testingSet[x][-1]))
accuracy = precision(testingSet,prediction)
print("accuracy:"+repr(accuracy)+"%")#函数repr()将对象转换为字符串
main()
运行结果:
Finding files... done.
Importing test modules ... trainSet:96
testSet:54
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-virginica'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-versicolor'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
accuracy:96.29629629629629%
done.
----------------------------------------------------------------------
Ran 0 tests in 0.000s
OK
在代码时要注意python3与python2不兼容的问题,在python3中将dict中的iteritems()方法已经废除了,在python3里要用items()方法替换iteritems()。
备注:
python3字典中items()和python2.x中iteritems()的区别:
在python2.x中,items( )用于 返回一个字典的拷贝列表【Returns a copy of the list of all items (key/value pairs) in D】,占额外的内存。
iteritems() 用于返回本身字典列表操作后的迭代【Returns
an iterator on all items(key/value pairs) in D】,不占用额外的内存。
python 3.x 里面,iteritems() 和 viewitems() 这两个方法都已经废除了,而 items() 得到的结果是和 2.x 里面 viewitems() 一致的。在3.x里 用 items()替换iteritems() ,可以用于 for 来循环遍历。