K近邻算法(KNN):
KNN通过测量不同特征值之间的距离进行分类,其思路是:如果一个样本在特征空间中的k个最相似的样本中的大多数属于某一个类别,则该样本也属于这个类别。K通常是取不大于20的整数。KNN算法中,所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。如下图所示,要决定 绿色圆属于 哪个类,是红色三角形还是蓝色四方形?如果K=3,由于红色三角形所占比例为2/3,绿色圆将被赋予红色三角形那个类,如果K=5,由于蓝色四方形比例为3/5,因此绿色圆被赋予蓝色四方形类。
KNN算法思想:在训练集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与训练集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是K个数据中出现次数最多的那个分类,其算法的描述为:
1)计算测试数据与各个训练数据之间的距离;
2)按照距离的递增关系进行排序;
3)选取距离最小的K个点;
4)确定前K个点所在类别的出现频率;
5)返回前K个点中出现频率最高的类别作为测试数据的预测分类。
1)算距离:给定测试对象,计算它与训练集中的每个对象的距离
2)找邻居:圈定距离最近的k个训练对象,作为测试对象的近邻
3)做分类:统计前K个点中每个类别的样本出现的频率;返回前K个点出现频率最高的类别作为当前点的预测分类。
KNN算法优缺点:
1、优点
1.1简单,易于理解,易于实现,无需估计参数,无需训练;
1.2适合对稀有事件进行分类(例如当流失率很低时,比如低于0.5%,构造流失预测模型)
特别适合于多分类问题(multi-modal,对象具有多个类别标签),例如根据基因特征来判断其功能分类,kNN比SVM的表现要好
2、缺点
2.1懒惰算法,对测试样本分类时的计算量大,内存开销大,评分慢;
2.2可解释性较差,无法给出决策树那样的规则。
示例1:不涉及文本文件,直接给出样本。
新建knn.py文件
#coding:utf-8
from numpy import *
import operator
#give training date and their labels
def createDataSet():
group = array([[1.5,2.0],[1.2,0.2],[1.6,1.1],[0.2,2.1],[0.15,1.4],[0.3,3.0]])
labels = ['A','A','A','B','B','B']
return group,labels
#classify by knn
def classify(input,dataSet,label,k):
dataSize = dataSet.shape[0]
diff = tile(input,(dataSize,1)) - dataSet
sqdiff = diff ** 2
squareDist = sum(sqdiff,axis = 1)#行向量分别相加,从而得到新的一个行向量
dist = squareDist ** 0.5
sortedDistIndex = argsort(dist)#根据元素的值从大到小对元素进行排序,返回下标
classCount={}
for i in range(k):
voteLabel = label[sortedDistIndex[i]]
#对选取的K个样本所属的类别个数进行统计
classCount[voteLabel] = classCount.get(voteLabel,0) + 1
#选取出现的类别次数最多的类别
maxCount = 0
for key,value in classCount.items():
if value > maxCount:
maxCount = value
classes = key
return classes
再新建knn_test.py文件:
#-*-coding:utf-8 -*-
import sys
sys.path.append("F://python_test")
import knn
from numpy import *
dataSet,labels = knn.createDataSet()
input = array([1.5,0.4])
K = 3
output = knn.classify(input,dataSet,labels,K)
print("training date is:",input,"classify output is:",output)
运行结果如下:
>>>
('training date is:', array([ 1.5, 0.4]), 'classify output is\xa3\xba', 'A')
示例2:采用文本文件,python读取并进行利用。
用kNN来分类一个手写数字的数据库,这个数据库包括数字0-9的手写体。每个数字大约有200个样本。每个样本保持在一个txt文件中。手写体图像本身的大小是32x32的二值图,转换到txt文件保存后,内容也是32x32个数字。数据库解压后有两个目录:目录trainingDigits存放的是大约2000个训练数据,testDigits存放大约900个测试数据。
数据库链接地址:http://download.csdn.net/detail/piaoxuezhong/9745648
新建一个knn.py脚本文件,文件里面包含四个函数:一个实现knn分类算法,一个用来生成将每个样本的txt文件转换为对应的一个向量,一个用来加载整个数据库,一个加载测试的函数。
from numpy import *
import operator
import os
# classify by knn
def kNNClassify(newInput, dataSet, labels, k):
numSamples = dataSet.shape[0] # shape[0] give the num of row
#step 1: calculate Euclidean distance
diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise
squaredDiff = diff ** 2 # squared for the subtract
squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row
distance = squaredDist ** 0.5
#step 2: sort the distance
sortedDistIndices = argsort(distance)
classCount = {} # define a dictionary
for i in xrange(k):
#step 3: choose the min k distance
voteLabel = labels[sortedDistIndices[i]]
#step 4: count the times labels occur
# when the key voteLabel is not in dictionary classCount, get()will return 0
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
#step 5: the max voted class will return
maxCount = 0
for key, value in classCount.items():
if value > maxCount:
maxCount = value
maxIndex = key
return maxIndex
# convert image to vector
def img2vector(filename):
rows = 32
cols = 32
imgVector = zeros((1, rows * cols))
fileIn = open(filename)
for row in xrange(rows):
lineStr = fileIn.readline()
for col in xrange(cols):
imgVector[0, row * 32 + col] = int(lineStr[col])
return imgVector
# load dataSet
def loadDataSet():
#step 1: Getting training set
print "Getting training set..."
dataSetDir = 'F://python_test//digits//'
trainingFileList = os.listdir(dataSetDir + 'trainingDigits')
numSamples = len(trainingFileList)
train_x = zeros((numSamples, 1024))
train_y = []
for i in xrange(numSamples):
filename = trainingFileList[i]
# get train_x
train_x[i, :] = img2vector(dataSetDir + 'trainingDigits/%s' % filename)
# get label from file name such as "1_18.txt"
label = int(filename.split('_')[0]) # return 1
train_y.append(label)
#step 2: Getting testing set
print "Getting testing set..."
testingFileList = os.listdir(dataSetDir + 'testDigits')
numSamples = len(testingFileList)
test_x = zeros((numSamples, 1024))
test_y = []
for i in xrange(numSamples):
filename = testingFileList[i]
# get train_x
test_x[i, :] = img2vector(dataSetDir + 'testDigits/%s' % filename)
# get label from file name such as "1_18.txt"
label = int(filename.split('_')[0]) # return 1
test_y.append(label)
return train_x, train_y, test_x, test_y
# test hand writing class
def testHandWritingClass():
#step 1: load data
print "step 1: load data..."
train_x, train_y, test_x, test_y = loadDataSet()
#step 2: training...
print "step 2: training..."
pass
#step 3: testing
print "step 3: testing..."
numTestSamples = test_x.shape[0]
matchCount = 0
for i in xrange(numTestSamples):
predict = kNNClassify(test_x[i], train_x, train_y, 3)
if predict == test_y[i]:
matchCount += 1
accuracy = float(matchCount) / numTestSamples
#step 4: show the result
print "step 4: show the result..."
print 'The classify accuracy is: %.2f%%' % (accuracy * 100)
另外,新建一个knn_test.py文件,用来测试实现knn算法:
import knn
knn.testHandWritingClass()
F5运行结果如下:
>>>
step 1: load data...
Getting training set...
Getting testing set...
step 2: training...
step 3: testing...
step 4: show the result...
The classify accuracy is: 98.84%
参考:
http://blog.csdn.net/zouxy09/article/details/16955347 (pythoon实现)
http://www.cnblogs.com/ybjourney/p/4702562.html
http://blog.csdn.net/xiaowei_cqu/article/details/23782561 (opencv实现)