K-近邻算法原理:在一组样本数据中,每一个数据都有标签,然后输入没有标签的数据,通过输入数据的各个特征和样本数据进行比较,用算法分析出最合适输入数据的标签。
一般选择样本中k个相似的数据,最后选择k个中出现次数最多的标签,作为新输入数据的标签。
一般做法:
step.1—计算未知样本和每个训练样本的距离dist
step.2—得到dist,对dist进行排序
step.3—选取与当前点距离最小的k个点
step.4—确定前k个点所在类别的出现频率
step.5—选择出现频率最大的类标号作为未知样本的类标号
def classify1(inX,group,labels,k=3):
m = group.shape[0]
#计算距离,采用的是欧式距离
dataInx = np.tile(inX, (m,1)) - group
dataInx = dataInx ** 2
dataSum = dataInx.sum(axis=1)
dataSum = dataSum ** 0.5
#选取k个最近的数据中标签最多的一个
dataSorted = dataSum.argsort()
classCount = {}
for i in range(k):
classIndex = dataSorted[i]
classCount[labels[classIndex]] = 1 + classCount.get(labels[classIndex],0)
#print(classCount)
classCountSorted = sorted(classCount.items(), key=operator.itemgetter(1),reverse=True)
return classCountSorted[0][0]
通过k-近邻算法,对手写数字进行识别
得到的手写数字是txt文件,例如0_1.txt,第一个0代表文件的数字是0。里面是32*32的矩阵。首先,要把txt文件化作np中的矩阵。
import os
def imgToVect(filename):
returnVect = np.zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,i*32+j] = int(lineStr[j])
return returnVect;
得到一个1*1024的矩阵。然后对训练数据进行读取,写进一个m*1024的矩阵中trainMat和hwLabels。在对测试数据进行读取,
用上面的classify1函数进行测试即可。
hwlabels = []
trainingFileList = os.listdir('trainingDigits');
m = len(trainingFileList)
trainMat = np.zeros((m,1024))
for i in range(m):
filename = trainingFileList[i];
trainIndex = filename.split('_')[0]
hwlabels.append(int(trainIndex))
#print('trainingDigits/%s'%i)
trainMat[i,:] = imgToVect('trainingDigits/%s'%filename)
testFileList = os.listdir('testDigits');
errorNum = 0.0
m = len(testFileList)
for i in range(m):
fileName = testFileList[i].split('_')[0]
testMat = imgToVect('testDigits/%s'%testFileList[i])
#print(testMat)
resultData = classify1(testMat, trainMat, hwlabels ,3)
if resultData != int(fileName):
errorNum += 1
print('the real:%s,the classify is %s'%(fileName,resultData))
print('the error rate is %f'%(errorNum/len(testFileList)))
最后打印的是错误率,大概是1.2%左右。
基于sklearn的neighbors.KNeighborsClassifier算法
from sklearn import neighbors
neigh = neighbors.KNeighborsClassifier(n_neighbors = 1)
def hendWrite1():
hwlabels = []
trainingFileList = os.listdir('trainingDigits');
m = len(trainingFileList)
trainMat = np.zeros((m,1024))
for i in range(m):
filename = trainingFileList[i];
trainIndex = filename.split('_')[0]
hwlabels.append(int(trainIndex))
#print('trainingDigits/%s'%i)
trainMat[i,:] = imgToVect('trainingDigits/%s'%filename)
neigh.fit(trainMat, hwlabels)
testFileList = os.listdir('testDigits');
errorNum = 0.0
m = len(testFileList)
for i in range(m):
fileName = testFileList[i].split('_')[0]
testMat = imgToVect('testDigits/%s'%testFileList[i])
#print(testMat)
#利用sklearn的knn算法
resultData = neigh.predict(testMat)
if resultData != int(fileName):
errorNum += 1
print('the real:%s,the classify is %s'%(fileName,resultData))
print('the error rate is %f'%(errorNum/len(testFileList)))
最后结果是1.2%左右。