原理解释
KNN
- K-nearest-neighbors:K最近邻算法。
- knn通过在特征空间中查找待预测节点的K个邻居,然后根据查找到的K个邻居的标签来决定待分类样本的标签,这样的方法叫做KNN方法,即K-最近邻方法。
- 以下图为例:
- 图中的绿色点为待预测点,如果K=3,那么在三个邻居中有2个红1个蓝,则绿色待预测点的标签应该是红色;如果K=5,则在5个邻居中,有2个红色,3个蓝色,则标签为蓝色。所以KNN的分类在很大程度上是由K值来决定的。即K个最近邻。
- 基于邻居的分类器尤其适合于类别边界不规则的数据。
- 打的K值使得模型对噪声不敏感,但是大的K值容易模糊类的边界。
- 那么点与点之间的距离又是怎么来计算的呢?
- 距离计算方式请参考链接: 距离计算
sklearn中的knn
模块
- from sklearn.neighbors import KNeighborsClassifier
函数原型
class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
参数说明:
经常需要调试的参数为:
n_neighbors, weights
==========================================
n_neighbors: 默认值为5,即KNN中的K值。
weights: 邻近点的权重。
uniform:表示各个邻近点具有相同的权重;
distance:越近的邻近点具有更高的权重;越远的权重越小;
callable: 用户自定义函数。接收一个距离值的数组,返回同样形状的包含权重的数组。
algorithm: 查找最近邻采用的算法。
auto: 算法自己根据输入的数据来决定采用什么方法查找最近邻;
ball_tree:采用球树方法;
kd_tree: 采用kd树的方法;
brute: 采用蛮力的方法;
leaf_size:
p: 闵可夫斯基度量的次方参数。
p=1,为曼哈顿距离;manhattan_distance
p=2,欧几里得距离,euclidean_distance
其他p值,使用闵可夫斯基距离;
metric: 字符串或者callable函数,默认是minkowski,闵可夫斯基度量。
metric_params: 度量参数。
n_jobs: 默认为1,进行邻居搜索的默认job数。如果选择-1,表示设置jobs数字为cpu的核数。此值不影响fit()方法,只影响predict方法。在实际使用中发现,如果设置为4,在fit()阶段仍然是1个cpu占用100%,而在predict()阶段,有4个cpu都为100%。
代码示例
- 以下代码使用KNN对MNIST数据集进行验证,从测试结果来看:
- 使用uniform与distance结果相差不大,uniform的准确率为0.97,distance的准确率为0.96,相差不大;
- K选择1,3,5,7的准确率相差不大,均达到0.96 ~ 0.97;
- 直接使用灰度值进行计算与把像素进行二值化之后就算结果相差不大,直接用灰度值计算为0.97,二值化后为0.96;
结论: KNN能达到这么好的效果,说明样本本身分类是比较明晰的,样本边界相对清晰,这样KNN分类效果才会比较好。
测试结果:
/usr/bin/python2.7 /root/PycharmProjects/ml3m/knn.py /home/gx/code/ml/mnistDataUnzipped
Parmeters: 2
file path: /home/gx/code/ml/mnistDataUnzipped
====> Step 1: Loading iamges
magic_num: 2051 image_num: 60000 row_num: 28 column_num: 28
Total num of images is: 60000
label_magic_num: 2049 label_num: 60000
Total num of labels is: 60000
magic_num: 2051 image_num: 10000 row_num: 28 column_num: 28
Total num of images is: 10000
label_magic_num: 2049 label_num: 10000
Total num of labels is: 10000
====> Start training
test_knn is running...
----> KNN Classifier: neighbor num: 3 weights: distance
Training time: 47.7752809525
Predicting time: 257.274169922
precision recall f1-score support
0 0.99 0.97 0.98 1001
1 1.00 0.97 0.98 1173
2 0.97 0.98 0.98 1013
3 0.97 0.97 0.97 1009
4 0.97 0.98 0.97 971
5 0.96 0.96 0.96 892
6 0.99 0.98 0.98 962
7 0.97 0.96 0.96 1033
8 0.95 0.99 0.97 935
9 0.96 0.96 0.96 1011
avg / total 0.97 0.97 0.97 10000
====> Training done
Process finished with exit code 0
#-*- coding:utf-8 –*- #为了是的python支持中文
#使用python 2.7语法
import sys
import struct
import time
from sklearn import neighbors #KNN,RadiusNN,
from sklearn import metrics
mnist_file_list = ['/train-images-idx3-ubyte', #训练样本集
'/train-labels-idx1-ubyte', #训练标签集
'/t10k-images-idx3-ubyte', #测试样本集
'/t10k-labels-idx1-ubyte'] #测试标签集
#获得命令行参数
argCount = len(sys.argv)
print "Parmeters: ", argCount #获得参数个数
print "file path: ", sys.argv[1] #获得样本文件路径
#生成样本文件全路径
train_images_path = sys.argv[1] + mnist_file_list[0]
train_label_path = sys.argv[1] + mnist_file_list[1]
test_images_path = sys.argv[1] + mnist_file_list[2]
test_label_path = sys.argv[1] + mnist_file_list[3]
#print "Train images path: ", train_images_path
#print "Train label path: ", train_label_path
#print "Test images path: ", test_images_path
#print "Test label path: ", test_label_path
#读取图像文件的函数
def read_images(image_path,label_path):
#打开并读取文件
with open(image_path, 'rb') as handle:
buf = handle.read()
#handle.close() #使用with,则不需要显式执行文件关闭操作了。
with open(label_path, 'rb') as handle_label:
buf_label = handle_label.read()
format_str = '>IIII' #大于号表示按照大端方式读取
index = 0
magic_num, image_num, row_num, column_num = struct.unpack_from(format_str, buf, index) #此处需要注意:python的参数索引从0开始,因此handle的索引为1
print "magic_num:", magic_num, " image_num: ", image_num, " row_num: ", row_num, "column_num: ", column_num
image_pixels_num = row_num * column_num
index += struct.calcsize(format_str) #返回字符串格式代表的结构的长度,以字节为单位
images = [] #定义空的列表用于存储每张图片
#读取每个图片并存储
image_format_str = str(image_pixels_num) + 'B'
for i in range(image_num):
image = struct.unpack_from(image_format_str, buf, index)
index += len(image)
images.append(image)
print "Total num of images is: ", len(images)
#处理标签
format_str = '>II'
index = 0
label_magic_num, label_num = struct.unpack_from(format_str, buf_label, index)
print "label_magic_num:", label_magic_num, " label_num: ", label_num
label_format_str = 'B'
index += struct.calcsize(format_str)
labels = []
for i in range(label_num):
label = struct.unpack_from(label_format_str, buf_label, index)
index += len(label)
labels.append(label[0])
print "Total num of labels is: ", len(labels)
ret = {"magic_num":magic_num, "image_num":image_num, "row_num":row_num, "column_num":column_num, 'image_data':images, 'labels': labels}
return ret
#显示一张图片
def showImage( image, label, rows, columns):
print "==== ==== ===="
for j in range(rows): #此行如果多加一个tab就会引起报错,unexcepted Indent。所以python对格式的要求是很严格的。
for k in range(columns):
offset = j * rows + k
value = image[offset]
if value == 0:
print ' ',
else:
print '1', #带上逗号则不打印换行,否则自动换行。
print ""
print ""
print '==== ==== ===='
print 'Label is: ', label
print "====> Step 1: Loading iamges"
train_data = read_images(train_images_path, train_label_path) #读取训练样本集
test_data = read_images(test_images_path, test_label_path) #读取测试样本集
#print train_data['magic_num'], train_data['image_num'], train_data['row_num'], train_data['column_num'], type(train_data['image_data'])
#index = 195
#showImage(train_data['image_data'][index], train_data['labels'][index], train_data['row_num'], train_data['column_num'])
#index = 444
#showImage(test_data['image_data'][index], test_data['labels'][index], test_data['row_num'], test_data['column_num'])
def knn(X, Y, Tx, Ty, neighbor_num, weight):
print "----> KNN Classifier: neighbor num: ", neighbor_num, " weights: ", weight
knnClassifier = neighbors.KNeighborsClassifier(weights=weight, n_neighbors=neighbor_num, n_jobs=4)
start = time.time()
knnClassifier.fit(X, Y)
end = time.time()
print "Training time: ", end-start
start = time.time()
Z = knnClassifier.predict(Tx)
end = time.time()
print "Predicting time: ", end-start
print metrics.classification_report(Z, Ty)
#显示预测结果
#for i in range(len(Z)):
# if Z[i] != Ty[i]:
# print Z[i]," - ", Ty[i]," - ", i
def test_knn(X,Y,Tx,Ty):
print "test_knn is running..."
#knn(X, Y, Tx, Ty, 1) # 1个邻居
#knn(X, Y, Tx, Ty, 2) # 2个邻居
#knn(X, Y, Tx, Ty, 3) # 3个邻居
#knn(X, Y, Tx, Ty, 4) # 4个邻居
#knn(X, Y, Tx, Ty, 5) # 5个邻居
#knn(X, Y, Tx, Ty, 6) # 6个邻居
#knn(X, Y, Tx, Ty, 7) # 7个邻居
#knn(X, Y, Tx, Ty, 8) # 8个邻居
knn(X, Y, Tx, Ty, 1, 'distance')
knn(X, Y, Tx, Ty, 3, 'distance')
knn(X, Y, Tx, Ty, 5, 'distance')
#knn(X, Y, Tx, Ty, 7, 'distance')
#像素转换为2值
def make_pixels_01(X):
X_new = []
for i in range(len(X)):
old = X[i]
new_one = []
for j in range(len(old)):
if old[j] == 0:
new_one.append(old[j])
else:
new_one.append(1)
X_new.append(new_one)
return X_new
print "====> Start training"
#Neighbours分类器
X = train_data['image_data'] #train_data['image_data']*2把测试数据加倍
Y = train_data['labels'] #train_data['labels']*2把测试数据加倍
Tx = test_data['image_data']
Ty = test_data['labels']
X_01 = make_pixels_01(X)
Tx_01 = make_pixels_01(Tx)
#test_knn(X,Y, Tx, Ty)
test_knn(X_01, Y, Tx_01, Ty)
print "====> Training done"