k近邻法之kd树

最新推荐文章于 2024-02-22 16:47:31 发布

Godxv

最新推荐文章于 2024-02-22 16:47:31 发布

阅读量525

点赞数

分类专栏：机器学习统计学习方法 Python 算法文章标签： KNN 统计学习方法 k近邻

算法同时被 3 个专栏收录

8 篇文章 0 订阅

订阅专栏

机器学习

7 篇文章 0 订阅

订阅专栏

统计学习方法

7 篇文章 0 订阅

订阅专栏

k近邻算法

给定一个训练数据集，对新的输入实例，在训练数据集中找到跟它最近的k个实例，根据这k个实例的类判断它自己的类（一般采用多数表决的方法）。

这里写图片描述

k近邻模型

模型有3个要素——距离度量方法、k值的选择和分类决策规则。

模型

当3要素确定的时候，对任何实例（训练或输入），它所属的类都是确定的，相当于将特征空间分为一些子空间。

这里写图片描述

距离度量

对n维实数向量空间 $R^n$ ，经常用 $L_p$ 距离或曼哈顿Minkowski距离。

$L_p$ 距离定义如下：

这里写图片描述

当p=2时，称为欧氏距离：

这里写图片描述

当p=1时，称为曼哈顿距离：

这里写图片描述

当p=∞，它是各个坐标距离的最大值，即：

这里写图片描述

用图表示如下：

这里写图片描述

k值的选择

k较小，容易被噪声影响，发生过拟合。

k较大，较远的训练实例也会对预测起作用，容易发生错误。

分类决策规则

使用0-1损失函数衡量，那么误分类率是：

这里写图片描述

Nk是近邻集合，要使左边最小，右边的

这里写图片描述

必须最大，所以多数表决=经验最小化。

k近邻法的实现：kd树

算法核心在于怎么快速搜索k个近邻出来，朴素做法是线性扫描，不可取，这里介绍的方法是kd树。

构造kd树

对数据集T中的子集S初始化S=T，取当前节点node=root取维数的序数i=0，对S递归执行：

找出S的第i维的中位数对应的点，通过该点，且垂直于第i维坐标轴做一个超平面。该点加入node的子节点。该超平面将空间分为两个部分，对这两个部分分别重复此操作（S=S’，++i，node=current），直到不可再分。

例子

这里写图片描述

Python代码

短短几行即可搞定：

T = [[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]]
class node:
def __init__(self, point):
self.left = None
self.right = None
self.point = point
pass
def median(lst):
m = len(lst) / 2
return lst[m], m
def build_kdtree(data, d):
data = sorted(data, key=lambda x: x[d])
p, m = median(data)
tree = node(p)
del data[m]
print data, p
if m > 0:tree.left = build_kdtree(data[:m], not d)
if len(data) > 1:tree.right = build_kdtree(data[m:], not d)
return tree
kd_tree =build_kdtree(T, 0)
print kd_tree

可视化

可视化的话则要费点功夫保存中间结果，并恰当地展示出来

# -*-coding:utf-8 -*-
#Filename: kdtree.py
# Author: hankcs
# Date:2015/2/4 15:01
import copy
import itertools
from matplotlib import pyplot as plt
from matplotlib.patches import Rectangle
from matplotlib import animation
T = [[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]]
def draw_point(data):
X, Y = [], []
for p in data:
X.append(p[0])
Y.append(p[1])
plt.plot(X, Y, 'bo')
def draw_line(xy_list):
for xy in xy_list:
x, y = xy
plt.plot(x, y, 'g', lw=2)
def draw_square(square_list):
currentAxis = plt.gca()
colors = itertools.cycle(["r", "b", "g", "c", "m", "y", '#EB70AA', '#0099FF'])
for square in square_list:
currentAxis.add_patch(
Rectangle((square[0][0], square[0][1]), square[1][0] - square[0][0], square[1][1] - square[0][1],
color=next(colors)))
def median(lst):
m = len(lst) / 2
return lst[m], m
history_quare= []
def build_kdtree(data, d, square):
history_quare.append(square)
data = sorted(data, key=lambda x: x[d])
p, m = median(data)
del data[m]
print data, p
if m >= 0:
sub_square = copy.deepcopy(square)
if d == 0:
sub_square[1][0] = p[0]
else:
sub_square[1][1] = p[1]
history_quare.append(sub_square)
if m > 0:build_kdtree(data[:m], not d,sub_square)
if len(data) > 1:
sub_square = copy.deepcopy(square)
if d == 0:
sub_square[0][0] = p[0]
else:
sub_square[0][1] = p[1]
build_kdtree(data[m:], not d,sub_square)
build_kdtree(T, 0, [[0, 0], [10, 10]])
print history_quare
# drawan animation to show how it works, the data comes from history
# firstset up the figure, the axis, and the plot element we want to animate
fig =plt.figure()
ax =plt.axes(xlim=(0, 2), ylim=(-2, 2))
line, = ax.plot([], [], 'g', lw=2)
label = ax.text([], [], '')
#initialization function: plot the background of each frame
def init():
plt.axis([0, 10, 0, 10])
plt.grid(True)
plt.xlabel('x_1')
plt.ylabel('x_2')
plt.title('build kd tree (www.hankcs.com)')
draw_point(T)
currentAxis = plt.gca()
colors =itertools.cycle(["#FF6633", "g", "#3366FF", "c", "m", "y", '#EB70AA', '#0099FF', '#66FFFF'])
#animation function. this is called sequentially
def animate(i):
square = history_quare[i]
currentAxis.add_patch(
Rectangle((square[0][0], square[0][1]), square[1][0] - square[0][0], square[1][1] - square[0][1], color=next(colors)))
return
# callthe animator. blit=true means only re-draw the parts that havechanged.
anim = animation.FuncAnimation(fig, animate, init_func = init, frames = len(history_quare), interval = 1000, repeat = False, blit = False)
plt.show()
anim.save('kdtree_build.gif', fps=2, writer='imagemagick')