算法描述
k近邻算法(k-nearest neighbour)的输入是实例的特征向量,对应于特征空间的点;输出是实例的类别。k近邻法假定在给定的训练数据集里,其中的实例的类别是确定的。对于新的实例,根据其k个最近的实例的类别,通过表决的方法进行预测。
3.1 k近邻算法
算法3.1
输入:训练数据集 T T T和实例的特征向量 x ^ \hat{x} x^;
其中训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\left\{\left(x_1,y_1\right),\left(x_2,y_2\right),...,\left(x_N,y_N\right)\right\} T={(x1,y1),(x2,y2),...,(xN,yN)}
其中, x i ∈ X ⊆ R n x_i\in{X}\subseteq{R^n} xi∈X⊆Rn为实例的特征向量, y i ∈ Y ⊆ { c 1 , c 2 , . . . , c K } y_i\in{Y}\subseteq\left\{c_1,c_2,...,c_K\right\} yi∈Y⊆{c1,c2,...,cK}为实例的类别, i = 1 , 2 , . . . , N i=1,2,...,N i=1,2,...,N, x ^ = ( x ( 1 ) , x ( 2 ) , . . . , x ( M ) ) \hat{x}=(x^{(1)}, x^{(2)},...,x^{(M)}) x^=(x(1),x(2),...,x(M)), x ( i ) x^{(i)} x(i)是特征向量的第i个参数,M是参数的个数;
输出:实例 x x x所属的类 y y y
(1)根据给定的距离度量,在训练集 T T T里找出与 x x x最邻近的k个点,涵盖这k个点的 x x x的邻域记做 N k ( x ) N_k(x) Nk(x)。
(2)在 N k ( x ) N_k(x) Nk(x)中根据分类决策规则,决定 x x x的分类 y y y。
KaTeX parse error: No such environment: equation at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲}̲ y=\mathop{…
3.2 k近邻模型
3.2.1 模型
3.2.2 距离
特征空间中的距离是2个实例的相似程度的反映。k近邻模型的特征空间一般是n维实数向量空间
R
n
R^n
Rn。距离一般使用欧氏距离,或者使用
L
p
L_p
Lp距离或明可夫斯基距离。
设特征空间X是n维实数向量空间
R
n
R^n
Rn,
x
i
,
x
j
∈
X
x_i,x_j\in X
xi,xj∈X,
x
i
^
=
(
x
i
(
1
)
,
x
i
(
2
)
,
.
.
.
,
x
i
(
n
)
)
\hat{x_i}=(x_i^{(1)},x_i^{(2)},...,x_i^{(n)})
xi^=(xi(1),xi(2),...,xi(n)),
x
j
^
=
(
x
j
(
1
)
,
x
j
(
2
)
,
.
.
.
,
x
j
(
n
)
)
\hat{x_j}=(x_j^{(1)},x_j^{(2)},...,x_j^{(n)})
xj^=(xj(1),xj(2),...,xj(n)),
x
i
^
\hat{x_i}
xi^和
x
j
^
\hat{x_j}
xj^的距离定义为
L
p
(
x
i
,
x
j
)
=
(
∑
l
=
1
n
∣
x
i
(
l
)
−
x
j
(
l
)
∣
p
)
1
p
L_p(x_i,x_j)=\left(\sum_{l=1}^n |x_i^{(l)}-x_j^{(l)}| ^ p \right) ^ {\frac 1p}
Lp(xi,xj)=(l=1∑n∣xi(l)−xj(l)∣p)p1
当p=2,称为欧氏距离
L
2
(
x
i
,
x
j
)
=
(
∑
l
=
1
n
∣
x
i
(
l
)
−
x
j
(
l
)
∣
2
)
1
2
L_2(x_i,x_j)=\left(\sum_{l=1}^n | x_i^{(l)}-x_j^{(l)} | ^ 2 \right) ^ {\frac 12}
L2(xi,xj)=(l=1∑n∣xi(l)−xj(l)∣2)21
当p=1,称为曼哈顿距离
L
1
(
x
i
,
x
j
)
=
(
∑
l
=
1
n
∣
x
i
(
l
)
−
x
j
(
l
)
∣
)
L_1(x_i,x_j)=\left(\sum_{l=1}^n | x_i^{(l)}-x_j^{(l)} | \right)
L1(xi,xj)=(l=1∑n∣xi(l)−xj(l)∣)
当p
→
∞
\rightarrow \infty
→∞,她是各个坐标差的最大值
L
∞
(
x
i
,
x
j
)
=
max
l
∣
x
i
(
l
)
−
x
j
(
l
)
∣
,
l
=
1
,
2
,
.
.
.
,
n
L_{\infty}(x_i, x_j)= \max_l | x_i^{(l)}-x_j^{(l)} |,\ \ l=1,2,...,n
L∞(xi,xj)=lmax∣xi(l)−xj(l)∣, l=1,2,...,n
3.2.3 k值得选择
通常使用交叉验证法选择一个最优的k值。
3.2.4 分类决策规则
表述很数学,我……
3.3 k近邻法的实现:kd树
3.3.1 构造kd树
算法3.2 构造平衡kd树
输入:k维空间数据集 T = { x 1 , x 2 , . . . , x N } T=\{x_1,x_2,...,x_N\} T={x1,x2,...,xN},其中 x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( n ) ) {x_i}=(x_i^{(1)},x_i^{(2)},...,x_i^{(n)}) xi=(xi(1),xi(2),...,xi(n))。
输出:一个kd树
(1)开始构造根节点,根节点对应于包含 T T T的k维空间的超矩形区域。
选择 x ( l ) x^{(l)} x(l),以 T T T中所有实例的 x ( l ) x^{(l)} x(l)坐标的中位数为切分点,将这个超矩形区域切分成两个子区域。
由根节点生成深度为1的左右两个子节点,左子节点对应区域内所有点的 x ( l ) x^{(l)} x(l)坐标小于切分点的 x ( l ) x^{(l)} x(l)坐标,右子节点对应区域内所有点的 x ( l ) x^{(l)} x(l)坐标大于/等于切分点的 x ( l ) x^{(l)} x(l)坐标。
将落在切分超平面上的实例点保存在根节点。
(2)重复(1)直到两个子区域中没有实例存在时停止。
3.3.2 搜索kd树
算法3.3 用kd树的最近邻搜索
输入:已构造的kd树;目标点x;
输出:x的最近邻。
(1)在kd树中找出包含目标点的叶节点(区域):从根节点出发递归地访问他的子节点。若目标点x坐标小于切分点的坐标,则移动到左子节点,否则移动到右子节点。直到叶子节点。
(2)以此节点作为当前最近点。
(3)递归地向上回退,对每个节点进行:
(a)如果该节点保存的实例点比当前最近点距离目标点更近,则以该实例点为当前最近点。
(b)当前最近点一定存在于该节点的一个子节点对应的区域。检查该子节点的父节点的另一个子节点对应的区域是否有更近的点。具体的,检查另一个子节点对应的区域是否与以目标节点为球心、以目标点与“当前最近点”间的距离为半径的超球体相交。
如果相交,可能在另一个子节点对应的区域内存在距目标点更近的点,移动到另一个子节点。接着,递归地进行最近邻搜索;
如果不相交,向上回退。
(4)当回退到根节点时,搜索结束。最后的“当前最近点”即为x的最近邻点。
代码
以下代码在Python3中调试通过。
(1)生成KD树
先上图。还是挺有意思的。
输入数据是一个2维向量集,也可支持多维,代码做了适配。
import numpy as np
import matplotlib.pyplot as plt
import copy
import math
"""
X, feature vectors
Y, class of X
D, dimension of each of vectors.
"""
# Construct initial to be classified data
D = 2
NUM = 50
C = [ 'g', 'r', 'b' ]
#X = np.array([ (3,5), (2,4), (1,1), (5,2), (1,5), (4,1) ])
X = np.random.rand(NUM,D)
Y = [ C[i] for i in np.random.randint(0,len(C),NUM) ]
class KD_Node:
cur_trav = None # cursor for traversal.
x_min = 0
x_max = 1
y_min = 0
y_max = 1
def __init__( self,
point=None, split=None, color=None,
L=None, R=None, father=None,
scope={} ):
"""
initiate a kd tree.
point: datum of this node
split: split plane for this node
L: left son
R: right son
father: father of this node, if root it's None
scope: area in hyperspace for each node.
"""
self.point = point
self.split = split
self.color = color
self.left = L
self.right = R
self.father = father
self.flag_trav = 0 # traversal flag.
# bit 0 is notation for itself
# bit 1 is for its left son
# bit 2 is for its right son
self.scope = scope # paint scope:
# x0: min of x
# x1: max of x
# y0: min of y
# y1: max of y
def clear_trav(self):
KD_Node.cur_trav = None
self.flag_trav = 0
if self.left:
self.left.clear_trav()
if self.right:
self.right.clear_trav()
def __iter__(self):
return self
def __next__(self):
# with non-iteration traverse the tree
cursor = None
if KD_Node.cur_trav == None: # First time to use cur_trav, initiate.
KD_Node.cur_trav = self
cursor = KD_Node.cur_trav
while 1:
if cursor.flag_trav & 0X07 == 0X7: # any node has flag with
# value=3
# that states a completion
# of traversal.
if cursor.father == None:
raise StopIteration
else:
cursor = cursor.father
elif cursor.flag_trav & 0X01 == 0: # if bit0 == 0,
cursor.flag_trav |= 0X01 # set bit0 = 1
#cursor = cursor # not need. set cursor => self
break # BREAK! return current.
elif cursor.flag_trav & 0X02 == 0: # if bit1==0, bit2==0
cursor.flag_trav |= 0X02 # set bit1 of self
if cursor.left != None:
cursor = cursor.left # set cursor => left son
else: # self.left is None, skip
continue
elif cursor.flag_trav & 0X04 == 0: # if bit2 == 0,
cursor.flag_trav |= 0X04 # set bit2 = 1
if cursor.right != None:
cursor = cursor.right # set cursor => right son
else:
continue
KD_Node.cur_trav = cursor
return KD_Node.cur_trav
def CreateKDT(node=None, data=None, color=None, father=None ):
"""
TODO: DOC FOR CreateKDT
INPUT: node, the node itself?
data, [ (3,5), (2,4), (1,1) ]
father, the father
OUTPUT:
"""
global C
if len(data) > 0:
global D
dim = D
var = np.var(data, axis=0) # variance for each dimension
split = np.argmax(var) # split for this node
pos = int(len(data)/2)
pos_list = np.argpartition(data[:,split], pos)
point = data[pos_list[pos]] # point for this node
color = C[np.random.randint(0, len(C))]
cur_scope = {} # scope
if not father:
cur_scope = { 'x0': 0, 'x1': 6, # current scope is where the node is.
'y0': 0, 'y1': 6 }# Or you can assign it the min and
# max of the graph.
else: # update cur_scope
cur_scope = copy.deepcopy(father.scope)
if father.split == 0:
if point[0] < father.point[0]:
cur_scope['x1'] = father.point[0]
else:
cur_scope['x0'] = father.point[0]
elif father.split == 1:
if point[1] < father.point[1]:
cur_scope['y1'] = father.point[1]
else:
cur_scope['y0'] = father.point[1]
node = KD_Node( point=point, split=split, color=color, father=father,
scope=cur_scope )
if len(data[pos_list[:pos]]) != 0:
node.left = CreateKDT( node = node.left,
data = data[pos_list[:pos]],
color = color,
father = node )
if len(data[pos_list[(pos+1):]]) != 0:
node.right = CreateKDT( node = node.right,
data = data[pos_list[(pos+1):]],
color = color,
father = node )
return node
def get_split_pos(data, split):
"""return the position to split in data."""
pos = len(data)/2
return
def preorder(node, depth=-1):
"""
Preorder a KD node
"""
print(node)
if node:
if node.left:
preorder(node.left)
if node.right:
preorder(node.right)
def draw_KDT(kd):
"""
Draw a plot in which each of data determined by a point and draw the classifying plane.
"""
x_min = kd.x_min
x_max = kd.x_max
y_min = kd.y_min
y_max = kd.y_max
plt.figure(figsize=(6,6))
plt.xlabel("$x^{(1)}$")
plt.ylabel("$x^{(2)}$")
plt.title("Machine Learning: KD Tree")
plt.xlim(int(x_min),math.ceil(x_max))
plt.ylim(int(y_min),math.ceil(y_max))
ax = plt.gca()
ax.set_aspect(1)
plt.plot( [x_min, x_max, x_max, x_min, x_min],
[y_min, y_min, y_max, y_max, y_min] )
line_from = [] # split line from and to
line_to = []
for node in kd:
if node.split == 0:
line_from = [ node.point[0], node.scope['y0'] ]
line_to = [ node.point[0], node.scope['y1'] ]
if node.split == 1:
line_from = [ node.scope['x0'], node.point[1] ]
line_to = [ node.scope['x1'], node.point[1] ]
plt.plot( [ line_from[0], line_to[0] ],
[ line_from[1], line_to[1] ],
'k-', linewidth=1 )
plt.scatter( node.point[0], node.point[1], color=node.color )
plt.show()
pass
def find_knn(root, x):
pass
def main():
kd = None
kd = CreateKDT(kd, X)
#kd.clear_trav()
draw_KDT(kd)
if __name__ == "__main__":
main()
参考:
[1] http://blog.csdn.net/u010551621/article/details/44813299