在我们使用k-NN模型时,需要计算测试集中每一点到训练集中每一点的欧氏距离,即需要求得两矩阵之间的欧氏距离。在实现k-NN算法时通常有三种方案,分别是使用两层循环,使用一层循环和不使用循环。
使用两层循环
分别对训练集和测试集中的数据进行循环遍历,计算每两个点之间的欧式距离,然后赋值给dist矩阵。此算法没有经过任何优化。
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in xrange(num_test):
for j in xrange(num_train):
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension. #
#####################################################################
# pass
dists[i][j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
#####################################################################
# END OF YOUR CODE #
#####################################################################
return dists
使用一层循环
使用矩阵表示训练集的数据,计算测试集中每一点到训练集矩阵的距离,可以对算法优化为只使用一层循环。
def compute_distances_one_loop(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a single loop over the test data.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in xrange(num_test):
#######################################################################
# TODO: #
# Compute the l2 distance between the ith test point and all training #
# points, and store the result in dists[i, :]. #
#######################################################################
# pass
dists[i] = np.sqrt(np.sum(np.square(self.X_train - X[i]), axis = 1))
#######################################################################
# END OF YOUR CODE #
#######################################################################
return dists
不使用循环
运算效率最高的算法是将训练集和测试集都使用矩阵表示,然后使用矩阵运算的方法替代之前的循环操作。但此操作需要我们对矩阵的运算规则非常熟悉。接下来着重记录如何计算两个矩阵之间的欧式距离。
记录测试集矩阵P的大小为M*D,训练集矩阵C的大小为N*D(测试集中共有M个点,每个点为D维特征向量。训练集中共有N个点,每个点为D维特征向量)
记
Pi
P
i
是P的第i行,记
Cj
C
j
是C的第j行
Pi=[Pi1Pi2⋯PiD]
P
i
=
[
P
i
1
P
i
2
⋯
P
i
D
]
Cj=[Cj1Cj2⋯CjD]
C
j
=
[
C
j
1
C
j
2
⋯
C
j
D
]
首先计算
Pi
P
i
和
Cj
C
j
之间的距离dist(i,j)
d(Pi,Cj)=(Pi1−Cj1)2+(Pi2−Cj2)2+⋯+(PiD−CjD)2−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√=(P2i1+P2i2+⋯+P2iD)+(C2j1+C2j2+⋯+C2jD)−2×(Pi1Cj1+Pi2Cj2+⋯+PiDCiD)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√=∥Pi∥2+∥Cj∥2−2×PiCTj−−−−−−−−−−−−−−−−−−−−−√
d
(
P
i
,
C
j
)
=
(
P
i
1
−
C
j
1
)
2
+
(
P
i
2
−
C
j
2
)
2
+
⋯
+
(
P
i
D
−
C
j
D
)
2
=
(
P
i
1
2
+
P
i
2
2
+
⋯
+
P
i
D
2
)
+
(
C
j
1
2
+
C
j
2
2
+
⋯
+
C
j
D
2
)
−
2
×
(
P
i
1
C
j
1
+
P
i
2
C
j
2
+
⋯
+
P
i
D
C
i
D
)
=
‖
P
i
‖
2
+
‖
C
j
‖
2
−
2
×
P
i
C
j
T
我们可以推广到距离矩阵的第i行的计算公式
dist[i]=(∥Pi∥2∥Pi∥2⋯∥Pi∥2)+(∥C1∥2∥C2∥2⋯∥CN∥2)−2×Pi(CT1CT2⋯CTN)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√=(∥Pi∥2∥Pi∥2⋯∥Pi∥2)+(∥C1∥2∥C2∥2⋯∥CN∥2)−2×PiCT−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√
d
i
s
t
[
i
]
=
(
‖
P
i
‖
2
‖
P
i
‖
2
⋯
‖
P
i
‖
2
)
+
(
‖
C
1
‖
2
‖
C
2
‖
2
⋯
‖
C
N
‖
2
)
−
2
×
P
i
(
C
1
T
C
2
T
⋯
C
N
T
)
=
(
‖
P
i
‖
2
‖
P
i
‖
2
⋯
‖
P
i
‖
2
)
+
(
‖
C
1
‖
2
‖
C
2
‖
2
⋯
‖
C
N
‖
2
)
−
2
×
P
i
C
T
继续将公式推广为整个距离矩阵
dist=⎛⎝⎜⎜⎜⎜⎜⎜∥P1∥2∥P2∥2⋮∥PM∥2∥P1∥2∥P2∥2⋮∥PM∥2⋯⋯⋱⋯∥P1∥2∥P2∥2⋮∥PM∥2⎞⎠⎟⎟⎟⎟⎟⎟+⎛⎝⎜⎜⎜⎜⎜⎜∥C1∥2∥C1∥2⋮∥C1∥2∥C2∥2∥C2∥2⋮∥C2∥2⋯⋯⋱⋯∥CN∥2∥CN∥2⋮∥CN∥2⎞⎠⎟⎟⎟⎟⎟⎟−2×PCT−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−⎷
d
i
s
t
=
(
‖
P
1
‖
2
‖
P
1
‖
2
⋯
‖
P
1
‖
2
‖
P
2
‖
2
‖
P
2
‖
2
⋯
‖
P
2
‖
2
⋮
⋮
⋱
⋮
‖
P
M
‖
2
‖
P
M
‖
2
⋯
‖
P
M
‖
2
)
+
(
‖
C
1
‖
2
‖
C
2
‖
2
⋯
‖
C
N
‖
2
‖
C
1
‖
2
‖
C
2
‖
2
⋯
‖
C
N
‖
2
⋮
⋮
⋱
⋮
‖
C
1
‖
2
‖
C
2
‖
2
⋯
‖
C
N
‖
2
)
−
2
×
P
C
T
表示为python代码:
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
#########################################################################
# TODO: #
# Compute the l2 distance between all test points and all training #
# points without using any explicit loops, and store the result in #
# dists. #
# #
# You should implement this function using only basic array operations; #
# in particular you should not use functions from scipy. #
# #
# HINT: Try to formulate the l2 distance using matrix multiplication #
# and two broadcast sums. #
#########################################################################
# pass
dists = np.sqrt(-2*np.dot(X, self.X_train.T) + np.sum(np.square(self.X_train), axis = 1) + np.transpose([np.sum(np.square(X), axis = 1)]))
#########################################################################
# END OF YOUR CODE #
#########################################################################
return dists