文章目录
k-Nearest Neighbor (kNN)
作业完整代码见Github仓库
1. 简介
kNN分类器包含两个阶段:
- 在训练过程中,分类器获取训练数据并简单地记住它;
- 在测试过程中,kNN将每个测试图像与所有的训练图像进行比较,选出其中k个最相似的训练实例,用它们的标签进行预测,从而对每个测试图像进行分类;
- 对k值进行交叉验证
2. kNN分类器代码实现
我们现在想用kNN分类器对测试数据进行分类。我们可以把这个过程分成两个步骤:
- 首先,我们必须计算所有测试样例和所有训练样例之间的距离;
- 给定这些距离,对于每个测试示例,我们找到k个最近的样例,投票决定标签。
2.1 计算距离
下面是用了两个循环的算法实现(L2距离):
def compute_distances_two_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a nested loop over both the training data and the
test data.
Inputs:
- X: A numpy array of shape (num_test, D) containing test data.
Returns:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
is the Euclidean distance between the ith test point and the jth training
point.
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension, nor use np.linalg.norm(). #
#####################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
dists[i, j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return dists
下面是用了一个循环的算法实现(L2距离):
这里使用了广播机制,省去了一个循环.
def compute_distances_one_loop(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a single loop over the test data.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
#######################################################################
# TODO: #
# Compute the l2 distance between the ith test point and all training #
# points, and store the result in dists[i, :]. #
# Do not use np.linalg.norm(). #
#######################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# X[i] -> shape(D,) self.X.train -> shape(num_train, D)
dists[i, :] = np.sqrt(np.sum(np.square(X[i] - self.X_train), axis = 1))
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return dists
下面是不用循环的算法实现(L2距离):
先分析dists数组中的一个元素:
d
i
s
t
s
[
i
]
[
j
]
dists[i][j]
dists[i][j]
d
i
s
t
s
[
i
]
[
j
]
=
∑
k
=
0
D
−
1
(
X
[
i
]
[
k
]
−
X
_
t
r
a
i
n
[
j
]
[
k
]
)
2
=
(
X
[
i
]
−
X
_
t
r
a
i
n
[
j
]
)
(
X
[
i
]
−
X
_
t
r
a
i
n
[
j
]
)
T
=
X
[
i
]
X
[
i
]
T
−
2
X
[
i
]
X
_
t
r
a
i
n
[
j
]
T
+
X
_
t
r
a
i
n
[
j
]
X
_
t
r
a
i
n
[
j
]
T
,
\begin{aligned} dists[i][j] & =\sqrt{\sum^{D-1}_{k=0}(X[i][k]-X\_train[j][k])^2} \\ & =\sqrt{(X[i]-X\_train[j]) (X[i]-X\_train[j])^T} \\ & = \sqrt{X[i]X[i]^T-2X[i]X\_train[j]^T+X\_train[j]X\_train[j]^T} ,\\ \end{aligned}
dists[i][j]=k=0∑D−1(X[i][k]−X_train[j][k])2=(X[i]−X_train[j])(X[i]−X_train[j])T=X[i]X[i]T−2X[i]X_train[j]T+X_train[j]X_train[j]T,
所以,
d
i
s
t
s
=
X
X
T
−
2
X
X
_
t
r
a
i
n
T
+
X
_
t
r
a
i
n
X
_
t
r
a
i
n
T
\begin{aligned} dists& =\sqrt{XX^T-2XX\_train^T+X\_trainX\_train^T} \end{aligned}
dists=XXT−2XX_trainT+X_trainX_trainT
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
#########################################################################
# TODO: #
# Compute the l2 distance between all test points and all training #
# points without using any explicit loops, and store the result in #
# dists. #
# #
# You should implement this function using only basic array operations; #
# in particular you should not use functions from scipy, #
# nor use np.linalg.norm(). #
# #
# HINT: Try to formulate the l2 distance using matrix multiplication #
# and two broadcast sums. #
#########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
d1 = np.sum(np.square(X), axis = 1, keepdims = True)
d2 = - 2 * np.dot(X, self.X_train.T)
d3 = np.sum(np.square(self.X_train.T), axis = 0, keepdims = True)
assert(d1.shape == (num_test, 1))
assert(d2.shape ==(num_test, num_train))
assert(d3.shape == (1, num_train))
dists = np.sqrt(d1 + d2 + d3)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return dists
2.2 预测标签
Tips:
- 作业提示中说明若k个近邻投票产生多个结果时,选用标签较小的那个,此要求在本次示例中已完成,
np.bicount()
会返回0到数组中最大值出现的次数数组,np.argmax
会返回次数数组中次数最多的索引(如果有多个,则返回最前面的),即标签较小的。
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in range(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = []
#########################################################################
# TODO: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort. #
#########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
closest_y = self.y_train[np.argsort(dists[i])[0: k]].astype(np.int32)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
#########################################################################
# TODO: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
y_pred[i] = np.argmax(np.bincount(closest_y))
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return y_pred
3. 交叉验证
我们已经实现了k近邻分类器,但是我们随意设置了k = 5的值。现在我们将通过交叉验证来确定这个超参数的最佳值。
Tips:
np.array_split()
可以将数组切割成指定折数,返回一个由数组组成的List;- 每次将其中一折作为交叉验证集,其余作为训练集。训练集的生成主要用到了
np.vstack
函数,将List叠成数组。注意对秩为1的数组的特殊处理flatten
拉平,所以尽可能少用秩为1的数组,这样可以避免一些不必要的bug,如果一定要用,加入assert
断言函数确认你的数组的维度。
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
################################################################################
# TODO: #
# Split up the training data into folds. After splitting, X_train_folds and #
# y_train_folds should each be lists of length num_folds, where #
# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #
# Hint: Look up the numpy array_split function. #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# X_train.shape = (num_train, D)
# y_train.shape = (num_train,)
X_train_folds = np.array_split(X_train, num_folds, axis = 0) # list each is a array shape(num_train/num_folds, D)
y_train_folds = np.array_split(y_train, num_folds, axis = 0)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}
################################################################################
# TODO: #
# Perform k-fold cross validation to find the best value of k. For each #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all #
# values of k in the k_to_accuracies dictionary. #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
for k in k_choices:
accuracies = []
for i in range(num_folds):
X_train_new = np.vstack(X_train_folds[0:i]+X_train_folds[i+1 :])
y_train_new = np.vstack(y_train_folds[0:i]+y_train_folds[i+1 :]).flatten()
X_test_new = X_train_folds[i]
y_test_new = y_train_folds[i]
assert(X_train_new.shape == (4000, 3072))
assert(y_train_new.shape == (4000,))
assert(X_test_new.shape == (1000, 3072))
assert(y_test_new.shape == (1000,))
classifier.train(X_train_new, y_train_new)
y_test_pred = classifier.predict(X_test_new, k=k)
num_correct = np.sum(y_test_pred == y_test_new)
accuracy = float(num_correct) / X_test_new.shape[0]
accuracies.append(accuracy)
k_to_accuracies[k] = accuracies
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# Print out the computed accuracies
for k in sorted(k_to_accuracies):
sum_of_accuracy = 0
for accuracy in k_to_accuracies[k]:
sum_of_accuracy += accuracy
print('k = %d, accuracy = %f' % (k, accuracy))
print('k = %d, mean accuracy = %f' %(k, sum_of_accuracy/num_folds))
最优选择k=10,此时准确率约为0.282
5. 作业中内联问题
5.1 问题一
Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)
- What in the data is the cause behind the distinctly bright rows?
- What causes the columns?
Y o u r A n s w e r : \color{blue}{\textit Your Answer:} YourAnswer:
- 即这个测试样本和所有训练样本都差距很大,可能是outlier(离群值);
- 即这个训练样本和所有测试样本差距很大,可能是训练数据中的坏点.
5.2 问题二
We can also use other distance metrics such as L1 distance.
For pixel values
p
i
j
(
k
)
p_{ij}^{(k)}
pij(k) at location
(
i
,
j
)
(i,j)
(i,j) of some image
I
k
I_k
Ik,
the mean
μ
\mu
μ across all pixels over all images is
μ
=
1
n
h
w
∑
k
=
1
n
∑
i
=
1
h
∑
j
=
1
w
p
i
j
(
k
)
\mu=\frac{1}{nhw}\sum_{k=1}^n\sum_{i=1}^{h}\sum_{j=1}^{w}p_{ij}^{(k)}
μ=nhw1k=1∑ni=1∑hj=1∑wpij(k)
And the pixel-wise mean
μ
i
j
\mu_{ij}
μij across all images is
μ
i
j
=
1
n
∑
k
=
1
n
p
i
j
(
k
)
.
\mu_{ij}=\frac{1}{n}\sum_{k=1}^np_{ij}^{(k)}.
μij=n1k=1∑npij(k).
The general standard deviation
σ
\sigma
σ and pixel-wise standard deviation
σ
i
j
\sigma_{ij}
σij is defined similarly.
Which of the following preprocessing steps will not change the performance of a Nearest Neighbor classifier that uses L1 distance? Select all that apply.
- Subtracting the mean μ \mu μ ( p ~ i j ( k ) = p i j ( k ) − μ \tilde{p}_{ij}^{(k)}=p_{ij}^{(k)}-\mu p~ij(k)=pij(k)−μ.)
- Subtracting the per pixel mean μ i j \mu_{ij} μij ( p ~ i j ( k ) = p i j ( k ) − μ i j \tilde{p}_{ij}^{(k)}=p_{ij}^{(k)}-\mu_{ij} p~ij(k)=pij(k)−μij.)
- Subtracting the mean μ \mu μ and dividing by the standard deviation σ \sigma σ.
- Subtracting the pixel-wise mean μ i j \mu_{ij} μij and dividing by the pixel-wise standard deviation σ i j \sigma_{ij} σij.
- Rotating the coordinate axes of the data.
Y o u r A n s w e r : \color{blue}{\textit Your Answer:} YourAnswer:
1,2,3,4
Y o u r E x p l a n a t i o n : \color{blue}{\textit Your Explanation:} YourExplanation:
对问题分析列出下列公式
d
i
s
t
[
m
]
[
n
]
=
∑
i
=
1
h
∑
j
=
1
w
∣
P
i
j
(
m
)
−
P
i
j
(
n
)
∣
μ
=
1
n
h
w
∑
k
=
1
n
∑
i
=
1
h
∑
j
=
1
w
p
i
j
(
k
)
=
c
o
n
s
t
a
n
t
μ
i
j
=
1
n
∑
k
=
1
n
p
i
j
(
k
)
σ
=
∑
k
=
1
n
∑
i
=
1
h
∑
j
=
1
w
(
p
i
j
(
k
)
−
μ
)
2
n
h
w
=
c
o
n
s
t
a
n
t
σ
i
j
=
∑
k
=
1
n
(
P
i
j
(
k
)
−
μ
i
j
)
2
n
\begin{aligned} & dist[m][n]=\sum_{i=1}^h\sum_{j=1}^w | P_{ij}^{(m)} - P_{ij}^{(n)} | \\ & \mu=\frac{1}{nhw}\sum_{k=1}^n\sum_{i=1}^{h}\sum_{j=1}^{w}p_{ij}^{(k)} =constant \\ & \mu_{ij}=\frac{1}{n}\sum_{k=1}^np_{ij}^{(k)} \\ & \sigma=\sqrt{\frac{\sum_{k=1}^n\sum_{i=1}^{h}\sum_{j=1}^{w}\left(p_{ij}^{(k)} -\mu \right)^2}{nhw}}=constant \\ & \sigma_{ij} = \sqrt{\frac{\sum_{k=1}^n \left( P^{(k)}_{ij} -\mu_{ij} \right)^2}{n}} \end{aligned}
dist[m][n]=i=1∑hj=1∑w∣Pij(m)−Pij(n)∣μ=nhw1k=1∑ni=1∑hj=1∑wpij(k)=constantμij=n1k=1∑npij(k)σ=nhw∑k=1n∑i=1h∑j=1w(pij(k)−μ)2=constantσij=n∑k=1n(Pij(k)−μij)2
我们将1选项代入 d i s t dist dist 公式, μ \mu μ 会被抵消,对L1无影响。
我们将3选项代入 d i s t dist dist 公式,相当于对L1距离进行rescale,乘以 1 σ \frac{1}{\sigma} σ1。
d i s t dist dist 公式本来就是 element-wise 的,所以代入2,4公式也会发现无影响,并且2,4就是数据的标准化处理,会将数据变成均值为0,方差为1的状态,所以2,4也不会改变性能。
对于选项5,我们举个简单例子。点 A ( 0 , 1 ) , B ( 1 , 0 ) , C ( 2 , 1 ) A(0, 1),B(1,0),C(2,1) A(0,1),B(1,0),C(2,1),AB和BC之间的距离均为2,旋转45°后计算AB和BC的距离显然不一样(打公式太累了,hhh,偷个懒),所以不对。
5.3 问题三
Which of the following statements about k k k-Nearest Neighbor ( k k k-NN) are true in a classification setting, and for all k k k? Select all that apply.
- The decision boundary of the k-NN classifier is linear.
- The training error of a 1-NN will always be lower than that of 5-NN.
- The test error of a 1-NN will always be lower than that of a 5-NN.
- The time needed to classify a test example with the k-NN classifier grows with the size of the training set.
- None of the above.
Y o u r A n s w e r : \color{blue}{\textit Your Answer:} YourAnswer:
2,4
Y o u r E x p l a n a t i o n : \color{blue}{\textit Your Explanation:} YourExplanation:
- kNN不是线性分类器,显然不对。
- 这是正确的。因为如果使用训练数据集作为测试集,那么对于k=1,如果给定一个点x,近邻将是它自己,因此错误率将是0。对于5-NN, 0错误率是下界。
- 对于测试集来说,显然不对,例如在编程作业中你最好的k值为10,而不是1(这里10和5指代意义相同)
- 正确,测试时间为 O ( n t r a i n ) O(n_{train}) O(ntrain) 级别。