1.课程要点:
a.L1,L2距离的区别,图像中的L1,L2距离如何计算
L2距离更适合区分度比较大的场景,为图像各个像素的欧基里德距离。
b.KNN的概念
对于一个点的标签,由与其最近的K个点的标签决定。在这个K个点中,出现次数最多的标签就是该点的标签。
c.验证集:
测试集不能作为调参时测验精确度的依据,因为这样会使得模型的参数对测试集有依赖性,测试集只能用于最后的测试,用于衡量模型的真实精度。所以在训练集中摘取部分拆分成一个单独的集合即验证集用于调参时的精度测验
d:交叉验证:
当训练集比较小的时候,拆分一个验证集可能会是训练数据不够,此时我们的做法是将训练集分成N个小的集合,然后每次取其中一个作为验证集,其他的作为训练集,如此进行N次,每一种参数可以得到N个精度,分别由第1,2,..N个集合作为验证集得到。这种做法的缺点就是计算量比较大。
当数据足够的时候一般采用验证集的方法,当数据比较小的时候一般采用交叉验证。
代码:
1.这是计算距离的三种方式,
def compute_distances_two_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a nested loop over both the training data and the
test data.
Inputs:
- X: A numpy array of shape (num_test, D) containing test data.
Returns:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
is the Euclidean distance between the ith test point and the jth training
point.
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension. #
#####################################################################
pass
dists[i][j] = np.sqrt(np.sum(np.square(self.X_train[j]-X[i])))
#####################################################################
# END OF YOUR CODE #
#####################################################################
return dists
def compute_distances_one_loop(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a single loop over the test data.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
#######################################################################
# TODO: #
# Compute the l2 distance between the ith test point and all training #
# points, and store the result in dists[i, :]. #
#######################################################################
pass
#print(X.shape,self.X_train.shape)
dists[i:] = np.sqrt(np.sum(np.square(X[i]-self.X_train),axis=1))
#######################################################################
# END OF YOUR CODE #
#######################################################################
return dists
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
#########################################################################
# TODO: #
# Compute the l2 distance between all test points and all training #
# points without using any explicit loops, and store the result in #
# dists. #
# #
# You should implement this function using only basic array operations; #
# in particular you should not use functions from scipy. #
# #
# HINT: Try to formulate the l2 distance using matrix multiplication #
# and two broadcast sums. #
#########################################################################
pass
dists += np.sum(np.square(self.X_train),axis=1).reshape(1,num_train)
dists += np.sum(np.square(X),axis=1).reshape(num_test,1)
dists -= 2*np.dot(X,self.X_train.T)
dists = np.sqrt(dists)
#########################################################################
# END OF YOUR CODE #
#########################################################################
return dists
2.预测标签:
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in range(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = []
#########################################################################
# TODO: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort. #
#########################################################################
pass
closest_y = self.y_train[np.argsort(dists[i])[:k]]
#########################################################################
# TODO: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
pass
word_counts = Counter(closest_y)
y_pred[i] = word_counts.most_common(1)[0][0]
#########################################################################
# END OF YOUR CODE #
#########################################################################
return y_pred
3.进行交叉验证:
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
################################################################################
# TODO: #
# Split up the training data into folds. After splitting, X_train_folds and #
# y_train_folds should each be lists of length num_folds, where #
# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #
# Hint: Look up the numpy array_split function. #
################################################################################
# Your code
X_train_folds.extend(np.array_split(X_train,5))
y_train_folds.extend(np.array_split(y_train,5))
################################################################################
# END OF YOUR CODE #
################################################################################
# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}
################################################################################
# TODO: #
# Perform k-fold cross validation to find the best value of k. For each #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all #
# values of k in the k_to_accuracies dictionary. #
################################################################################
# Your code
for k in k_choices:
accuracies = []
for i in range(num_folds):
X_train_cv = np.vstack(X_train_folds[0:i] + X_train_folds[i+1:])
y_train_cv = np.hstack(y_train_folds[0:i] + y_train_folds[i+1:])
X_valid_cv = X_train_folds[i]
y_valid_cv = y_train_folds[i]
classifier.train(X_train_cv, y_train_cv)
dists = classifier.compute_distances_no_loops(X_valid_cv)
accuracy = float(np.sum(classifier.predict_labels(dists, k) == y_valid_cv)) / y_valid_cv.shape[0]
accuracies.append(accuracy)
k_to_accuracies[k] = accuracies
################################################################################
# END OF YOUR CODE #
################################################################################
# Print out the computed accuracies
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print('k = %d, accuracy = %f' % (k, accuracy))
unbutu的输入法差评。