琢磨了一下网上其他人的,自己改写了 一下,因为本身用了array_split以后会输出列表,所以网上好多人用了np.vstack之类的方法,我觉得不行太麻烦了,干脆直接转成数组做。
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
################################################################################
# TODO: #
# Split up the training data into folds. After splitting, X_train_folds and #
# y_train_folds should each be lists of length num_folds, where #
# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #
# Hint: Look up the numpy array_split function. #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
X_train_folds = np.array(np.array_split(X_train, num_folds, axis = 0))
y_train_folds = np.array(np.array_split(y_train, num_folds, axis = 0))
pass
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}
################################################################################
# TODO: #
# Perform k-fold cross validation to find the best value of k. For each #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all #
# values of k in the k_to_accuracies dictionary. #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
for ki in k_choices:
k_to_accuracies[ki]=[]
for fi in range(num_folds):
x_train_tem = np.delete(X_train_folds,fi,axis=0).reshape(-1, X_train_folds.shape[2])
y_train_tem = np.delete(y_train_folds,fi,axis=0).reshape(-1, 1)
x_test_tem = X_train_folds[fi]
y_test_tem = y_train_folds[fi]
classifier=KNearestNeighbor()
classifier.train(x_train_tem, y_train_tem)
dists=classifier.compute_distances_no_loops(x_test_tem)
y_test_pred = classifier.predict_labels(dists, ki)
num_correct = np.sum(y_test_pred == y_test_tem)
accuracy = float(num_correct) / x_test_tem.shape[0]
k_to_accuracies[ki].append(accuracy)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# Print out the computed accuracies
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print('k = %d, accuracy = %f' % (k, accuracy))
交叉验证这部分很多人不理解哈,主要代码也有锅,这里面的test实际上是val,也是训练集的一部分。这其中我们有5000个训练数据,然后按作者的分成5份,也就是说4个是train data,剩下1个是test(val) data,所以我就粗暴简单,循环到fi的时候,选序号为fi的作为验证(测试)集,比如fi=1的时候,我们本来有0,1,2,3,4五个集,现在去掉1,剩下序号为0,2,3,4拿来做训练集,因此我直接从原数据里删除了第fi行的数据,原先X_train_folds大小是(5,1000,3072),去掉第一个变成(4,1000,3072),但我们塞进classifier的数据是(数量,其他所有维-3072),因此保留最后的3072,让前面两维相乘,这里reshape的时候-1自适应就好了。test数据同理,因为其本身就比训练数据少一维,所以操作略有不同。