CS231n-assignment1

最新推荐文章于 2023-03-18 15:38:37 发布

搞么子啊小子

最新推荐文章于 2023-03-18 15:38:37 发布

阅读量574

点赞数

参考文章：http://www.cnblogs.com/daihengchen/p/5754383.html

KNN

参考文章：http://blog.csdn.net/zhyh1435589631/article/details/54236643

knn 本质实现部分代码分析

2.2.3.1 KNearestNeighbor 类整体分析

本质上，这是一个类，有多个成员函数构成，用户调用的时候，只需要调用 train 和 predict即可得到想要的预测数据
其中， compute_distances_two_loops,compute_distances_one_loop,compute_distances_no_loops分别是用来实现需要预测的数据集 X 和原始记录的训练集 self.X_train之间的距离关系，并通过predict_labels 进行KNN预测

class KNearestNeighbor(object):
  """ a kNN classifier with L2 distance """

  def __init__(self):
    pass

  def train(self, X, y):
    ...

  def predict(self, X, k=1, num_loops=0):
    ...

  def compute_distances_two_loops(self, X):
    ...

  def compute_distances_one_loop(self, X):
    ...

  def compute_distances_no_loops(self, X):
    ...

  def getNormMatrix(self, x, lines_num):
    ... 

  def predict_labels(self, dists, k=1):
    ...
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

2.2.3.2 compute_distances_two_loops

这个函数主要通过两层 for 循环对计算测试集与训练集数据之间的欧式距离

d 2 (I 1, I 2) = \sum p (I p 1 - I p 2) 2 - - - - - - - - - - - - \sqrt

def compute_distances_two_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a nested loop over both the training data and the 
    test data.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      is the Euclidean distance between the ith test point and the jth training
      point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in xrange(num_test):
      for j in xrange(num_train):
        #####################################################################
        # TODO:                                                             #
        # Compute the l2 distance between the ith test point and the jth    #
        # training point, and store the result in dists[i, j]. You should   #
        # not use a loop over dimension.                                    #
        #####################################################################
         dists[i][j] = np.sqrt(np.sum(np.square(self.X_train[j,:] - X[i,:])))
        #####################################################################
        #                       END OF YOUR CODE                            #
        #####################################################################
    return dists
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

2.2.3.3 compute_distances_one_loop

本质上这里填入的代码和上一节中的是一致的，只是多了一个 axis = 1 指定方向

def compute_distances_one_loop(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in xrange(num_test):
      #######################################################################
      # TODO:                                                               #
      # Compute the l2 distance between the ith test point and all training #
      # points, and store the result in dists[i, :].                        #
      #######################################################################
      dists[i,:] = np.sqrt(np.sum(np.square(self.X_train-X[i,:]),axis = 1)) 
      #######################################################################
      #                         END OF YOUR CODE                            #
      #######################################################################
    return dists
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

2.2.3.4 compute_distances_no_loops

这部分公式虽然短小，但是需要一定的数学功底，参考文章：http://blog.csdn.net/geekmanong/article/details/51524402
我们记测试集矩阵为 P 大小为 M×D , 训练集矩阵为 C 大小为 N×D
记 Pi 是 P 的第 i 行，同理 Cj 是 C 的第 j 行：
$P i = [P i 1 P i 2 \dots P i D] C j = [C j 1 C j 2 \dots C j D]$
我们先来计算一下 Pi 和 Cj 之间的距离
$d (P i, C j) = (P i 1 - C j 1) 2 + (P i 2 - C j 2) 2 + \dots + (P i D - C j D) 2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \sqrt = (P 2 i 1 + P 2 i 2 + \dots + P 2 i D) + (C 2 j 1 + C 2 j 2 + \dots + C 2 j D) - 2 * (P i 1 C j 1 + P i 2 C j 2 + \dots + P i D C j D) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \sqrt = | | P i | | 2 + | | C j | | 2 - 2 * P i C' j - - - - - - - - - - - - - - - - - - - - \sqrt$
我们可以推广得到，结果矩阵的每行元素为：
$R e s (i) = (| | P i | | 2 | | P i | | 2 \dots | | P i | | 2) + (| | C 1 | | 2 | | C 2 | | 2 \dots | | C N | | 2) - 2 * P i (C' 1 C' 2 \dots C' N) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \sqrt = (| | P i | | 2 | | P i | | 2 \dots | | P i | | 2) + (| | C 1 | | 2 | | C 2 | | 2 \dots | | C N | | 2) - 2 * P i C' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \sqrt$
继而，结果矩阵为：
$R e s = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ | | P 1 | | 2 | | P 2 | | 2 ⋮ | | P M | | 2 | | P 1 | | 2 | | P 2 | | 2 ⋮ | | P M | | 2 \dots \dots ⋱ \dots | | P 1 | | 2 | | P 2 | | 2 ⋮ | | P M | | 2 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ + ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ | | C 1 | | 2 | | C 1 | | 2 ⋮ | | C 1 | | 2 | | C 2 | | 2 | | C 2 | | 2 ⋮ | | C 2 | | 2 \dots \dots ⋱ \dots | | C N | | 2 | | C N | | 2 ⋮ | | C N | | 2 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ - 2 P C' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  ⎷         = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ | | P 1 | | 2 | | P 2 | | 2 ⋮ | | P M | | 2 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ M \times 1 * (11 \dots 1) 1 \times N + ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ 11 ⋮ 1 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ M \times 1 * (| | C 1 | | 2 | | C 2 | | 2 \dots | | C N | | 2) 1 \times N - 2 P M \times D C' N \times D - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  ⎷       $
转换为python 代码如下：

 def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    #########################################################################
    sq1=np.sum(np.square(X),axis=1)
   sq2=np.sum(np.square(self.X_train),axis=1)
   s=np.dot(X,self.X_train.T)
   dist = np.sqrt(sq1+sq2.T-2*s)    #dists = np.sqrt(self.getNormMatrix(X, num_train).T + self.getNormMatrix(self.X_train, num_test) - 2 * np.dot(X, self.X_train.T))
    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
    return dists

  def getNormMatrix(self, x, lines_num):
    """
    Get a lines_num x size(x, 1) matrix
    """ 
    return np.ones((lines_num, 1)) * np.sum(np.square(x), axis = 1) 
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

2.2.3.5 predict_labels

根据计算得到的距离关系，挑选 K 个数据组成选民，进行党派选举

def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in xrange(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #
      #########################################################################
      kids = np.argsort(dists[i])
      closest_y = self.y_train[kids[:k]]
      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      #########################################################################
      y_pred[i] = np.argmax(np.bincount(closest_y))
      #########################################################################
      #                           END OF YOUR CODE                            # 
      #########################################################################

    return y_pred

2.2.4 cross-validation 代码分析

交叉验证实际上是将数据的训练集进行拆分，分成多个组，构成多个训练和测试集，来筛选较好的超参数
如图所示，可以分为 5组数据，（分别将 fold 1, 2 .. 5 作为验证集，将剩余的数据作为训练集，训练得到超参数）

2.2.4.1 筛选不同的k

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)


################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################
for k in k_choices:
    k_to_accuracies[k] = np.zeros(num_folds)
    for i in range(num_folds):
        Xtr = np.array(X_train_folds[:i] + X_train_folds[i+1:])
        ytr = np.array(y_train_folds[:i] + y_train_folds[i+1:])
        Xte = np.array(X_train_folds[i])
        yte = np.array(y_train_folds[i])     

        Xtr = np.reshape(Xtr, (X_train.shape[0] * 4 / 5, -1))
        ytr = np.reshape(ytr, (y_train.shape[0] * 4 / 5, -1))
        Xte = np.reshape(Xte, (X_train.shape[0] / 5, -1))
        yte = np.reshape(yte, (y_train.shape[0] / 5, -1))

        classifier.train(Xtr, ytr)
        yte_pred = classifier.predict(Xte, k)
        yte_pred = np.reshape(yte_pred, (yte_pred.shape[0], -1))
        num_correct = np.sum(yte_pred == yte)
        accuracy = float(num_correct) / len(yte)
        k_to_accuracies[k][i] = accuracy

################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print 'k = %d, accuracy = %f' % (k, accuracy)