cs231n_knn实现

cs231n_knn

相似性度量

两个向量之间的距离的计算,也称为样本之间的相似性度量。他反应为某类事物在距离上接近或远离的程度。在介绍距离之前,先看一个概念。

  • 范数:可以简单、形象的理解为向量的长度,或者向量到坐标系原点的距离。
    (1)、L1范数:||x||为x向量各个元素绝对值之和。
    L1=i=1n|xi|(31) (31) L 1 = ∑ i = 1 n | x i |

    (2)、L2范数:向量x各个元素平方和的开方。L2范数又称为Euclidean范数或者Frobenious范数。
    L2=i=1nx2i(32) (32) L 2 = ∑ i = 1 n x i 2
  • 各类距离的意义及Python实现
    (1)、欧式距离(Euclidean Distance)
    两个n维列向量 A(x1,x2,...,xn) A ( x 1 , x 2 , . . . , x n ) B(y1,y2,...,yn) B ( y 1 , y 2 , . . . , y n ) 之间的欧氏距离
    d=i=1n(xiyi)2(33) (33) d = ∑ i = 1 n ( x i − y i ) 2

    写成矩阵形式:
    d=(AB)T(AB)(34) (34) d = ( A − B ) T ( A − B )

    Python实现欧氏距离 
    import numpy as np
    a = np.array([1, 2, 3]).reshape(3, 1)
    b = np.array([4, 5, 6]).reshape(3, -1)
    print(a.shape, b.shape)
    '''
    ((3, 1), (3, 1))
    '''
    c = a - b
    print(c.T)
    print(np.sqrt(np.dot(c.T, c)))
    print(np.linalg.norm(c))
    '''
    ((3, 1), (3, 1))
    [[-3 -3 -3]]
    [[5.19615242]]
    5.19615242271
    '''

    (2)、曼哈顿距离
    曼哈顿距离又叫街区距离,计算公式为:
    d=k=1n|xkyk|(35) (35) d = ∑ k = 1 n | x k − y k |

    Python实现:
     
    print(np.sum(np.abs(a - b)))

    (3)、切比雪夫距离
    在国际象棋中,国王能移动到相邻八个方格中的任一个,从格子 (x1,y1) ( x 1 , y 1 ) 走到 (x2,y2) ( x 2 , y 2 ) 最少的步数是
    max(|x2x1|,|y2y1|)(36) (36) m a x ( | x 2 − x 1 | , | y 2 − y 1 | )

    Python实现:
     
    print(np.max(np.abs(a - b)))
交叉验证


交叉验证实际上是将数据的训练集进行拆分, 分成k组, 构成多个训练和测试集, 来筛选较好的超参数,拆分时每个子集尽量保证数据分布的一致性,即从训练集中分层采样得到。然后每次用k-1个子集
的并集作为训练集,余下的那个子集作为验证集,最终返回的是这k个测试结果的均值。利用交叉验证调好超参数以后,最后在测试集上验证模型的好坏。

方差和偏差

以回归任务为例,对测试样本x,令 yD y D 为x在数据集中的标记,y为x的真实标记,f(x,D)为训练集D上学得模型f在x上的预测输出。
学习算法的期望预测为:

f(x)=ED[f(x;D)](42) (42) f ¯ ( x ) = E D [ f ( x ; D ) ]

使用样本数相同的不同训练集产生的 方差为:
var(x)=ED[(f(x;D)f(x))2](43) (43) v a r ( x ) = E D [ ( f ( x ; D ) − f ¯ ( x ) ) 2 ]

噪声为
ED[(yDy)2](44) (44) E D [ ( y D − y ) 2 ]

期望输出与真实标记之间的差别成为 偏差
bias2(x)=(f(x)y)2(45) (45) b i a s 2 ( x ) = ( f ¯ ( x ) − y ) 2

泛化误差的偏差-方差分解:
E(f;D)=bias2(x)+var(x)+e2(46) (46) E ( f ; D ) = b i a s 2 ( x ) + v a r ( x ) + e 2

偏差度量了学习算法的期望预测与真实结果的偏离程度,即刻画了学习算法本身的拟合能力;方差度量了同样大小的训练集的变动所导致
的学习性能的变化,即刻画了数据扰动所造成的影响;噪声则表达了在当前任务上任何学习算法所能达到的期望泛华误差的下界,即刻画了问题本身的难度。
下图是李宏毅老师课件中的图,比较形象的说明了方差和偏差的区别。
这里写图片描述

knn算法

k最近邻采用测量不同特征值之间的距离方法进行分类。大体思想是:如果一个样本在特征空间中的k个最近邻的样本中大多数都属于某一类别,则该样本也属于这个类别。
主要代码
classfiers.py

 
import numpy as np
class KNearestNeighbor():
def init(self):
self.X_train = None
self.y_train = None
def train(self, X, y):
"""
训练分类器,对于KNN来说这一步就是记住训练集中的数据。
:param X: 训练集,numpy类型数据
:param y: 标签,1 - 10数字代表十个类别。
:return: Null
"""
self.X_train = X
self.y_train = y
def computer_distance_two_loop(self, X):
"""
计算测试集中每一个点到训练集中每一个点的欧氏距离
:param X: 测试集
:return:
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
dists[i][j] = np.linalg.norm(X[i] - self.X_train[j])
return dists
def computer_distance_one_loop(self, X):
"""
计算测试集中每一个点到训练集中每一个点的欧氏距离
:param X: 测试集
:return:
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
dists[i:] = np.sqrt(np.sum(np.square(X[i] - self.X_train), axis=1))
return dists
def computer_distance_no_loop(self, X):
"""
计算测试集中每一个点到训练集中每一个点的欧氏距离
:param X: 测试集
:return:
"""
p = np.sum(np.square(X), axis=1).reshape(X.shape[0], -1)
c = np.sum(np.square(self.X_train), axis=1).reshape(1, -1)
cp = np.dot(X, self.X_train.T)
dists = p + c
dists = dists - 2 * cp
return np.sqrt(dists)
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in xrange(num_test):
kids = np.argsort(dists[i])
closest_y = self.y_train[kids[:k]]
count = 0
label = 0
for j in closest_y:
tmp = 0
for kk in closest_y:
tmp += (kk == j)
if tmp > count:
count = tmp
label = j
y_pred[i] = label
'''
y_pred[i] = np.argmax(np.bincount(closest_y))
'''
return y_pred
def predict(self, X, k=1, num_loops=0):
if num_loops == 0:
dists = self.computer_distance_no_loop(X)
elif num_loops == 1:
dists = self.computer_distance_one_loop(X)
elif num_loops == 2:
dists = self.computer_distance_two_loop(X)
else:
raise ValueError("Invalid value error for num_loops {}".format(num_loops))
return self.predict_labels(dists, k)

main.py
 
import numpy as np
import time
from load_datasets import load_CIFAR10
import matplotlib.pyplot as plt
from classifiers import KNearestNeighbor
def show_datasets(xtr, ytr):
"""
显示一部分图像
:param xtr: 测试集像素
:param ytr: 测试集标签
:return: NULL
np.flatnonzero(a)用来返回数组中非零元素下标
a = np.array([1, 2, 3])
print(a == 1)
[ True False False]
print(np.flatnonzero(a == 1))
0
np.random.choice(a, size=None, replace=True, p=None) a如果是一个列表的话就从列表中随机抽样,
如果a是一个整数先用np.arange(a)生成列表。size抽样个数,replace=True表示允许重复,p是每个数字的概率。
"""
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'] # 十个类别
num_classes = len(classes)
sample_per_class = 7
for y, cls in enumerate(classes):
idxs = np.flatnonzero(ytr == y)
idxs = np.random.choice(idxs, sample_per_class, replace=False)
for i, idx in enumerate(idxs):
plt_idx = i * num_classes + y + 1
plt.subplot(sample_per_class, num_classes, plt_idx)
plt.imshow(xtr[idx].astype('uint8'))
plt.axis('off')
if i == 0:
plt.title(cls)
plt.savefig("fig1.png")
plt.show()
def subsample_data(xtr, ytr, xte, yte): # 抽样一部分数据,快速版本
xtr = np.reshape(xtr, (xtr.shape[0], -1))
xte = np.reshape(xte, (xte.shape[0], -1))
num_train = 5000
mask = range(num_train)
x_train_set, y_train_set = xtr[mask], ytr[mask]
num_test = 500
mask = range(num_test)
x_test_set, y_test_set = xte[mask], yte[mask]
x_train_set = np.reshape(x_train_set, (x_train_set.shape[0], -1))
x_test_set = np.reshape(x_test_set, (x_test_set.shape[0], -1))
return x_train_set, y_train_set, x_test_set, y_test_set
if name == 'main':
Xtr, Ytr, Xte, Yte = load_CIFAR10()
show_datasets(Xtr, Ytr)
x_train, y_train, x_test, y_test = subsample_data(Xtr, Ytr, Xte, Yte)
classifier = KNearestNeighbor()
classifier.train(x_train, y_train)
'''start_time = time.time()
dists = classifier.computer_distance_two_loop(x_test)
print("Two Loop{}s".format(time.time() - start_time))
start_time = time.time()
dists = classifier.computer_distance_one_loop(x_test)
print("One Loop{}s".format(time.time() - start_time))
start_time = time.time()
dists = classifier.computer_distance_no_loop(x_test)
print("No Loop{}s".format(time.time() - start_time))'''
'''
Two Loop14.4767839909s
One Loop21.5460550785s
No Loop0.262940168381s
Accuracy:0.29
'''
acc = []
num_folds = 5 # 五折交叉验证
X_train_folds = []
y_train_folds = []
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = np.array_split(x_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
k_to_accuracies = {}
for k in k_choices:
k_to_accuracies[k] = np.zeros(num_folds)
for i in range(num_folds):
Xtr = np.array(X_train_folds[:i] + X_train_folds[i + 1:])
ytr = np.array(y_train_folds[:i] + y_train_folds[i + 1:])
Xte = np.array(X_train_folds[i])
yte = np.array(y_train_folds[i])
Xtr = np.reshape(Xtr, (x_train.shape[0] * 4 / 5, -1))
ytr = np.reshape(ytr, (y_train.shape[0] * 4 / 5, -1))
Xte = np.reshape(Xte, (x_train.shape[0] / 5, -1))
yte = np.reshape(yte, (y_train.shape[0] / 5, -1))
classifier.train(Xtr, ytr)
yte_pred = classifier.predict(Xte, k)
yte_pred = np.reshape(yte_pred, (yte_pred.shape[0], -1))
num_correct = np.sum(yte_pred == yte)
accuracy = float(num_correct) / len(yte)
k_to_accuracies[k][i] = accuracy
for k in k_choices:
accuracies = k_to_accuracies[k]
plt.scatter([k] * len(accuracies), accuracies)
accuracies_mean = np.array([np.mean(v) for k, v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k, v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.savefig("a.png")
plt.show()

交叉验证的结果:
这里写图片描述

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值