数学推导+纯Python实现机器学习算法3:k近邻


数学推导

k近邻模型中参数估计的推导过程:
作为一种没有显式训练和学习过程的分类和回归算法,k 近邻在众多有监督机器学习算法中算是一种比较独特的方法。说它独特,是因为 k 近邻不像其他模型有损失函数、有优化算法、有训练过程。对于给定的实例数据和实例数据对应所属类别,当要对新的实例进行分类时,根据这个实例最近的 k 个实例所属的类别来决定其属于哪一类。所以相对于其它机器学习模型和算法,k 近邻总体上而言是一种非常简单的方法。
在这里插入图片描述
k 近邻基本理论
k 近邻算法最直观的解释:给定一个训练数据集,对于新的输入实例,在训练集中找到与该实例最近邻的 k 个实例,这 k 个实例的多数属于哪个类,则该实例就属于哪个类。

该算法的几个关键点

  1. 找到与该实例最近邻的实例,这里就涉及到如何找到,即在特征向量空间中,我们要采取何种方式来对距离进行度量。
  2. k 个实例,这个 k 值的大小如何选择。
  3. k 个实例的多数属于哪个类,明显是多数表决的归类规则。当然还可能使用其他规则,所以第三个关键就是分类决策规则。

1)首先的是距离的度量方式。距离的度量用在 k 近邻中我们也可以称之为相似性度量,即特征空间中两个实例点相似程度的反映。在机器学习中,常用的距离度量方式包括欧式距离、曼哈顿距离、余弦距离以及切比雪夫距离等。在 k 近邻算法中常用的距离度量方式是欧式距离,也即 L2 距离,L2 距离计算公式如下:

L 2 ( x i , x j ) = ( ∑ l = 1 n ∣ x i ( l ) − x i j ( l ) ∣ 2 ) 1 2 L_2(x_i,x_j)=(\sum^n_{l=1}|x_i^{(l)}-x_ij^{(l)}|^2)^{\frac{1}{2}} L2(xi,xj)=(l=1nxi(l)xij(l)2)21

2)其次是 k 值的选择。一般而言,k 值的大小对分类结果有着重大的影响。当选择的 k 值较小的情况下,就相当于用较小的邻域中的训练实例进行预测,只有当与输入实例较近的训练实例才会对预测结果起作用。但与此同时预测结果会对实例点非常敏感,分类器抗噪能力较差,因而容易产生过拟合,所以一般而言,k 值的选择不宜过小。但如果选择较大的 k 值,就相当于在用较大邻域中的训练实例进行预测,但相应的分类误差也会增大,模型整体变得简单,会产生一定程度的欠拟合。所以一般而言,我们需要采用交叉验证的方式来选择合适的 k 值。

3)最后的分类决策规则。通常为多数表决方法,这个很好理解,这里不做过多解释。所以总结来看,k 近邻算法的本质在于基于距离和 k 值对特征空间进行划分。当训练数据、距离度量方式、k 值和分类决策规则确定后,对于任一新输入的实例,其所属的类别唯一地确定。


Numpy 实现

  • 导入相关的 package 和设定绘图参数:
import numpy as np
from collections import Counter
import random
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.utils import shuffle
plt.rcParams['figure.figsize'] = (10.0, 8.0) 
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
  • 以 iris 数据集为例:
X, y = shuffle(iris.data, iris.target, random_state=13)
X = X.astype(np.float32)
# 训练集与测试集的简单划分
offset = int(X.shape[0] * 0.7)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
y_train = y_train.reshape((-1,1))
y_test = y_test.reshape((-1,1))

print('X_train=', X_train.shape)
print('X_test=', X_test.shape)
print('y_train=', y_train.shape)
print('y_test=', y_test.shape)

X_train= (105, 4)
X_test= (45, 4)
y_train= (105, 1)
y_test= (45, 1)

  • 定义 L2 距离度量函数:
def compute_distances(X, X_train):
    num_test = X.shape[0]
    num_train = X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 

    M = np.dot(X, X_train.T)
    te = np.square(X).sum(axis=1)
    tr = np.square(X_train).sum(axis=1)
    dists = np.sqrt(-2 * M + tr + np.matrix(te).T)    
    return dists
  • 计算一下测试集与训练集实例的距离:
dists = compute_distances(X_test)
print(dists.shape)

(45,105)

  • 对距离矩阵进行可视化展示:
plt.imshow(dists, interpolation='none')
plt.show()
  • 使用多数表决的分类决策规则定义预测函数,这里假设 k 值取 1:
def predict_labels(y_train, dists, k=1):
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)    
    for i in range(num_test):

        closest_y = []
        # 注意 argsort 函数的用法
        labels = y_train[np.argsort(dists[i, :])].flatten()
        closest_y = labels[0:k]

        c = Counter(closest_y)
        y_pred[i] = c.most_common(1)[0][0]    
    return y_pred
  • 预测测试集的类别准确率:
y_test_pred = predict_labels(dists, k=1)
y_test_pred = y_test_pred.reshape((-1, 1))
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / X_test.shape[0]
print('Got %d / %d correct => accuracy: %f' % (num_correct, X_test.shape[0], accuracy))

Got 44/45 correct => accuracy:0.977778

  • 使用 5 折交叉验证来选择最优的 k 值:
num_folds = 5
k_choices = [1, 3, 5, 8]

X_train_folds = []
y_train_folds = []

X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
k_to_accuracies = {}
for k in k_choices:    
    for fold in range(num_folds): 
        # 对传入的训练集单独划出一个验证集作为测试集
        validation_X_test = X_train_folds[fold]
        validation_y_test = y_train_folds[fold]
        temp_X_train = np.concatenate(X_train_folds[:fold] + X_train_folds[fold + 1:])
        temp_y_train = np.concatenate(y_train_folds[:fold] + y_train_folds[fold + 1:])       
        # 计算距离
        temp_dists = compute_distances(validation_X_test)
        temp_y_test_pred = predict_labels(temp_dists, k=k)
        temp_y_test_pred = temp_y_test_pred.reshape((-1, 1))       
        # 查看分类准确率
        num_correct = np.sum(temp_y_test_pred == validation_y_test)
        num_test = validation_X_test.shape[0]
        accuracy = float(num_correct) / num_test
        k_to_accuracies[k] = k_to_accuracies.get(k,[]) + [accuracy]
      
 
# 打印不同 k 值不同折数下的分类准确率
for k in sorted(k_to_accuracies):    
      for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

k =1, accuracy = 0.904762
k =1, accuracy = 1.000000
k =1, accuracy = 0.952381
k =1, accuracy = 0.857143
k =1, accuracy = 0.952381
k =3, accuracy = 0.857143
k =3, accuracy = 1.000000
k =3, accuracy = 0.952381
k =3, accuracy = 0.857143
k =3, accuracy = 0.952381
k =5, accuracy = 0.857143
k =5, accuracy = 1.000000
k =5, accuracy = 0.952381
k =5, accuracy = 0.904762
k =5, accuracy = 0.952381
k =8, accuracy = 0.904762
k =8, accuracy = 1.000000
k =8, accuracy = 0.952381
k =8, accuracy = 0.904762
k =8, accuracy = 0.952381

  • 对不同 k 值下的分类准确率进行可视化展示:
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)

accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()


可以看到,在 k 值取 20 以内,分类准确率波动幅度不大,取 20 到 40 时分类准确率开始下降,取 50 以上时,分类准确率呈断崖式下跌。这主要与我们的数据量有关。

  • 查看最佳 k 值大小:
best_k = k_choices[np.argmax(accuracies_mean)]
print('最佳k值为',best_k)

10

  • 封装一个knn算法类
import numpy as np
from collections import Counter
import random
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.utils import shuffle

# 设定绘图参数
plt.rcParams['figure.figsize'] = (10.0, 8.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'


class KNearestNeighbor(object):
    def __init__(self):
        pass

    def train(self, X, y):
        self.X_train = X
        self.y_train = y

    # 定义 L2 距离度量函数
    def compute_distances(self, X):
        num_test = X.shape[0]
        print('num_test=', num_test)
        # num_test= 21
        num_train = self.X_train.shape[0]
        print('num_train=', num_train)
        # num_train= 84
        dists = np.zeros((num_test, num_train))

        M = np.dot(X, self.X_train.T)
        print('M=', M.shape)
        # M= (21, 84)
        te = np.square(X).sum(axis=1)
        print('te=', te.shape)
        # te= (21,)
        tr = np.square(self.X_train).sum(axis=1)
        print('tr=', tr.shape)
        # tr= (84,)
        dists = np.sqrt(-2 * M + tr + np.matrix(te).T)
        print('dists=', dists.shape)
        # dists= (21, 84)
        return dists

    # 使用多数表决的分类决策规则定义预测函数,这里假设 k=1
    def predict_labels(self, dists, k=1):
        num_test = dists.shape[0]
        print('num_test=', num_test)
        # num_test= 21
        y_pred = np.zeros(num_test)
        print('y_pred=', y_pred.shape)
        # y_pred = (21,)

        for i in range(num_test):
            closest_y = []
            # argsort函数:数组值从小到大的索引值
            labels = self.y_train[np.argsort(dists[i, :])].flatten()
            print('labels=', labels.shape)
            # labels= (84,)
            closest_y = labels[0:k]
            print('closest_y=', closest_y)
            # closest_y = [2] 预测的label
            c = Counter(closest_y)
            print('c=', c)
            # c = Counter({closest_y: 1})
            y_pred[i] = c.most_common(1)[0][0]
            print('y_pred=', y_pred)
            # y_pred= [1. 1. 0. 2. 2. 0. 2. 2. 0. 1. 1. 1. 1. 0. 2. 0. 2. 2. 1. 0. 1.] 21个
        return y_pred

    # 使用 5 折交叉验证来选择最优的 k 值
    def cross_validation(self, X_train, y_train):
        num_folds = 5
        k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

        X_train_folds = []
        y_train_folds = []

        X_train_folds = np.array_split(X_train, num_folds)
        print('X_train_folds=', X_train_folds)
        # 5 * 21 * 4
        y_train_folds = np.array_split(y_train, num_folds)
        print('y_train_folds=', y_train_folds)
        # 5 * 21 * 1

        k_to_accuracies = {}
        for k in k_choices:
            for fold in range(num_folds):
                # 对传入的训练集单独划出一个验证集作为测试集

                # 第fold片
                validation_X_test = X_train_folds[fold]
                print('validation_X_test=', validation_X_test)
                # 1 * 21 * 4
                validation_y_test = y_train_folds[fold]
                print('validation_y_test=', validation_y_test)
                # 1 * 21 * 1

                # 除 第fold片 以外
                temp_X_train = np.concatenate(X_train_folds[:fold] + X_train_folds[fold + 1:])
                print('temp_X_train=', temp_X_train)
                # 1 * 84 * 4
                temp_y_train = np.concatenate(y_train_folds[:fold] + y_train_folds[fold + 1:])
                print('temp_y_train=', temp_y_train)
                # 1 * 84 * 1

                self.train(temp_X_train, temp_y_train)

                # 计算距离
                temp_dists = self.compute_distances(validation_X_test)
                temp_y_test_pred = self.predict_labels(temp_dists, k=k)
                temp_y_test_pred = temp_y_test_pred.reshape((-1, 1))  # Checking accuracies

                # 查看分类准确率
                num_correct = np.sum(temp_y_test_pred == validation_y_test)
                num_test = validation_X_test.shape[0]
                accuracy = float(num_correct) / num_test
                # get() 函数返回指定键的值
                k_to_accuracies[k] = k_to_accuracies.get(k, []) + [accuracy]  # Print out the computed accuracies
                print('accuracy=', accuracy)
                # accuracy = 0.23809523809523808
                print('k_to_accuracies=', k_to_accuracies)
                # k_to_accuracies = {1: [0.9047619047619048, 1.0, 0.9523809523809523, 0.8571428571428571, 0.9523809523809523],
                #                    3: [0.8571428571428571, 1.0, 0.9523809523809523, 0.8571428571428571, 0.9523809523809523],
                #                    5: [0.8571428571428571, 1.0, 0.9523809523809523, 0.9047619047619048, 0.9523809523809523],
                #                    8: [0.9047619047619048, 1.0, 0.9523809523809523, 0.9047619047619048, 0.9523809523809523],
                #                    10: [0.9523809523809523, 1.0, 0.9523809523809523, 0.9047619047619048, 0.9523809523809523],
                #                    12: [0.9523809523809523, 1.0, 0.9523809523809523, 0.8571428571428571, 0.9523809523809523],
                #                    15: [0.9523809523809523, 1.0, 0.9523809523809523, 0.8571428571428571, 0.9523809523809523],
                #                    20: [0.9523809523809523, 1.0, 0.9523809523809523, 0.7619047619047619, 0.9523809523809523],
                #                    50: [1.0, 1.0, 0.9047619047619048, 0.7619047619047619, 0.9047619047619048],
                #                    100: [0.2857142857142857, 0.38095238095238093, 0.3333333333333333, 0.23809523809523808]}

        # 打印不同 k 值不同折数下的分类准确率
        for k in sorted(k_to_accuracies):
            for accuracy in k_to_accuracies[k]:
                print('k = %d, accuracy = %f' % (k, accuracy))

        # 求某个 k 值下的平均 accuracy
        accuracies_mean = np.array([np.mean(v) for k, v in sorted(k_to_accuracies.items())])

        # 查看最佳 k 值大小
        best_k = k_choices[np.argmax(accuracies_mean)]
        print('最佳k值为{}'.format(best_k))

        return best_k

    # iris 数据集准备
    def create_train_test(self):
        iris = datasets.load_iris()
        data = iris.data
        target = iris.target
        X, y = shuffle(data, target, random_state=13)
        X = X.astype(np.float32)
        y = y.reshape((-1, 1))

        # 训练集与测试集的简单划分
        offset = int(X.shape[0] * 0.7)
        # 前70%
        X_train, y_train = X[:offset], y[:offset]
        # 后30%
        X_test, y_test = X[offset:], y[offset:]

        y_train = y_train.reshape((-1, 1))
        y_test = y_test.reshape((-1, 1))

        print('X_train=', X_train.shape)
        # X_train= (105, 4)
        print('X_test=', X_test.shape)
        # X_test= (45, 4)
        print('y_train=', y_train.shape)
        # y_train= (105, 1)
        print('y_test=', y_test.shape)
        # y_test= (45, 1)

        return X_train, y_train, X_test, y_test


if __name__ == '__main__':
    knn_classifier = KNearestNeighbor()
    X_train, y_train, X_test, y_test = knn_classifier.create_train_test()
    best_k = knn_classifier.cross_validation(X_train, y_train)

    # 测试集与训练集实例的距离
    dists = knn_classifier.compute_distances(X_test)
    print(dists.shape)

    # 预测测试集的类别准确率
    y_test_pred = knn_classifier.predict_labels(dists, k=best_k)
    y_test_pred = y_test_pred.reshape((-1, 1))
    num_correct = np.sum(y_test_pred == y_test)
    accuracy = float(num_correct) / X_test.shape[0]
    print('Got %d / %d correct => accuracy: %f' % (num_correct, X_test.shape[0], accuracy))

k = 1, accuracy = 0.904762
k = 1, accuracy = 1.000000
k = 1, accuracy = 0.952381
k = 1, accuracy = 0.857143
k = 1, accuracy = 0.952381
k = 3, accuracy = 0.857143
k = 3, accuracy = 1.000000
k = 3, accuracy = 0.952381
k = 3, accuracy = 0.857143
k = 3, accuracy = 0.952381
k = 5, accuracy = 0.857143
k = 5, accuracy = 1.000000
k = 5, accuracy = 0.952381
k = 5, accuracy = 0.904762
k = 5, accuracy = 0.952381
k = 8, accuracy = 0.904762
k = 8, accuracy = 1.000000
k = 8, accuracy = 0.952381
k = 8, accuracy = 0.904762
k = 8, accuracy = 0.952381
k = 10, accuracy = 0.952381
k = 10, accuracy = 1.000000
k = 10, accuracy = 0.952381
k = 10, accuracy = 0.904762
k = 10, accuracy = 0.952381
k = 12, accuracy = 0.952381
k = 12, accuracy = 1.000000
k = 12, accuracy = 0.952381
k = 12, accuracy = 0.857143
k = 12, accuracy = 0.952381
k = 15, accuracy = 0.952381
k = 15, accuracy = 1.000000
k = 15, accuracy = 0.952381
k = 15, accuracy = 0.857143
k = 15, accuracy = 0.952381
k = 20, accuracy = 0.952381
k = 20, accuracy = 1.000000
k = 20, accuracy = 0.952381
k = 20, accuracy = 0.761905
k = 20, accuracy = 0.952381
k = 50, accuracy = 1.000000
k = 50, accuracy = 1.000000
k = 50, accuracy = 0.904762
k = 50, accuracy = 0.761905
k = 50, accuracy = 0.904762
k = 100, accuracy = 0.285714
k = 100, accuracy = 0.380952
k = 100, accuracy = 0.333333
k = 100, accuracy = 0.238095
k = 100, accuracy = 0.190476
最佳k值为10


参考:

1、数学推导+纯Python实现机器学习算法3:k近邻

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值