KNN算法的高斯优化

最新推荐文章于 2024-07-03 10:42:45 发布

MarToony|名角

最新推荐文章于 2024-07-03 10:42:45 发布

阅读量4.2k

点赞数 4

分类专栏：机器学习文章标签： python 机器学习

本文链接：https://blog.csdn.net/m0_38052500/article/details/107296750

版权

机器学习专栏收录该内容

11 篇文章 2 订阅

订阅专栏

为什么需要加权

原始的KNN算法的一个缺点是，如果给定的原始数据集中各类别的样本数目不平衡，容易导致k个邻居投票的时候，各个类别的参与概率不一样，换句话说，k个邻居中，较大样本数的类别其所属的样本占了绝大多数。举个例子：训练样本中汉字“一”有1个，汉字“二”有99个，当我给出一张测试数据，它的类型是“一”，设定k值为5,则无论该测试数组与训练样本的“一”有多么相似，最终k个邻居中，类别“二”的比例，永远最大，且永远为80%。如果是这样，则算法即为失败的算法。
而为了避免这一点，可以采用加权的KNN算法，其思想是：和该样本距离小的邻居权值大。

大家也可能会发现，我的训练数据集，每个类别的样本数都是270左右，数目比较平均，并不存在我上述提及的问题。
在这里插入图片描述
确实不存在，但是我之前提过，我是希望这门课程能够以手写字体识别作为引线，来扩展模型训练的方法。所以，虽然不适用，但是不妨尝试一下.

加权方法

高斯函数实现方法

采用Gaussian函数进行不同距离的样本的权重优化，当训练样本与测试样本距离↑，该距离值权重↓。
给更近的邻居分配更大的权重(你离我更近，那我就认为你跟我更相似，就给你分配更大的权重)，而较远的邻居的权重相应地减少，取其加权平均。

高斯函数简介

高斯函数广泛应用于统计学领域，用于表述正态分布，在信号处理领域，用于定义高斯滤波器，在图像处理领域，二维高斯核函数常用于高斯模糊Gaussian Blur，在数学领域，主要是用于解决热力方程和扩散方程，以及定义Weiertrass Transform。

高斯函数是正态分布的密度函数。

公式是：
在这里插入图片描述
其中，a是曲线尖峰的高度，b是尖峰中心的坐标，c称为标准方差，表征的是bell钟状的宽度。

图像是：
在这里插入图片描述
其中，随着sigma的增大，整个高斯函数的尖峰逐渐dist减小，整体也变的更加平缓，则对图像的平滑效果越来越明显。

高斯函数加权实现

def Gaussian(distance, sigma = 9.0):
    """ Input a distance and return it`s weight"""
    weight = np.exp(-distance**2/(2*sigma**2))
    return weight

# 测试函数
def test_foronePng(test_imgarr, data_lst, type_lst, k=4, height=20, width=20,sigma=10):
    """
    input:标准的图片序列、训练集二维的array形式，训练集标签的一维array形式。
    output：测试图片的预测类别。
    """
    # 1欧式距离
    o_distances = O_Distances(data_lst, test_imgarr)
    sortedDistancesIndicies = o_distances.argsort()
    classCount = {}
    for i in range(k):
        voteIlabel = type_lst[sortedDistancesIndicies[i]]
        # 加权的方式
        weight = Gaussian(o_distances[sortedDistancesIndicies[i]],sigma)
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + weight*1
        
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    k_data = [o_distances[sortedDistancesIndicies[i]] for i in range(k)]
    k_type = [str(type_lst[sortedDistancesIndicies[i]])+"-{}".format(i) for i in range(k)]
    k_data_type = dict(zip(k_type,k_data))
    return sortedClassCount[0][0] , k_data_type

main函数，承载了对sigma参数的遍历选择，还有相应的曲线图可视化展示。

if __name__ == '__main__':
    # 指定参数
    k = 5
    height = 20
    width = 20


    # 1 获得样本序列化的数据集
    # 通过getArrDataSet函数，而getArrDataSet函数的运行，需要借助getObjectPath和img2vector两个工具
    pngDataSets = getArrDataSet(r'./train', 20, 20)
    data_lst = []
    for arr in pngDataSets.data:
        data_lst.append(arr)
    # 转化为array形式
    data_lst = np.array(data_lst)
    # 转化为array形式。
    type_lst = pngDataSets.type.values

    # 准备测试数据
    pngPaths = getObjectPath(r"./test")

    sigma_dict = {}

    for sigma in np.linspace(1,15,29):
        i = 0  # 计算预测正确的个数
        count = 0  # 计算测试样本的总数
        print(sigma)
        for eachPath in pngPaths:
            filename = os.path.basename(eachPath)
            second_pngPaths = getObjectPath(eachPath)
            for second_eachPath in second_pngPaths:
                count += 1
                if count % 50 == 0: print(count)
                # 得到标准的测试数据的array形式。
                test_imgarr = img2vector(second_eachPath, height, width)

                testName,k_data_type = test_foronePng(test_imgarr, data_lst, type_lst, k, height, width,sigma)
                if testName == filename:
                    i += 1
                # else:
                #     k_data_type["True"] = filename
                #     print(k_data_type)
        # 评价指标
        accuracy = i / count  # 表示准确率
        ScoreF1 = accuracy * 1 * 2 / (accuracy + 1)  # 表示F1Score
        sigma_dict[sigma] = accuracy

        print("准确率:{}".format(round(accuracy, 4)))
        print("F1-Score:{}".format(round(ScoreF1, 4)))

    plt.figure(figsize=(10,10))
    plt.plot(list(sigma_dict.keys()),list(sigma_dict.values()))
    for a, b in zip(list(sigma_dict.keys()), list(sigma_dict.values())):
        plt.text(a, b, (round(float(b*1000))%10000), ha='center', va='bottom', fontsize=5)
    plt.show()
    print(list(sigma_dict.keys()))
    print(list(sigma_dict.values()))

高斯加权的结果表明

在这里插入图片描述
sigma参数无论如何基本上不会突破0.9451这个准确率。

高斯加权结果的分析与优化

事实上，如果我们查看两个东西，会发现一点新的东西，
第一个是高斯函数随着sigma参数的增大，而变得平坦
在这里插入图片描述

import numpy as np
import matplotlib.pyplot as plt

def Gaussian(distance, sigma = 1.0):
    """ Input a distance and return it`s weight"""
    weight = np.exp(-distance**2/(2*sigma**2))
    return weight

x = np.arange(-10,10)
print(x)

y = [Gaussian(x_one) for x_one in x]

plt.figure(figsize=(12,4))

plt.subplot(2,4,1)
plt.plot(x,[Gaussian(x_one,1) for x_one in x])
plt.subplot(2,4,2)
plt.plot(x,[Gaussian(x_one,2) for x_one in x])
plt.subplot(2,4,3)
plt.plot(x,[Gaussian(x_one,3) for x_one in x])
plt.subplot(2,4,4)
plt.plot(x,[Gaussian(x_one,4) for x_one in x])

plt.subplot(2,4,5)
plt.plot(x,[Gaussian(x_one,5) for x_one in x])
plt.subplot(2,4,6)
plt.plot(x,[Gaussian(x_one,6) for x_one in x])
plt.subplot(2,4,7)
plt.plot(x,[Gaussian(x_one,7) for x_one in x])
plt.subplot(2,4,8)
plt.plot(x,[Gaussian(x_one,8) for x_one in x])

plt.show()

具体来看一下，sigma为1时，在这里插入图片描述
可以看出，在[-2,2]区间内，曲线的斜率最高，其实如果两个距离的值都处于这段区间的一侧，比如[0,2]，经过高斯函数，两个样本中距离至较近的样本的高斯值的大小要远大于距离值较远的那个，那么在统计类别的票数时，距离较劲的样本所代表的类别将会得到更高的权重。
但是事实呢？你会发现进行高斯函数处理的值，他们的距离值，实际上在7-12之间。

{'00002-0': 7.745966692414834, '00002-1': 7.9372539331937721, '00002-2': 8.1853527718724504, '00002-3': 8.5440037453175304, 'True': '00006'}
{'00005-0': 9.8488578017961039, '00005-1': 9.9498743710661994, '00005-2': 10.0, '00009-3': 10.148891565092219, 'True': '00006'}
{'00002-0': 10.392304845413264, '00006-1': 10.488088481701515, '00006-2': 10.723805294763608, '00002-3': 10.816653826391969, 'True': '00006'}
{'00002-0': 11.090536506409418, '00002-1': 11.357816691600547, '00005-2': 11.532562594670797, '00009-3': 11.575836902790225, 'True': '00006'}

而比对sigma为1的高斯图像，可以看出，在[7-12]区间，其权重几乎一致，所以效果相当于不起作用。

为了使高斯起到加权的作用，需要将每个输入高斯函数的距离值，除以6。
会得到如下图结果（左图，而右图是未进一步处理的试验结果）。
在这里插入图片描述

对比发现，高斯函数输入值处理后的试验结果要比未处理的试验结果要快一步达到0.945。可以看出，高斯函数输入值处理预期实现的加权效果有作用。但如果只看其中任意一个图像，会发现，sigma越大，输入的距离值越处于高斯函数的平坦阶段，起到的加权效果越低，即距离近的样本其权值越高，距离远的样本权值越低。而sigma越小，则加权效果会越明显，但是细看左图会发现，随着加权效果越来越明显，模型的准确率竟然降低了。说明，其实高斯加权会阻碍准确率。**因此在这里使用高斯加权不合适。**但仍然需要说明的是，在这里的不适合，并不表示其在任何情况下永远不适合，还是要看具体情况。

如果对比左图和右图，会发现，对输入值除以6后，会使的高斯加权带来的准确率降低现象得到较为明显的避免。，这也说明我们上述的除以6的操作是合理的。

sigmoid函数加权实现

上述实验提及了通过高斯函数加权，之所以可行，是因为高斯函数存在一定的性质：在某一段区间上，曲线的斜率远大于1，会拉大两个样本的差距。

因此，我联想到了Sigmoid函数，他在深度神经网络中，代表激活函数，在此处，作为加权函数，经过实验，发现其效果也还好，虽然同样没有超过0.945的大关。

Sigmoid函数如下：

激活函数——sigmoid函数（理解）
其性质是随着alpha的增大，图像会变得越来越陡峭，加权效果会越来越好。
代码实现如下：

def sigmoid(each_distance,alpha = 10):
    weight = 1/(1+np.exp(-each_distance*alpha))
    return weight

def test_foronePng(test_imgarr, data_lst, type_lst, k=4, height=20, width=20,sigma_or_alpha=10):
    """
    input:标准的图片序列、训练集二维的array形式，训练集标签的一维array形式。
    output：测试图片的预测类别。
    """
    # 1欧式距离
    o_distances = O_Distances(data_lst, test_imgarr)
    sortedDistancesIndicies = o_distances.argsort()
    classCount = {}
    # print(o_distances[sortedDistancesIndicies])
    for i in range(k):
        # 相同的，如果通过 pngDataSets.type[sortedDistancesIndicies[i]]，即可得到前几个最小距离的样本的类型的列表。
        voteIlabel = type_lst[sortedDistancesIndicies[i]]

        ## 加权的方式=====================
        # weight = Gaussian(o_distances[sortedDistancesIndicies[i]]/6,sigma_or_alpha)

        # sigmoid 方式
        avg_distance = np.average([o_distances[sortedDistancesIndicies[i]] for i in range(k)])
        each_distance = -(o_distances[sortedDistancesIndicies[i]] - avg_distance)
        weight = sigmoid(each_distance,sigma_or_alpha)

        # 这里不再是加一，而是加权重*1
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + weight*1
        # classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1


    # 对字典的值的列表， 作为排序的依据，得到一个由键值构成元组组成的列表
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)

    k_data = [o_distances[sortedDistancesIndicies[i]] for i in range(k)]
    k_type = [str(type_lst[sortedDistancesIndicies[i]])+"-{}".format(i) for i in range(k)]
    k_data_type = dict(zip(k_type,k_data))

    return sortedClassCount[0][0] , k_data_type

main函数如下


if __name__ == '__main__':
    # 指定参数
    k = 4
    height = 20
    width = 20


    # 1 获得样本序列化的数据集
    # 通过getArrDataSet函数，而getArrDataSet函数的运行，需要借助getObjectPath和img2vector两个工具
    pngDataSets = getArrDataSet(r'./train', 20, 20)
    data_lst = []
    for arr in pngDataSets.data:
        data_lst.append(arr)
    # 转化为array形式
    data_lst = np.array(data_lst)
    # 转化为array形式。
    type_lst = pngDataSets.type.values

    # 准备测试数据
    pngPaths = getObjectPath(r"./test")

    sigma_dict = {}

    for sigma in [0.1,0.2,0.3,0.5,0.7,1,1.5,2,2.5,3,3.5,4,4.5,5,6,7,8,9,10]:

        # sigma = 1
        i = 0  # 计算预测正确的个数
        count = 0  # 计算测试样本的总数
        print(sigma)
        for eachPath in pngPaths:
            filename = os.path.basename(eachPath)
            second_pngPaths = getObjectPath(eachPath)
            for second_eachPath in second_pngPaths:
                count += 1
                if count % 50 == 0: print(count)
                # 得到标准的测试数据的array形式。
                test_imgarr = img2vector(second_eachPath, height, width)

                testName,k_data_type = test_foronePng(test_imgarr, data_lst, type_lst, k, height, width,sigma)
                if testName == filename:
                    i += 1
                # else:
                #     k_data_type["True"] = filename
                #     print(k_data_type)
        # 评价指标
        accuracy = i / count  # 表示准确率
        ScoreF1 = accuracy * 1 * 2 / (accuracy + 1)  # 表示F1Score
        sigma_dict[sigma] = accuracy

        print("准确率:{}".format(round(accuracy, 4)))
        print("F1-Score:{}".format(round(ScoreF1, 4)))

        # break

    plt.figure(figsize=(10,10))
    plt.plot(list(sigma_dict.keys()),list(sigma_dict.values()))
    for a, b in zip(list(sigma_dict.keys()), list(sigma_dict.values())):
        plt.text(a, b, (round(float(b*1000))%1000), ha='center', va='bottom', fontsize=10)
    plt.show()
    print(sigma_dict)