目录
一、算法概述
1.1算法介绍
K近邻算法:如果一个样本在特征空间中距离最近的K个样本点的所属类别比重最大的,则该样本也属于这个类别。
KNN算法的核心是样本间的距离和K的取值,主要应用于分类和回归。
1.2算法的一般流程
-
确定邻居数k:确定一个合适的k,是KNN算法中的一个超参数。
-
计算距离:对于待预测样本,计算它与训练集中每个样本的距离。常用的距离度量包括欧氏距离、曼哈顿距离、闵可夫斯基距离等。
-
找到最近的邻居:根据计算的距离,找到与待预测样本最近的k个训练样本。
-
投票决策:根据找到的k个最近邻样本的标签,采取多数投票或者加权投票的方式来决定待预测样本的类别。
-
输出预测结果:将投票决策的结果作为待预测样本的类别输出。
1.3 K的取值分析
(1)K的取值决定最后待测样本的类别,举个例子:
当K=3时,选取距离最近的3个样本,红圆=1<蓝三角=2,所以待测样本为蓝三角;
当K=5时,选取距离最近的5个样本,红圆=3>蓝三角=2,所以待测样本为红圆。
(2)K过大会导致,K太小会导致欠拟合。
二、算法实现
2.1数据集加载和预处理
用torchvision直接调用手写体数据集MNIST下载;
可以print其中一个数据看效果(类似下图):
2.2KNN算法实现
2.2.1手写KNN算法
(1)代码:
def knn(K):
train_x, train_y, test_x, test_y = load_data()
cnt = 0
for i in range(len(test_x)):
# print(i)
x = test_x[i]
y = test_y[i]
vec = get_vec(K, x, train_x, train_y)
weight = [] # 权重与序号
sum_distance = 0.0
for j in range(K):
sum_distance += vec[j][0] # 计算前K个距离的和
for j in range(K):
weight.append([1 - vec[j][0] / sum_distance, vec[j][1]]) # 权重+序号
# 将相同序号的加起来
num = [] # 统计有哪些序号
for j in range(K):
num.append(weight[j][1])
num = list(set(num)) # 去重
final_res = []
for j in range(len(num)):
res = 0.0
for k in range(len(weight)):
if weight[k][1] == num[j]: # 前K个标签一样的样本权值加起来
res += weight[k][0]
final_res.append([res, num[j]])
final_res = sorted(final_res, key=(lambda e: e[0]), reverse=True) # 按照权重从大到小排序
if y == final_res[0][1]:
cnt = cnt + 1
print('预测结果:%d'%y, '正确结果:%d'%final_res[0][1])
print('accuracy:', cnt / len(test_x))
(2)运行结果:
2.2.2直接调库
(1)代码:
if __name__ == '__main__':
K = 10
train_x, train_y, test_x, test_y = load_data()
knn = KNeighborsClassifier(n_neighbors=K)
knn.fit(train_x, train_y)
acc = knn.score(test_x, test_y)
print('accuracy:', acc)
(2)运行结果:
三、实验结果分析
1.分析KNN算法优劣
KNN算法简单易懂,易于实现,但在处理大规模数据集时计算复杂度较高,手写体识别相比于用深度学习来说耗时长很多。此外,KNN算法对数据的特征缩放、数据维度等因素较为敏感。
2.算法改进
1、引入交叉验证
2、改用深度学习神经网络实现
(1)代码(pytorch实现):
import torch
from torchvision import transforms
from torchvision import datasets
from torch.utils.data import DataLoader
import torch.nn.functional as F
import torch.optim as optim
batch_size = 64
#dataset
transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.1307, ),(0.3081, ))])
train_data = datasets.MNIST(root='../data/minist',train=True,download=True,transform=transform)
train_loader = DataLoader(train_data,batch_size=batch_size,shuffle=True)
test_data = datasets.MNIST(root='../data/minist',train=True,download=True,transform=transform)
test_loader = DataLoader(test_data,batch_size=batch_size,shuffle=False)
#design modle
class InceptionA(torch.nn.Module):
def __init__(self,in_channels) -> None:
super(InceptionA,self).__init__()
self.bratch1x1 = torch.nn.Conv2d(in_channels,16,kernel_size=1)
self.bratch5x5_1 = torch.nn.Conv2d(in_channels,16,kernel_size=1)
self.bratch5x5_2 = torch.nn.Conv2d(16,24,kernel_size=5,padding=2)
self.bratch3x3_1 = torch.nn.Conv2d(in_channels,16,kernel_size=1)
self.bratch3x3_2 = torch.nn.Conv2d(16,24,kernel_size=3,padding=1)
self.bratch3x3_3 = torch.nn.Conv2d(24,24,kernel_size=3,padding=1)
self.pooling = torch.nn.Conv2d(in_channels,24,kernel_size=1)
def forward(self,x):
bratch1x1 = self.bratch1x1(x)
bratch5x5 = self.bratch5x5_1(x)
bratch5x5 = self.bratch5x5_2(bratch5x5)
bratch3x3 = self.bratch3x3_1(x)
bratch3x3 = self.bratch3x3_2(bratch3x3)
bratch3x3 = self.bratch3x3_3(bratch3x3)
bratch_pool = F.avg_pool2d(x,kernel_size=3,stride=1,padding=1)
bratch_pool = self.pooling(bratch_pool)
outputs = [bratch1x1,bratch3x3,bratch5x5,bratch_pool]
return torch.cat(outputs,dim=1) #沿着第一维度拼接
class Net(torch.nn.Module):
def __init__(self) -> None:
super(Net,self).__init__()
self.conv1 = torch.nn.Conv2d(1,10,kernel_size=5)
self.conv2 = torch.nn.Conv2d(88,20,kernel_size=5)
self.incep1 = InceptionA(in_channels=10)
self.incep2 = InceptionA(in_channels=20)
self.mp = torch.nn.MaxPool2d(2)
self.fc = torch.nn.Linear(1408,10)
def forward(self,x):
in_size = x.size(0)
x = F.relu(self.mp(self.conv1(x))) #1->10
x = self.incep1(x) #10->88
x = F.relu(self.mp(self.conv2(x))) #88->20
x = self.incep2(x) #20->1408
x = x.view(in_size,-1)
x = self.fc(x)
return x
model = Net()
#optimizer & loss
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(),lr=0.01,momentum=0.5)
#train
def train(epoch):
running_loss=0.0
for batch_idx,data in enumerate(train_loader,0): #这个batch_idx是???
inputs, target = data
optimizer.zero_grad()
y_pred = model(inputs)
loss = criterion(y_pred,target)
loss.backward()
optimizer.step()
running_loss += loss.item()
if batch_idx % 300 == 299:
print('[%d, %5d] loss: %.3f' % (epoch+1, batch_idx+1, running_loss/300))
running_loss = 0.0
def test():
correct,total=0,0
with torch.no_grad():
for data in test_loader:
images,lables = data
outputs = model(images)
_,perdicted = torch.max(outputs.data,dim=1)
total += lables.size(0)
correct += (perdicted==lables).sum().item()
print('accuracy on test set: %d %% ' % (100*correct/total))
if __name__ == '__main__':
for epoch in range(10):
train(epoch)
test()
(2)运行结果(比起KNN算法,准确度更高):
四、模型性能评估
1、混淆矩阵:
通过混淆矩阵算得下列值:
在这里简单举例一下关于recall和precision的计算:
(1)所有样本[这里以5个为例]
(2)选取阈值:阈值设为0.6[即置信度大于0.6为预测正确的样本],选出如下三个正样本
1)准确率:预测正确的样本有3个,但真正正确[上图红框]的只有2个,precision=2/3
2)召回率:正样本有5个[蓝框],其中预测正确的正样本为2个,recall=2/5
2、ROC和PR曲线
1、ROC曲线
(1)概念
通常用于二分类问题的模型评估,以真正例率 (TPR,也叫recall) 和假正例率 (FPR) 作为纵、横轴,这两个指标计算如下:
曲线越靠近左上角说明模型的性能越佳。但是,当类别不平衡下,负类样本过多会导致FPR变化很小,曲线出现假乐观的情况。
(2)绘制ROC曲线
1)简单代码:
#TPR,FPR
fpr,tpr,threshold = roc_curve(data_labels,confidence_scores)
# print(fpr,fpr)
plt.figure()
plt.title('ROC Curve')
plt.xlabel('FPR')
plt.ylabel('TPR')
auc = auc(fpr,tpr)
plt.plot(fpr,tpr, label='Class 0 (AUC = %0.2f)' % auc)
plt.legend('lower left')
plt.show()
2)可视化:
3)手写体识别的ROC曲线:
代码:
#10个数字各自ROC曲线
for i in range(10):
y_pred = knn.predict_proba(test_x)
fpr,tpr,threshold = roc_curve((test_y==i).astype(int),y_pred[:,i])
roc_auc = auc(fpr,tpr)
plt.plot(fpr, tpr, label='Class %d (AUC = %0.2f)' % (i, roc_auc))
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title("10个类别的ROC曲线")
plt.legend(loc="lower left")#放左下角
plt.show()
可视化:
2、PR曲线
(1)概念
PR曲线展示的是Precision vs Recall的曲线。(PR曲线与ROC曲线都采用了TPR (Recall),不同点是ROC曲线使用了FPR,而PR曲线使用了Precision,因此PR曲线的两个指标都聚焦于正例。类别不平衡问题中由于主要关心正例)
越往右上角效果越佳
(2)绘制PR曲线
1)简单代码【这里只是为了理解计算过程】:
#简单构造数据,因为事先初步理解计算过程,所以就没有用训练的数据集,而是直接给置信度和标签
confidence_scores = np.array([0.9,0.46,0.78,0.37,0.6,0.4,0.2,0.16])
confidence_scores = sorted(confidence_scores,reverse=True) #按大到小排序
print(f'置信度为:{confidence_scores}')
data_labels = np.array([1,1,0,1,0,0,1,0])
#精确率、召回率、阈值
precision,recall,threshold = precision_recall_curve(data_labels,confidence_scores)
print('准确率:',precision)
print('召回率:',recall)
print('阈值',threshold)
plt.figure()
plt.title('PR曲线')
plt.xlabel('recall')
plt.ylabel('precision')
plt.grid() #添加网格线
plt.plot(recall,precision) #绘制pr曲线
plt.show()
2)可视化[由于只是自定义的一些离散数据,所以曲线不平滑]:
3)手写体识别的PR曲线
代码:
import numpy as np
import torchvision
import torchvision.transforms as transforms
from torchvision import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_recall_curve,roc_curve,auc #计算PR和RPC
# from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt
def load_data():
transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.1307, ),(0.3081, ))])#传入元组,元组只有一个元素也必须在后面加逗号#01分布
dataset_train = datasets.MNIST(root='../data/minist',train=True,download=True,transform=transform)
dataset_test = datasets.MNIST(root='../data/minist', train=False, download=True,transform=transform)
data_train = dataset_train.data
X_train = data_train.numpy()
X_test = dataset_test.data.numpy()
X_train = np.reshape(X_train, (60000, 784))
X_test = np.reshape(X_test, (10000, 784))
Y_train = dataset_train.targets.numpy()
Y_test = dataset_test.targets.numpy()
return X_train, Y_train, X_test, Y_test
if __name__ == '__main__':
K = 10
# knn(K)
train_x, train_y, test_x, test_y = load_data()
knn = KNeighborsClassifier(n_neighbors=K)
knn.fit(train_x, train_y)
# y_pred = knn.predict(test_x)
plt.figure()
#10个数字各自的PR曲线
for i in range(10):
y_pred = knn.predict_proba(test_x)
precision, recall, _ = precision_recall_curve((test_y == i).astype(int), y_pred[:, i]) #计算精度、召回率、threshold
pr_auc = auc(recall, precision)
plt.plot(recall, precision, label='Class %d (AUC = %0.2f)' % (i, pr_auc))
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title("10个类别的PR曲线")
plt.legend(loc="lower left")#放左下角
plt.show()
PR曲线如下: