KNN算法-实现二分类任务

还不秃顶的计科生

已于 2024-08-23 16:17:11 修改

阅读量965

点赞数 20

分类专栏：机器学习文章标签：算法

于 2024-08-23 15:26:03 首次发布

本文链接：https://blog.csdn.net/weixin_74009895/article/details/141355680

版权

机器学习专栏收录该内容

18 篇文章 0 订阅

订阅专栏

基本概念：

全称K nearest neighbor，也就是K近值算法。

核心思想：“近朱者赤近墨者黑”

（1）分类任务（你是黑还是白）

基本原理：

①K个最近的距离

②民主集中制投票

③分类表决与加权分类表决

（2）回归任务（根据你周围人的财富状况判断你兜里面有多少钱）

均值法与加权均值法

（3）K值的选取

①K太小，就代表随便一个人都能对你造成巨大影响，从而导致“过拟合（过分相信某个数据）”

比如我们要对蓝色正方形进行分类，很明显，他应该被分为红色，但是如果我们取K=1，那么就会错误地把它分为黑色，从而引入噪声。

②K值太大，比如我们把所有样本点全部都选取了，那就相当于没有训练，也就是直接计算当前哪一种的数量多就归谁，从而导致欠拟合（就是和数据特征关系太小了）

③因此K的选取既不能太大，也不能太小，我们可以通过超参数调参来动态调整；

同时，在处理二分类（投票问题）的时候，我们尽量选K为奇数，以避免最终双方投票相同而出现的极端情况。

（4）三种举例公式

①欧式距离

也就是我们初高中数学上的两点之间的距离，也就是上述的这个图的红线。

②曼哈顿距离

这个指的是点和点坐标值差的绝对值和。

也就是上述这个图的两条黑色线。

③明氏距离

当p=1的时候，明氏距离就是曼哈顿距离；

当p=2的时候，明氏距离就是欧式距离。

（5）公式参数（分类）

具体的计算步骤：

1. 数据准备：首先，需要有一组已标记的训练样本，每个样本都有一组特征和对应的类别

标签。这些样本将被用来构建分类模型。

2.计算距离：对于一个待分类的样本，需要计算它与训练样本之间的距离。常用的距离度

量方法包括欧氏距离、曼哈顿距离、闵可夫斯基距离等。

3. 选择最近邻：根据计算得到的距离，选择与待分类样本最近的 k 个训练样本作为最近

邻。 k 是一个预先设定的参数。

4. 投票决策：根据最近邻的类别标签，通过投票的方式来决定待分类样本的类别。一般来

说，可以使用简单投票（每个邻居的权重相等）或加权投票（根据距离赋予不同的权重）。

5. 输出结果：根据投票的结果，将待分类样本归为得票最多的类别，作为最终的分类结果。

k-NN 算法的优点包括简单易用、无需训练过程、对异常值不敏感等。

（6）公式实现

①示例

现在我以"电信银行卡诈骗的数据分析"为例，通过使用“精确率”，“准确率”，“召回率”和“F1分数”来评估。

②代码

1.训练模型

将模型设置名为knn_classifier_model.joblib并保存。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import joblib  # 用于保存和加载模型
import matplotlib.pyplot as plt

# Load the dataset
file_path = '电信银行卡诈骗的数据分析.csv'
data = pd.read_csv(file_path)

# Split the data into features (X) and target (y)
X = data.drop(columns=['Fraud'])  # Drop target column 'Fraud'
y = data['Fraud']  # Target column

# Split the data into training, validation, and test sets (60% train, 20% validation, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Train a K-Nearest Neighbors classifier with k=5 (default)
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)

# Save the trained model to a file using joblib
model_filename = 'knn_classifier_model.joblib'
joblib.dump(knn_classifier, model_filename)
print(f"Model saved to {model_filename}")

# After training, you can load the model from the saved file
loaded_model = joblib.load(model_filename)
print("Model loaded for testing")

# Predict on validation and test sets using the loaded model
y_val_pred = loaded_model.predict(X_val)
y_test_pred = loaded_model.predict(X_test)

# Define a function to calculate metrics
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp)
    return accuracy, precision, recall, f1, specificity

# Calculate metrics for validation and test sets
val_metrics = calculate_metrics(y_val, y_val_pred)
test_metrics = calculate_metrics(y_test, y_test_pred)

# Create a DataFrame to display the results
metrics_df = pd.DataFrame({
    'Dataset': ['Validation', 'Test'],
    'Accuracy': [val_metrics[0], test_metrics[0]],
    'Precision': [val_metrics[1], test_metrics[1]],
    'Recall': [val_metrics[2], test_metrics[2]],
    'F1 Score': [val_metrics[3], test_metrics[3]],
    'Specificity': [val_metrics[4], test_metrics[4]]
})

# Print the DataFrame to show the metrics
print(metrics_df)

# Plot the metrics for better visualization
metrics_df.set_index('Dataset').plot(kind='bar', figsize=(10, 6))
plt.title("KNN Model Metrics on Validation and Test Sets")
plt.ylabel("Score")
plt.xticks(rotation=0)
plt.legend(loc='best')
plt.show()

2.测试效果

import numpy as np
import joblib  # 用于加载模型

# Load the trained model
model_filename = 'knn_classifier_model.joblib'
loaded_model = joblib.load(model_filename)

# 假设新数据的格式和训练数据一致，包含7个特征
# 新数据的顺序为: Distance1, Distance2, Ratio, Repeat, Card, Pin, Online
new_data = np.array([[57.877857, 0.31114, 1.94594, 1, 1, 1, 0],  # 样本 1
                     [44.199036, 0.566463, 2.222763, 1, 1, 1, 0]])  # 样本 2

# 使用加载的模型对新数据进行预测
new_predictions = loaded_model.predict(new_data)

# 打印预测结果
print("Predicted classes for new data:", new_predictions)