如何建立、调整和评测K近邻分类预测模型

Chenshuo_Xu

已于 2023-02-14 23:29:29 修改

阅读量1.3k

点赞数 2

分类专栏：机器学习文章标签： python 开发语言近邻算法近邻

于 2023-01-16 09:59:29 首次发布

本文链接：https://blog.csdn.net/Chenshuo_Xu/article/details/128700636

版权

机器学习专栏收录该内容

9 篇文章 10 订阅

订阅专栏

我们使用电信企业的客户流失数据集，Orange_Telecom_Churn_Data.csv（存放在当前目录下）。我们先读入数据集，做一些数据预处理，然后使用K近邻模型根据用户的特点来预测其是否会流失。

第一步：

将数据集读入变量data中，并查看其前5行。
去除其中的“state"，"area_code"和"phone_number"三列。

# 将数据集读入变量data中，并查看其前5行
import pandas as pd

data = pd.read_csv("C:/Users/86157/Desktop/Orange_Telecom_Churn_Data.csv")

data.head(5)

	state	account_length	area_code	phone_number	intl_plan	voice_mail_plan	number_vmail_messages	total_day_minutes	total_day_calls	total_day_charge	...	total_eve_calls	total_eve_charge	total_night_minutes	total_night_calls	total_night_charge	total_intl_minutes	total_intl_calls	total_intl_charge	number_customer_service_calls	churned
0	KS	128	415	382-4657	no	yes	25	265.1	110	45.07	...	99	16.78	244.7	91	11.01	10.0	3	2.70	1	False
1	OH	107	415	371-7191	no	yes	26	161.6	123	27.47	...	103	16.62	254.4	103	11.45	13.7	3	3.70	1	False
2	NJ	137	415	358-1921	no	no	0	243.4	114	41.38	...	110	10.30	162.6	104	7.32	12.2	5	3.29	0	False
3	OH	84	408	375-9999	yes	no	0	299.4	71	50.90	...	88	5.26	196.9	89	8.86	6.6	7	1.78	2	False
4	OK	75	415	330-6626	yes	no	0	166.7	113	28.34	...	122	12.61	186.9	121	8.41	10.1	3	2.73	3	False

5 rows × 21 columns

# 去除“state"，"area_code"和"phone_number"三列
data.drop(columns = ["state", "area_code", "phone_number"], inplace = True)

data

	account_length	intl_plan	voice_mail_plan	number_vmail_messages	total_day_minutes	total_day_calls	total_day_charge	total_eve_minutes	total_eve_calls	total_eve_charge	total_night_minutes	total_night_calls	total_night_charge	total_intl_minutes	total_intl_calls	total_intl_charge	number_customer_service_calls	churned
0	128	no	yes	25	265.1	110	45.07	197.4	99	16.78	244.7	91	11.01	10.0	3	2.70	1	False
1	107	no	yes	26	161.6	123	27.47	195.5	103	16.62	254.4	103	11.45	13.7	3	3.70	1	False
2	137	no	no	0	243.4	114	41.38	121.2	110	10.30	162.6	104	7.32	12.2	5	3.29	0	False
3	84	yes	no	0	299.4	71	50.90	61.9	88	5.26	196.9	89	8.86	6.6	7	1.78	2	False
4	75	yes	no	0	166.7	113	28.34	148.3	122	12.61	186.9	121	8.41	10.1	3	2.73	3	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4995	50	no	yes	40	235.7	127	40.07	223.0	126	18.96	297.5	116	13.39	9.9	5	2.67	2	False
4996	152	no	no	0	184.2	90	31.31	256.8	73	21.83	213.6	113	9.61	14.7	2	3.97	3	True
4997	61	no	no	0	140.6	89	23.90	172.8	128	14.69	212.4	97	9.56	13.6	4	3.67	1	False
4998	109	no	no	0	188.8	67	32.10	171.7	92	14.59	224.4	89	10.10	8.5	6	2.30	0	False
4999	86	no	yes	34	129.4	102	22.00	267.1	104	22.70	154.8	100	6.97	9.3	16	2.51	0	False

5000 rows × 18 columns

第二步：

有些列的值是分类数据，如’intl_plan’, ‘voice_mail_plan’, 'churned’这三列，需要把它们转换成数值数据。

from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()

for col in ['intl_plan', 'voice_mail_plan', 'churned']:
    data[col] = lb.fit_transform(data[col])
data.head(5)

	account_length	intl_plan	voice_mail_plan	number_vmail_messages	total_day_minutes	total_day_calls	total_day_charge	total_eve_minutes	total_eve_calls	total_eve_charge	total_night_minutes	total_night_calls	total_night_charge	total_intl_minutes	total_intl_calls	total_intl_charge	number_customer_service_calls
0	128	0	1	25	265.1	110	45.07	197.4	99	16.78	244.7	91	11.01	10.0	3	2.70	1
1	107	0	1	26	161.6	123	27.47	195.5	103	16.62	254.4	103	11.45	13.7	3	3.70	1
2	137	0	0	0	243.4	114	41.38	121.2	110	10.30	162.6	104	7.32	12.2	5	3.29	0
3	84	1	0	0	299.4	71	50.90	61.9	88	5.26	196.9	89	8.86	6.6	7	1.78	2
4	75	1	0	0	166.7	113	28.34	148.3	122	12.61	186.9	121	8.41	10.1	3	2.73	3

第三步：

将“churned”列作为要预测的目标列数据，赋给y_data；除“churned”列之外的所有其他列的数据作为特征列数据，赋给X_data。
使用scaling method来缩放X_data。

# 生成X_data和y_data
y_data = data["churned"]

data_new = data.drop(columns = ["churned"])

X_data = data_new

# 缩放X_data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler = scaler.fit(X_data)

X_data = scaler.transform(X_data)

X_data

array([[ 0.69894149, -0.32324017,  1.66712012, ..., -0.58423577,
        -0.0955088 , -0.43667564],
       [ 0.16984882, -0.32324017,  1.66712012, ..., -0.58423577,
         1.24598231, -0.43667564],
       [ 0.92569549, -0.32324017, -0.5998368 , ...,  0.22991664,
         0.69597096, -1.20223603],
       ...,
       [-0.98911606, -0.32324017, -0.5998368 , ..., -0.17715957,
         1.20573758, -0.43667564],
       [ 0.2202386 , -0.32324017, -0.5998368 , ...,  0.63699285,
        -0.63210525, -1.20223603],
       [-0.35924384, -0.32324017,  1.66712012, ...,  4.70775494,
        -0.35039211, -1.20223603]])

第四步：

创建一个k=3的K近邻模型，并拟合X_data和y_data。

# 创建一个3NN模型，并训练
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)

knn.fit(X_data, y_data)

KNeighborsClassifier(n_neighbors=3)

第五步：

用上一步训练好的K近邻模型预测相同的数据集，即X_data，并评测预测结果的精度。

# 预测并评价
y_pred = knn.predict(X_data)

from sklearn.metrics import accuracy_score

print("Accuracy:", float(accuracy_score(y_data, y_pred))*100, "%")

Accuracy: 93.96 %

第六步：

构建一个同样是n_neighbors=3的模型，但是用距离作为聚集K个近邻预测结果的权重。同样计算此模型在X_data上的预测精度。
构建另一个K近邻模型：使用均匀分布的权重，但是将闵科夫斯基距离中的指数参数设为1(p=1)，即使用曼哈顿距离。

# n_neighbors=3, weights='distance'
knn = KNeighborsClassifier(n_neighbors = 3, weights="distance")

knn.fit(X_data, y_data)

y_pred1 = knn.predict(X_data)

from sklearn.metrics import accuracy_score

print("Accuracy:", float(accuracy_score(y_data, y_pred1))*100, "%")

Accuracy: 100.0 %

# n_neighbors=3, p=1
knn = KNeighborsClassifier(n_neighbors = 3, p = 1)

knn.fit(X_data, y_data)

y_pred2 = knn.predict(X_data)

from sklearn.metrics import accuracy_score

print("Accuracy:", float(accuracy_score(y_data, y_pred2))*100, "%")

Accuracy: 94.16 %

第七步：

将K值从1变化到20，训练20个不同的K近邻模型。权重使用均匀分布的权重（缺省的）。闵科夫斯基距离的指数参数(p)可以设为1或者2（只要一致即可）。将每个模型得到的精度和其k值存到一个列表或字典中。
将accuracy和k的关系绘成图表。当k=1时，你观察到了什么? 为什么?

K = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

accuracy = []

for i in range(1, 21):

    knn = KNeighborsClassifier(n_neighbors = i, p = 1)

    knn.fit(X_data, y_data)

    y_pred2 = knn.predict(X_data)

    from sklearn.metrics import accuracy_score

    accuracy.append( float(accuracy_score(y_data, y_pred2)))
    
accuracy

[1.0,
 0.9252,
 0.9416,
 0.9178,
 0.9334,
 0.9152,
 0.9282,
 0.909,
 0.9172,
 0.905,
 0.9124,
 0.9024,
 0.9084,
 0.8988,
 0.9054,
 0.8972,
 0.902,
 0.894,
 0.8994,
 0.8906]

z = zip(K, accuracy)

z1 = list(z)

K_accuracy = dict(z1)

print(K_accuracy)

{1: 1.0, 2: 0.9252, 3: 0.9416, 4: 0.9178, 5: 0.9334, 6: 0.9152, 7: 0.9282, 8: 0.909, 9: 0.9172, 10: 0.905, 11: 0.9124, 12: 0.9024, 13: 0.9084, 14: 0.8988, 15: 0.9054, 16: 0.8972, 17: 0.902, 18: 0.894, 19: 0.8994, 20: 0.8906}

import matplotlib.pyplot as plt

K_label = [str(i) for i in K]

plt.bar(K_label, accuracy)

plt.xlabel("K")

plt.ylabel("accuracy_score")

plt.show()

这里插入图片描述

Chenshuo_Xu

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
3
评论
如何建立、调整和评测K近邻分类预测模型

在这个练习中，我们使用电信企业的客户流失数据集，Orange_Telecom_Churn_Data.csv（存放在当前目录下）。我们先读入数据集，做一些数据预处理，然后使用K近邻模型根据用户的特点来预测其是否会流失。并观察不同的K值会对模型产生何种影响。
复制链接

扫一扫