我们使用电信企业的客户流失数据集,Orange_Telecom_Churn_Data.csv(存放在当前目录下)。我们先读入数据集,做一些数据预处理,然后使用K近邻模型根据用户的特点来预测其是否会流失。
第一步:
- 将数据集读入变量data中,并查看其前5行。
- 去除其中的“state","area_code"和"phone_number"三列。
# 将数据集读入变量data中,并查看其前5行
import pandas as pd
data = pd.read_csv("C:/Users/86157/Desktop/Orange_Telecom_Churn_Data.csv")
data.head(5)
state | account_length | area_code | phone_number | intl_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | ... | total_eve_calls | total_eve_charge | total_night_minutes | total_night_calls | total_night_charge | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | churned | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | KS | 128 | 415 | 382-4657 | no | yes | 25 | 265.1 | 110 | 45.07 | ... | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | False |
1 | OH | 107 | 415 | 371-7191 | no | yes | 26 | 161.6 | 123 | 27.47 | ... | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | False |
2 | NJ | 137 | 415 | 358-1921 | no | no | 0 | 243.4 | 114 | 41.38 | ... | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | False |
3 | OH | 84 | 408 | 375-9999 | yes | no | 0 | 299.4 | 71 | 50.90 | ... | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | False |
4 | OK | 75 | 415 | 330-6626 | yes | no | 0 | 166.7 | 113 | 28.34 | ... | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | False |
5 rows × 21 columns
# 去除“state","area_code"和"phone_number"三列
data.drop(columns = ["state", "area_code", "phone_number"], inplace = True)
data
account_length | intl_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | total_night_minutes | total_night_calls | total_night_charge | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | churned | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 128 | no | yes | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | False |
1 | 107 | no | yes | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | False |
2 | 137 | no | no | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | False |
3 | 84 | yes | no | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | False |
4 | 75 | yes | no | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4995 | 50 | no | yes | 40 | 235.7 | 127 | 40.07 | 223.0 | 126 | 18.96 | 297.5 | 116 | 13.39 | 9.9 | 5 | 2.67 | 2 | False |
4996 | 152 | no | no | 0 | 184.2 | 90 | 31.31 | 256.8 | 73 | 21.83 | 213.6 | 113 | 9.61 | 14.7 | 2 | 3.97 | 3 | True |
4997 | 61 | no | no | 0 | 140.6 | 89 | 23.90 | 172.8 | 128 | 14.69 | 212.4 | 97 | 9.56 | 13.6 | 4 | 3.67 | 1 | False |
4998 | 109 | no | no | 0 | 188.8 | 67 | 32.10 | 171.7 | 92 | 14.59 | 224.4 | 89 | 10.10 | 8.5 | 6 | 2.30 | 0 | False |
4999 | 86 | no | yes | 34 | 129.4 | 102 | 22.00 | 267.1 | 104 | 22.70 | 154.8 | 100 | 6.97 | 9.3 | 16 | 2.51 | 0 | False |
5000 rows × 18 columns
第二步:
- 有些列的值是分类数据,如’intl_plan’, ‘voice_mail_plan’, 'churned’这三列,需要把它们转换成数值数据。
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
for col in ['intl_plan', 'voice_mail_plan', 'churned']:
data[col] = lb.fit_transform(data[col])
data.head(5)
account_length | intl_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | total_night_minutes | total_night_calls | total_night_charge | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | churned | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 128 | 0 | 1 | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | 0 |
1 | 107 | 0 | 1 | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | 0 |
2 | 137 | 0 | 0 | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | 0 |
3 | 84 | 1 | 0 | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | 0 |
4 | 75 | 1 | 0 | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | 0 |
第三步:
- 将“churned”列作为要预测的目标列数据,赋给y_data;除“churned”列之外的所有其他列的数据作为特征列数据,赋给X_data。
- 使用scaling method来缩放X_data。
# 生成X_data和y_data
y_data = data["churned"]
data_new = data.drop(columns = ["churned"])
X_data = data_new
# 缩放X_data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler = scaler.fit(X_data)
X_data = scaler.transform(X_data)
X_data
array([[ 0.69894149, -0.32324017, 1.66712012, ..., -0.58423577,
-0.0955088 , -0.43667564],
[ 0.16984882, -0.32324017, 1.66712012, ..., -0.58423577,
1.24598231, -0.43667564],
[ 0.92569549, -0.32324017, -0.5998368 , ..., 0.22991664,
0.69597096, -1.20223603],
...,
[-0.98911606, -0.32324017, -0.5998368 , ..., -0.17715957,
1.20573758, -0.43667564],
[ 0.2202386 , -0.32324017, -0.5998368 , ..., 0.63699285,
-0.63210525, -1.20223603],
[-0.35924384, -0.32324017, 1.66712012, ..., 4.70775494,
-0.35039211, -1.20223603]])
第四步:
- 创建一个k=3的K近邻模型,并拟合X_data和y_data。
# 创建一个3NN模型,并训练
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_data, y_data)
KNeighborsClassifier(n_neighbors=3)
第五步:
- 用上一步训练好的K近邻模型预测相同的数据集,即X_data,并评测预测结果的精度。
# 预测并评价
y_pred = knn.predict(X_data)
from sklearn.metrics import accuracy_score
print("Accuracy:", float(accuracy_score(y_data, y_pred))*100, "%")
Accuracy: 93.96 %
第六步:
- 构建一个同样是
n_neighbors=3
的模型,但是用距离作为聚集K个近邻预测结果的权重。同样计算此模型在X_data上的预测精度。 - 构建另一个K近邻模型:使用均匀分布的权重,但是将闵科夫斯基距离中的指数参数设为1(
p=1
),即使用曼哈顿距离。
# n_neighbors=3, weights='distance'
knn = KNeighborsClassifier(n_neighbors = 3, weights="distance")
knn.fit(X_data, y_data)
y_pred1 = knn.predict(X_data)
from sklearn.metrics import accuracy_score
print("Accuracy:", float(accuracy_score(y_data, y_pred1))*100, "%")
Accuracy: 100.0 %
# n_neighbors=3, p=1
knn = KNeighborsClassifier(n_neighbors = 3, p = 1)
knn.fit(X_data, y_data)
y_pred2 = knn.predict(X_data)
from sklearn.metrics import accuracy_score
print("Accuracy:", float(accuracy_score(y_data, y_pred2))*100, "%")
Accuracy: 94.16 %
第七步:
- 将K值从1变化到20,训练20个不同的K近邻模型。权重使用均匀分布的权重(缺省的)。闵科夫斯基距离的指数参数(
p
)可以设为1或者2(只要一致即可)。将每个模型得到的精度和其k
值存到一个列表或字典中。 - 将
accuracy
和k
的关系绘成图表。当k=1
时,你观察到了什么? 为什么?
K = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
accuracy = []
for i in range(1, 21):
knn = KNeighborsClassifier(n_neighbors = i, p = 1)
knn.fit(X_data, y_data)
y_pred2 = knn.predict(X_data)
from sklearn.metrics import accuracy_score
accuracy.append( float(accuracy_score(y_data, y_pred2)))
accuracy
[1.0,
0.9252,
0.9416,
0.9178,
0.9334,
0.9152,
0.9282,
0.909,
0.9172,
0.905,
0.9124,
0.9024,
0.9084,
0.8988,
0.9054,
0.8972,
0.902,
0.894,
0.8994,
0.8906]
z = zip(K, accuracy)
z1 = list(z)
K_accuracy = dict(z1)
print(K_accuracy)
{1: 1.0, 2: 0.9252, 3: 0.9416, 4: 0.9178, 5: 0.9334, 6: 0.9152, 7: 0.9282, 8: 0.909, 9: 0.9172, 10: 0.905, 11: 0.9124, 12: 0.9024, 13: 0.9084, 14: 0.8988, 15: 0.9054, 16: 0.8972, 17: 0.902, 18: 0.894, 19: 0.8994, 20: 0.8906}
import matplotlib.pyplot as plt
K_label = [str(i) for i in K]
plt.bar(K_label, accuracy)
plt.xlabel("K")
plt.ylabel("accuracy_score")
plt.show()