如何建立、调整和评测K近邻分类预测模型

我们使用电信企业的客户流失数据集,Orange_Telecom_Churn_Data.csv(存放在当前目录下)。我们先读入数据集,做一些数据预处理,然后使用K近邻模型根据用户的特点来预测其是否会流失。

第一步:

  • 将数据集读入变量data中,并查看其前5行。
  • 去除其中的“state","area_code"和"phone_number"三列。
# 将数据集读入变量data中,并查看其前5行
import pandas as pd

data = pd.read_csv("C:/Users/86157/Desktop/Orange_Telecom_Churn_Data.csv")

data.head(5)
stateaccount_lengtharea_codephone_numberintl_planvoice_mail_plannumber_vmail_messagestotal_day_minutestotal_day_callstotal_day_charge...total_eve_callstotal_eve_chargetotal_night_minutestotal_night_callstotal_night_chargetotal_intl_minutestotal_intl_callstotal_intl_chargenumber_customer_service_callschurned
0KS128415382-4657noyes25265.111045.07...9916.78244.79111.0110.032.701False
1OH107415371-7191noyes26161.612327.47...10316.62254.410311.4513.733.701False
2NJ137415358-1921nono0243.411441.38...11010.30162.61047.3212.253.290False
3OH84408375-9999yesno0299.47150.90...885.26196.9898.866.671.782False
4OK75415330-6626yesno0166.711328.34...12212.61186.91218.4110.132.733False

5 rows × 21 columns

# 去除“state","area_code"和"phone_number"三列
data.drop(columns = ["state", "area_code", "phone_number"], inplace = True)

data
account_lengthintl_planvoice_mail_plannumber_vmail_messagestotal_day_minutestotal_day_callstotal_day_chargetotal_eve_minutestotal_eve_callstotal_eve_chargetotal_night_minutestotal_night_callstotal_night_chargetotal_intl_minutestotal_intl_callstotal_intl_chargenumber_customer_service_callschurned
0128noyes25265.111045.07197.49916.78244.79111.0110.032.701False
1107noyes26161.612327.47195.510316.62254.410311.4513.733.701False
2137nono0243.411441.38121.211010.30162.61047.3212.253.290False
384yesno0299.47150.9061.9885.26196.9898.866.671.782False
475yesno0166.711328.34148.312212.61186.91218.4110.132.733False
.........................................................
499550noyes40235.712740.07223.012618.96297.511613.399.952.672False
4996152nono0184.29031.31256.87321.83213.61139.6114.723.973True
499761nono0140.68923.90172.812814.69212.4979.5613.643.671False
4998109nono0188.86732.10171.79214.59224.48910.108.562.300False
499986noyes34129.410222.00267.110422.70154.81006.979.3162.510False

5000 rows × 18 columns

第二步:

  • 有些列的值是分类数据,如’intl_plan’, ‘voice_mail_plan’, 'churned’这三列,需要把它们转换成数值数据。
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()

for col in ['intl_plan', 'voice_mail_plan', 'churned']:
    data[col] = lb.fit_transform(data[col])
data.head(5)
account_lengthintl_planvoice_mail_plannumber_vmail_messagestotal_day_minutestotal_day_callstotal_day_chargetotal_eve_minutestotal_eve_callstotal_eve_chargetotal_night_minutestotal_night_callstotal_night_chargetotal_intl_minutestotal_intl_callstotal_intl_chargenumber_customer_service_callschurned
01280125265.111045.07197.49916.78244.79111.0110.032.7010
11070126161.612327.47195.510316.62254.410311.4513.733.7010
2137000243.411441.38121.211010.30162.61047.3212.253.2900
384100299.47150.9061.9885.26196.9898.866.671.7820
475100166.711328.34148.312212.61186.91218.4110.132.7330

第三步:

  • 将“churned”列作为要预测的目标列数据,赋给y_data;除“churned”列之外的所有其他列的数据作为特征列数据,赋给X_data。
  • 使用scaling method来缩放X_data。
# 生成X_data和y_data
y_data = data["churned"]

data_new = data.drop(columns = ["churned"])

X_data = data_new
# 缩放X_data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler = scaler.fit(X_data)

X_data = scaler.transform(X_data)

X_data
array([[ 0.69894149, -0.32324017,  1.66712012, ..., -0.58423577,
        -0.0955088 , -0.43667564],
       [ 0.16984882, -0.32324017,  1.66712012, ..., -0.58423577,
         1.24598231, -0.43667564],
       [ 0.92569549, -0.32324017, -0.5998368 , ...,  0.22991664,
         0.69597096, -1.20223603],
       ...,
       [-0.98911606, -0.32324017, -0.5998368 , ..., -0.17715957,
         1.20573758, -0.43667564],
       [ 0.2202386 , -0.32324017, -0.5998368 , ...,  0.63699285,
        -0.63210525, -1.20223603],
       [-0.35924384, -0.32324017,  1.66712012, ...,  4.70775494,
        -0.35039211, -1.20223603]])

第四步:

  • 创建一个k=3的K近邻模型,并拟合X_data和y_data。
# 创建一个3NN模型,并训练
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)

knn.fit(X_data, y_data)
KNeighborsClassifier(n_neighbors=3)

第五步:

  • 用上一步训练好的K近邻模型预测相同的数据集,即X_data,并评测预测结果的精度。
# 预测并评价
y_pred = knn.predict(X_data)

from sklearn.metrics import accuracy_score

print("Accuracy:", float(accuracy_score(y_data, y_pred))*100, "%")
Accuracy: 93.96 %

第六步:

  • 构建一个同样是n_neighbors=3的模型,但是用距离作为聚集K个近邻预测结果的权重。同样计算此模型在X_data上的预测精度。
  • 构建另一个K近邻模型:使用均匀分布的权重,但是将闵科夫斯基距离中的指数参数设为1(p=1),即使用曼哈顿距离。
# n_neighbors=3, weights='distance'
knn = KNeighborsClassifier(n_neighbors = 3, weights="distance")

knn.fit(X_data, y_data)

y_pred1 = knn.predict(X_data)

from sklearn.metrics import accuracy_score

print("Accuracy:", float(accuracy_score(y_data, y_pred1))*100, "%")
Accuracy: 100.0 %
# n_neighbors=3, p=1
knn = KNeighborsClassifier(n_neighbors = 3, p = 1)

knn.fit(X_data, y_data)

y_pred2 = knn.predict(X_data)

from sklearn.metrics import accuracy_score

print("Accuracy:", float(accuracy_score(y_data, y_pred2))*100, "%")
Accuracy: 94.16 %

第七步:

  • 将K值从1变化到20,训练20个不同的K近邻模型。权重使用均匀分布的权重(缺省的)。闵科夫斯基距离的指数参数(p)可以设为1或者2(只要一致即可)。将每个模型得到的精度和其k值存到一个列表或字典中。
  • accuracyk的关系绘成图表。当k=1时,你观察到了什么? 为什么?
K = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

accuracy = []

for i in range(1, 21):

    knn = KNeighborsClassifier(n_neighbors = i, p = 1)

    knn.fit(X_data, y_data)

    y_pred2 = knn.predict(X_data)

    from sklearn.metrics import accuracy_score

    accuracy.append( float(accuracy_score(y_data, y_pred2)))
    
accuracy
[1.0,
 0.9252,
 0.9416,
 0.9178,
 0.9334,
 0.9152,
 0.9282,
 0.909,
 0.9172,
 0.905,
 0.9124,
 0.9024,
 0.9084,
 0.8988,
 0.9054,
 0.8972,
 0.902,
 0.894,
 0.8994,
 0.8906]
z = zip(K, accuracy)

z1 = list(z)

K_accuracy = dict(z1)

print(K_accuracy)
{1: 1.0, 2: 0.9252, 3: 0.9416, 4: 0.9178, 5: 0.9334, 6: 0.9152, 7: 0.9282, 8: 0.909, 9: 0.9172, 10: 0.905, 11: 0.9124, 12: 0.9024, 13: 0.9084, 14: 0.8988, 15: 0.9054, 16: 0.8972, 17: 0.902, 18: 0.894, 19: 0.8994, 20: 0.8906}
import matplotlib.pyplot as plt

K_label = [str(i) for i in K]

plt.bar(K_label, accuracy)

plt.xlabel("K")

plt.ylabel("accuracy_score")

plt.show()

这里插入图片描述

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
POOR_IN2000901UV.xsd文件内容比较复杂,包含了POOR_IN2000901UV消息类型的各个元素和属性,以及其他HL7 V3消息类型中使用的一些公共元素和数据类型。以下是该文件的主要内容: 1. 命名空间和导入声明 ``` <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="urn:hl7-org:v3" xmlns:mif="urn:hl7-org:v3/mif" xmlns:its="http://www.hl7.org/schemas/HL7ITS-R1" xmlns:uv="urn:hl7-org:uv:mi" targetNamespace="urn:hl7-org:uv:poor_in2000901uv" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:import namespace="urn:hl7-org:v3" schemaLocation="..."/> <xs:import namespace="urn:hl7-org:v3/mif" schemaLocation="..."/> <xs:import namespace="http://www.hl7.org/schemas/HL7ITS-R1" schemaLocation="..."/> <xs:import namespace="urn:hl7-org:v3/uv" schemaLocation="..."/> ``` 2. 消息头和消息体定义 ``` <xs:element name="POOR_IN2000901UV" type="UV_Poor_IN2000901UV"/> <xs:complexType name="UV_Poor_IN2000901UV"> <xs:sequence> <xs:element name="id" type="II" minOccurs="0"/> <xs:element name="creationTime" type="TS" minOccurs="0"/> <xs:element name="securityText" type="ST" minOccurs="0"/> <xs:element name="versionCode" type="CS" minOccurs="0"/> <xs:element name="interactionId" type="II"/> <xs:element name="processingCode" type="CS" minOccurs="0"/> <xs:element name="processingModeCode" type="CS" minOccurs="0"/> <xs:element name="acceptAckCode" type="CS" minOccurs="0"/> <xs:element name="receiver" type="MCCI_MT000100UV01Receiver" maxOccurs="unbounded"/> <xs:element name="sender" type="MCCI_MT000100UV01Sender"/> <xs:element name="controlActProcess" type="UV_Poor_IN2000901UV.ControlActProcess"/> </xs:sequence> </xs:complexType> ``` 3. 公共元素和数据类型定义 ``` <xs:element name="addr" type="AD"/> <xs:element name="code" type="CE"/> <xs:element name="effectiveTime" type="IVL_TS"/> <xs:element name="name" type="EN"/> <xs:element name="telecom" type="TEL"/> <xs:element name="value" type="ANY"/> <xs:complexType name="AD"> ... </xs:complexType> <xs:complexType name="CE"> ... </xs:complexType> <xs:complexType name="EN"> ... </xs:complexType> <xs:complexType name="IVL_TS"> ... </xs:complexType> <xs:complexType name="TEL"> ... </xs:complexType> ``` 以上是POOR_IN2000901UV.xsd文件的部分内容,它定义了POOR_IN2000901UV消息类型的各个元素和属性,以及其他HL7 V3消息类型中使用的一些公共元素和数据类型。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值