Kaggle数据集之电信客户流失数据分析(三)之决策树分类

一、导入数据

import pandas as pd
df=pd.read_csv(r"D:\PycharmProjects\ku_pandas\WA_Fn-UseC_-Telco-Customer-Churn.csv")
df
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes ... Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.5 No
7039 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No ... Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.9 No
7040 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes ... No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No ... No No No No Month-to-month Yes Mailed check 74.40 306.6 Yes
7042 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes ... Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.5 No

7043 rows × 21 columns

本数据集描述了电信用户是否流失以及其相关信息,共包含7043条数据,共21个字段,分别介绍如下:

  • customerID : 用户ID
  • gender:性别(Female & Male)
  • SeniorCitizen :老年用户(1表示是,0表示不是)
  • Partner :伴侣用户(Yes or No)
  • Dependents :亲属用户(Yes or No)
  • tenure : 在网时长(0-72月)
  • PhoneService : 是否开通电话服务业务(Yes or No)
  • MultipleLines: 是否开通了多线业务(Yes 、No or No phoneservice 三种)
  • InternetService:是否开通互联网服务(No, DSL数字网络,fiber optic光纤网络 三种)
  • OnlineSecurity:是否开通网络安全服务(Yes,No,No internetserive 三种)
  • OnlineBackup:是否开通在线备份业务(Yes,No,No internetserive 三种)
  • DeviceProtection:是否开通了设备保护业务(Yes,No,No internetserive 三种)
  • TechSupport:是否开通了技术支持服务(Yes,No,No internetserive 三种)
  • StreamingTV:是否开通网络电视(Yes,No,No internetserive 三种)
  • StreamingMovies:是否开通网络电影(Yes,No,No internetserive 三种)
  • Contract:签订合同方式(按月,一年,两年)
  • PaperlessBilling:是否开通电子账单(Yes or No)
  • PaymentMethod:付款方式(bank transfer,credit card,electronic check,mailed check)
  • MonthlyCharges:月费用
  • TotalCharges:总费用
  • Churn:该用户是否流失(Yes or No)

二、数据描述分析

1. 查看数据

df.shape #显示数据的格式
(7043, 21)
df.dtypes #输出每一列对应的数据类型
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object
df.isnull().sum().values.sum() #查找缺失值
0
df.nunique() #查看每一列有几个不同值
customerID          7043
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
dtype: int64
# 查看表格某列中有多少个不同值,并计算每个不同值在该列中有多少重复值
df.Churn.value_counts()  #value_counts()是Series拥有的方法,一般在DataFrame中使用时,需要指定对哪一列或行使用
No     5174
Yes    1869
Name: Churn, dtype: int64

说明一共有1869个流失客户,5174个非流失客户

2. 数据清洗

(1). 简化属性值

  • 将InternetService中的DSL数字网络,fiber optic光纤网络替换为Yes
  • 将MultipleLines中的No phoneservice替换成No
  • 将TotalCharges转换为数字型
# 将InternetService中的DSL数字网络,fiber optic光纤网络替换为Yes
# 将MultipleLines中的No phoneservice替换成No
replace_list=['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
for i in replace_list:
    df[i]=df[i].str.replace('No internet service','No')
df['InternetService']=df['InternetService'].str.replace('Fiber optic','Yes')
df['InternetService']=df['InternetService'].str.replace('DSL','Yes')
df['MultipleLines']=df['MultipleLines'].str.replace('No phone service','No')
# 将TotalCharges转换为数字型
df.TotalCharges=pd.to_numeric(df.TotalCharges,errors="coerce") #.to_numeric()将参数转换为数字类型,其中coerce表示无效的解析将设置为NaN
df.TotalCharges.dtypes
dtype('float64')

3. 数据可视化

具体见

三、用决策树分类

根据(二)中的可视化结果,有11个特征与客户流失率的高低相关,分别是

  • SeniorCitizen :是否老年用户
  • Partner :是否伴侣用户
  • Dependents :是否亲属用户
  • tenure: 在网时长
  • InternetService:是否开通互联网服务
  • OnlineSecurity:是否开通网络安全服务
  • TechSupport:是否开通了技术支持服务
  • Contract:签订合同方式 (按月,一年,两年)
  • PaperlessBilling:是否开通电子账单(Yes or No)
  • PaymentMethod:付款方式(bank transfer,credit card,electronic check,mailed check)
  • MonthlyCharges:月费用

接下来通过样本数据训练一个决策树模型,使模型能够根据输入特征预测客户是否为潜在的流失对象。

1. 特征工程

df
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No Yes No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No Yes Yes ... Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No Yes Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No Yes Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Yes No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 6840-RESVB Male 0 Yes Yes 24 Yes Yes Yes Yes ... Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7039 2234-XADUH Female 0 Yes Yes 72 Yes Yes Yes No ... Yes No Yes Yes One year Yes
  • 7
    点赞
  • 101
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值