sklearn复合评估器的构建(电信客户流失模型)

零、实验环境及目的

1. 实验环境

  • Win10
  • anaconda3
  • JupyterLab

2. 实验目的

掌握sklearn中Pipeline和ColumnTransformer的基本使用。

一、数据介绍

文件名:WA_Fn-UseC_-Telco-Customer-Churn.csv

字段:

  • customerID:客户编号
  • gender: 性别 Whether the customer is a male or a female
  • SeniorCitizen: 是否是老人 Whether the customer is a senior citizen or not (1, 0)
  • Partner: 是否有配偶 Whether the customer has a partner or not (Yes, No)
  • Dependents:是否有家属 Whether the customer has dependents or not (Yes, No)
  • tenure:入网多少个月 Number of months the customer has stayed with the company
  • PhoneService: 是否订购电话服务Whether the customer has a phone service or not (Yes, No)
  • MultipleLines:Whether the customer has multiple lines or not (Yes, No, No phone service)
  • InternetService:Customer’s internet service provider (DSL, Fiber optic, No)
  • OnlineSecurity:Whether the customer has online security or not (Yes, No, No internet service)
  • OnlineBackup:Whether the customer has online backup or not (Yes, No, No internet service)
  • DeviceProtection:Whether the customer has device protection or not (Yes, No, No internet service)
  • TechSupport:Whether the customer has tech support or not (Yes, No, No internet service)
  • StreamingTV:Whether the customer has streaming TV or not (Yes, No, No internet service)
  • StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
  • Contract:合同条款 The contract term of the customer (Month-to-month, One year, Two year)
  • PaperlessBilling:是否有无纸化账单 Whether the customer has paperless billing or not (Yes, No)
  • PaymentMethod:支付方式 The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
  • MonthlyCharges:每月消费 The amount charged to the customer monthly
  • TotalCharges:总消费The total amount charged to the customer
  • Churn:是否流失 Whether the customer churned or not (Yes or No)

获取:

  • Kaggle下载
  • 文末的百度网盘下载

二、模型构建

1. 导入数据

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None) #设置查看列不省略
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv', na_values={'tenure':0, 'TotalCharges':' '})

观察数据的常用方法

  • head/tail,主要观察如下内容:
    • 文件编码是否正确,涉及read_csv的参数为encoding
    • csv文件分隔符是否设置正确,涉及read_csv的参数为sep
    • 是否有注释行、脚注等,涉及read_csv的参数commentskiprowsskipfooter
    • 数据中是否包含字段信息,涉及read_csv的参数headernames
    • 数据中是否包含行索引,涉及read_csv的参数index_col
  • info
    • 观察各字段是否有空缺值,各字段类型是否合理
  • describe
    • 观察各数值字段的最大值,最小值,平均数和中位数等基本统计信息,以便发现数据是否有噪声数据,是否有数据倾斜问题等。

基于上面的这些方法,当我们直接读取数据时,会发现如下问题:

df1 = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

问题1. 字段TotalCharges(总消费)类型为object,正常情况应该为float

df1.describe()
	SeniorCitizen	tenure	MonthlyCharges
count	7043.000000	7043.000000	7043.000000
mean	0.162147	32.371149	64.761692
std	0.368612	24.559481	30.090047
min	0.000000	0.000000	18.250000
25%	0.000000	9.000000	35.500000
50%	0.000000	29.000000	70.350000
75%	0.000000	55.000000	89.850000
max	1.000000	72.000000	118.750000

问题2. 字段tenure(入网月份)为0,正常情况下入网月份至少应该为1

有了这些发现,就明白了在读取数据时为什么加入参数 na_values={'tenure':0, 'TotalCharges':' '},它分别指定了tenureTotalCharges两个字段的缺失值标识。

2. 处理缺失数据

根据上面介绍的观察数据的方法,观察数据的缺失值情况

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7032 non-null   float64
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7032 non-null   float64
 20  Churn             7043 non-null   object 
dtypes: float64(3), int64(1), object(17)
memory usage: 1.1+ MB

发现字段tenureTotalCharges分别有11条缺失数据,可以进一步观察缺失数据

df[df.isna().any(axis=1)]
	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
488	4472-LVYGI	Female	0	Yes	Yes	0	No	No phone service	DSL	Yes	No	Yes	Yes	Yes	No	Two year	Yes	Bank transfer (automatic)	52.55		No
753	3115-CZMZD	Male	0	No	Yes	0	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.25		No
936	5709-LVOEQ	Female	0	Yes	Yes	0	Yes	No	DSL	Yes	Yes	Yes	No	Yes	Yes	Two year	No	Mailed check	80.85		No
1082	4367-NUYAO	Male	0	Yes	Yes	0	Yes	Yes	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.75		No
1340	1371-DWPAZ	Female	0	Yes	Yes	0	No	No phone service	DSL	Yes	Yes	Yes	Yes	Yes	No	Two year	No	Credit card (automatic)	56.05		No
3331	7644-OMVMY	Male	0	Yes	Yes	0	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	19.85		No
3826	3213-VVOLG	Male	0	Yes	Yes	0	Yes	Yes	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.35		No
4380	2520-SGTTA	Female	0	Yes	Yes	0	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.00		No
5218	2923-ARZLG	Male	0	Yes	Yes	0	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	Yes	Mailed check	19.70		No
6670	4075-WKNIU	Female	0	Yes	Yes	0	Yes	Yes	DSL	No	Yes	Yes	Yes	Yes	No	Two year	No	Mailed check	73.35		No
6754	2775-SEFEE	Male	0	No	Yes	0	Yes	Yes	DSL	Yes	Yes	No	Yes	No	No	Two year	Yes	Bank transfer (automatic)	61.90		No

发现tenure(入网月份)字段为缺失值时,TotalCharges(总消费)字段也是缺失值,但MonthlyCharges(月消费)有数据,可以猜测这些数据可能来源于刚入网用户,于是有如下缺失值填充方法:

  • 字段tenure(入网月份)的缺失值修改为1
  • 字段 TotalCharges(总消费)的缺失值用MonthlyCharges(月消费)填充
df.tenure.fillna(1, inplace=True)
df.TotalCharges.fillna(df.MonthlyCharges, inplace=True)

3. 构建流失模型

观察上面的数据会发现,数据有7043个样本,20个特征,1个标签。

1)数据转换

20个特征中,有部分字段为字符串类型,需要对其进行适当的编码,我们希望按如下方式处理各特征

  • 对有2个值的类别列(即字段类型为object),进行序号编码
  • 对有大于2个值的类别列进行独热编码
  • 对数值型字段进行标准化处理(根据不同的模型,考虑是否需要进行标准化处理)

2)特征选择

可以通过过滤法,嵌入法和包装法对特征进行选择,这里选择过滤法选择特征,利用卡方统计信息进行特征的筛选。

3)模型构建

这里以决策树为例介绍整个Pipeline的构建,并使用网格搜索寻找最优参数,下面是构建的完整代码:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.feature_selection import SelectKBest, chi2

#分离出特征数据和标签数据
X = df.iloc[:,:-1]
y = df.iloc[:, -1]

#筛选各个类型的列名,筛选字段时没有包含customerID
#筛选数值特征字段名
cols_n = [col for col in X.columns[1:-1] if X[col].dtype != object]
cols_ = [col for col in X.columns[1:-1] if X[col].dtype == object]
#筛选类别个数为2的特征字段名
cols_c_2 = [col for col in cols_ if X[col].nunique() == 2]
#筛选类别个数大于2的特征字段名
cols_c_x = [col for col in cols_ if col not in cols_c_2]

pipe = Pipeline(
[
    ( 'features', ColumnTransformer(
    [
        ('del_id', 'drop', ['customerID']),  #删除明显没有意义的特征
        ('order_enc', OrdinalEncoder(), cols_c_2),  #对类别个数为2的特征进行序号编码
        ('onehot_enc', OneHotEncoder(sparse=False), cols_c_x)   #对类型个数大于2的特征进行独热编码
    ]
    )),
    ('chi2', SelectKBest(chi2)),
    ('clf', DecisionTreeClassifier())
]
)

param_grid = {
    'chi__k':range(5,30),
    'clf__max_depth': range(2,20)
}

#参数n_jobs=-1的目的是调用所有的CPU,以便更快的训练模型
gsv = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1)
gsv.fit(X,y)

观察最优参数和分数

gsv.best_score_,gsv.best_params_
(0.7864533598941867, {'chi2__k': 19, 'clf__max_depth': 9})

sklearn中一共提供了三个构建复合模型的方法,分别是PipelineFeatureUnionColumnTransformer,下面对三者进行综合比较:

What does it do?When to use it?
PipelineApply a series of transformers sequentially and then a final estimatorWhen building machine learning pipeline that transforms the data then predicts
ColumnTransformerApply different transformers to different subsets of columns in parallel and concatenate the output of these parallel transformationsWhen different data transfromations are to be applied to different subsets of columns. Use together with Pipeline.
FeatureUnionApply different transformers on the same input data in parallel and concatenate the output of these parallel transformationsWhen different data transformations are to be applied to on the same input data. Use together with Pipeline.

三、实验数据

链接:https://pan.baidu.com/s/1RlIlRLDI63sECfO-N5VsCg?pwd=zf01
提取码:zf01
  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

JJustRight

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值