sklearn复合评估器的构建(电信客户流失模型)
零、实验环境及目的
1. 实验环境
- Win10
- anaconda3
- JupyterLab
2. 实验目的
掌握sklearn中Pipeline和ColumnTransformer的基本使用。
一、数据介绍
文件名:WA_Fn-UseC_-Telco-Customer-Churn.csv
字段:
- customerID:客户编号
- gender: 性别 Whether the customer is a male or a female
- SeniorCitizen: 是否是老人 Whether the customer is a senior citizen or not (1, 0)
- Partner: 是否有配偶 Whether the customer has a partner or not (Yes, No)
- Dependents:是否有家属 Whether the customer has dependents or not (Yes, No)
- tenure:入网多少个月 Number of months the customer has stayed with the company
- PhoneService: 是否订购电话服务Whether the customer has a phone service or not (Yes, No)
- MultipleLines:Whether the customer has multiple lines or not (Yes, No, No phone service)
- InternetService:Customer’s internet service provider (DSL, Fiber optic, No)
- OnlineSecurity:Whether the customer has online security or not (Yes, No, No internet service)
- OnlineBackup:Whether the customer has online backup or not (Yes, No, No internet service)
- DeviceProtection:Whether the customer has device protection or not (Yes, No, No internet service)
- TechSupport:Whether the customer has tech support or not (Yes, No, No internet service)
- StreamingTV:Whether the customer has streaming TV or not (Yes, No, No internet service)
- StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
- Contract:合同条款 The contract term of the customer (Month-to-month, One year, Two year)
- PaperlessBilling:是否有无纸化账单 Whether the customer has paperless billing or not (Yes, No)
- PaymentMethod:支付方式 The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- MonthlyCharges:每月消费 The amount charged to the customer monthly
- TotalCharges:总消费The total amount charged to the customer
- Churn:是否流失 Whether the customer churned or not (Yes or No)
获取:
- Kaggle下载
- 文末的百度网盘下载
二、模型构建
1. 导入数据
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None) #设置查看列不省略
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv', na_values={'tenure':0, 'TotalCharges':' '})
观察数据的常用方法
- head/tail,主要观察如下内容:
- 文件编码是否正确,涉及read_csv的参数为
encoding
- csv文件分隔符是否设置正确,涉及read_csv的参数为
sep
- 是否有注释行、脚注等,涉及read_csv的参数
comment
、skiprows
、skipfooter
- 数据中是否包含字段信息,涉及read_csv的参数
header
、names
- 数据中是否包含行索引,涉及read_csv的参数
index_col
- …
- info
- 观察各字段是否有空缺值,各字段类型是否合理
- describe
- 观察各数值字段的最大值,最小值,平均数和中位数等基本统计信息,以便发现数据是否有噪声数据,是否有数据倾斜问题等。
基于上面的这些方法,当我们直接读取数据时,会发现如下问题:
df1 = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv') df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
问题1. 字段
TotalCharges
(总消费)类型为object,正常情况应该为floatdf1.describe()
SeniorCitizen tenure MonthlyCharges count 7043.000000 7043.000000 7043.000000 mean 0.162147 32.371149 64.761692 std 0.368612 24.559481 30.090047 min 0.000000 0.000000 18.250000 25% 0.000000 9.000000 35.500000 50% 0.000000 29.000000 70.350000 75% 0.000000 55.000000 89.850000 max 1.000000 72.000000 118.750000
问题2. 字段
tenure
(入网月份)为0,正常情况下入网月份至少应该为1有了这些发现,就明白了在读取数据时为什么加入参数
na_values={'tenure':0, 'TotalCharges':' '}
,它分别指定了tenure
和TotalCharges
两个字段的缺失值标识。
2. 处理缺失数据
根据上面介绍的观察数据的方法,观察数据的缺失值情况
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7032 non-null float64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7032 non-null float64
20 Churn 7043 non-null object
dtypes: float64(3), int64(1), object(17)
memory usage: 1.1+ MB
发现字段tenure
和TotalCharges
分别有11条缺失数据,可以进一步观察缺失数据
df[df.isna().any(axis=1)]
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
488 4472-LVYGI Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 No
753 3115-CZMZD Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 No
936 5709-LVOEQ Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 No
1082 4367-NUYAO Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 No
1340 1371-DWPAZ Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 No
3331 7644-OMVMY Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 No
3826 3213-VVOLG Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 No
4380 2520-SGTTA Female 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 No
5218 2923-ARZLG Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 No
6670 4075-WKNIU Female 0 Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 No
6754 2775-SEFEE Male 0 No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 No
发现tenure
(入网月份)字段为缺失值时,TotalCharges
(总消费)字段也是缺失值,但MonthlyCharges
(月消费)有数据,可以猜测这些数据可能来源于刚入网用户,于是有如下缺失值填充方法:
- 字段
tenure
(入网月份)的缺失值修改为1 - 字段
TotalCharges
(总消费)的缺失值用MonthlyCharges
(月消费)填充
df.tenure.fillna(1, inplace=True)
df.TotalCharges.fillna(df.MonthlyCharges, inplace=True)
3. 构建流失模型
观察上面的数据会发现,数据有7043个样本,20个特征,1个标签。
1)数据转换
20个特征中,有部分字段为字符串类型,需要对其进行适当的编码,我们希望按如下方式处理各特征
- 对有2个值的类别列(即字段类型为object),进行序号编码
- 对有大于2个值的类别列进行独热编码
- 对数值型字段进行标准化处理(根据不同的模型,考虑是否需要进行标准化处理)
2)特征选择
可以通过过滤法,嵌入法和包装法对特征进行选择,这里选择过滤法选择特征,利用卡方统计信息进行特征的筛选。
3)模型构建
这里以决策树为例介绍整个Pipeline的构建,并使用网格搜索寻找最优参数,下面是构建的完整代码:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.feature_selection import SelectKBest, chi2
#分离出特征数据和标签数据
X = df.iloc[:,:-1]
y = df.iloc[:, -1]
#筛选各个类型的列名,筛选字段时没有包含customerID
#筛选数值特征字段名
cols_n = [col for col in X.columns[1:-1] if X[col].dtype != object]
cols_ = [col for col in X.columns[1:-1] if X[col].dtype == object]
#筛选类别个数为2的特征字段名
cols_c_2 = [col for col in cols_ if X[col].nunique() == 2]
#筛选类别个数大于2的特征字段名
cols_c_x = [col for col in cols_ if col not in cols_c_2]
pipe = Pipeline(
[
( 'features', ColumnTransformer(
[
('del_id', 'drop', ['customerID']), #删除明显没有意义的特征
('order_enc', OrdinalEncoder(), cols_c_2), #对类别个数为2的特征进行序号编码
('onehot_enc', OneHotEncoder(sparse=False), cols_c_x) #对类型个数大于2的特征进行独热编码
]
)),
('chi2', SelectKBest(chi2)),
('clf', DecisionTreeClassifier())
]
)
param_grid = {
'chi__k':range(5,30),
'clf__max_depth': range(2,20)
}
#参数n_jobs=-1的目的是调用所有的CPU,以便更快的训练模型
gsv = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1)
gsv.fit(X,y)
观察最优参数和分数
gsv.best_score_,gsv.best_params_
(0.7864533598941867, {'chi2__k': 19, 'clf__max_depth': 9})
sklearn中一共提供了三个构建复合模型的方法,分别是Pipeline、FeatureUnion、ColumnTransformer,下面对三者进行综合比较:
What does it do? When to use it? Pipeline Apply a series of transformers sequentially and then a final estimator When building machine learning pipeline that transforms the data then predicts ColumnTransformer Apply different transformers to different subsets of columns in parallel and concatenate the output of these parallel transformations When different data transfromations are to be applied to different subsets of columns. Use together with Pipeline. FeatureUnion Apply different transformers on the same input data in parallel and concatenate the output of these parallel transformations When different data transformations are to be applied to on the same input data. Use together with Pipeline.
三、实验数据
链接:https://pan.baidu.com/s/1RlIlRLDI63sECfO-N5VsCg?pwd=zf01
提取码:zf01