sklearn复合评估器的构建（电信客户流失模型）

JJustRight

已于 2022-05-10 08:34:40 修改

阅读量902

点赞数 1

分类专栏：机器学习 # sklearn 文章标签：客户流失分析 Pipeline sklearn复合评估器机器学习复合评估器的构建

于 2022-05-04 23:39:14 首次发布

本文链接：https://blog.csdn.net/tangyi2008/article/details/124580082

版权

机器学习同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

sklearn

1 篇文章 0 订阅

订阅专栏

sklearn复合评估器的构建（电信客户流失模型）

零、实验环境及目的
- 1. 实验环境
- 2. 实验目的
一、数据介绍
二、模型构建
三、实验数据

零、实验环境及目的

1. 实验环境

Win10
anaconda3
JupyterLab

2. 实验目的

掌握sklearn中Pipeline和ColumnTransformer的基本使用。

一、数据介绍

文件名：WA_Fn-UseC_-Telco-Customer-Churn.csv

字段：

customerID：客户编号
gender: 性别 Whether the customer is a male or a female
SeniorCitizen: 是否是老人 Whether the customer is a senior citizen or not (1, 0)
Partner: 是否有配偶 Whether the customer has a partner or not (Yes, No)
Dependents:是否有家属 Whether the customer has dependents or not (Yes, No)
tenure:入网多少个月 Number of months the customer has stayed with the company
PhoneService: 是否订购电话服务Whether the customer has a phone service or not (Yes, No)
MultipleLines:Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService:Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity:Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup:Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection:Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport:Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV:Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract：合同条款 The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling:是否有无纸化账单 Whether the customer has paperless billing or not (Yes, No)
PaymentMethod：支付方式 The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges：每月消费 The amount charged to the customer monthly
TotalCharges：总消费The total amount charged to the customer
Churn：是否流失 Whether the customer churned or not (Yes or No)

获取：

Kaggle下载
文末的百度网盘下载

二、模型构建

1. 导入数据

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None) #设置查看列不省略
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv', na_values={'tenure':0, 'TotalCharges':' '})

观察数据的常用方法

head/tail，主要观察如下内容：
文件编码是否正确，涉及read_csv的参数为encoding
csv文件分隔符是否设置正确，涉及read_csv的参数为sep
是否有注释行、脚注等，涉及read_csv的参数comment、skiprows、skipfooter
数据中是否包含字段信息，涉及read_csv的参数header、names
数据中是否包含行索引，涉及read_csv的参数index_col
…

info
观察各字段是否有空缺值，各字段类型是否合理

describe
观察各数值字段的最大值，最小值，平均数和中位数等基本统计信息，以便发现数据是否有噪声数据，是否有数据倾斜问题等。

基于上面的这些方法，当我们直接读取数据时，会发现如下问题：
df1 = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
问题1. 字段TotalCharges（总消费）类型为object，正常情况应该为float
df1.describe()
	SeniorCitizen	tenure	MonthlyCharges
count	7043.000000	7043.000000	7043.000000
mean	0.162147	32.371149	64.761692
std	0.368612	24.559481	30.090047
min	0.000000	0.000000	18.250000
25%	0.000000	9.000000	35.500000
50%	0.000000	29.000000	70.350000
75%	0.000000	55.000000	89.850000
max	1.000000	72.000000	118.750000
问题2. 字段tenure(入网月份）为0，正常情况下入网月份至少应该为1

有了这些发现，就明白了在读取数据时为什么加入参数 na_values={'tenure':0, 'TotalCharges':' '}，它分别指定了tenure和TotalCharges两个字段的缺失值标识。

2. 处理缺失数据

根据上面介绍的观察数据的方法，观察数据的缺失值情况

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7032 non-null   float64
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7032 non-null   float64
 20  Churn             7043 non-null   object 
dtypes: float64(3), int64(1), object(17)
memory usage: 1.1+ MB

发现字段tenure和TotalCharges分别有11条缺失数据，可以进一步观察缺失数据

df[df.isna().any(axis=1)]

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
488	4472-LVYGI	Female	0	Yes	Yes	0	No	No phone service	DSL	Yes	No	Yes	Yes	Yes	No	Two year	Yes	Bank transfer (automatic)	52.55		No
753	3115-CZMZD	Male	0	No	Yes	0	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.25		No
936	5709-LVOEQ	Female	0	Yes	Yes	0	Yes	No	DSL	Yes	Yes	Yes	No	Yes	Yes	Two year	No	Mailed check	80.85		No
1082	4367-NUYAO	Male	0	Yes	Yes	0	Yes	Yes	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.75		No
1340	1371-DWPAZ	Female	0	Yes	Yes	0	No	No phone service	DSL	Yes	Yes	Yes	Yes	Yes	No	Two year	No	Credit card (automatic)	56.05		No
3331	7644-OMVMY	Male	0	Yes	Yes	0	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	19.85		No
3826	3213-VVOLG	Male	0	Yes	Yes	0	Yes	Yes	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.35		No
4380	2520-SGTTA	Female	0	Yes	Yes	0	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.00		No
5218	2923-ARZLG	Male	0	Yes	Yes	0	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	Yes	Mailed check	19.70		No
6670	4075-WKNIU	Female	0	Yes	Yes	0	Yes	Yes	DSL	No	Yes	Yes	Yes	Yes	No	Two year	No	Mailed check	73.35		No
6754	2775-SEFEE	Male	0	No	Yes	0	Yes	Yes	DSL	Yes	Yes	No	Yes	No	No	Two year	Yes	Bank transfer (automatic)	61.90		No

发现tenure(入网月份)字段为缺失值时，TotalCharges（总消费）字段也是缺失值，但MonthlyCharges(月消费)有数据，可以猜测这些数据可能来源于刚入网用户，于是有如下缺失值填充方法：

字段tenure(入网月份)的缺失值修改为1
字段 TotalCharges（总消费）的缺失值用MonthlyCharges(月消费)填充

df.tenure.fillna(1, inplace=True)
df.TotalCharges.fillna(df.MonthlyCharges, inplace=True)

3. 构建流失模型

观察上面的数据会发现，数据有7043个样本，20个特征，1个标签。

1）数据转换

20个特征中，有部分字段为字符串类型，需要对其进行适当的编码，我们希望按如下方式处理各特征

对有2个值的类别列（即字段类型为object），进行序号编码
对有大于2个值的类别列进行独热编码
对数值型字段进行标准化处理（根据不同的模型，考虑是否需要进行标准化处理）

2）特征选择

可以通过过滤法，嵌入法和包装法对特征进行选择，这里选择过滤法选择特征，利用卡方统计信息进行特征的筛选。

3）模型构建

这里以决策树为例介绍整个Pipeline的构建，并使用网格搜索寻找最优参数，下面是构建的完整代码：

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.feature_selection import SelectKBest, chi2

#分离出特征数据和标签数据
X = df.iloc[:,:-1]
y = df.iloc[:, -1]

#筛选各个类型的列名，筛选字段时没有包含customerID
#筛选数值特征字段名
cols_n = [col for col in X.columns[1:-1] if X[col].dtype != object]
cols_ = [col for col in X.columns[1:-1] if X[col].dtype == object]
#筛选类别个数为2的特征字段名
cols_c_2 = [col for col in cols_ if X[col].nunique() == 2]
#筛选类别个数大于2的特征字段名
cols_c_x = [col for col in cols_ if col not in cols_c_2]

pipe = Pipeline(
[
    ( 'features', ColumnTransformer(
    [
        ('del_id', 'drop', ['customerID']),  #删除明显没有意义的特征
        ('order_enc', OrdinalEncoder(), cols_c_2),  #对类别个数为2的特征进行序号编码
        ('onehot_enc', OneHotEncoder(sparse=False), cols_c_x)   #对类型个数大于2的特征进行独热编码
    ]
    )),
    ('chi2', SelectKBest(chi2)),
    ('clf', DecisionTreeClassifier())
]
)

param_grid = {
    'chi__k':range(5,30),
    'clf__max_depth': range(2,20)
}

#参数n_jobs=-1的目的是调用所有的CPU，以便更快的训练模型
gsv = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1)
gsv.fit(X,y)

观察最优参数和分数

gsv.best_score_,gsv.best_params_

(0.7864533598941867, {'chi2__k': 19, 'clf__max_depth': 9})

sklearn中一共提供了三个构建复合模型的方法，分别是Pipeline、FeatureUnion、ColumnTransformer，下面对三者进行综合比较：

What does it do? When to use it?
Pipeline Apply a series of transformers sequentially and then a final estimator When building machine learning pipeline that transforms the data then predicts
ColumnTransformer Apply different transformers to different subsets of columns in parallel and concatenate the output of these parallel transformations When different data transfromations are to be applied to different subsets of columns. Use together with Pipeline.
FeatureUnion Apply different transformers on the same input data in parallel and concatenate the output of these parallel transformations When different data transformations are to be applied to on the same input data. Use together with Pipeline.

	What does it do?	When to use it?
Pipeline	Apply a series of transformers sequentially and then a final estimator	When building machine learning pipeline that transforms the data then predicts
ColumnTransformer	Apply different transformers to different subsets of columns in parallel and concatenate the output of these parallel transformations	When different data transfromations are to be applied to different subsets of columns. Use together with Pipeline.
FeatureUnion	Apply different transformers on the same input data in parallel and concatenate the output of these parallel transformations	When different data transformations are to be applied to on the same input data. Use together with Pipeline.

三、实验数据

链接：https://pan.baidu.com/s/1RlIlRLDI63sECfO-N5VsCg?pwd=zf01
提取码：zf01

JJustRight

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
sklearn复合评估器的构建（电信客户流失模型）

Sklearn (全称 Scikit-Learn) 是基于 Python 语言的机器学习工具。它建立在 NumPy, SciPy, Pandas 和 Matplotlib 之上， API 的设计非常好，所有对象的接口简单，很适合新手上路。
复制链接

扫一扫