【一周算法实践进阶】任务2 特征工程

最新推荐文章于 2023-04-01 11:13:27 发布

XiongLY0

最新推荐文章于 2023-04-01 11:13:27 发布

阅读量1.1k

点赞数 1

分类专栏：数据分析文章标签：数据分析

本文链接：https://blog.csdn.net/bear507/article/details/86696246

版权

数据分析专栏收录该内容

7 篇文章 3 订阅

订阅专栏

导入本次任务所用到的包：

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
                            confusion_matrix, f1_score, roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
%matplotlib inline
plt.rc('font', family='SimHei', size=14)
plt.rcParams['axes.unicode_minus']=False
%config InlineBackend.figure_format = 'retina'

准备数据

导入数据

原始数据集下载地址： https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw

说明：这份数据集是金融数据（非原始数据，已经处理过了），我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签：0 表示未逾期，1 表示逾期。

本次导入的是前文(【一周算法实践进阶】任务 1 数据预处理)已经清洗过的数据集：

data_processed = pd.read_csv('data_processed.csv')
data_processed.head()

	low_volume_percent	middle_volume_percent	take_amount_in_later_12_month_highest	trans_amount_increase_rate_lately	trans_activity_month	trans_activity_day	transd_mcc	trans_days_interval_filter	trans_days_interval	regional_mobility	...	loans_latest_day	一线城市	三线城市	latest_query_time_month	latest_query_time_weekday	loans_latest_time_month	loans_latest_time_weekday
0	0.01	0.99	0.0	0.90	0.55	0.313	17.0	27.0	26.0	3.0	...	18.0	1	0	4	2	4	3
1	0.02	0.94	2000.0	1.28	1.00	0.458	19.0	30.0	14.0	4.0	...	2.0	1	0	5	3	5	5
2	0.04	0.96	0.0	1.00	1.00	0.114	13.0	68.0	22.0	1.0	...	6.0	1	0	5	5	5	1
3	0.00	0.96	2000.0	0.13	0.57	0.777	22.0	14.0	6.0	3.0	...	4.0	0	1	5	5	5	3
4	0.01	0.99	0.0	0.46	1.00	0.175	13.0	66.0	42.0	1.0	...	120.0	1	0	4	6	1	6

5 rows × 89 columns

划分数据

将原始数据划分为数据集以及标签

label = data_processed['status']
data = data_processed.drop(['status'], axis=1)

标准化

scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

data_scaled.head()

	low_volume_percent	middle_volume_percent	take_amount_in_later_12_month_highest	trans_amount_increase_rate_lately	trans_activity_month	trans_activity_day	transd_mcc	trans_days_interval_filter	trans_days_interval	regional_mobility	...	loans_latest_day	一线城市	三线城市	二线城市	其他城市	境外	latest_query_time_month	latest_query_time_weekday	loans_latest_time_month	loans_latest_time_weekday
0	-0.281946	0.613274	-0.493153	-0.019542	-1.295506	-0.315097	-0.120837	-0.090540	0.256651	0.361774	...	-0.696716	0.632505	-0.538091	-0.170886	-0.029907	-0.181666	-0.211985	-0.674125	-0.188457	-0.015516
1	-0.044836	0.267090	0.007392	-0.019011	0.993048	0.534560	0.323209	0.040603	-0.467915	1.483449	...	-0.996528	0.632505	-0.538091	-0.170886	-0.029907	-0.181666	0.513432	-0.161836	0.135729	1.005817
2	0.429385	0.405564	-0.493153	-0.019402	0.993048	-1.481178	-1.008929	1.701755	0.015129	-1.881576	...	-0.921575	0.632505	-0.538091	-0.170886	-0.029907	-0.181666	0.513432	0.862743	0.135729	-1.036849
3	-0.519056	0.405564	0.007392	-0.020619	-1.193793	2.403805	0.989278	-0.658829	-0.950958	0.361774	...	-0.959052	-1.581015	1.858422	-0.170886	-0.029907	-0.181666	0.513432	0.862743	0.135729	-0.015516
4	-0.281946	0.613274	-0.493153	-0.020157	0.993048	-1.123736	-1.008929	1.614326	1.222738	-1.881576	...	1.214585	0.632505	-0.538091	-0.170886	-0.029907	-0.181666	-0.211985	1.375032	-1.161015	1.516484

5 rows × 88 columns

特征选择

根据IV值

IV 的全称是 Information Value，中文意思是信息价值，或者信息量。此处仅介绍IV值的计算方式，具体可以看参考资料。

首先计算WOE(Weight of Evidence)值：
$WOE_i = \ln(\frac{p(y_i)}{p(n_i)})= \ln(\frac{y_i/y_T}{n_i/n_T})$
其中， $p(y_i)$ 指本组逾期客户(即status=1)占样本中所有逾期客户的比例， $p(n_i)$ 指本组未逾期客户(即status=0)占样本中所有未逾期客户的比例。 $y_i$ 是本组逾期客户的数量， $y_T$ 是所有样本逾期客户的数量， $n_i$ 是本组未逾期客户的数量， $n_T$ 是所有样本未逾期客户的数量。

得到IV的计算公式：
$IV_i = (p(y_i)-p(n_i))\times WOE_i = (y_i/y_T - n_i/n_T)\times \ln(\frac{y_i/y_T}{n_i/n_T})$

根据特征的IV值，可以得到特征的预测能力，如下表。

IV	预测能力
<0.03	无
0.03~0.09	低
0.1~0.29	中
0.3~0.49	高
>=0.5	极高

数据分箱

在计算IV值之前，首先要对数据进行进行分箱操作，分箱包含有监督分箱（卡方、最小熵法）和无监督分箱（等距、等频、聚类）。我们采用卡方分箱，其他分箱方法的介绍见参考资料。

初始化阶段

首先按照属性值对实例进行排序，每个实例属于一个分组。

合并阶段

（1）计算每一对相邻组的卡方值

（2）将卡方值最小的相邻组合并
$X^2 = \sum^2_{i=1}\sum^2_{j=1}\frac{(A_{ij}-E_{ij})^2}{E_{ij}}$
其中， $A_{ij}$ 指第 $i$ 组第 $j$ 类实例数量， $E_{ij}$ 指 $A_{ij}$ 的期望频率

（3）不断重复（1），（2）直到计算出的卡方值都不低于事先设定的阈值，或者分组数达到一定的条件（如最小分组数 5，最大分组数 8）。

(chiMerge函数代码来自参考资料3，有修改)

def chiMerge(df, col, target, threshold=None):
    ''' 卡方分箱
    df: pandas dataframe数据集
    col: 需要分箱的变量名（数值型）
    target: 类标签
    max_groups: 最大分组数。
    threshold: 卡方阈值，如果未指定max_groups，默认使用置信度95%设置threshold。
    return: 包括各组的起始值的列表.
    '''
    freq_tab = pd.crosstab(df[col],df[target])
    freq = freq_tab.values #转成 numpy 数组用于计算。
    # 1.初始化阶段：按照属性值对实例进行排序，每个实例属于一个分组。
    # 为了保证后续分组包含所有样本值，添加上一个比最大值大的数
    cutoffs = np.append(freq_tab.index.values, max(freq_tab.index.values)+1)
    if threshold == None:
        # 如果没有指定卡方阈值和最大分类数
        # 则以 95% 的置信度（自由度为类数目 - 1）设定阈值。
        cls_num = freq.shape[-1]
        threshold = stats.chi2.isf(0.05, df=cls_num - 1)
    # 2.合并阶段
    while True:
        minvalue = np.inf
        minidx = np.inf
        # 计算每一对相邻组的卡方值
        for i in range(len(freq) - 1):
            v = stats.chi2_contingency(freq[i:i+2] + 1, correction=False)[0]
            # 更新最小值
            if minvalue > v:
                minvalue = v
                minidx = i
        # 如果最小卡方值小于阈值，则合并最小卡方值的相邻两组，并继续循环
        if threshold != None and minvalue < threshold:
            freq[minidx] += freq[minidx+1]
            freq = np.delete(freq, minidx+1, 0)
            cutoffs = np.delete(cutoffs, minidx+1, 0)
        else:
            break
            
    return cutoffs

IV值计算

def iv_value(df, col, target):
    ''' 计算单列特征的IV值
    df: pandas dataframe数据集
    col: 需要计算的变量名（数值型）
    target: 标签
    return: 该特征的iv值
    '''
    bins = chiMerge(df, col, target) # 获得分组区间
    cats = pd.cut(df[col], bins, right=False) 
    # 为了防止除0错误，对分子分母均做+1处理
    temp = (pd.crosstab(cats, df[target]) + 1) / (df[target].value_counts() + 1)
    woe = np.log(temp.iloc[:, 1] / temp.iloc[:, 0])
    iv = sum((temp.iloc[:, 1] - temp.iloc[:, 0]) * woe)
    
    return iv

计算所有特征的iv值

iv = []
data_iv = pd.concat([data_scaled, label], axis=1)

for col in data_scaled.columns:
    iv.append(iv_value(data_iv, col, 'status'))

降序输出：

iv = np.array(iv)
np.save('iv', iv)
iv = np.load('iv.npy')
iv

array([0.02968667, 0.06475457, 0.06981247, 0.27089581, 0.03955683,
       0.13346826, 0.00854632, 0.03929596, 0.04422897, 0.00559611,
       0.53421682, 0.        , 0.03166467, 0.38242452, 0.92400898,
       0.18871897, 0.11657733, 0.79563374, 0.        , 0.36688692,
       0.06479698, 0.08637859, 0.0315798 , 0.08726314, 0.02813494,
       0.07862981, 0.02872391, 0.00936212, 0.59139039, 0.25168984,
       0.25886425, 0.42645628, 0.32054195, 0.01342581, 0.00419829,
       0.23346355, 0.57449389, 0.        , 0.37383946, 0.14084117,
       0.50192192, 0.01717901, 0.        , 0.00990202, 0.02356634,
       0.02668144, 0.03360329, 0.02932465, 0.00517526, 0.66353628,
       0.        , 0.05768091, 0.03631875, 0.40640499, 0.01445641,
       0.00671275, 0.01300546, 0.00552671, 0.03980268, 0.03645762,
       0.0140021 , 0.65682529, 0.15289713, 0.37204304, 0.05508829,
       0.0192688 , 0.01318021, 0.01300546, 0.01037065, 0.01728017,
       0.25268217, 0.15254589, 0.00475146, 0.00671275, 0.01011964,
       0.03126195, 0.50228468, 0.11432889, 0.07337619, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.03444958,
       0.00903816, 0.01497038, 0.        ])

随机森林

n_estimators : integer, optional (default=10)

n_estimators: 也就是弱学习器的最大迭代次数，或者说最大的弱学习器的个数。

对参数n_estimators粗调：

param = {'n_estimators': list(range(10, 1001, 50))}
g = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                       param_grid=param, cv=5)
g.fit(data_scaled, label)
g.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=810, n_jobs=1,
            oob_score=False, random_state=2018, verbose=0,
            warm_start=False)

对参数n_estimators细调：

param = {'n_estimators': list(range(770, 870, 10))}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                       param_grid=param, cv=5)
forest_grid.fit(data, label)
rnd_clf = forest_grid.best_estimator_
rnd_clf

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=810, n_jobs=1,
            oob_score=False, random_state=2018, verbose=0,
            warm_start=False)

综合分析

将IV值和随机森林的特征重要度进行整合：

feature_df = pd.DataFrame(np.c_[rnd_clf.feature_importances_, iv.T], 
                          index=data.columns, columns=['随机森林', 'IV值'])
feature_df.head()

	随机森林	IV值
low_volume_percent	0.007025	0.029687
middle_volume_percent	0.009346	0.064755
take_amount_in_later_12_month_highest	0.009766	0.069812
trans_amount_increase_rate_lately	0.014802	0.270896
trans_activity_month	0.010418	0.039557

绘制两者对比曲线，按照IV值和随机森林评分的降序：

feature_df.sort_values(by='IV值', ascending=False)\
          .plot(figsize=(8, 6), subplots=True, use_index=False)

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F077ACDE80>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F077B4E898>],
      dtype=object)

在这里插入图片描述

feature_df.sort_values(by='随机森林', ascending=False)\
          .plot(figsize=(8, 6), subplots=True, use_index=False)

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F077975B38>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F0779A1BE0>],
      dtype=object)

在这里插入图片描述

由上图可以看出，虽然会有上下浮动，但随机森林和IV值呈现的特征重要度曲线的变化趋势是基本一致的。根据之前提及的IV值和预测能力关系表，以及随机森林特征评分在0到10之间有一个断崖式的下降，所以取随机森林得分在前15且IV值大于0.3的特征，作为筛选后的特征。

rnf_sorted = feature_df.sort_values(by='随机森林', ascending=False).iloc[:15, 0]
iv_sorted = feature_df[feature_df['IV值'] >= 0.3]['IV值']
index = pd.DataFrame([rnf_sorted, iv_sorted]).dropna(axis=1).columns
index

Index(['abs', 'apply_score', 'consfin_avg_limit', 'historical_trans_amount',
       'history_fail_fee', 'latest_one_month_fail', 'loans_overdue_count',
       'loans_score', 'max_cumulative_consume_later_1_month',
       'repayment_capability', 'trans_amount_3_month',
       'trans_fail_top_count_enum_last_1_month'],
      dtype='object')

经过筛选后，剩余12个特征，获得筛选特征后的数据：

data_del = data_scaled[index]
data_del.head()

	abs	apply_score	consfin_avg_limit	historical_trans_amount	history_fail_fee	latest_one_month_fail	loans_overdue_count	loans_score	max_cumulative_consume_later_1_month	repayment_capability	trans_amount_3_month	trans_fail_top_count_enum_last_1_month
0	-0.200665	0.124820	-1.201348	-0.255030	-0.427773	-0.337569	-0.098210	0.144596	-0.067183	0.020868	-0.049208	-0.346369
1	-0.090524	1.497024	0.238640	0.215237	-0.547614	-0.080162	-0.733973	1.509325	-0.073494	-0.034355	-0.274805	-0.868380
2	-0.312623	1.516627	-0.671941	-0.675385	-0.627508	-0.080162	-0.733973	1.476440	-0.262821	-0.171658	-0.321773	0.697653
3	1.359842	0.360055	0.736282	0.790524	0.331218	-0.337569	0.537552	-0.019829	0.471049	-0.237850	0.505738	-0.346369
4	-0.315531	-0.698503	0.042759	-0.522714	0.291271	-0.337569	1.173315	-1.055708	-0.172665	-0.144424	-0.282697	0.697653

建立模型

调用sklearn包将数据集按比例7:3划分为训练集和数据集，随机种子2018：

X_train, X_test, y_train, y_test = train_test_split(data_del, label, test_size=0.3, random_state=2018)

查看划分的数据集和训练集大小：

[X_train.shape, y_train.shape, X_test.shape, y_test.shape]

[(3133, 12), (3133,), (1343, 12), (1343,)]

模型调优

构建并训练本次需要用到的七个模型，包含四个集成模型：XGBoost，LightGBM，GBDT，随机森林，和三个非集成模型：逻辑回归，SVM，决策树。使用网格搜索法对7个模型进行调优（调参时采用五折交叉验证的方式）

逻辑回归

部分参数介绍：

C : float, default: 1.0

C值越小，正则化强度越强，C必须是一个正的浮点数。

class_weight : dict or ‘balanced’, default: None

指定每个类的权重，未指定时所有类都默认有同一权重。“平衡”模式使用y的值来自动调整与输入数据中的类频率成反比的权重，即n_samples / (n_classes * np.bincount(y)).

solver : str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default: ‘liblinear’.

优化时使用的算法，对于小数据集，'liblinear’是一个好选择，而‘sag’ 和 ‘saga’对大数据集训练更快。对于多分类问题，只有‘newton-cg’, ‘sag’, ‘saga’ 和 ‘lbfgs’能处理多元损失，‘liblinear’只能以one-versus-rest模式训练。

‘newton-cg’, ‘lbfgs’ 和 ‘sag’只能使用l2惩罚项，‘liblinear’ 和 ‘saga’可以使用l1惩罚项。

注意，“sag”和“saga”快速收敛仅在具有大致相同比例的特征上得到保证。可以使用sklearn.preprocessing中的scaler预处理数据。

max_iter : int, default: 100

最大迭代次数

查看最佳参数和评分：

param_grid = {
    'C': np.arange(0.01, 0.1, 0.01),
    'solver': ['liblinear', 'lbfgs'],
    'class_weight': ['balanced', None]
}
log_grid = GridSearchCV(LogisticRegression(random_state=2018, max_iter=1000), 
                        param_grid, cv=5)
log_grid.fit(X_train, y_train)
log_grid.best_estimator_, log_grid.best_score_

(LogisticRegression(C=0.02, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=1000, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=2018, solver='lbfgs', tol=0.0001,
           verbose=0, warm_start=False), 0.7874241940631982)

SVM

C: float, optional (default=1.0)

目标函数的惩罚系数C，用来平衡分类间隔margin和错分样本的

kernel: string, optional (default=’rbf’)

参数选择有RBF, Linear, Poly, Sigmoid

gamma：float, optional (default=’auto’)

核函数的系数(‘Poly’, ‘RBF’ 和 ‘Sigmoid’), 默认是gamma = 1 / n_features;

param_grid = {
    'C': np.arange(0.1, 5.2, 0.5),
    'gamma': ['auto', 0.01, 0.5],
}

svc_grid = GridSearchCV(SVC(random_state=2018, probability=True), param_grid, cv=5)
svc_grid.fit(X_train, y_train)
svc_grid.best_estimator_, svc_grid.best_score_

(SVC(C=2.1, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=True, random_state=2018, shrinking=True,
   tol=0.001, verbose=False), 0.7871050111714012)

决策树

max_depth : int or None, optional (default=None)

树的最大深度，如果设置为None则直到叶节点只剩一个或者少于min_samples_split时停止。

max_features : int, float, string or None, optional (default=None)

在寻找最佳分割时考虑的特征数，设置为’auto’时max_features=sqrt(n_features)

class_weight : dict, list of dicts, “balanced” or None, default=None

指定每个类的权重，未指定时所有类都默认有同一权重。“平衡” 模式使用 y 的值来自动调整与输入数据中的类频率成反比的权重，即 n_samples / (n_classes * np.bincount(y)).

param_grid = {
    'max_depth': range(2, 8, 1),
    'min_samples_split': range(2, 11, 1)
}
tree_grid = GridSearchCV(DecisionTreeClassifier(random_state=2018), param_grid, cv=5)
tree_grid.fit(X_train, y_train)
tree_grid.best_estimator_, tree_grid.best_score_

(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False, random_state=2018,
             splitter='best'), 0.7749760612831152)

param_grid = {
    'min_samples_leaf': range(26, 35, 2),
    'max_features': range(2, 10, 1)
}
tree_grid = GridSearchCV(DecisionTreeClassifier(random_state=2018, max_depth=4,
                                                min_samples_split=2), 
                         param_grid, cv=5)
tree_grid.fit(X_train, y_train)
tree_grid.best_estimator_, tree_grid.best_score_

(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
             max_features=7, max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=30,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=2018, splitter='best'),
 0.7794446217682732)

随机森林

param = {'n_estimators': list(range(10, 1001, 50))}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_

(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7883817427385892)

forest_grid = forest_grid.best_estimator_

param = {
    'n_estimators': list(range(forest_grid.n_estimators - 40, 
                                    forest_grid.n_estimators + 50, 10))
}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_

(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7883817427385892)

forest_grid = forest_grid.best_estimator_

param = {
    'max_depth': range(3, 15, 2),
    'min_samples_split': range(2, 53, 10)
}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018, 
                                                              n_estimators=forest_grid.n_estimators),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_

(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=9, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=30,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7918927545483562)

param = {'min_samples_leaf': range(1, 10, 2)}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018, max_features='auto',
                                                              max_depth=9,
                                                              n_estimators=860, 
                                                              min_samples_split=30),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_

(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=9, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=30,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7918927545483562)

GBDT

n_estimators : integer, optional (default=10)

要执行的迭代次数。梯度增强对于过拟合强健性好，多以较大的值通常效果更好。

learning_rate : float, optional (default=0.1)

通过learning_rate缩小每棵树的贡献，需要在learning_rate和n_estimators之间进行权衡。

param_grid = {
    'n_estimators': range(80, 150, 10),
    'learning_rate': [0.02, 0.01, 0.04],
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018), param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_

(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.02, loss='deviance', max_depth=3,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=110,
               presort='auto', random_state=2018, subsample=1.0, verbose=0,
               warm_start=False), 0.7864666453878072)

gbdt = gbdt.best_estimator_

param_grid = {
    'max_depth': range(3, 12, 2),
    'min_samples_split': range(20, 41, 5)
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018, 
                                               n_estimators=gbdt.n_estimators,
                                               learning_rate=gbdt.learning_rate), 
                    param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_

(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.02, loss='deviance', max_depth=5,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=35,
               min_weight_fraction_leaf=0.0, n_estimators=110,
               presort='auto', random_state=2018, subsample=1.0, verbose=0,
               warm_start=False), 0.7874241940631982)

gbdt = gbdt.best_estimator_

param_grid = {
    'min_samples_leaf': range(1, 10, 2)
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018, 
                                               n_estimators=gbdt.n_estimators,
                                               learning_rate=gbdt.learning_rate, 
                                               max_depth=gbdt.max_depth,
                                               min_samples_split=gbdt.min_samples_split), 
                    param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_

(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.02, loss='deviance', max_depth=5,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=35,
               min_weight_fraction_leaf=0.0, n_estimators=110,
               presort='auto', random_state=2018, subsample=1.0, verbose=0,
               warm_start=False), 0.7874241940631982)

XGBoost

param_grid = {
    'n_estimators': range(70, 150, 10),
    'learning_rate': [0.02, 0.1, 0.2],
}
xgb = GridSearchCV(XGBClassifier(random_state=2018), param_grid, cv=5)
xgb.fit(X_train, y_train)
xgb.best_estimator_, xgb.best_score_

(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
        max_depth=3, min_child_weight=1, missing=None, n_estimators=90,
        n_jobs=1, nthread=None, objective='binary:logistic',
        random_state=2018, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
        seed=None, silent=True, subsample=1), 0.7867858282796042)

xgb = xgb.best_estimator_

param_grid = {
    'max_depth': range(1, 4, 1),
    'min_samples_split': range(1, 22, 5)
}
xgb = GridSearchCV(XGBClassifier(random_state=2018, 
                                               n_estimators=xgb.n_estimators,
                                               learning_rate=xgb.learning_rate), 
                    param_grid, cv=5)
xgb.fit(X_train, y_train)
xgb.best_estimator_, xgb.best_score_

(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
        max_depth=3, min_child_weight=1, min_samples_split=1, missing=None,
        n_estimators=90, n_jobs=1, nthread=None,
        objective='binary:logistic', random_state=2018, reg_alpha=0,
        reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
        subsample=1), 0.7867858282796042)

LightGBM

param_grid = {
    'n_estimators': range(70, 150, 10),
    'learning_rate': [0.02, 0.1, 0.2],
}
lgbm = GridSearchCV(LGBMClassifier(random_state=2018), param_grid, cv=5)
lgbm.fit(X_train, y_train)
lgbm.best_estimator_, lgbm.best_score_

(LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
         importance_type='split', learning_rate=0.02, max_depth=-1,
         min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
         n_estimators=90, n_jobs=-1, num_leaves=31, objective=None,
         random_state=2018, reg_alpha=0.0, reg_lambda=0.0, silent=True,
         subsample=1.0, subsample_for_bin=200000, subsample_freq=0),
 0.7826364506862432)

lgbm = lgbm.best_estimator_

param_grid = {
    'max_depth': range(1, 10, 2),
    'min_samples_split': range(10, 31, 5)
}
lgbm = GridSearchCV(LGBMClassifier(random_state=2018, 
                                               n_estimators=lgbm.n_estimators,
                                               learning_rate=lgbm.learning_rate), 
                    param_grid, cv=5)
lgbm.fit(X_train, y_train)
lgbm.best_estimator_, lgbm.best_score_

(LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
         importance_type='split', learning_rate=0.02, max_depth=5,
         min_child_samples=20, min_child_weight=0.001, min_samples_split=10,
         min_split_gain=0.0, n_estimators=90, n_jobs=-1, num_leaves=31,
         objective=None, random_state=2018, reg_alpha=0.0, reg_lambda=0.0,
         silent=True, subsample=1.0, subsample_for_bin=200000,
         subsample_freq=0), 0.7832748164698372)

模型评估

对构建的七个模型进行评估

models = {'随机森林': forest_grid.best_estimator_,
          'GBDT': gbdt.best_estimator_,
          'XGBoost': xgb.best_estimator_,
          'LightGBM': lgbm,
          '逻辑回归': log_grid.best_estimator_,
          'SVM': svc_grid.best_estimator_,
          '决策树': tree_grid.best_estimator_}

assessments = {
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F1-score': [],
    'AUC': []
}

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend()
    plt.tight_layout()

for name, model in models.items():
    test_pre = model.predict(X_test)
    train_pre = model.predict(X_train)
    test_proba = model.predict_proba(X_test)[:,1]
    train_proba = model.predict_proba(X_train)[:,1]
    
    acc_test = accuracy_score(test_pre, y_test) * 100
    acc_train = accuracy_score(train_pre, y_train) * 100
    accuracy = '训练集：%.2f%%；测试集:%.2f%%' % (acc_train, acc_test)
    assessments['Accuracy'].append(accuracy)
    
    pre_test = precision_score(test_pre, y_test) * 100
    pre_train = precision_score(train_pre, y_train) * 100
    precision = '训练集：%.2f%%；测试集:%.2f%%' % (pre_train, pre_test)
    assessments['Precision'].append(precision)
    
    rec_test = recall_score(test_pre, y_test) * 100
    rec_train = recall_score(train_pre, y_train) * 100
    recall = '训练集：%.2f%%；测试集:%.2f%%' % (rec_train, rec_test)
    assessments['Recall'].append(recall)
    
    f1_test = f1_score(test_pre, y_test) * 100
    f1_train = f1_score(train_pre, y_train) * 100
    f1 = '训练集：%.2f%%；测试集:%.2f%%' % (f1_train, f1_test)
    assessments['F1-score'].append(f1)
    
    fig = plt.figure(figsize=(8, 6))
    fpr, tpr, thresholds = roc_curve(y_test, test_proba)
    plot_roc_curve(fpr, tpr, label='测试集')
    fpr, tpr, thresholds = roc_curve(y_train, train_proba)
    plot_roc_curve(fpr, tpr, label='训练集')
    plt.title(name)
    
    auc_test = roc_auc_score(y_test, test_proba) * 100
    auc_train = roc_auc_score(y_train, train_proba) * 100
    auc = '训练集：%.2f%%；测试集:%.2f%%' % (auc_train, auc_test)
    assessments['AUC'].append(auc)

fig = plt.figure(figsize=(8, 6))
for name, model in models.items():
    proba = model.predict_proba(X_test)[:,1]
    fpr, tpr, thresholds = roc_curve(y_test, proba)
    plot_roc_curve(fpr, tpr, label=name)

fig = plt.figure(figsize=(8, 6))
for name, model in models.items():
    proba = model.predict_proba(X_train)[:,1]
    fpr, tpr, thresholds = roc_curve(y_train, proba)
    plot_roc_curve(fpr, tpr, label=name)

ass_df = pd.DataFrame(assessments, index=models.keys())
ass_df

	AUC	Accuracy	F1-score	Precision	Recall
随机森林	训练集：90.82%；测试集:79.88%	训练集：84.33%；测试集:79.60%	训练集：58.07%；测试集:46.69%	训练集：43.59%；测试集:34.68%	训练集：86.96%；测试集:71.43%
GBDT	训练集：87.87%；测试集:79.15%	训练集：84.26%；测试集:78.78%	训练集：57.17%；测试集:44.01%	训练集：42.18%；测试集:32.37%	训练集：88.68%；测试集:68.71%
XGBoost	训练集：90.41%；测试集:79.28%	训练集：85.06%；测试集:79.23%	训练集：63.03%；测试集:49.18%	训练集：51.15%；测试集:39.02%	训练集：82.10%；测试集:66.50%
LightGBM	训练集：86.70%；测试集:79.53%	训练集：82.41%；测试集:78.93%	训练集：49.77%；测试集:41.41%	训练集：35.00%；测试集:28.90%	训练集：86.12%；测试集:72.99%
逻辑回归	训练集：76.33%；测试集:78.34%	训练集：78.77%；测试集:78.70%	训练集：37.68%；测试集:38.89%	训练集：25.77%；测试集:26.30%	训练集：70.03%；测试集:74.59%
SVM	训练集：80.23%；测试集:74.26%	训练集：80.82%；测试集:77.96%	训练集：43.14%；测试集:34.80%	训练集：29.23%；测试集:22.83%	训练集：82.31%；测试集:73.15%
决策树	训练集：76.63%；测试集:74.19%	训练集：79.29%；测试集:77.14%	训练集：46.41%；测试集:43.46%	训练集：36.03%；测试集:34.10%	训练集：65.20%；测试集:59.90%