【一周算法实践进阶】任务2 特征工程

导入本次任务所用到的包:

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
                            confusion_matrix, f1_score, roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
%matplotlib inline
plt.rc('font', family='SimHei', size=14)
plt.rcParams['axes.unicode_minus']=False
%config InlineBackend.figure_format = 'retina'

准备数据

导入数据

原始数据集下载地址: https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw

说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0 表示未逾期,1 表示逾期。

本次导入的是前文(【一周算法实践进阶】任务 1 数据预处理)已经清洗过的数据集:

data_processed = pd.read_csv('data_processed.csv')
data_processed.head()
low_volume_percentmiddle_volume_percenttake_amount_in_later_12_month_highesttrans_amount_increase_rate_latelytrans_activity_monthtrans_activity_daytransd_mcctrans_days_interval_filtertrans_days_intervalregional_mobility...loans_latest_day一线城市三线城市二线城市其他城市境外latest_query_time_monthlatest_query_time_weekdayloans_latest_time_monthloans_latest_time_weekday
00.010.990.00.900.550.31317.027.026.03.0...18.0100004243
10.020.942000.01.281.000.45819.030.014.04.0...2.0100005355
20.040.960.01.001.000.11413.068.022.01.0...6.0100005551
30.000.962000.00.130.570.77722.014.06.03.0...4.0010005553
40.010.990.00.461.000.17513.066.042.01.0...120.0100004616

5 rows × 89 columns

划分数据

将原始数据划分为数据集以及标签

label = data_processed['status']
data = data_processed.drop(['status'], axis=1)

标准化

scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
data_scaled.head()
low_volume_percentmiddle_volume_percenttake_amount_in_later_12_month_highesttrans_amount_increase_rate_latelytrans_activity_monthtrans_activity_daytransd_mcctrans_days_interval_filtertrans_days_intervalregional_mobility...loans_latest_day一线城市三线城市二线城市其他城市境外latest_query_time_monthlatest_query_time_weekdayloans_latest_time_monthloans_latest_time_weekday
0-0.2819460.613274-0.493153-0.019542-1.295506-0.315097-0.120837-0.0905400.2566510.361774...-0.6967160.632505-0.538091-0.170886-0.029907-0.181666-0.211985-0.674125-0.188457-0.015516
1-0.0448360.2670900.007392-0.0190110.9930480.5345600.3232090.040603-0.4679151.483449...-0.9965280.632505-0.538091-0.170886-0.029907-0.1816660.513432-0.1618360.1357291.005817
20.4293850.405564-0.493153-0.0194020.993048-1.481178-1.0089291.7017550.015129-1.881576...-0.9215750.632505-0.538091-0.170886-0.029907-0.1816660.5134320.8627430.135729-1.036849
3-0.5190560.4055640.007392-0.020619-1.1937932.4038050.989278-0.658829-0.9509580.361774...-0.959052-1.5810151.858422-0.170886-0.029907-0.1816660.5134320.8627430.135729-0.015516
4-0.2819460.613274-0.493153-0.0201570.993048-1.123736-1.0089291.6143261.222738-1.881576...1.2145850.632505-0.538091-0.170886-0.029907-0.181666-0.2119851.375032-1.1610151.516484

5 rows × 88 columns

特征选择

根据IV值

IV 的全称是 Information Value,中文意思是信息价值,或者信息量。此处仅介绍IV值的计算方式,具体可以看参考资料。

首先计算WOE(Weight of Evidence)值:
W O E i = ln ⁡ ( p ( y i ) p ( n i ) ) = ln ⁡ ( y i / y T n i / n T ) WOE_i = \ln(\frac{p(y_i)}{p(n_i)})= \ln(\frac{y_i/y_T}{n_i/n_T}) WOEi=ln(p(ni)p(yi))=ln(ni/nTyi/yT)
其中, p ( y i ) p(y_i) p(yi)指本组逾期客户(即status=1)占样本中所有逾期客户的比例, p ( n i ) p(n_i) p(ni)指本组未逾期客户(即status=0)占样本中所有未逾期客户的比例。 y i y_i yi是本组逾期客户的数量, y T y_T yT是所有样本逾期客户的数量, n i n_i ni是本组未逾期客户的数量, n T n_T nT是所有样本未逾期客户的数量。

得到IV的计算公式:
I V i = ( p ( y i ) − p ( n i ) ) × W O E i = ( y i / y T − n i / n T ) × ln ⁡ ( y i / y T n i / n T ) IV_i = (p(y_i)-p(n_i))\times WOE_i = (y_i/y_T - n_i/n_T)\times \ln(\frac{y_i/y_T}{n_i/n_T}) IVi=(p(yi)p(ni))×WOEi=(yi/yTni/nT)×ln(ni/nTyi/yT)

根据特征的IV值,可以得到特征的预测能力,如下表。

IV预测能力
<0.03
0.03~0.09
0.1~0.29
0.3~0.49
>=0.5极高

数据分箱

在计算IV值之前,首先要对数据进行进行分箱操作,分箱包含有监督分箱(卡方、最小熵法)和无监督分箱(等距、等频、聚类)。我们采用卡方分箱,其他分箱方法的介绍见参考资料。

  1. 初始化阶段

首先按照属性值对实例进行排序,每个实例属于一个分组。

  1. 合并阶段

(1)计算每一对相邻组的卡方值

(2)将卡方值最小的相邻组合并
X 2 = ∑ i = 1 2 ∑ j = 1 2 ( A i j − E i j ) 2 E i j X^2 = \sum^2_{i=1}\sum^2_{j=1}\frac{(A_{ij}-E_{ij})^2}{E_{ij}} X2=i=12j=12Eij(AijEij)2
其中, A i j A_{ij} Aij指第 i i i组第 j j j类实例数量, E i j E_{ij} Eij A i j A_{ij} Aij的期望频率

(3)不断重复(1),(2)直到计算出的卡方值都不低于事先设定的阈值,或者分组数达到一定的条件(如最小分组数 5,最大分组数 8)。

(chiMerge函数代码来自参考资料3,有修改)

def chiMerge(df, col, target, threshold=None):
    ''' 卡方分箱
    df: pandas dataframe数据集
    col: 需要分箱的变量名(数值型)
    target: 类标签
    max_groups: 最大分组数。
    threshold: 卡方阈值,如果未指定max_groups,默认使用置信度95%设置threshold。
    return: 包括各组的起始值的列表.
    '''
    freq_tab = pd.crosstab(df[col],df[target])
    freq = freq_tab.values #转成 numpy 数组用于计算。
    # 1.初始化阶段:按照属性值对实例进行排序,每个实例属于一个分组。
    # 为了保证后续分组包含所有样本值,添加上一个比最大值大的数
    cutoffs = np.append(freq_tab.index.values, max(freq_tab.index.values)+1)
    if threshold == None:
        # 如果没有指定卡方阈值和最大分类数
        # 则以 95% 的置信度(自由度为类数目 - 1)设定阈值。
        cls_num = freq.shape[-1]
        threshold = stats.chi2.isf(0.05, df=cls_num - 1)
    # 2.合并阶段
    while True:
        minvalue = np.inf
        minidx = np.inf
        # 计算每一对相邻组的卡方值
        for i in range(len(freq) - 1):
            v = stats.chi2_contingency(freq[i:i+2] + 1, correction=False)[0]
            # 更新最小值
            if minvalue > v:
                minvalue = v
                minidx = i
        # 如果最小卡方值小于阈值,则合并最小卡方值的相邻两组,并继续循环
        if threshold != None and minvalue < threshold:
            freq[minidx] += freq[minidx+1]
            freq = np.delete(freq, minidx+1, 0)
            cutoffs = np.delete(cutoffs, minidx+1, 0)
        else:
            break
            
    return cutoffs

IV值计算

def iv_value(df, col, target):
    ''' 计算单列特征的IV值
    df: pandas dataframe数据集
    col: 需要计算的变量名(数值型)
    target: 标签
    return: 该特征的iv值
    '''
    bins = chiMerge(df, col, target) # 获得分组区间
    cats = pd.cut(df[col], bins, right=False) 
    # 为了防止除0错误,对分子分母均做+1处理
    temp = (pd.crosstab(cats, df[target]) + 1) / (df[target].value_counts() + 1)
    woe = np.log(temp.iloc[:, 1] / temp.iloc[:, 0])
    iv = sum((temp.iloc[:, 1] - temp.iloc[:, 0]) * woe)
    
    return iv

计算所有特征的iv值

iv = []
data_iv = pd.concat([data_scaled, label], axis=1)

for col in data_scaled.columns:
    iv.append(iv_value(data_iv, col, 'status'))

降序输出:

iv = np.array(iv)
np.save('iv', iv)
iv = np.load('iv.npy')
iv
array([0.02968667, 0.06475457, 0.06981247, 0.27089581, 0.03955683,
       0.13346826, 0.00854632, 0.03929596, 0.04422897, 0.00559611,
       0.53421682, 0.        , 0.03166467, 0.38242452, 0.92400898,
       0.18871897, 0.11657733, 0.79563374, 0.        , 0.36688692,
       0.06479698, 0.08637859, 0.0315798 , 0.08726314, 0.02813494,
       0.07862981, 0.02872391, 0.00936212, 0.59139039, 0.25168984,
       0.25886425, 0.42645628, 0.32054195, 0.01342581, 0.00419829,
       0.23346355, 0.57449389, 0.        , 0.37383946, 0.14084117,
       0.50192192, 0.01717901, 0.        , 0.00990202, 0.02356634,
       0.02668144, 0.03360329, 0.02932465, 0.00517526, 0.66353628,
       0.        , 0.05768091, 0.03631875, 0.40640499, 0.01445641,
       0.00671275, 0.01300546, 0.00552671, 0.03980268, 0.03645762,
       0.0140021 , 0.65682529, 0.15289713, 0.37204304, 0.05508829,
       0.0192688 , 0.01318021, 0.01300546, 0.01037065, 0.01728017,
       0.25268217, 0.15254589, 0.00475146, 0.00671275, 0.01011964,
       0.03126195, 0.50228468, 0.11432889, 0.07337619, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.03444958,
       0.00903816, 0.01497038, 0.        ])

随机森林

  • n_estimators : integer, optional (default=10)

n_estimators: 也就是弱学习器的最大迭代次数,或者说最大的弱学习器的个数。

对参数n_estimators粗调:

param = {'n_estimators': list(range(10, 1001, 50))}
g = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                       param_grid=param, cv=5)
g.fit(data_scaled, label)
g.best_estimator_
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=810, n_jobs=1,
            oob_score=False, random_state=2018, verbose=0,
            warm_start=False)

对参数n_estimators细调:

param = {'n_estimators': list(range(770, 870, 10))}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                       param_grid=param, cv=5)
forest_grid.fit(data, label)
rnd_clf = forest_grid.best_estimator_
rnd_clf
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=810, n_jobs=1,
            oob_score=False, random_state=2018, verbose=0,
            warm_start=False)

综合分析

将IV值和随机森林的特征重要度进行整合:

feature_df = pd.DataFrame(np.c_[rnd_clf.feature_importances_, iv.T], 
                          index=data.columns, columns=['随机森林', 'IV值'])
feature_df.head()
随机森林IV值
low_volume_percent0.0070250.029687
middle_volume_percent0.0093460.064755
take_amount_in_later_12_month_highest0.0097660.069812
trans_amount_increase_rate_lately0.0148020.270896
trans_activity_month0.0104180.039557

绘制两者对比曲线,按照IV值和随机森林评分的降序:

feature_df.sort_values(by='IV值', ascending=False)\
          .plot(figsize=(8, 6), subplots=True, use_index=False)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F077ACDE80>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F077B4E898>],
      dtype=object)

在这里插入图片描述

feature_df.sort_values(by='随机森林', ascending=False)\
          .plot(figsize=(8, 6), subplots=True, use_index=False)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F077975B38>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F0779A1BE0>],
      dtype=object)

在这里插入图片描述

由上图可以看出,虽然会有上下浮动,但随机森林和IV值呈现的特征重要度曲线的变化趋势是基本一致的。根据之前提及的IV值和预测能力关系表,以及随机森林特征评分在0到10之间有一个断崖式的下降,所以取随机森林得分在前15且IV值大于0.3的特征,作为筛选后的特征。

rnf_sorted = feature_df.sort_values(by='随机森林', ascending=False).iloc[:15, 0]
iv_sorted = feature_df[feature_df['IV值'] >= 0.3]['IV值']
index = pd.DataFrame([rnf_sorted, iv_sorted]).dropna(axis=1).columns
index
Index(['abs', 'apply_score', 'consfin_avg_limit', 'historical_trans_amount',
       'history_fail_fee', 'latest_one_month_fail', 'loans_overdue_count',
       'loans_score', 'max_cumulative_consume_later_1_month',
       'repayment_capability', 'trans_amount_3_month',
       'trans_fail_top_count_enum_last_1_month'],
      dtype='object')

经过筛选后,剩余12个特征,获得筛选特征后的数据:

data_del = data_scaled[index]
data_del.head()
absapply_scoreconsfin_avg_limithistorical_trans_amounthistory_fail_feelatest_one_month_failloans_overdue_countloans_scoremax_cumulative_consume_later_1_monthrepayment_capabilitytrans_amount_3_monthtrans_fail_top_count_enum_last_1_month
0-0.2006650.124820-1.201348-0.255030-0.427773-0.337569-0.0982100.144596-0.0671830.020868-0.049208-0.346369
1-0.0905241.4970240.2386400.215237-0.547614-0.080162-0.7339731.509325-0.073494-0.034355-0.274805-0.868380
2-0.3126231.516627-0.671941-0.675385-0.627508-0.080162-0.7339731.476440-0.262821-0.171658-0.3217730.697653
31.3598420.3600550.7362820.7905240.331218-0.3375690.537552-0.0198290.471049-0.2378500.505738-0.346369
4-0.315531-0.6985030.042759-0.5227140.291271-0.3375691.173315-1.055708-0.172665-0.144424-0.2826970.697653

建立模型

调用sklearn包将数据集按比例7:3划分为训练集和数据集,随机种子2018:

X_train, X_test, y_train, y_test = train_test_split(data_del, label, test_size=0.3, random_state=2018)

查看划分的数据集和训练集大小:

[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3133, 12), (3133,), (1343, 12), (1343,)]

模型调优

构建并训练本次需要用到的七个模型,包含四个集成模型:XGBoost,LightGBM,GBDT,随机森林,和三个非集成模型:逻辑回归,SVM,决策树。使用网格搜索法对7个模型进行调优(调参时采用五折交叉验证的方式)

逻辑回归

部分参数介绍:

  • C : float, default: 1.0

C值越小,正则化强度越强,C必须是一个正的浮点数。

  • class_weight : dict or ‘balanced’, default: None

指定每个类的权重,未指定时所有类都默认有同一权重。“平衡”模式使用y的值来自动调整与输入数据中的类频率成反比的权重,即n_samples / (n_classes * np.bincount(y)).

  • solver : str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default: ‘liblinear’.

优化时使用的算法,对于小数据集,'liblinear’是一个好选择,而‘sag’ 和 ‘saga’对大数据集训练更快。对于多分类问题,只有‘newton-cg’, ‘sag’, ‘saga’ 和 ‘lbfgs’能处理多元损失,‘liblinear’只能以one-versus-rest模式训练。

‘newton-cg’, ‘lbfgs’ 和 ‘sag’只能使用l2惩罚项,‘liblinear’ 和 ‘saga’可以使用l1惩罚项。

注意,“sag”和“saga”快速收敛仅在具有大致相同比例的特征上得到保证。可以使用sklearn.preprocessing中的scaler预处理数据。

  • max_iter : int, default: 100

最大迭代次数

查看最佳参数和评分:

param_grid = {
    'C': np.arange(0.01, 0.1, 0.01),
    'solver': ['liblinear', 'lbfgs'],
    'class_weight': ['balanced', None]
}
log_grid = GridSearchCV(LogisticRegression(random_state=2018, max_iter=1000), 
                        param_grid, cv=5)
log_grid.fit(X_train, y_train)
log_grid.best_estimator_, log_grid.best_score_
(LogisticRegression(C=0.02, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=1000, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=2018, solver='lbfgs', tol=0.0001,
           verbose=0, warm_start=False), 0.7874241940631982)

SVM

  • C: float, optional (default=1.0)

目标函数的惩罚系数C,用来平衡分类间隔margin和错分样本的

  • kernel: string, optional (default=’rbf’)

参数选择有RBF, Linear, Poly, Sigmoid

  • gamma:float, optional (default=’auto’)

核函数的系数(‘Poly’, ‘RBF’ 和 ‘Sigmoid’), 默认是gamma = 1 / n_features;

param_grid = {
    'C': np.arange(0.1, 5.2, 0.5),
    'gamma': ['auto', 0.01, 0.5],
}

svc_grid = GridSearchCV(SVC(random_state=2018, probability=True), param_grid, cv=5)
svc_grid.fit(X_train, y_train)
svc_grid.best_estimator_, svc_grid.best_score_
(SVC(C=2.1, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=True, random_state=2018, shrinking=True,
   tol=0.001, verbose=False), 0.7871050111714012)

决策树

  • max_depth : int or None, optional (default=None)

树的最大深度,如果设置为None则直到叶节点只剩一个或者少于min_samples_split时停止。

  • max_features : int, float, string or None, optional (default=None)

在寻找最佳分割时考虑的特征数,设置为’auto’时max_features=sqrt(n_features)

  • class_weight : dict, list of dicts, “balanced” or None, default=None

指定每个类的权重,未指定时所有类都默认有同一权重。“平衡” 模式使用 y 的值来自动调整与输入数据中的类频率成反比的权重,即 n_samples / (n_classes * np.bincount(y)).

param_grid = {
    'max_depth': range(2, 8, 1),
    'min_samples_split': range(2, 11, 1)
}
tree_grid = GridSearchCV(DecisionTreeClassifier(random_state=2018), param_grid, cv=5)
tree_grid.fit(X_train, y_train)
tree_grid.best_estimator_, tree_grid.best_score_
(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False, random_state=2018,
             splitter='best'), 0.7749760612831152)
param_grid = {
    'min_samples_leaf': range(26, 35, 2),
    'max_features': range(2, 10, 1)
}
tree_grid = GridSearchCV(DecisionTreeClassifier(random_state=2018, max_depth=4,
                                                min_samples_split=2), 
                         param_grid, cv=5)
tree_grid.fit(X_train, y_train)
tree_grid.best_estimator_, tree_grid.best_score_
(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
             max_features=7, max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=30,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=2018, splitter='best'),
 0.7794446217682732)

随机森林

param = {'n_estimators': list(range(10, 1001, 50))}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7883817427385892)
forest_grid = forest_grid.best_estimator_

param = {
    'n_estimators': list(range(forest_grid.n_estimators - 40, 
                                    forest_grid.n_estimators + 50, 10))
}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7883817427385892)
forest_grid = forest_grid.best_estimator_

param = {
    'max_depth': range(3, 15, 2),
    'min_samples_split': range(2, 53, 10)
}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018, 
                                                              n_estimators=forest_grid.n_estimators),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=9, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=30,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7918927545483562)
param = {'min_samples_leaf': range(1, 10, 2)}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018, max_features='auto',
                                                              max_depth=9,
                                                              n_estimators=860, 
                                                              min_samples_split=30),
                           param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=9, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=30,
             min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
             oob_score=False, random_state=2018, verbose=0,
             warm_start=False), 0.7918927545483562)

GBDT

  • n_estimators : integer, optional (default=10)

要执行的迭代次数。梯度增强对于过拟合强健性好,多以较大的值通常效果更好。

  • learning_rate : float, optional (default=0.1)

通过learning_rate缩小每棵树的贡献,需要在learning_rate和n_estimators之间进行权衡。

param_grid = {
    'n_estimators': range(80, 150, 10),
    'learning_rate': [0.02, 0.01, 0.04],
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018), param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.02, loss='deviance', max_depth=3,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=110,
               presort='auto', random_state=2018, subsample=1.0, verbose=0,
               warm_start=False), 0.7864666453878072)
gbdt = gbdt.best_estimator_

param_grid = {
    'max_depth': range(3, 12, 2),
    'min_samples_split': range(20, 41, 5)
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018, 
                                               n_estimators=gbdt.n_estimators,
                                               learning_rate=gbdt.learning_rate), 
                    param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.02, loss='deviance', max_depth=5,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=35,
               min_weight_fraction_leaf=0.0, n_estimators=110,
               presort='auto', random_state=2018, subsample=1.0, verbose=0,
               warm_start=False), 0.7874241940631982)
gbdt = gbdt.best_estimator_

param_grid = {
    'min_samples_leaf': range(1, 10, 2)
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018, 
                                               n_estimators=gbdt.n_estimators,
                                               learning_rate=gbdt.learning_rate, 
                                               max_depth=gbdt.max_depth,
                                               min_samples_split=gbdt.min_samples_split), 
                    param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.02, loss='deviance', max_depth=5,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=35,
               min_weight_fraction_leaf=0.0, n_estimators=110,
               presort='auto', random_state=2018, subsample=1.0, verbose=0,
               warm_start=False), 0.7874241940631982)

XGBoost

param_grid = {
    'n_estimators': range(70, 150, 10),
    'learning_rate': [0.02, 0.1, 0.2],
}
xgb = GridSearchCV(XGBClassifier(random_state=2018), param_grid, cv=5)
xgb.fit(X_train, y_train)
xgb.best_estimator_, xgb.best_score_
(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
        max_depth=3, min_child_weight=1, missing=None, n_estimators=90,
        n_jobs=1, nthread=None, objective='binary:logistic',
        random_state=2018, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
        seed=None, silent=True, subsample=1), 0.7867858282796042)
xgb = xgb.best_estimator_

param_grid = {
    'max_depth': range(1, 4, 1),
    'min_samples_split': range(1, 22, 5)
}
xgb = GridSearchCV(XGBClassifier(random_state=2018, 
                                               n_estimators=xgb.n_estimators,
                                               learning_rate=xgb.learning_rate), 
                    param_grid, cv=5)
xgb.fit(X_train, y_train)
xgb.best_estimator_, xgb.best_score_
(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
        max_depth=3, min_child_weight=1, min_samples_split=1, missing=None,
        n_estimators=90, n_jobs=1, nthread=None,
        objective='binary:logistic', random_state=2018, reg_alpha=0,
        reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
        subsample=1), 0.7867858282796042)

LightGBM

param_grid = {
    'n_estimators': range(70, 150, 10),
    'learning_rate': [0.02, 0.1, 0.2],
}
lgbm = GridSearchCV(LGBMClassifier(random_state=2018), param_grid, cv=5)
lgbm.fit(X_train, y_train)
lgbm.best_estimator_, lgbm.best_score_
(LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
         importance_type='split', learning_rate=0.02, max_depth=-1,
         min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
         n_estimators=90, n_jobs=-1, num_leaves=31, objective=None,
         random_state=2018, reg_alpha=0.0, reg_lambda=0.0, silent=True,
         subsample=1.0, subsample_for_bin=200000, subsample_freq=0),
 0.7826364506862432)
lgbm = lgbm.best_estimator_

param_grid = {
    'max_depth': range(1, 10, 2),
    'min_samples_split': range(10, 31, 5)
}
lgbm = GridSearchCV(LGBMClassifier(random_state=2018, 
                                               n_estimators=lgbm.n_estimators,
                                               learning_rate=lgbm.learning_rate), 
                    param_grid, cv=5)
lgbm.fit(X_train, y_train)
lgbm.best_estimator_, lgbm.best_score_
(LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
         importance_type='split', learning_rate=0.02, max_depth=5,
         min_child_samples=20, min_child_weight=0.001, min_samples_split=10,
         min_split_gain=0.0, n_estimators=90, n_jobs=-1, num_leaves=31,
         objective=None, random_state=2018, reg_alpha=0.0, reg_lambda=0.0,
         silent=True, subsample=1.0, subsample_for_bin=200000,
         subsample_freq=0), 0.7832748164698372)

模型评估

对构建的七个模型进行评估

models = {'随机森林': forest_grid.best_estimator_,
          'GBDT': gbdt.best_estimator_,
          'XGBoost': xgb.best_estimator_,
          'LightGBM': lgbm,
          '逻辑回归': log_grid.best_estimator_,
          'SVM': svc_grid.best_estimator_,
          '决策树': tree_grid.best_estimator_}

assessments = {
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F1-score': [],
    'AUC': []
} 
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend()
    plt.tight_layout()
for name, model in models.items():
    test_pre = model.predict(X_test)
    train_pre = model.predict(X_train)
    test_proba = model.predict_proba(X_test)[:,1]
    train_proba = model.predict_proba(X_train)[:,1]
    
    acc_test = accuracy_score(test_pre, y_test) * 100
    acc_train = accuracy_score(train_pre, y_train) * 100
    accuracy = '训练集:%.2f%%;测试集:%.2f%%' % (acc_train, acc_test)
    assessments['Accuracy'].append(accuracy)
    
    pre_test = precision_score(test_pre, y_test) * 100
    pre_train = precision_score(train_pre, y_train) * 100
    precision = '训练集:%.2f%%;测试集:%.2f%%' % (pre_train, pre_test)
    assessments['Precision'].append(precision)
    
    rec_test = recall_score(test_pre, y_test) * 100
    rec_train = recall_score(train_pre, y_train) * 100
    recall = '训练集:%.2f%%;测试集:%.2f%%' % (rec_train, rec_test)
    assessments['Recall'].append(recall)
    
    f1_test = f1_score(test_pre, y_test) * 100
    f1_train = f1_score(train_pre, y_train) * 100
    f1 = '训练集:%.2f%%;测试集:%.2f%%' % (f1_train, f1_test)
    assessments['F1-score'].append(f1)
    
    fig = plt.figure(figsize=(8, 6))
    fpr, tpr, thresholds = roc_curve(y_test, test_proba)
    plot_roc_curve(fpr, tpr, label='测试集')
    fpr, tpr, thresholds = roc_curve(y_train, train_proba)
    plot_roc_curve(fpr, tpr, label='训练集')
    plt.title(name)
    
    auc_test = roc_auc_score(y_test, test_proba) * 100
    auc_train = roc_auc_score(y_train, train_proba) * 100
    auc = '训练集:%.2f%%;测试集:%.2f%%' % (auc_train, auc_test)
    assessments['AUC'].append(auc)
fig = plt.figure(figsize=(8, 6))
for name, model in models.items():
    proba = model.predict_proba(X_test)[:,1]
    fpr, tpr, thresholds = roc_curve(y_test, proba)
    plot_roc_curve(fpr, tpr, label=name)
fig = plt.figure(figsize=(8, 6))
for name, model in models.items():
    proba = model.predict_proba(X_train)[:,1]
    fpr, tpr, thresholds = roc_curve(y_train, proba)
    plot_roc_curve(fpr, tpr, label=name)
ass_df = pd.DataFrame(assessments, index=models.keys())
ass_df
AUCAccuracyF1-scorePrecisionRecall
随机森林训练集:90.82%;测试集:79.88%训练集:84.33%;测试集:79.60%训练集:58.07%;测试集:46.69%训练集:43.59%;测试集:34.68%训练集:86.96%;测试集:71.43%
GBDT训练集:87.87%;测试集:79.15%训练集:84.26%;测试集:78.78%训练集:57.17%;测试集:44.01%训练集:42.18%;测试集:32.37%训练集:88.68%;测试集:68.71%
XGBoost训练集:90.41%;测试集:79.28%训练集:85.06%;测试集:79.23%训练集:63.03%;测试集:49.18%训练集:51.15%;测试集:39.02%训练集:82.10%;测试集:66.50%
LightGBM训练集:86.70%;测试集:79.53%训练集:82.41%;测试集:78.93%训练集:49.77%;测试集:41.41%训练集:35.00%;测试集:28.90%训练集:86.12%;测试集:72.99%
逻辑回归训练集:76.33%;测试集:78.34%训练集:78.77%;测试集:78.70%训练集:37.68%;测试集:38.89%训练集:25.77%;测试集:26.30%训练集:70.03%;测试集:74.59%
SVM训练集:80.23%;测试集:74.26%训练集:80.82%;测试集:77.96%训练集:43.14%;测试集:34.80%训练集:29.23%;测试集:22.83%训练集:82.31%;测试集:73.15%
决策树训练集:76.63%;测试集:74.19%训练集:79.29%;测试集:77.14%训练集:46.41%;测试集:43.46%训练集:36.03%;测试集:34.10%训练集:65.20%;测试集:59.90%

ROC曲线:

集成模型非集成模型
在这里插入图片描述在这里插入图片描述
在这里插入图片描述在这里插入图片描述
在这里插入图片描述在这里插入图片描述
在这里插入图片描述

综合比较ROC曲线:

训练集测试集
在这里插入图片描述在这里插入图片描述

总结

相比于之前的结果(参考【数据分析实践】Task 1.3 模型调优)。在经过特征处理和特征选择后,各个模型的效果都有小幅提升,模型的过拟合现象也有所减少。

因为时间问题没有更加具体的调参,未来想要进一步提升效果还需要在调参和特征工程上多下功夫。

参考资料

任务描述:特征选择:分别用IV值和随机森林进行特征选择。再用【算法实践】中的7个模型(逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM),进行模型评估。

[1] https://blog.csdn.net/sscc_learning/article/details/78591210, 【评分卡】评分卡入门与创建原则——分箱、WOE、IV、分值分配

[2] https://blog.csdn.net/pylady/article/details/78882220, 特征工程之分箱

[3] https://mp.weixin.qq.com/s?__biz=MzIxNzc1NDgzMw==&mid=2247484031&idx=1&sn=dc6f97982ac958653ba8af8cf75ec0d0&chksm=97f5bfc1a08236d75b13b4e456334e07d4bbff209c9449adf8ce1aae45a52fcb04954584c2ce&mpshare=1&scene=23&srcid=0127eIvjcmFdJMnR2fdaJnFX#rd, python 评分卡建模—实现 WOE 编码及 IV 值计算

[4] https://blog.csdn.net/RuDing/article/details/78332192, Gradient Boosting(GBM) 调参指南

  • 1
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值