导入本次任务所用到的包:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
confusion_matrix, f1_score, roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
%matplotlib inline
plt.rc('font', family='SimHei', size=14)
plt.rcParams['axes.unicode_minus']=False
%config InlineBackend.figure_format = 'retina'
准备数据
导入数据
原始数据集下载地址: https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw
说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0 表示未逾期,1 表示逾期。
本次导入的是前文(【一周算法实践进阶】任务 1 数据预处理)已经清洗过的数据集:
data_processed = pd.read_csv('data_processed.csv')
data_processed.head()
low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | ... | loans_latest_day | 一线城市 | 三线城市 | 二线城市 | 其他城市 | 境外 | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.01 | 0.99 | 0.0 | 0.90 | 0.55 | 0.313 | 17.0 | 27.0 | 26.0 | 3.0 | ... | 18.0 | 1 | 0 | 0 | 0 | 0 | 4 | 2 | 4 | 3 |
1 | 0.02 | 0.94 | 2000.0 | 1.28 | 1.00 | 0.458 | 19.0 | 30.0 | 14.0 | 4.0 | ... | 2.0 | 1 | 0 | 0 | 0 | 0 | 5 | 3 | 5 | 5 |
2 | 0.04 | 0.96 | 0.0 | 1.00 | 1.00 | 0.114 | 13.0 | 68.0 | 22.0 | 1.0 | ... | 6.0 | 1 | 0 | 0 | 0 | 0 | 5 | 5 | 5 | 1 |
3 | 0.00 | 0.96 | 2000.0 | 0.13 | 0.57 | 0.777 | 22.0 | 14.0 | 6.0 | 3.0 | ... | 4.0 | 0 | 1 | 0 | 0 | 0 | 5 | 5 | 5 | 3 |
4 | 0.01 | 0.99 | 0.0 | 0.46 | 1.00 | 0.175 | 13.0 | 66.0 | 42.0 | 1.0 | ... | 120.0 | 1 | 0 | 0 | 0 | 0 | 4 | 6 | 1 | 6 |
5 rows × 89 columns
划分数据
将原始数据划分为数据集以及标签
label = data_processed['status']
data = data_processed.drop(['status'], axis=1)
标准化
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
data_scaled.head()
low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | ... | loans_latest_day | 一线城市 | 三线城市 | 二线城市 | 其他城市 | 境外 | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.281946 | 0.613274 | -0.493153 | -0.019542 | -1.295506 | -0.315097 | -0.120837 | -0.090540 | 0.256651 | 0.361774 | ... | -0.696716 | 0.632505 | -0.538091 | -0.170886 | -0.029907 | -0.181666 | -0.211985 | -0.674125 | -0.188457 | -0.015516 |
1 | -0.044836 | 0.267090 | 0.007392 | -0.019011 | 0.993048 | 0.534560 | 0.323209 | 0.040603 | -0.467915 | 1.483449 | ... | -0.996528 | 0.632505 | -0.538091 | -0.170886 | -0.029907 | -0.181666 | 0.513432 | -0.161836 | 0.135729 | 1.005817 |
2 | 0.429385 | 0.405564 | -0.493153 | -0.019402 | 0.993048 | -1.481178 | -1.008929 | 1.701755 | 0.015129 | -1.881576 | ... | -0.921575 | 0.632505 | -0.538091 | -0.170886 | -0.029907 | -0.181666 | 0.513432 | 0.862743 | 0.135729 | -1.036849 |
3 | -0.519056 | 0.405564 | 0.007392 | -0.020619 | -1.193793 | 2.403805 | 0.989278 | -0.658829 | -0.950958 | 0.361774 | ... | -0.959052 | -1.581015 | 1.858422 | -0.170886 | -0.029907 | -0.181666 | 0.513432 | 0.862743 | 0.135729 | -0.015516 |
4 | -0.281946 | 0.613274 | -0.493153 | -0.020157 | 0.993048 | -1.123736 | -1.008929 | 1.614326 | 1.222738 | -1.881576 | ... | 1.214585 | 0.632505 | -0.538091 | -0.170886 | -0.029907 | -0.181666 | -0.211985 | 1.375032 | -1.161015 | 1.516484 |
5 rows × 88 columns
特征选择
根据IV值
IV 的全称是 Information Value,中文意思是信息价值,或者信息量。此处仅介绍IV值的计算方式,具体可以看参考资料。
首先计算WOE(Weight of Evidence)值:
W
O
E
i
=
ln
(
p
(
y
i
)
p
(
n
i
)
)
=
ln
(
y
i
/
y
T
n
i
/
n
T
)
WOE_i = \ln(\frac{p(y_i)}{p(n_i)})= \ln(\frac{y_i/y_T}{n_i/n_T})
WOEi=ln(p(ni)p(yi))=ln(ni/nTyi/yT)
其中,
p
(
y
i
)
p(y_i)
p(yi)指本组逾期客户(即status=1)占样本中所有逾期客户的比例,
p
(
n
i
)
p(n_i)
p(ni)指本组未逾期客户(即status=0)占样本中所有未逾期客户的比例。
y
i
y_i
yi是本组逾期客户的数量,
y
T
y_T
yT是所有样本逾期客户的数量,
n
i
n_i
ni是本组未逾期客户的数量,
n
T
n_T
nT是所有样本未逾期客户的数量。
得到IV的计算公式:
I
V
i
=
(
p
(
y
i
)
−
p
(
n
i
)
)
×
W
O
E
i
=
(
y
i
/
y
T
−
n
i
/
n
T
)
×
ln
(
y
i
/
y
T
n
i
/
n
T
)
IV_i = (p(y_i)-p(n_i))\times WOE_i = (y_i/y_T - n_i/n_T)\times \ln(\frac{y_i/y_T}{n_i/n_T})
IVi=(p(yi)−p(ni))×WOEi=(yi/yT−ni/nT)×ln(ni/nTyi/yT)
根据特征的IV值,可以得到特征的预测能力,如下表。
IV | 预测能力 |
---|---|
<0.03 | 无 |
0.03~0.09 | 低 |
0.1~0.29 | 中 |
0.3~0.49 | 高 |
>=0.5 | 极高 |
数据分箱
在计算IV值之前,首先要对数据进行进行分箱操作,分箱包含有监督分箱(卡方、最小熵法)和无监督分箱(等距、等频、聚类)。我们采用卡方分箱,其他分箱方法的介绍见参考资料。
- 初始化阶段
首先按照属性值对实例进行排序,每个实例属于一个分组。
- 合并阶段
(1)计算每一对相邻组的卡方值
(2)将卡方值最小的相邻组合并
X
2
=
∑
i
=
1
2
∑
j
=
1
2
(
A
i
j
−
E
i
j
)
2
E
i
j
X^2 = \sum^2_{i=1}\sum^2_{j=1}\frac{(A_{ij}-E_{ij})^2}{E_{ij}}
X2=i=1∑2j=1∑2Eij(Aij−Eij)2
其中,
A
i
j
A_{ij}
Aij指第
i
i
i组第
j
j
j类实例数量,
E
i
j
E_{ij}
Eij指
A
i
j
A_{ij}
Aij的期望频率
(3)不断重复(1),(2)直到计算出的卡方值都不低于事先设定的阈值,或者分组数达到一定的条件(如最小分组数 5,最大分组数 8)。
(chiMerge函数代码来自参考资料3,有修改)
def chiMerge(df, col, target, threshold=None):
''' 卡方分箱
df: pandas dataframe数据集
col: 需要分箱的变量名(数值型)
target: 类标签
max_groups: 最大分组数。
threshold: 卡方阈值,如果未指定max_groups,默认使用置信度95%设置threshold。
return: 包括各组的起始值的列表.
'''
freq_tab = pd.crosstab(df[col],df[target])
freq = freq_tab.values #转成 numpy 数组用于计算。
# 1.初始化阶段:按照属性值对实例进行排序,每个实例属于一个分组。
# 为了保证后续分组包含所有样本值,添加上一个比最大值大的数
cutoffs = np.append(freq_tab.index.values, max(freq_tab.index.values)+1)
if threshold == None:
# 如果没有指定卡方阈值和最大分类数
# 则以 95% 的置信度(自由度为类数目 - 1)设定阈值。
cls_num = freq.shape[-1]
threshold = stats.chi2.isf(0.05, df=cls_num - 1)
# 2.合并阶段
while True:
minvalue = np.inf
minidx = np.inf
# 计算每一对相邻组的卡方值
for i in range(len(freq) - 1):
v = stats.chi2_contingency(freq[i:i+2] + 1, correction=False)[0]
# 更新最小值
if minvalue > v:
minvalue = v
minidx = i
# 如果最小卡方值小于阈值,则合并最小卡方值的相邻两组,并继续循环
if threshold != None and minvalue < threshold:
freq[minidx] += freq[minidx+1]
freq = np.delete(freq, minidx+1, 0)
cutoffs = np.delete(cutoffs, minidx+1, 0)
else:
break
return cutoffs
IV值计算
def iv_value(df, col, target):
''' 计算单列特征的IV值
df: pandas dataframe数据集
col: 需要计算的变量名(数值型)
target: 标签
return: 该特征的iv值
'''
bins = chiMerge(df, col, target) # 获得分组区间
cats = pd.cut(df[col], bins, right=False)
# 为了防止除0错误,对分子分母均做+1处理
temp = (pd.crosstab(cats, df[target]) + 1) / (df[target].value_counts() + 1)
woe = np.log(temp.iloc[:, 1] / temp.iloc[:, 0])
iv = sum((temp.iloc[:, 1] - temp.iloc[:, 0]) * woe)
return iv
计算所有特征的iv值
iv = []
data_iv = pd.concat([data_scaled, label], axis=1)
for col in data_scaled.columns:
iv.append(iv_value(data_iv, col, 'status'))
降序输出:
iv = np.array(iv)
np.save('iv', iv)
iv = np.load('iv.npy')
iv
array([0.02968667, 0.06475457, 0.06981247, 0.27089581, 0.03955683,
0.13346826, 0.00854632, 0.03929596, 0.04422897, 0.00559611,
0.53421682, 0. , 0.03166467, 0.38242452, 0.92400898,
0.18871897, 0.11657733, 0.79563374, 0. , 0.36688692,
0.06479698, 0.08637859, 0.0315798 , 0.08726314, 0.02813494,
0.07862981, 0.02872391, 0.00936212, 0.59139039, 0.25168984,
0.25886425, 0.42645628, 0.32054195, 0.01342581, 0.00419829,
0.23346355, 0.57449389, 0. , 0.37383946, 0.14084117,
0.50192192, 0.01717901, 0. , 0.00990202, 0.02356634,
0.02668144, 0.03360329, 0.02932465, 0.00517526, 0.66353628,
0. , 0.05768091, 0.03631875, 0.40640499, 0.01445641,
0.00671275, 0.01300546, 0.00552671, 0.03980268, 0.03645762,
0.0140021 , 0.65682529, 0.15289713, 0.37204304, 0.05508829,
0.0192688 , 0.01318021, 0.01300546, 0.01037065, 0.01728017,
0.25268217, 0.15254589, 0.00475146, 0.00671275, 0.01011964,
0.03126195, 0.50228468, 0.11432889, 0.07337619, 0. ,
0. , 0. , 0. , 0. , 0.03444958,
0.00903816, 0.01497038, 0. ])
随机森林
- n_estimators : integer, optional (default=10)
n_estimators: 也就是弱学习器的最大迭代次数,或者说最大的弱学习器的个数。
对参数n_estimators粗调:
param = {'n_estimators': list(range(10, 1001, 50))}
g = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
param_grid=param, cv=5)
g.fit(data_scaled, label)
g.best_estimator_
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=810, n_jobs=1,
oob_score=False, random_state=2018, verbose=0,
warm_start=False)
对参数n_estimators细调:
param = {'n_estimators': list(range(770, 870, 10))}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
param_grid=param, cv=5)
forest_grid.fit(data, label)
rnd_clf = forest_grid.best_estimator_
rnd_clf
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=810, n_jobs=1,
oob_score=False, random_state=2018, verbose=0,
warm_start=False)
综合分析
将IV值和随机森林的特征重要度进行整合:
feature_df = pd.DataFrame(np.c_[rnd_clf.feature_importances_, iv.T],
index=data.columns, columns=['随机森林', 'IV值'])
feature_df.head()
随机森林 | IV值 | |
---|---|---|
low_volume_percent | 0.007025 | 0.029687 |
middle_volume_percent | 0.009346 | 0.064755 |
take_amount_in_later_12_month_highest | 0.009766 | 0.069812 |
trans_amount_increase_rate_lately | 0.014802 | 0.270896 |
trans_activity_month | 0.010418 | 0.039557 |
绘制两者对比曲线,按照IV值和随机森林评分的降序:
feature_df.sort_values(by='IV值', ascending=False)\
.plot(figsize=(8, 6), subplots=True, use_index=False)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F077ACDE80>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001F077B4E898>],
dtype=object)
feature_df.sort_values(by='随机森林', ascending=False)\
.plot(figsize=(8, 6), subplots=True, use_index=False)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F077975B38>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001F0779A1BE0>],
dtype=object)
由上图可以看出,虽然会有上下浮动,但随机森林和IV值呈现的特征重要度曲线的变化趋势是基本一致的。根据之前提及的IV值和预测能力关系表,以及随机森林特征评分在0到10之间有一个断崖式的下降,所以取随机森林得分在前15且IV值大于0.3的特征,作为筛选后的特征。
rnf_sorted = feature_df.sort_values(by='随机森林', ascending=False).iloc[:15, 0]
iv_sorted = feature_df[feature_df['IV值'] >= 0.3]['IV值']
index = pd.DataFrame([rnf_sorted, iv_sorted]).dropna(axis=1).columns
index
Index(['abs', 'apply_score', 'consfin_avg_limit', 'historical_trans_amount',
'history_fail_fee', 'latest_one_month_fail', 'loans_overdue_count',
'loans_score', 'max_cumulative_consume_later_1_month',
'repayment_capability', 'trans_amount_3_month',
'trans_fail_top_count_enum_last_1_month'],
dtype='object')
经过筛选后,剩余12个特征,获得筛选特征后的数据:
data_del = data_scaled[index]
data_del.head()
abs | apply_score | consfin_avg_limit | historical_trans_amount | history_fail_fee | latest_one_month_fail | loans_overdue_count | loans_score | max_cumulative_consume_later_1_month | repayment_capability | trans_amount_3_month | trans_fail_top_count_enum_last_1_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.200665 | 0.124820 | -1.201348 | -0.255030 | -0.427773 | -0.337569 | -0.098210 | 0.144596 | -0.067183 | 0.020868 | -0.049208 | -0.346369 |
1 | -0.090524 | 1.497024 | 0.238640 | 0.215237 | -0.547614 | -0.080162 | -0.733973 | 1.509325 | -0.073494 | -0.034355 | -0.274805 | -0.868380 |
2 | -0.312623 | 1.516627 | -0.671941 | -0.675385 | -0.627508 | -0.080162 | -0.733973 | 1.476440 | -0.262821 | -0.171658 | -0.321773 | 0.697653 |
3 | 1.359842 | 0.360055 | 0.736282 | 0.790524 | 0.331218 | -0.337569 | 0.537552 | -0.019829 | 0.471049 | -0.237850 | 0.505738 | -0.346369 |
4 | -0.315531 | -0.698503 | 0.042759 | -0.522714 | 0.291271 | -0.337569 | 1.173315 | -1.055708 | -0.172665 | -0.144424 | -0.282697 | 0.697653 |
建立模型
调用sklearn包将数据集按比例7:3划分为训练集和数据集,随机种子2018:
X_train, X_test, y_train, y_test = train_test_split(data_del, label, test_size=0.3, random_state=2018)
查看划分的数据集和训练集大小:
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3133, 12), (3133,), (1343, 12), (1343,)]
模型调优
构建并训练本次需要用到的七个模型,包含四个集成模型:XGBoost,LightGBM,GBDT,随机森林,和三个非集成模型:逻辑回归,SVM,决策树。使用网格搜索法对7个模型进行调优(调参时采用五折交叉验证的方式)
逻辑回归
部分参数介绍:
- C : float, default: 1.0
C值越小,正则化强度越强,C必须是一个正的浮点数。
- class_weight : dict or ‘balanced’, default: None
指定每个类的权重,未指定时所有类都默认有同一权重。“平衡”模式使用y的值来自动调整与输入数据中的类频率成反比的权重,即n_samples / (n_classes * np.bincount(y)).
- solver : str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default: ‘liblinear’.
优化时使用的算法,对于小数据集,'liblinear’是一个好选择,而‘sag’ 和 ‘saga’对大数据集训练更快。对于多分类问题,只有‘newton-cg’, ‘sag’, ‘saga’ 和 ‘lbfgs’能处理多元损失,‘liblinear’只能以one-versus-rest模式训练。
‘newton-cg’, ‘lbfgs’ 和 ‘sag’只能使用l2惩罚项,‘liblinear’ 和 ‘saga’可以使用l1惩罚项。
注意,“sag”和“saga”快速收敛仅在具有大致相同比例的特征上得到保证。可以使用sklearn.preprocessing中的scaler预处理数据。
- max_iter : int, default: 100
最大迭代次数
查看最佳参数和评分:
param_grid = {
'C': np.arange(0.01, 0.1, 0.01),
'solver': ['liblinear', 'lbfgs'],
'class_weight': ['balanced', None]
}
log_grid = GridSearchCV(LogisticRegression(random_state=2018, max_iter=1000),
param_grid, cv=5)
log_grid.fit(X_train, y_train)
log_grid.best_estimator_, log_grid.best_score_
(LogisticRegression(C=0.02, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=1000, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=2018, solver='lbfgs', tol=0.0001,
verbose=0, warm_start=False), 0.7874241940631982)
SVM
- C: float, optional (default=1.0)
目标函数的惩罚系数C,用来平衡分类间隔margin和错分样本的
- kernel: string, optional (default=’rbf’)
参数选择有RBF, Linear, Poly, Sigmoid
- gamma:float, optional (default=’auto’)
核函数的系数(‘Poly’, ‘RBF’ 和 ‘Sigmoid’), 默认是gamma = 1 / n_features;
param_grid = {
'C': np.arange(0.1, 5.2, 0.5),
'gamma': ['auto', 0.01, 0.5],
}
svc_grid = GridSearchCV(SVC(random_state=2018, probability=True), param_grid, cv=5)
svc_grid.fit(X_train, y_train)
svc_grid.best_estimator_, svc_grid.best_score_
(SVC(C=2.1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=True, random_state=2018, shrinking=True,
tol=0.001, verbose=False), 0.7871050111714012)
决策树
- max_depth : int or None, optional (default=None)
树的最大深度,如果设置为None则直到叶节点只剩一个或者少于min_samples_split时停止。
- max_features : int, float, string or None, optional (default=None)
在寻找最佳分割时考虑的特征数,设置为’auto’时max_features=sqrt(n_features)
- class_weight : dict, list of dicts, “balanced” or None, default=None
指定每个类的权重,未指定时所有类都默认有同一权重。“平衡” 模式使用 y 的值来自动调整与输入数据中的类频率成反比的权重,即 n_samples / (n_classes * np.bincount(y)).
param_grid = {
'max_depth': range(2, 8, 1),
'min_samples_split': range(2, 11, 1)
}
tree_grid = GridSearchCV(DecisionTreeClassifier(random_state=2018), param_grid, cv=5)
tree_grid.fit(X_train, y_train)
tree_grid.best_estimator_, tree_grid.best_score_
(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=2018,
splitter='best'), 0.7749760612831152)
param_grid = {
'min_samples_leaf': range(26, 35, 2),
'max_features': range(2, 10, 1)
}
tree_grid = GridSearchCV(DecisionTreeClassifier(random_state=2018, max_depth=4,
min_samples_split=2),
param_grid, cv=5)
tree_grid.fit(X_train, y_train)
tree_grid.best_estimator_, tree_grid.best_score_
(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
max_features=7, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=30,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=2018, splitter='best'),
0.7794446217682732)
随机森林
param = {'n_estimators': list(range(10, 1001, 50))}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
oob_score=False, random_state=2018, verbose=0,
warm_start=False), 0.7883817427385892)
forest_grid = forest_grid.best_estimator_
param = {
'n_estimators': list(range(forest_grid.n_estimators - 40,
forest_grid.n_estimators + 50, 10))
}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018),
param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
oob_score=False, random_state=2018, verbose=0,
warm_start=False), 0.7883817427385892)
forest_grid = forest_grid.best_estimator_
param = {
'max_depth': range(3, 15, 2),
'min_samples_split': range(2, 53, 10)
}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018,
n_estimators=forest_grid.n_estimators),
param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=9, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=30,
min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
oob_score=False, random_state=2018, verbose=0,
warm_start=False), 0.7918927545483562)
param = {'min_samples_leaf': range(1, 10, 2)}
forest_grid = GridSearchCV(estimator = RandomForestClassifier(random_state=2018, max_features='auto',
max_depth=9,
n_estimators=860,
min_samples_split=30),
param_grid=param, cv=5)
forest_grid.fit(X_train, y_train)
forest_grid.best_estimator_, forest_grid.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=9, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=30,
min_weight_fraction_leaf=0.0, n_estimators=860, n_jobs=1,
oob_score=False, random_state=2018, verbose=0,
warm_start=False), 0.7918927545483562)
GBDT
- n_estimators : integer, optional (default=10)
要执行的迭代次数。梯度增强对于过拟合强健性好,多以较大的值通常效果更好。
- learning_rate : float, optional (default=0.1)
通过learning_rate缩小每棵树的贡献,需要在learning_rate和n_estimators之间进行权衡。
param_grid = {
'n_estimators': range(80, 150, 10),
'learning_rate': [0.02, 0.01, 0.04],
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018), param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.02, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=110,
presort='auto', random_state=2018, subsample=1.0, verbose=0,
warm_start=False), 0.7864666453878072)
gbdt = gbdt.best_estimator_
param_grid = {
'max_depth': range(3, 12, 2),
'min_samples_split': range(20, 41, 5)
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018,
n_estimators=gbdt.n_estimators,
learning_rate=gbdt.learning_rate),
param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.02, loss='deviance', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=35,
min_weight_fraction_leaf=0.0, n_estimators=110,
presort='auto', random_state=2018, subsample=1.0, verbose=0,
warm_start=False), 0.7874241940631982)
gbdt = gbdt.best_estimator_
param_grid = {
'min_samples_leaf': range(1, 10, 2)
}
gbdt = GridSearchCV(GradientBoostingClassifier(random_state=2018,
n_estimators=gbdt.n_estimators,
learning_rate=gbdt.learning_rate,
max_depth=gbdt.max_depth,
min_samples_split=gbdt.min_samples_split),
param_grid, cv=5)
gbdt.fit(X_train, y_train)
gbdt.best_estimator_, gbdt.best_score_
(GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.02, loss='deviance', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=35,
min_weight_fraction_leaf=0.0, n_estimators=110,
presort='auto', random_state=2018, subsample=1.0, verbose=0,
warm_start=False), 0.7874241940631982)
XGBoost
param_grid = {
'n_estimators': range(70, 150, 10),
'learning_rate': [0.02, 0.1, 0.2],
}
xgb = GridSearchCV(XGBClassifier(random_state=2018), param_grid, cv=5)
xgb.fit(X_train, y_train)
xgb.best_estimator_, xgb.best_score_
(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=90,
n_jobs=1, nthread=None, objective='binary:logistic',
random_state=2018, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=None, silent=True, subsample=1), 0.7867858282796042)
xgb = xgb.best_estimator_
param_grid = {
'max_depth': range(1, 4, 1),
'min_samples_split': range(1, 22, 5)
}
xgb = GridSearchCV(XGBClassifier(random_state=2018,
n_estimators=xgb.n_estimators,
learning_rate=xgb.learning_rate),
param_grid, cv=5)
xgb.fit(X_train, y_train)
xgb.best_estimator_, xgb.best_score_
(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
max_depth=3, min_child_weight=1, min_samples_split=1, missing=None,
n_estimators=90, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=2018, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
subsample=1), 0.7867858282796042)
LightGBM
param_grid = {
'n_estimators': range(70, 150, 10),
'learning_rate': [0.02, 0.1, 0.2],
}
lgbm = GridSearchCV(LGBMClassifier(random_state=2018), param_grid, cv=5)
lgbm.fit(X_train, y_train)
lgbm.best_estimator_, lgbm.best_score_
(LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.02, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=90, n_jobs=-1, num_leaves=31, objective=None,
random_state=2018, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0),
0.7826364506862432)
lgbm = lgbm.best_estimator_
param_grid = {
'max_depth': range(1, 10, 2),
'min_samples_split': range(10, 31, 5)
}
lgbm = GridSearchCV(LGBMClassifier(random_state=2018,
n_estimators=lgbm.n_estimators,
learning_rate=lgbm.learning_rate),
param_grid, cv=5)
lgbm.fit(X_train, y_train)
lgbm.best_estimator_, lgbm.best_score_
(LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.02, max_depth=5,
min_child_samples=20, min_child_weight=0.001, min_samples_split=10,
min_split_gain=0.0, n_estimators=90, n_jobs=-1, num_leaves=31,
objective=None, random_state=2018, reg_alpha=0.0, reg_lambda=0.0,
silent=True, subsample=1.0, subsample_for_bin=200000,
subsample_freq=0), 0.7832748164698372)
模型评估
对构建的七个模型进行评估
models = {'随机森林': forest_grid.best_estimator_,
'GBDT': gbdt.best_estimator_,
'XGBoost': xgb.best_estimator_,
'LightGBM': lgbm,
'逻辑回归': log_grid.best_estimator_,
'SVM': svc_grid.best_estimator_,
'决策树': tree_grid.best_estimator_}
assessments = {
'Accuracy': [],
'Precision': [],
'Recall': [],
'F1-score': [],
'AUC': []
}
def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, label=label)
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([0, 1, 0, 1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.tight_layout()
for name, model in models.items():
test_pre = model.predict(X_test)
train_pre = model.predict(X_train)
test_proba = model.predict_proba(X_test)[:,1]
train_proba = model.predict_proba(X_train)[:,1]
acc_test = accuracy_score(test_pre, y_test) * 100
acc_train = accuracy_score(train_pre, y_train) * 100
accuracy = '训练集:%.2f%%;测试集:%.2f%%' % (acc_train, acc_test)
assessments['Accuracy'].append(accuracy)
pre_test = precision_score(test_pre, y_test) * 100
pre_train = precision_score(train_pre, y_train) * 100
precision = '训练集:%.2f%%;测试集:%.2f%%' % (pre_train, pre_test)
assessments['Precision'].append(precision)
rec_test = recall_score(test_pre, y_test) * 100
rec_train = recall_score(train_pre, y_train) * 100
recall = '训练集:%.2f%%;测试集:%.2f%%' % (rec_train, rec_test)
assessments['Recall'].append(recall)
f1_test = f1_score(test_pre, y_test) * 100
f1_train = f1_score(train_pre, y_train) * 100
f1 = '训练集:%.2f%%;测试集:%.2f%%' % (f1_train, f1_test)
assessments['F1-score'].append(f1)
fig = plt.figure(figsize=(8, 6))
fpr, tpr, thresholds = roc_curve(y_test, test_proba)
plot_roc_curve(fpr, tpr, label='测试集')
fpr, tpr, thresholds = roc_curve(y_train, train_proba)
plot_roc_curve(fpr, tpr, label='训练集')
plt.title(name)
auc_test = roc_auc_score(y_test, test_proba) * 100
auc_train = roc_auc_score(y_train, train_proba) * 100
auc = '训练集:%.2f%%;测试集:%.2f%%' % (auc_train, auc_test)
assessments['AUC'].append(auc)
fig = plt.figure(figsize=(8, 6))
for name, model in models.items():
proba = model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, proba)
plot_roc_curve(fpr, tpr, label=name)
fig = plt.figure(figsize=(8, 6))
for name, model in models.items():
proba = model.predict_proba(X_train)[:,1]
fpr, tpr, thresholds = roc_curve(y_train, proba)
plot_roc_curve(fpr, tpr, label=name)
ass_df = pd.DataFrame(assessments, index=models.keys())
ass_df
AUC | Accuracy | F1-score | Precision | Recall | |
---|---|---|---|---|---|
随机森林 | 训练集:90.82%;测试集:79.88% | 训练集:84.33%;测试集:79.60% | 训练集:58.07%;测试集:46.69% | 训练集:43.59%;测试集:34.68% | 训练集:86.96%;测试集:71.43% |
GBDT | 训练集:87.87%;测试集:79.15% | 训练集:84.26%;测试集:78.78% | 训练集:57.17%;测试集:44.01% | 训练集:42.18%;测试集:32.37% | 训练集:88.68%;测试集:68.71% |
XGBoost | 训练集:90.41%;测试集:79.28% | 训练集:85.06%;测试集:79.23% | 训练集:63.03%;测试集:49.18% | 训练集:51.15%;测试集:39.02% | 训练集:82.10%;测试集:66.50% |
LightGBM | 训练集:86.70%;测试集:79.53% | 训练集:82.41%;测试集:78.93% | 训练集:49.77%;测试集:41.41% | 训练集:35.00%;测试集:28.90% | 训练集:86.12%;测试集:72.99% |
逻辑回归 | 训练集:76.33%;测试集:78.34% | 训练集:78.77%;测试集:78.70% | 训练集:37.68%;测试集:38.89% | 训练集:25.77%;测试集:26.30% | 训练集:70.03%;测试集:74.59% |
SVM | 训练集:80.23%;测试集:74.26% | 训练集:80.82%;测试集:77.96% | 训练集:43.14%;测试集:34.80% | 训练集:29.23%;测试集:22.83% | 训练集:82.31%;测试集:73.15% |
决策树 | 训练集:76.63%;测试集:74.19% | 训练集:79.29%;测试集:77.14% | 训练集:46.41%;测试集:43.46% | 训练集:36.03%;测试集:34.10% | 训练集:65.20%;测试集:59.90% |
ROC曲线:
集成模型 | 非集成模型 |
---|---|
综合比较ROC曲线:
训练集 | 测试集 |
---|---|
、 |
总结
相比于之前的结果(参考【数据分析实践】Task 1.3 模型调优)。在经过特征处理和特征选择后,各个模型的效果都有小幅提升,模型的过拟合现象也有所减少。
因为时间问题没有更加具体的调参,未来想要进一步提升效果还需要在调参和特征工程上多下功夫。
参考资料
任务描述:特征选择:分别用IV值和随机森林进行特征选择。再用【算法实践】中的7个模型(逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM),进行模型评估。
[1] https://blog.csdn.net/sscc_learning/article/details/78591210, 【评分卡】评分卡入门与创建原则——分箱、WOE、IV、分值分配
[2] https://blog.csdn.net/pylady/article/details/78882220, 特征工程之分箱
[3] https://mp.weixin.qq.com/s?__biz=MzIxNzc1NDgzMw==&mid=2247484031&idx=1&sn=dc6f97982ac958653ba8af8cf75ec0d0&chksm=97f5bfc1a08236d75b13b4e456334e07d4bbff209c9449adf8ce1aae45a52fcb04954584c2ce&mpshare=1&scene=23&srcid=0127eIvjcmFdJMnR2fdaJnFX#rd, python 评分卡建模—实现 WOE 编码及 IV 值计算
[4] https://blog.csdn.net/RuDing/article/details/78332192, Gradient Boosting(GBM) 调参指南