文章目录
模型调优 (判定贷款用户是否逾期)
给定金融数据,预测贷款用户是否会逾期。
(status是标签:0表示未逾期,1表示逾期。)
Task6(模型调优) - 使用网格搜索对模型进行调优, 并采用五折交叉验证的方式进行模型评估
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
# 数据集预览
data = pd.read_csv('data.csv')
# 去重
data.drop_duplicates(inplace=True)
print(data.shape)
data.head()
导入特征和获取label
import pickle
# 载入特征
with open('feature.pkl', 'rb') as f:
X = pickle.load(f)
# 观测正负样本是否均衡
y = data.status
y.value_counts()
1. 数据集划分
# 划分训练集测试集
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2333)
# 特征归一化
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)
2. 模型评估
from sklearn.metrics import accuracy_score, roc_auc_score
def model_metrics(clf, X_train, X_test, y_train, y_test):
# 预测
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
y_train_proba = clf.predict_proba(X_train)[:,1]
y_test_proba = clf.predict_proba(X_test)[:,1]
# 准确率
print('[准确率]', end = ' ')
print('训练集:', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
print('测试集:', '%.4f'%accuracy_score(y_test, y_test_pred))
# auc取值:用roc_auc_score或auc
print('[auc值]', end = ' ')
print('训练集:', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
print('测试集:', '%.4f'%roc_auc_score(y_test, y_test_proba))
导入包
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
3. LR模型
lr = LogisticRegression()
param = {'C': [1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
gsearch = GridSearchCV(lr, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
print('最佳参数:',gsearch.best_params_)
print('训练集的最佳分数:', gsearch.best_score_)
print('测试集的最佳分数:', gsearch.score(X_test, y_test))
输出
最佳参数: {‘C’: 0.1, ‘penalty’: ‘l1’}
训练集的最佳分数: 0.7839975682166337
测试集的最佳分数: 0.7994893464372945
lr = LogisticRegression(C = 0.1, penalty = 'l1')
lr.fit(X_train, y_train)
model_metrics(lr, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.7959 测试集: 0.8073
[auc值] 训练集: 0.8016 测试集: 0.7995
4. SVM模型
调参范围可设为’gamma’:[0.001,0.01,0.1,1,10,100], ‘C’:[0.001,0.01,0.1,1,10,100]}。鉴于时间原因, 下面网格搜索时选用较小区间。
# 1) 线性SVM
svm_linear = svm.SVC(kernel = 'linear', probability=True)
param = {'C':[0.01,0.1,1]}
gsearch = GridSearchCV(svm_linear, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
print('最佳参数:',gsearch.best_params_)
print('训练集的最佳分数:', gsearch.best_score_)
print('测试集的最佳分数:', gsearch.score(X_test, y_test))
输出
最佳参数: {‘C’: 0.01}
训练集的最佳分数: 0.7849963207712166
测试集的最佳分数: 0.810812878176418
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True)
svm_linear.fit(X_train, y_train)
model_metrics(svm_linear, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.7848 测试集: 0.7912
[auc值] 训练集: 0.8044 测试集: 0.8108
# 2) 多项式SVM
svm_poly = svm.SVC(kernel = 'poly', probability=True)
param = {'C':[0.01,0.1,1]}
gsearch = GridSearchCV(svm_poly, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
print('最佳参数:',gsearch.best_params_)
print('训练集的最佳分数:', gsearch.best_score_)
print('测试集的最佳分数:', gsearch.score(X_test, y_test))
输出
最佳参数: {‘C’: 0.01}
训练集的最佳分数: 0.7359832819347378
测试集的最佳分数: 0.7259879405573931
svm_poly = svm.SVC(C = 0.01, kernel = 'poly', probability=True)
svm_poly.fit(X_train, y_train)
model_metrics(svm_poly, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.7538 测试集: 0.7547
[auc值] 训练集: 0.8690 测试集: 0.7260
# 3) 高斯SVM
svm_rbf = svm.SVC(probability=True)
param = {'gamma':[0.01,0.1,1,10],
'C':[0.01,0.1,1]}
gsearch = GridSearchCV(svm_poly, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
print('最佳参数:',gsearch.best_params_)
print('训练集的最佳分数:', gsearch.best_score_)
print('测试集的最佳分数:', gsearch.score(X_test, y_test))
输出
最佳参数: {‘C’: 0.01, ‘gamma’: 0.01}
训练集的最佳分数: 0.7370458508629731
测试集的最佳分数: 0.7247746108112956
svm_rbf = svm.SVC(gamma = 0.01, C =0.01 , probability=True)
svm_rbf.fit(X_train, y_train)
model_metrics(svm_rbf, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.7475 测试集: 0.7526
[auc值] 训练集: 0.8614 测试集: 0.7944
# 4) sigmoid - SVM
svm_sigmoid = svm.SVC(kernel = 'sigmoid',probability=True)
param = {'C':[0.01,0.1,1]}
gsearch = GridSearchCV(svm_sigmoid, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
print('最佳参数:',gsearch.best_params_)
print('训练集的最佳分数:', gsearch.best_score_)
print('测试集的最佳分数:', gsearch.score(X_test, y_test))
输出
最佳参数: {‘C’: 0.01}
训练集的最佳分数: 0.7747312761692854
测试集的最佳分数: 0.7803266494690364
svm_sigmoid = svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True)
svm_sigmoid.fit(X_train, y_train)
model_metrics(svm_sigmoid, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.7475 测试集: 0.7526
[auc值] 训练集: 0.7615 测试集: 0.7803
5. 决策树模型
1)首先对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。
param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=8,min_samples_split=300,min_samples_leaf=20, max_features='sqrt' ,random_state =2333),
param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘max_depth’: 11, ‘min_samples_split’: 100}, 0.7061428632294259)
2)对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参。
param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=11,min_samples_split=100,min_samples_leaf=20, max_features='sqrt',random_state =2333),
param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘min_samples_leaf’: 80, ‘min_samples_split’: 550}, 0.7118895755089746)
3)再对最大特征数max_features进行网格搜索
param = {'max_features':range(7,20,2)}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=11,min_samples_split=550,min_samples_leaf=80, max_features='sqrt',random_state =2333),
param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘max_features’: 19}, 0.7050922939313528)
观察最终结果
dt = DecisionTreeClassifier(max_depth=11,min_samples_split=550,min_samples_leaf=80,max_features=19,random_state =2333)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.7538 测试集: 0.7561
[auc值] 训练集: 0.7155 测试集: 0.6836
6. XGBoost模型
1、max_depth = 5 :这个参数的取值最好在3-10之间。我选的起始值为5,但是你也可以选择其它的值。起始值在4-6之间都是不错的选择。
2、min_child_weight = 1:在这里选了一个比较小的值,因为这是一个极不平衡的分类问题。因此,某些叶子节点下的值会比较小。
3、gamma = 0: 起始值也可以选其它比较小的值,在0.1到0.2之间就可以。这个参数后继也是要调整的。
4、subsample, colsample_bytree = 0.8: 这个是最常见的初始值了。典型值的范围在0.5-0.9之间。
首先看一下默认参数的结果
import warnings
warnings.filterwarnings("ignore")
xgb0 = XGBClassifier()
xgb0.fit(X_train, y_train)
model_metrics(xgb0, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.8530 测试集: 0.8024
[auc值] 训练集: 0.9155 测试集: 0.7932
1)首先从步长(learning rate)和迭代次数(n_estimators)入手。
开始选择一个较小的步长来网格搜索最好的迭代次数。这里,我们将步长初始值设置为0.1, 对于迭代次数进行网格搜索。
param_test = {'n_estimators':range(20,200,20)}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, seed=27),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
({‘n_estimators’: 20}, 0.7822329511588559)
2) max_depth 和 min_child_weight 参数调优
param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=20, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, seed=27),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘max_depth’: 5, ‘min_child_weight’: 5}, 0.7852805853095239)
可以看出理想的max_depth值为5,理想的min_child_weight值为5。在这个值附近我们可以再进一步调整,来找出理想值。
param_test = {'max_depth':[3,4,5], 'min_child_weight':[3,4,5]}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=20, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, seed=27),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘max_depth’: 5, ‘min_child_weight’: 5}, 0.7852805853095239)
3)gamma参数调优
param_test = {'gamma':[i/10 for i in range(1,6)]}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=20, max_depth=5,
min_child_weight=5, gamma=0, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, seed=27),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘gamma’: 0.4}, 0.7858965997168706)
4)调整subsample 和 colsample_bytree 参数
param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=20, max_depth=5,
min_child_weight=5, gamma=0.4, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, seed=27),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘colsample_bytree’: 0.9, ‘subsample’: 0.9}, 0.7859006376180202)
从这里可以看出来,subsample理想取值0.9, colsample_bytree理想取值都是0.9。现在,我们以0.05为步长,在这个值附近尝试取值。
param_test = { 'subsample':[i/100 for i in range(85,101,5)], 'colsample_bytree':[i/100 for i in range(85,101,5)]}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=20, max_depth=5,
min_child_weight=5, gamma=0.4, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, seed=27),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘colsample_bytree’: 0.95, ‘subsample’: 0.9}, 0.7868599289367876)
5)正则化参数调优
#‘reg_lambda’: [0.2, 0.4, 0.6, 0.8, 1]
# 'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=20, max_depth=5,
min_child_weight=5, gamma=0.4, subsample=0.95,
colsample_bytree=0.9, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, seed=27),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘reg_alpha’: 1}, 0.7855783149845719)
6)继续从1)循环调整, 此处省略步骤。
最终选取的模型参数如下:
#XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=5,
# gamma=0.4, subsample=0.5, colsample_bytree=0.9, reg_alpha=1, objective= 'binary:logistic',
# nthread=4,scale_pos_weight=1, seed=27)
# score - 0.7936928753627137
7)回到第1)步,降低学习速率, 调整迭代次数
param_test = {'n_estimators':range(20,200,20)}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.01, n_estimators=60, max_depth=3,
min_child_weight=5, gamma=0.4, subsample=0.5,
colsample_bytree=0.9, reg_alpha=1, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, seed=27),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
输出:({‘n_estimators’: 180}, 0.7842467214840007)
与1)结果相比, 较好。所以, 选用此处学习速率较小时的结果。继续从1)循环调整, 此处不再调整。
观察最终结果
xgb = XGBClassifier(learning_rate =0.01, n_estimators=180, max_depth=3, min_child_weight=5,
gamma=0.4, subsample=0.5, colsample_bytree=0.9, reg_alpha=1,
objective= 'binary:logistic', nthread=4,scale_pos_weight=1, seed=27)
xgb.fit(X_train, y_train)
model_metrics(xgb, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.8028 测试集: 0.7940
[auc值] 训练集: 0.8257 测试集: 0.7953
7. LightGBM模型
类似XGB, 先调整boosting框架参数, 再调整个体学习器参数。
略
调整过程如下 - 可以划分更详细点
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'gamma':[i/10 for i in range(1,6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
gsearch = GridSearchCV(estimator = LGBMClassifier(learning_rate =0.1, n_estimators=60, max_depth=3,
min_child_weight=11, gamma=0.1, subsample=0.5,
colsample_bytree=0.8, reg_alpha = 1e-5,
nthread=4,scale_pos_weight=1, seed=27),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
最终粗调结果如下
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11,
gamma=0.1, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5,
nthread=4,scale_pos_weight=1, seed=27)
lgb.fit(X_train, y_train)
model_metrics(lgb, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.8242 测试集: 0.8066
[auc值] 训练集: 0.8715 测试集: 0.7970
结果对比
模型 | 参数 | 准确率 | auc值 |
---|---|---|---|
LR | LogisticRegression(C = 0.1, penalty = ‘l1’) | 训练集:0.7959 测试集:0.8073 | 训练集:0.8016 测试集:0.7995 |
svm_linear | svm.SVC(C = 0.01, kernel = ‘linear’, probability=True) | 训练集:0.7848 测试集:0.7912 | 训练集:0.8044 测试集:0.8108 |
svm_poly | svm.SVC(C = 0.01, kernel = ‘poly’, probability=True) | 训练集:0.7538 测试集:0.7547 | 训练集:0.8690 测试集:0.7260 |
svm_rbf | svm.SVC(gamma = 0.01, C =0.01 , probability=True) | 训练集 0.7475 测试集:0.7526 | 训练集:0.8614 测试集:0.7944 |
svm_sigmoid | svm.SVC(C = 0.01, kernel = ‘sigmoid’,probability=True) | 训练集:0.7475 测试集:0.7526 | 训练集:0.7615 测试集:0.7803 |
决策树 | DecisionTreeClassifier(max_depth=11,min_samples_split=550, min_samples_leaf=80,max_features=19) | 训练集:0.7538 测试集:0.7561 | 训练集:0.7155 测试集:0.6836 |
XGBoost | XGBClassifier(learning_rate =0.01, n_estimators=180, max_depth=3,min_child_weight=5,gamma=0.4, subsample=0.5,colsample_bytree=0.9,reg_alpha=1,objective= ‘binary:logistic’,nthread=4,scale_pos_weight=1, seed=27) | 训练集:0.8028 测试集:0.7940 | 训练集: 0.8257 测试集:0.7953 |
LightGBM | LGBMClassifier(learning_rate=0.1,n_estimators=60, max_depth=3,min_child_weight=11,gamma=0.1, subsample=0.5, colsample_bytree=0.8,reg_alpha=1e-5, nthread=4, scale_pos_weight=1, seed=27) | 训练集:0.8242 测试集: 0.8066 | 训练集:0.8715 测试集:0.7970 |
遇到的问题
XGB调参完还没有默认参数好, 一方面原因是因为没细调,另外是有5折时对score取平均的原因吗?
Reference
More
代码参见Github: https://github.com/libihan/Exercise-ML