数据说明
使用之前的数据data_all.csv
1任务
- 使用网格搜索法对7个模型进行调优(调参时采用五折交叉验证的方式),并进行模型评估.
2k折交叉验证&网格搜索法&GridSearchCV
- K折交叉验证(k-fold cross validation),将初始采样(样本集X,Y)分割成K份,一份被保留作为验证模型的数据(test set),其他K-1份用来训练(train set)。交叉验证重复K次,每份验证一次,平均K次的结果或者使用其它结合方式,最终得到一个单一估测。
- Grid Search:一种调参手段;穷举搜索:在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。(为什么叫网格搜索?以有两个参数的模型为例,参数a有3种可能,参数b有4种可能,把所有可能性列出来,可以表示成一个3*4的表格,其中每个cell就是一个网格,循环过程就像是在每个网格里遍历、搜索,所以叫grid search)
- GridSearchCV(网格搜索)用简答的话来说就是你手动的给出一个模型中你想要改动的所用的参数,程序自动的帮你使用穷举法来将所用的参数都运行一遍。GridSearchCV 除了自行完成叉验证外,还返回了最优的超参数及对应的最优模型。
3不同模型调参前后的性能
模型 | 默认参数下的 AUC | 调整参数后的AUC | 调整的参数 |
---|---|---|---|
Logistic Regression | 0.7657 | 0.7797 | {‘C’:[0.1,1,2,3],‘penalty’:[‘l1’,‘l2’]} |
SVM | 0.7678 | 0.7754 | {‘C’:[0.1,1,2,3],‘kernel’:[‘linear’,‘poly’,‘rbf’]} |
决策树 | 0.5956 | 0.7015 | {‘criterion’: [‘gini’, ‘entropy’], ‘max_depth’: [1,2,3,4,5,6], ‘splitter’: [‘best’, ‘random’], ‘max_features’: [‘log2’, ‘sqrt’, ‘auto’]} |
随机森林 | 0.7199 | 0.7542 | parameters = {‘n_estimators’: range(1,200), ‘max_features’: [‘log2’, ‘sqrt’, ‘auto’]} |
GBDT | 0.7633 | 0.7673 | parameters = {‘n_estimators’: range(1,100,10),‘learning_rate’: np.arange(0.1, 1, 0.1)} |
XGBoost | 0.7709 | 0.7730 | parameters = {‘eta’: np.arange(0.1, 0.5, 0.1), ‘max_depth’: range(1,6,1), ‘min_child_weight’: range(1,6,1)} |
LightGBM | 0.7657 | 0.7750 | parameters = {‘learning_rate’: np.arange(0.1,0.5,0.1), ‘max_depth’: range(1,6,1), ‘n_estimators’:range(30,50,5)} |
4完整代码及注释
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')
#读取数据
data_all = pd.read_csv('data_all.csv')
print('数据的行列',data_all.shape)
#划分数据集
X = data_all.drop(['status'],axis=1)
y = data_all['status']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=2018)
#归一化
sc = StandardScaler()
sc.fit(X_train)# 估算每个特征的平均值和标准差
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#定义网格搜索交叉验证函数(5折)
def gridsearch(model,parameters):
grid = GridSearchCV(model,parameters,scoring='accuracy',cv=5)
grid = grid.fit(X_train_std,y_train)
if hasattr(model,'decision_function'):
y_predict_pro = grid.decision_function(X_test_std)
else:
y_predict_pro = grid.predict_proba(X_test_std)[:,1]
print('best score:',grid.best_score_)
print(grid.best_params_)
print('test score:',grid.score(X_test_std,y_test))
print('AUC:',metrics.roc_auc_score(y_test,y_predict_pro))
#逻辑回归
print('逻辑回归:')
# C浮点型,默认:1.0;其值等于正则化强度的倒数,为正的浮点数。数值越小表示正则化越强。
# penalty 字符串型,’l1’ or ‘l2’,默认:’l2’;正则化类型。
parameters = {'C':[0.1,1,2,3],'penalty':['l1','l2']}
lr = LogisticRegression(random_state=2018)
lr.fit(X_train_std,y_train)
gridsearch(lr, parameters)
print('')
#SVM
print('SVM:')
parameters = {'C':[0.1,1,2,3],'kernel':['linear','poly','rbf']}
svc = SVC(random_state=2018)
svc.fit(X_train_std,y_train)
gridsearch(svc,parameters)
print('')
#决策树
print('决策树:')
parameters = {'criterion': ['gini', 'entropy'], 'max_depth': [1,2,3,4,5,6], 'splitter': ['best', 'random'],
'max_features': ['log2', 'sqrt', 'auto']}
clf = DecisionTreeClassifier(random_state=2018)
clf.fit(X_train_std,y_train)
gridsearch(clf,parameters)
print('')
#随机森林
print('随机森林:')
parameters = {'n_estimators': range(1,200), 'max_features': ['log2', 'sqrt', 'auto']}
rfc = RandomForestClassifier(random_state=2018)
rfc.fit(X_train_std,y_train)
gridsearch(rfc,parameters)
print('')
#GBDT
print('GBDT:')
parameters = {'n_estimators': range(1,100,10),'learning_rate': np.arange(0.1, 1, 0.1)}
gbdt = GradientBoostingClassifier(random_state=2018)
gbdt.fit(X_train_std,y_train)
gridsearch(gbdt,parameters)
print('')
#XGBoost
print('XGBoost:')
parameters = {'eta': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1,6,1), 'min_child_weight': range(1,6,1)}
xgbs = XGBClassifier(random_state=2018)
xgbs.fit(X_train_std,y_train)
gridsearch(xgbs,parameters)
print('')
#LightGBM
parameters = {'learning_rate': np.arange(0.1,0.5,0.1), 'max_depth': range(1,6,1), 'n_estimators':range(30,50,5)}
lgbm = LGBMClassifier(random_state=2018)
lgbm.fit(X_train_std,y_train)
gridsearch(lgbm,parameters)
print('')
5代码的运行结果
逻辑回归:
best score: 0.7929065223925459
{'C': 0.1, 'penalty': 'l1'}
test score: 0.7890679747722494
AUC: 0.779722596058548
SVM:
best score: 0.7956116621581004
{'C': 2, 'kernel': 'linear'}
test score: 0.7778556412053259
AUC: 0.7754399966615547
决策树:
best score: 0.7646528403967539
{'criterion': 'gini', 'max_depth': 3, 'max_features': 'log2', 'splitter': 'best'}
test score: 0.7491240364400841
AUC: 0.701477783689608
随机森林:
best score: 0.8004207995190863
{'max_features': 'sqrt', 'n_estimators': 185}
test score: 0.7820602662929222
AUC: 0.7541613199378213
GBDT:
best score: 0.7965133754132853
{'learning_rate': 0.1, 'n_estimators': 41}
test score: 0.7876664330763841
AUC: 0.7673247055386894
XGBoost:
best score: 0.7968139464983469
{'eta': 0.1, 'max_depth': 3, 'min_child_weight': 4}
test score: 0.7883672039243167
AUC: 0.7729883258739945
LightGBM:
best score: 0.7971145175834085
{'learning_rate': 0.2, 'max_depth': 1, 'n_estimators': 45}
test score: 0.7813594954449895
AUC: 0.774961399225898
6结论
经过模型调优后,模型的AUC值普遍提高。