任务6. Titanic-模型调优

最新推荐文章于 2020-08-07 20:04:01 发布

qq_43585786

最新推荐文章于 2020-08-07 20:04:01 发布

阅读量222

点赞数

本文链接：https://blog.csdn.net/qq_43585786/article/details/86365698

版权

本文通过GridSearchCV对逻辑回归、SVM、决策树和随机森林进行参数调优，寻找最佳模型。结果显示，决策树在训练和测试集上取得了最好的表现，分别达到0.835和0.765的准确率。

摘要由CSDN通过智能技术生成

使用GridSearchCV来做格点搜索寻找最优参数
调用库

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

1.逻辑回归

seed=()
clf=LogisticRegression()
#grid search找到最好的参数
param_grid = dict( )
##创建分类pipeline
pipeline=Pipeline([ ('clf',clf) ])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=3,scoring='accuracy',\
cv=StratifiedShuffleSplit(n_splits=5, test_size=0.2,random_state=seed)).fit(x_train, y_train)
# 对结果打分
print("Best score: %0.3f" % grid_search.best_score_)
print(grid_search.best_estimator_)

print('-----grid search end------------')
print ('on all train set')
scores = cross_val_score(grid_search.best_estimator_, x_train, y_train,cv=3,scoring='accuracy')
print (scores.mean(),scores)
print ('on test set')
scores = cross_val_score(grid_search.best_estimator_, x_test, y_test,cv=3,scoring='accuracy')
print(scores.mean(),scores)

输出
Best score: 0.821
Pipeline(memory=None,
steps=[(‘clf’, LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘warn’,
n_jobs=None, penalty=‘l2’, random_state=None, solver=‘warn’,
tol=0.0001, verbose=0, warm_start=False))])
-----grid search end------------
on all train set
0.8234856930509104 [0.82692308 0.78846154 0.85507246]
on test set
0.8059508947149396 [0.81111111 0.80898876 0.79775281]

2.SVM

clf=SVC()
#grid search找到最好的参数
param_grid = dict( )
##创建分类pipeline
pipeline=Pipeline([ ('clf',clf) ])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=3,scoring='accuracy',\
cv=StratifiedShuffleSplit(n_splits=5, test_size=0.2,random_state=seed)).fit(x_train, y_train)
# 对结果打分
print("Best score: %0.3f" % grid_search.best_score_)
print(grid_search.best_estimator_)

print('-----grid search end------------')
print ('on all train set')
scores = cross_val_score(grid_search.best_estimator_, x_train, y_train,cv=3,scoring='accuracy')
print (scores.mean(),scores)
print ('on test set')
scores = cross_val_score(grid_search.best_estimator_, x_test, y_test,cv=3,scoring='accuracy')
print(scores.mean(),scores)

输出
Best score: 0.787
Pipeline(memory=None,
steps=[(‘clf’, SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=‘ovr’, degree=3, gamma=‘auto_deprecated’,
kernel=‘rbf’, max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
-----grid search end------------
on all train set
0.7672565960609439 [0.78846154 0.74519231 0.76811594]
on test set
0.727632126508531 [0.72222222 0.74157303 0.71910112]

3.决策树

clf=DecisionTreeClassifier()
#grid search找到最好的参数
param_grid = dict( )
##创建分类pipeline
pipeline=Pipeline([ ('clf',clf) ])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=3,scoring='accuracy',\
cv=StratifiedShuffleSplit(n_splits=5, test_size=0.2,random_state=seed)).fit(x_train, y_train)
# 对结果打分
print("Best score: %0.3f" % grid_search.best_score_)
print(grid_search.best_estimator_)

print('-----grid search end------------')
print ('on all train set')
scores = cross_val_score(grid_search.best_estimator_, x_train, y_train,cv=3,scoring='accuracy')
print (scores.mean(),scores)
print ('on test set')
scores = cross_val_score(grid_search.best_estimator_, x_test, y_test,cv=3,scoring='accuracy')
print(scores.mean(),scores)

输出
Best score: 0.835
Pipeline(memory=None,
steps=[(‘clf’, DecisionTreeClassifier(class_weight=None, criterion=‘gini’, max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter=‘best’))])
-----grid search end------------
on all train set
0.8154573888269541 [0.78846154 0.8125 0.84541063]
on test set
0.7649604660840615 [0.75555556 0.75280899 0.78651685]

4.随机森林

clf=RandomForestClassifier(n_estimators=100)
###grid search找到最好的参数
param_grid = dict( )
##创建分类pipeline
pipeline=Pipeline([ ('clf',clf) ])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=3,scoring='accuracy',\
cv=StratifiedShuffleSplit(n_splits=5, test_size=0.2,random_state=seed)).fit(x_train, y_train)
# 对结果打分
print("Best score: %0.3f" % grid_search.best_score_)
print(grid_search.best_estimator_)

print('-----grid search end------------')
print ('on all train set')
scores = cross_val_score(grid_search.best_estimator_, x_train, y_train,cv=3,scoring='accuracy')
print (scores.mean(),scores)
print ('on test set')
scores = cross_val_score(grid_search.best_estimator_, x_test, y_test,cv=3,scoring='accuracy')
print(scores.mean(),scores)

输出
Best score: 0.810
Pipeline(memory=None,
steps=[(‘clf’, RandomForestClassifier(bootstrap=True, class_weight=None, criterion=‘gini’,
max_depth=None, max_features=‘auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False))])
-----grid search end------------
on all train set
0.8058420042115694 [0.76923077 0.80288462 0.84541063]
on test set
0.7835622138992925 [0.78888889 0.76404494 0.79775281]
（xgboost运行时服务器依然会挂，任务结束后再研究）

参考：https://www.jianshu.com/p/c4e24a6a9633
https://blog.csdn.net/qq_39422642/article/details/78566763