数据挖掘项目(五)

本文档涵盖了数据挖掘项目的第五个部分,详细介绍了如何利用数据进行深入分析。参考了CSDN上的相关文章和GitHub上的Datawhale_data_mining项目中的loan.ipynb文件,涉及数据预处理、特征工程和模型训练等关键步骤。
摘要由CSDN通过智能技术生成

数据挖掘项目(五)


目标任务:【模型调优】
使用网格搜索法对5个模型进行调优(调参时采用五折交叉验证的方式),并进行模型评估,记得展示代码的运行结果。

网格搜索是一种调参手段;穷举搜索:在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。(为什么叫网格搜索?以有两个参数的模型为例,参数a有3种可能,参数b有4种可能,把所有可能性列出来,可以表示成一个3*4的表格,其中每个cell就是一个网格,循环过程就像是在每个网格里遍历、搜索,所以叫grid search)
from sklearn.model_selection import GridSearchCV
cv = 5
scoring = 'accuracy'
n_jobs = -1
``
`# 逻辑回归调参
model = LogisticRegression()
param_grid = {'penalty': ['l1', 'l2'], 'C': [0.0001, 0.001, 0.01, 0.1, 1.0]}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')
# svm调参
model = SVC()
param_grid = {'C': [0.001, 0.01, 0.1, 1.0]}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')
# 决策树调参
model = DecisionTreeClassifier()
param_grid = {'max_depth': range(1, 15, 2), 'min_samples_split': range(50, 200, 20)}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')
# 随机森林调参
model = RandomForestClassifier()
param_grid = {'n_estimators': [5, 10, 25, 50, 100],
'criterion': ['gini', 'entropy'],
'max_features': [1, 2, 3, 4],
'warm_start': [True, False]
}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')
# xgboost调参
model = XGBClassifier()
param_grid =  {'booster': ['gbtree', 'gblinear'], 'max_depth': range(3, 6)}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')

参考
https://blog.csdn.net/weixin_43891494/article/details/88384562
https://blog.csdn.net/chen19830/article/details/88375795
https://github.com/highroom/Datawhale_data_mining/blob/master/task5/loan.ipynb

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值