数据挖掘实战：第五部分-CSDN博客

本文链接：https://blog.csdn.net/qingxuanmingye/article/details/88397264

本文档涵盖了数据挖掘项目的第五个部分，详细介绍了如何利用数据进行深入分析。参考了CSDN上的相关文章和GitHub上的Datawhale_data_mining项目中的loan.ipynb文件，涉及数据预处理、特征工程和模型训练等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据挖掘项目（五）

目标任务：【模型调优】
使用网格搜索法对5个模型进行调优（调参时采用五折交叉验证的方式），并进行模型评估，记得展示代码的运行结果。

网格搜索是一种调参手段；穷举搜索：在所有候选的参数选择中，通过循环遍历，尝试每一种可能性，表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。（为什么叫网格搜索？以有两个参数的模型为例，参数a有3种可能，参数b有4种可能，把所有可能性列出来，可以表示成一个3*4的表格，其中每个cell就是一个网格，循环过程就像是在每个网格里遍历、搜索，所以叫grid search）

from sklearn.model_selection import GridSearchCV
cv = 5
scoring = 'accuracy'
n_jobs = -1
``
`# 逻辑回归调参
model = LogisticRegression()
param_grid = {'penalty': ['l1', 'l2'], 'C': [0.0001, 0.001, 0.01, 0.1, 1.0]}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')

# svm调参
model = SVC()
param_grid = {'C': [0.001, 0.01, 0.1, 1.0]}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')

# 决策树调参
model = DecisionTreeClassifier()
param_grid = {'max_depth': range(1, 15, 2), 'min_samples_split': range(50, 200, 20)}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')

# 随机森林调参
model = RandomForestClassifier()
param_grid = {'n_estimators': [5, 10, 25, 50, 100],
'criterion': ['gini', 'entropy'],
'max_features': [1, 2, 3, 4],
'warm_start': [True, False]
}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')

# xgboost调参
model = XGBClassifier()
param_grid =  {'booster': ['gbtree', 'gblinear'], 'max_depth': range(3, 6)}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=n_jobs, return_train_score=True)
grid_search.fit(X_selected, y_train)
train_score = grid_search.best_score_
test_score = grid_search.score(X_test_selected, y_test)
best_params = grid_search.best_params_
print(f'train score: {train_score:2f}, test score: {test_score:2f}, best params: {best_params}')

参考
https://blog.csdn.net/weixin_43891494/article/details/88384562
https://blog.csdn.net/chen19830/article/details/88375795
https://github.com/highroom/Datawhale_data_mining/blob/master/task5/loan.ipynb