文章目录
0 简介
今天学长向大家分享一个毕业设计项目
毕业设计 基于机器学习的乳腺癌数据分析
项目运行效果:
毕业设计 机器学习乳腺数据挖掘分析
项目获取:
https://gitee.com/sinonfin/algorithm-sharing
模型评估
1. 机器学习常用分类模型:
1.最近邻 (KNN Classifier)
2.Logistic回归 (Logistic Regression Classifier)
3.高斯朴素贝叶斯(GaussianNB)
4.多项分布朴素贝叶斯(Multinomial Naive Bayes Classifier )
5.决策树(Decision Tree Classifier)
6.集成算法(Ensemble methods)
- 梯度提升决策树(GBDT(Gradient Boosting Decision Tree) Classifier)
- 自适应推举算法(AdaBoost)(AdaBoost Classifier)
- 随机森林 (Random Forest Classifier)
- Bagging
7.支持向量机(SVM Classifier)
2.分类模型的评估:
-
模型评估指标
准确率,精确率和召回率,F1分数,均方误差、根均方误差、绝对百分比误差,ROC曲线 -
模型评估方法
Holdout检验,交叉验证,自助法,超参数调优 -
优化过拟合与欠拟合
- 降低过拟合风险的方法:
(1).
从数据入手,获得更多的训练数据。使用更多的训练数据是解决过拟合问题最有效的手段,因为更多的样本能够让模型学习到更多更有效的特征,减少噪音的影响,当然,直接增加实验数据一般是很困难的,但是可以通过一定的规则来扩充训练数据。比如,在图像分类的问题上,可以通过图像的平移、旋转、缩放等方式扩充数据;更进一步地,可以使用生成式对抗网络来合成大量的新训练数据
(2).
降低模型复杂度。在数据较少时,模型过于复杂是产生过拟合的主要因素,适当降低模型复杂度可以避免拟合过多的采样噪音。例如,在神经网络中减少网络层数、神经元个数等;在决策树模型中降低树的深度、进行剪枝等
(3). 正则化方法
(4). 集成学习方法。集成学习是把多个模型集成在一起,来降低单一模型的过拟合风险
* 降低欠拟合风险方法
(1).添加新特征。当特征不足或现有特征与样本标签的相关性不强时,模型容易出现不拟合,通过挖掘’上下文特征’‘ID类特征’'组合特征’等新的特征,往往能够取得更好的效果,在深度学习的潮流中,有很多类型可以帮组完成特征工程,如因子分解机
(2).增加模型复杂度。简单模型的学习能力较差,通过增加模型的复杂度可以使模型拥有更强的拟合能力,例如,在线性模型中添加高次项,在神经网络模型中增加网络层数或神经元个数等
(3). 减少正则化系数。正则化是用来防止过拟合的,但当模型出现欠拟合现象时,则需要针对性地减少正则化系数
1. 导入扩展库
import time
from sklearn import metrics
import pickle as pickle
import pandas as pd
from sklearn import tree
from sklearn.tree import export_graphviz
import graphviz
from IPython.display import Image
import pydotplus
import os
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from common.utils import plot_learning_curve
from common.utils import plot_param_curve
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import ShuffleSplit
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['font.sans-serif'] = ['SimHei'] #设置简黑字体
mpl.rcParams['axes.unicode_minus'] = False # 解决‘-’bug
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
2. 准备训练数据
cancer = load_breast_cancer() #载入数据
df = pd.DataFrame(cancer.data,columns=cancer.feature_names)
df['target'] = cancer.target
x = cancer.data
y = cancer.target
print('data:',x.shape)
print('target:',y.shape)
# 打印前五行数据
df.head()
data: (569, 30)
target: (569,)
| mean radius| mean texture| mean perimeter| mean area| mean smoothness| mean
compactness| mean concavity| mean concave points| mean symmetry| mean fractal
dimension| …| worst texture| worst perimeter| worst area| worst smoothness|
worst compactness| worst concavity| worst concave points| worst symmetry|
worst fractal dimension| target
—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—
0| 17.99| 10.38| 122.80| 1001.0| 0.11840| 0.27760| 0.3001| 0.14710| 0.2419|
0.07871| …| 17.33| 184.60| 2019.0| 0.1622| 0.6656| 0.7119| 0.2654| 0.4601|
0.11890| 0
1| 20.57| 17.77| 132.90| 1326.0| 0.08474| 0.07864| 0.0869| 0.07017| 0.1812|
0.05667| …| 23.41| 158.80| 1956.0| 0.1238| 0.1866| 0.2416| 0.1860| 0.2750|
0.08902| 0
2| 19.69| 21.25| 130.00| 1203.0| 0.10960| 0.15990| 0.1974| 0.12790| 0.2069|
0.05999| …| 25.53| 152.50| 1709.0| 0.1444| 0.4245| 0.4504| 0.2430| 0.3613|
0.08758| 0
3| 11.42| 20.38| 77.58| 386.1| 0.14250| 0.28390| 0.2414| 0.10520| 0.2597|
0.09744| …| 26.50| 98.87| 567.7| 0.2098| 0.8663| 0.6869| 0.2575| 0.6638|
0.17300| 0
4| 20.29| 14.34| 135.10| 1297.0| 0.10030| 0.13280| 0.1980| 0.10430| 0.1809|
0.05883| …| 16.67| 152.20| 1575.0| 0.1374| 0.2050| 0.4000| 0.1625| 0.2364|
0.07678| 0
5 rows × 31 columns
# 查看数据描述
df.info()
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mean radius 569 non-null float64
1 mean texture 569 non-null float64
2 mean perimeter 569 non-null float64
3 mean area 569 non-null float64
4 mean smoothness 569 non-null float64
5 mean compactness 569 non-null float64
6 mean concavity 569 non-null float64
7 mean concave points 569 non-null float64
8 mean symmetry 569 non-null float64
9 mean fractal dimension 569 non-null float64
10 radius error 569 non-null float64
11 texture error 569 non-null float64
12 perimeter error 569 non-null float64
13 area error 569 non-null float64
14 smoothness error 569 non-null float64
15 compactness error 569 non-null float64
16 concavity error 569 non-null float64
17 concave points error 569 non-null float64
18 symmetry error 569 non-null float64
19 fractal dimension error 569 non-null float64
20 worst radius 569 non-null float64
21 worst texture 569 non-null float64
22 worst perimeter 569 non-null float64
23 worst area 569 non-null float64
24 worst smoothness 569 non-null float64
25 worst compactness 569 non-null float64
26 worst concavity 569 non-null float64
27 worst concave points 569 non-null float64
28 worst symmetry 569 non-null float64
29 worst fractal dimension 569 non-null float64
30 target 569 non-null int32
dtypes: float64(30), int32(1)
memory usage: 135.7 KB
数据未包含空值
# 打印数据类别及每种类别的个数
df['target'].value_counts()
1 357
0 212
Name: target, dtype: int64
# 查看对数值属性的概括
df.describe()
# 画出数据分布直方图
df.hist(bins=50,figsize=(20,15))
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.33)
训练集
df_train = pd.DataFrame(x_train,columns=cancer.feature_names)
df_train['target'] = y_train
df_train
测试集
df_test = pd.DataFrame(x_test,columns=cancer.feature_names)
df_test['target'] = y_test
df_test
3.创建模型
# Multinomial Naive Bayes Classifier
def mul_naive_bayes_classifier(train_x, train_y):
model = MultinomialNB(alpha=0.01)
model.fit(train_x, train_y)
return model
def naive_bayes_classifier(train_x, train_y):
model = GaussianNB(priors=None)
model.fit(train_x, train_y)
return model
# KNN Classifier
def knn_classifier(train_x, train_y):
model = KNeighborsClassifier()
model.fit(train_x, train_y)
return model
# Logistic Regression Classifier
def logistic_regression_classifier(train_x, train_y):
model = LogisticRegression(penalty='l2')
model.fit(train_x, train_y)
return model
# Random Forest Classifier
def random_forest_classifier(train_x, train_y):
model = RandomForestClassifier(n_estimators=8)
model.fit(train_x, train_y)
return model
# Decision Tree Classifier
def decision_tree_classifier(train_x, train_y):
model = DecisionTreeClassifier()
model.fit(train_x, train_y)
return model
# GBDT(Gradient Boosting Decision Tree) Classifier
def gradient_boosting_classifier(train_x, train_y):
model = GradientBoostingClassifier(n_estimators=200)
model.fit(train_x, train_y)
return model
# SVM Classifier
def svm_classifier(train_x, train_y):
model = SVC(kernel='rbf', probability=True)
model.fit(train_x, train_y)
return model
def adaboost_classifier(train_x, train_y):
model = AdaBoostClassifier(DecisionTreeClassifier(),algorithm="SAMME", n_estimators=7, learning_rate=0.4)
model.fit(train_x, train_y)
return model
def bagging_classifier(train_x, train_y):
model = BaggingClassifier(DecisionTreeClassifier(), bootstrap=True)
model.fit(train_x,train_y)
return model
4.测试模型
test_classifiers = ['NB(高斯朴素贝叶斯)','MNB(多项式分布朴素贝叶斯)', 'KNN(最近邻)', 'LR(Logistic回归)', 'RF(随机森林)', 'DT(决策树)', 'SVM(支持向量机)', 'GBDT(梯度提升决策树)','Adaboost','Bagging']
classifiers = {
'GBDT(梯度提升决策树)':gradient_boosting_classifier,
'Adaboost':adaboost_classifier,
'Bagging':bagging_classifier,
'NB(高斯朴素贝叶斯)':naive_bayes_classifier,
'MNB(多项式分布朴素贝叶斯)':mul_naive_bayes_classifier,
'KNN(最近邻)':knn_classifier,
'LR(Logistic回归)':logistic_regression_classifier,
'RF(随机森林)':random_forest_classifier,
'DT(决策树)':decision_tree_classifier,
'SVM(支持向量机)':svm_classifier
}
for classifier in test_classifiers:
print('******************* %s ********************' % classifier)
start_time = time.time()
model = classifiers[classifier](x_train, y_train)
print(model)
print('training took %fs!' % (time.time() - start_time))
predict = model.predict(x_test)
# if model_save_file != None:
# model_save[classifier] = model )
score = metrics.precision_score(y_test, predict)
recall = metrics.recall_score(y_test, predict)
print('precision: %.2f%%, recall: %.2f%%' % (100 * score, 100 * recall))
accuracy = metrics.accuracy_score(y_test, predict)
print('accuracy: %.2f%%' % (100 * accuracy))
c_matrix = confusion_matrix(
y_test, # array, Gound true (correct) target values
predict, # array, Estimated targets as returned by a classifier
labels=[0,1], # array, List of labels to index the matrix.
sample_weight=None # array-like of shape = [n_samples], Optional sample weights
)
print('\nclassification_report:')
print(classification_report( y_test,predict,labels=[0,1]))
print('\nconfusion_matrix:')
print(c_matrix)
cv = ShuffleSplit(n_splits=10, test_size=0.25, random_state=0)
title = classifier+' Learning Curves'
start = time.clock()
plot_learning_curve(plt, model,title,cancer.data, cancer.target, ylim=(0.5, 1.01), cv=cv)
print('elaspe: {0:.6f}'.format(time.clock()-start))
curve1 = plot_roc_curve(model, x_train, y_train, alpha=0.8,name=classifier)
curve1.figure_.suptitle("乳腺癌 ROC")
#画出决策树
if classifier == 'DT(决策树)':
dot_data = export_graphviz(model,
out_file = None,
# feature_names = iris_feature_name,
# class_names = iris_target_name,
filled=True,
rounded=True
)
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(graph.create_png()))
plt.show()
print()
******************* NB(高斯朴素贝叶斯) ********************
GaussianNB(priors=None, var_smoothing=1e-09)
training took 0.004002s!
precision: 91.67%, recall: 94.02%
accuracy: 90.96%
classification_report:
precision recall f1-score support
0 0.90 0.86 0.88 71
1 0.92 0.94 0.93 117
accuracy 0.91 188
macro avg 0.91 0.90 0.90 188
weighted avg 0.91 0.91 0.91 188
confusion_matrix:
[[ 61 10]
[ 7 110]]
elaspe: 0.299883
******************* MNB(多项式分布朴素贝叶斯) ********************
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
training took 0.008931s!
precision: 88.19%, recall: 95.73%
accuracy: 89.36%
classification_report:
precision recall f1-score support
0 0.92 0.79 0.85 71
1 0.88 0.96 0.92 117
accuracy 0.89 188
macro avg 0.90 0.87 0.88 188
weighted avg 0.90 0.89 0.89 188
confusion_matrix:
[[ 56 15]
[ 5 112]]
elaspe: 0.272553
******************* KNN(最近邻) ********************
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
training took 0.006923s!
precision: 93.28%, recall: 94.87%
accuracy: 92.55%
classification_report:
precision recall f1-score support
0 0.91 0.89 0.90 71
1 0.93 0.95 0.94 117
accuracy 0.93 188
macro avg 0.92 0.92 0.92 188
weighted avg 0.93 0.93 0.93 188
confusion_matrix:
[[ 63 8]
[ 6 111]]
elaspe: 1.937058
******************* LR(Logistic回归) ********************
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
training took 0.132035s!
precision: 95.73%, recall: 95.73%
accuracy: 94.68%
classification_report:
precision recall f1-score support
0 0.93 0.93 0.93 71
1 0.96 0.96 0.96 117
accuracy 0.95 188
macro avg 0.94 0.94 0.94 188
weighted avg 0.95 0.95 0.95 188
confusion_matrix:
[[ 66 5]
[ 5 112]]
elaspe: 5.063377
******************* RF(随机森林) ********************
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=8,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
training took 0.044998s!
precision: 94.83%, recall: 94.02%
accuracy: 93.09%
classification_report:
precision recall f1-score support
0 0.90 0.92 0.91 71
1 0.95 0.94 0.94 117
accuracy 0.93 188
macro avg 0.93 0.93 0.93 188
weighted avg 0.93 0.93 0.93 188
confusion_matrix:
[[ 65 6]
[ 7 110]]
elaspe: 1.873387
******************* DT(决策树) ********************
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
training took 0.014005s!
precision: 93.16%, recall: 93.16%
accuracy: 91.49%
classification_report:
precision recall f1-score support
0 0.89 0.89 0.89 71
1 0.93 0.93 0.93 117
accuracy 0.91 188
macro avg 0.91 0.91 0.91 188
weighted avg 0.91 0.91 0.91 188
confusion_matrix:
[[ 63 8]
[ 8 109]]
elaspe: 0.448771
******************* SVM(支持向量机) ********************
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
verbose=False)
training took 0.028140s!
precision: 90.48%, recall: 97.44%
accuracy: 92.02%
classification_report:
precision recall f1-score support
0 0.95 0.83 0.89 71
1 0.90 0.97 0.94 117
accuracy 0.92 188
macro avg 0.93 0.90 0.91 188
weighted avg 0.92 0.92 0.92 188
confusion_matrix:
[[ 59 12]
[ 3 114]]
elaspe: 1.027975
******************* GBDT(梯度提升决策树) ********************
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200,
n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
training took 0.996242s!
precision: 94.07%, recall: 94.87%
accuracy: 93.09%
classification_report:
precision recall f1-score support
0 0.91 0.90 0.91 71
1 0.94 0.95 0.94 117
accuracy 0.93 188
macro avg 0.93 0.93 0.93 188
weighted avg 0.93 0.93 0.93 188
confusion_matrix:
[[ 64 7]
[ 6 111]]
elaspe: 39.072309
******************* Adaboost ********************
AdaBoostClassifier(algorithm='SAMME',
base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=None,
splitter='best'),
learning_rate=0.4, n_estimators=7, random_state=None)
training took 0.025010s!
precision: 93.22%, recall: 94.02%
accuracy: 92.02%
classification_report:
precision recall f1-score support
0 0.90 0.89 0.89 71
1 0.93 0.94 0.94 117
accuracy 0.92 188
macro avg 0.92 0.91 0.91 188
weighted avg 0.92 0.92 0.92 188
confusion_matrix:
[[ 63 8]
[ 7 110]]
elaspe: 0.960197
******************* Bagging ********************
BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=None,
splitter='best'),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=1.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
training took 0.106950s!
precision: 94.02%, recall: 94.02%
accuracy: 92.55%
classification_report:
precision recall f1-score support
0 0.90 0.90 0.90 71
1 0.94 0.94 0.94 117
accuracy 0.93 188
macro avg 0.92 0.92 0.92 188
weighted avg 0.93 0.93 0.93 188
confusion_matrix:
[[ 64 7]
[ 7 110]]
elaspe: 4.000736
当使用默认参数时, GBDT(梯度提升决策树)的准确率和召回率最高,同时耗费的时间也最长;相对的MNB(多项式分布朴素贝叶斯)耗费的时间最短。
5.参数调优
各个分类模型的默认参数
KNN Classifier
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
userscript.html?id=1cfc3476-717c-41b6-b4e7-1a24541c7949:24
Logistic Regression Classifier
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Random Forest Classifier
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=8,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
Decision Tree Classifier
max_depth(树的深度)
max_leaf_nodes(叶子结点的数目)
max_features(最大特征数目)
min_samples_leaf(叶子结点的最小样本数)
min_samples_split(中间结点的最小样本树)
min_weight_fraction_leaf(叶子节点的样本权重占总权重的比例)
min_impurity_split(最小不纯净度)也可以调整
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
[sklearn决策树之剪枝参数_数据结构与算法_The Zen of Data Analysis-CSDN博客](https://blog.csdn.net/gracejpw/article/details/102239574)
GBDT(Gradient Boosting Decision Tree) Classifier
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200,
n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
AdaBoost
AdaBoostClassifier(algorithm='SAMME',
base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=None,
splitter='best'),
learning_rate=0.4, n_estimators=7, random_state=None)
Bagging
BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=None,
splitter='best'),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=1.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
SVM
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
verbose=False)
def grid_search(model,param_grid,train_x,train_y,cv=5):
grid_search = GridSearchCV(model, param_grid=param_grid, n_jobs = -1, verbose=1) # cv:交叉验证参数,默认是None, 使用三折交叉验证,指定 fold数量, default = 3
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
# for para, val in list(best_parameters.items()):
# print(para, val)
print('最优参数:',best_parameters)
return grid_search.best_estimator_
# 调整参数的字典
common_classifiers = ['KNN(最近邻)', 'LR(Logistic回归)', 'DT(决策树)', 'SVM(支持向量机)' ]
ensem_classifiers = ['RF(随机森林)','GBDT(梯度提升决策树)','Adaboost']
basic_classifiers = {
'KNN(最近邻)':KNeighborsClassifier(),
'LR(Logistic回归)':LogisticRegression(penalty='l2'),
'DT(决策树)': DecisionTreeClassifier() ,
'SVM(支持向量机)': SVC(kernel='rbf', probability=True),
'GBDT(梯度提升决策树)': GradientBoostingClassifier(n_estimators=200),
'RF(随机森林)': RandomForestClassifier(n_estimators=8) ,
'Adaboost': AdaBoostClassifier(DecisionTreeClassifier(),algorithm="SAMME", n_estimators=7, learning_rate=0.4)
}
grid_params = {
'KNN(最近邻)':[
{'weights':['uniform'],'n_neighbors':np.arange(4,8,1)},
{'weights':['distance'],'n_neighbors':np.arange(4,8,1)},
],
'LR(Logistic回归)':[
{'C':[0.01,0.1,1.0,10.0,100.0],'penalty':['l1']},
{'C':[0.01,0.1,1.0,10.0,100.0],'penalty':['l2'],'solver':['liblinear','newton-cg','sag','lbfgs']},
],
'DT(决策树)':[
{'min_samples_split':np.arange(1,15,1),'min_samples_leaf':np.arange(1,15,1),'splitter':['random']},
{'min_samples_split':np.arange(1,15,1),'min_samples_leaf':np.arange(1,15,1),'splitter':['best']},
],
'SVM(支持向量机)':[
{'C': [1e-1, 1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1e-1, 1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
}
ensem_params = {
'GBDT(梯度提升决策树)':{'n_estimators':np.arange(20,500,50),'max_depth':np.arange(3,14,2), 'min_samples_split':np.arange(2,10,2)},#'min_samples_split':list(range(800,1900,200)), 'min_samples_leaf':list(range(60,101,10))
'RF(随机森林)':{'n_estimators':np.arange(10,71,10),'max_depth':np.arange(3,14,2), 'min_samples_split':np.arange(80,150,20), 'min_samples_leaf':np.arange(10,60,10)},
'Adaboost':{'n_estimators':np.arange(1,11,1),'learning_rate':np.arange(0.1,1,0.1)}
}
from sklearn.metrics import roc_curve, auc, roc_auc_score
for classifier in common_classifiers:
print('******************* %s ********************' % classifier)
start_time = time.time()
model = basic_classifiers[classifier]
clf = grid_search(model,grid_params[classifier],x_train,y_train,cv=5)
print('training took %fs!' % (time.time() - start_time))
print(clf)
clf.fit(x_train,y_train)
predict = clf.predict(x_test)
score = metrics.precision_score(y_test, predict)
recall = metrics.recall_score(y_test, predict)
print('precision: %.2f%%, recall: %.2f%%' % (100 * score, 100 * recall))
accuracy = metrics.accuracy_score(y_test, predict)
print('accuracy: %.2f%%' % (100 * accuracy))
c_matrix = confusion_matrix(
y_test, # array, Gound true (correct) target values
predict, # array, Estimated targets as returned by a classifier
labels=[0,1], # array, List of labels to index the matrix.
sample_weight=None # array-like of shape = [n_samples], Optional sample weights
)
print('\nclassification_report:')
print(classification_report( y_test,predict,labels=[0,1]))
print('\nconfusion_matrix:')
print(c_matrix)
print()
curve1 = plot_roc_curve(clf, x_train, y_train, alpha=0.8,name=classifier)
curve1.figure_.suptitle("乳腺癌 ROC")
plt.show()
******************* KNN(最近邻) ********************
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 2.7s finished
最优参数: {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 6, 'p': 2, 'weights': 'uniform'}
training took 2.806710s!
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=6, p=2,
weights='uniform')
precision: 94.92%, recall: 97.39%
accuracy: 95.21%
classification_report:
precision recall f1-score support
0 0.96 0.92 0.94 73
1 0.95 0.97 0.96 115
accuracy 0.95 188
macro avg 0.95 0.95 0.95 188
weighted avg 0.95 0.95 0.95 188
confusion_matrix:
[[ 67 6]
[ 3 112]]
******************* LR(Logistic回归) ********************
Fitting 5 folds for each of 25 candidates, totalling 125 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed: 3.8s finished
最优参数: {'C': 100.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
training took 3.939845s!
LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)
precision: 95.83%, recall: 100.00%
accuracy: 97.34%
classification_report:
precision recall f1-score support
0 1.00 0.93 0.96 73
1 0.96 1.00 0.98 115
accuracy 0.97 188
macro avg 0.98 0.97 0.97 188
weighted avg 0.97 0.97 0.97 188
confusion_matrix:
[[ 68 5]
[ 0 115]]
******************* DT(决策树) ********************
Fitting 5 folds for each of 392 candidates, totalling 1960 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 312 tasks | elapsed: 0.9s
[Parallel(n_jobs=-1)]: Done 1960 out of 1960 | elapsed: 4.6s finished
最优参数: {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 4, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': None, 'splitter': 'random'}
training took 4.746069s!
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=4,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='random')
precision: 95.54%, recall: 93.04%
accuracy: 93.09%
classification_report:
precision recall f1-score support
0 0.89 0.93 0.91 73
1 0.96 0.93 0.94 115
accuracy 0.93 188
macro avg 0.93 0.93 0.93 188
weighted avg 0.93 0.93 0.93 188
confusion_matrix:
[[ 68 5]
[ 8 107]]
******************* SVM(支持向量机) ********************
Fitting 5 folds for each of 15 candidates, totalling 75 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 7.4min
[Parallel(n_jobs=-1)]: Done 75 out of 75 | elapsed: 10.5min finished
最优参数: {'C': 10, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'linear', 'max_iter': -1, 'probability': True, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
training took 698.401171s!
SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
verbose=False)
precision: 96.58%, recall: 98.26%
accuracy: 96.81%
classification_report:
precision recall f1-score support
0 0.97 0.95 0.96 73
1 0.97 0.98 0.97 115
accuracy 0.97 188
macro avg 0.97 0.96 0.97 188
weighted avg 0.97 0.97 0.97 188
confusion_matrix:
[[ 69 4]
[ 2 113]]
集成学习调参
for classifier in ensem_classifiers:
print('******************* %s ********************' % classifier)
start_time = time.time()
model = basic_classifiers[classifier]
clf = grid_search(model,ensem_params[classifier],x_train,y_train,cv=5)
print('training took %fs!' % (time.time() - start_time))
print(clf)
clf.fit(x_train,y_train)
predict = clf.predict(x_test)
score = metrics.precision_score(y_test, predict)
recall = metrics.recall_score(y_test, predict)
print('precision: %.2f%%, recall: %.2f%%' % (100 * score, 100 * recall))
accuracy = metrics.accuracy_score(y_test, predict)
print('accuracy: %.2f%%' % (100 * accuracy))
c_matrix = confusion_matrix(
y_test, # array, Gound true (correct) target values
predict, # array, Estimated targets as returned by a classifier
labels=[0,1], # array, List of labels to index the matrix.
sample_weight=None # array-like of shape = [n_samples], Optional sample weights
)
print('\nclassification_report:')
print(classification_report( y_test,predict,labels=[0,1]))
print('\nconfusion_matrix:')
print(c_matrix)
print()
curve1 = plot_roc_curve(clf, x_train, y_train, alpha=0.8,name=classifier)
curve1.figure_.suptitle("乳腺癌 ROC")
plt.show()
******************* RF(随机森林) ********************
Fitting 5 folds for each of 840 candidates, totalling 4200 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 4.7s
[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 12.4s
[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 27.4s
[Parallel(n_jobs=-1)]: Done 792 tasks | elapsed: 55.1s
[Parallel(n_jobs=-1)]: Done 1242 tasks | elapsed: 1.5min
[Parallel(n_jobs=-1)]: Done 1792 tasks | elapsed: 2.2min
[Parallel(n_jobs=-1)]: Done 2442 tasks | elapsed: 3.0min
[Parallel(n_jobs=-1)]: Done 3192 tasks | elapsed: 3.8min
[Parallel(n_jobs=-1)]: Done 4042 tasks | elapsed: 4.8min
[Parallel(n_jobs=-1)]: Done 4200 out of 4200 | elapsed: 5.0min finished
最优参数: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 3, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 10, 'min_samples_split': 120, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 30, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
training took 301.064679s!
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=3, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=10, min_samples_split=120,
min_weight_fraction_leaf=0.0, n_estimators=30,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
precision: 94.17%, recall: 94.17%
accuracy: 92.55%
classification_report:
precision recall f1-score support
0 0.90 0.90 0.90 68
1 0.94 0.94 0.94 120
accuracy 0.93 188
macro avg 0.92 0.92 0.92 188
weighted avg 0.93 0.93 0.93 188
confusion_matrix:
[[ 61 7]
[ 7 113]]
******************* GBDT(梯度提升决策树) ********************
Fitting 5 folds for each of 240 candidates, totalling 1200 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 11.9s
[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 1.0min
[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 2.1min
[Parallel(n_jobs=-1)]: Done 792 tasks | elapsed: 3.8min
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed: 5.2min finished
最优参数: {'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 420, 'n_iter_no_change': None, 'presort': 'deprecated', 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
training took 314.633900s!
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=420,
n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
precision: 96.75%, recall: 99.17%
accuracy: 97.34%
classification_report:
precision recall f1-score support
0 0.98 0.94 0.96 68
1 0.97 0.99 0.98 120
accuracy 0.97 188
macro avg 0.98 0.97 0.97 188
weighted avg 0.97 0.97 0.97 188
confusion_matrix:
[[ 64 4]
[ 1 119]]
******************* Adaboost ********************
Fitting 5 folds for each of 90 candidates, totalling 450 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 280 tasks | elapsed: 1.4s
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 2.0s finished
最优参数: {'algorithm': 'SAMME', 'base_estimator__ccp_alpha': 0.0, 'base_estimator__class_weight': None, 'base_estimator__criterion': 'gini', 'base_estimator__max_depth': None, 'base_estimator__max_features': None, 'base_estimator__max_leaf_nodes': None, 'base_estimator__min_impurity_decrease': 0.0, 'base_estimator__min_impurity_split': None, 'base_estimator__min_samples_leaf': 1, 'base_estimator__min_samples_split': 2, 'base_estimator__min_weight_fraction_leaf': 0.0, 'base_estimator__presort': 'deprecated', 'base_estimator__random_state': None, 'base_estimator__splitter': 'best', 'base_estimator': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best'), 'learning_rate': 0.7000000000000001, 'n_estimators': 3, 'random_state': None}
training took 2.143543s!
AdaBoostClassifier(algorithm='SAMME',
base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=None,
splitter='best'),
learning_rate=0.7000000000000001, n_estimators=3,
random_state=None)
precision: 95.04%, recall: 95.83%
accuracy: 94.15%
classification_report:
precision recall f1-score support
0 0.93 0.91 0.92 68
1 0.95 0.96 0.95 120
accuracy 0.94 188
macro avg 0.94 0.94 0.94 188
weighted avg 0.94 0.94 0.94 188
confusion_matrix:
[[ 62 6]
[ 5 115]]
经过对比发现,通过网格寻优对参数进行调参后,模型的准确率有所上升。
最后
项目获取: