信息管理毕设分享(含算法) 基于机器学习的乳腺癌数据分析-CSDN博客

本文链接：https://blog.csdn.net/noopier/article/details/135368425

文章目录

0 简介
- - 模型评估
  - - - KNN Classifier
        Logistic Regression Classifier
        Random Forest Classifier
        Decision Tree Classifier
        GBDT(Gradient Boosting Decision Tree) Classifier
        AdaBoost
        Bagging
        SVM
最后

0 简介

今天学长向大家分享一个毕业设计项目

毕业设计基于机器学习的乳腺癌数据分析

项目运行效果：

毕业设计机器学习乳腺数据挖掘分析

项目获取：

https://gitee.com/sinonfin/algorithm-sharing

模型评估

1. 机器学习常用分类模型:

1.最近邻 (KNN Classifier)

2.Logistic回归 (Logistic Regression Classifier)

3.高斯朴素贝叶斯(GaussianNB)

4.多项分布朴素贝叶斯(Multinomial Naive Bayes Classifier )

5.决策树(Decision Tree Classifier)

6.集成算法（Ensemble methods）

梯度提升决策树(GBDT(Gradient Boosting Decision Tree) Classifier)
自适应推举算法(AdaBoost)(AdaBoost Classifier)
随机森林 (Random Forest Classifier)
Bagging

7.支持向量机(SVM Classifier)

2.分类模型的评估：

模型评估指标
准确率,精确率和召回率,F1分数,均方误差、根均方误差、绝对百分比误差,ROC曲线
模型评估方法
Holdout检验,交叉验证,自助法,超参数调优
优化过拟合与欠拟合
- 降低过拟合风险的方法:

(1).
从数据入手，获得更多的训练数据。使用更多的训练数据是解决过拟合问题最有效的手段，因为更多的样本能够让模型学习到更多更有效的特征，减少噪音的影响，当然，直接增加实验数据一般是很困难的，但是可以通过一定的规则来扩充训练数据。比如，在图像分类的问题上，可以通过图像的平移、旋转、缩放等方式扩充数据；更进一步地，可以使用生成式对抗网络来合成大量的新训练数据

(2).
降低模型复杂度。在数据较少时，模型过于复杂是产生过拟合的主要因素，适当降低模型复杂度可以避免拟合过多的采样噪音。例如，在神经网络中减少网络层数、神经元个数等;在决策树模型中降低树的深度、进行剪枝等

(3). 正则化方法

(4). 集成学习方法。集成学习是把多个模型集成在一起，来降低单一模型的过拟合风险

* 降低欠拟合风险方法

(1).添加新特征。当特征不足或现有特征与样本标签的相关性不强时，模型容易出现不拟合，通过挖掘’上下文特征’‘ID类特征’'组合特征’等新的特征，往往能够取得更好的效果，在深度学习的潮流中，有很多类型可以帮组完成特征工程，如因子分解机

(2).增加模型复杂度。简单模型的学习能力较差，通过增加模型的复杂度可以使模型拥有更强的拟合能力,例如，在线性模型中添加高次项，在神经网络模型中增加网络层数或神经元个数等

(3). 减少正则化系数。正则化是用来防止过拟合的，但当模型出现欠拟合现象时，则需要针对性地减少正则化系数

1. 导入扩展库

import time 
from sklearn import metrics 
import pickle as pickle 
import pandas as pd
from sklearn import tree
from sklearn.tree import export_graphviz
import graphviz
from IPython.display import Image  
import pydotplus
import os

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB 
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC 
from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from common.utils import plot_learning_curve
from common.utils import plot_param_curve

from sklearn.metrics import roc_curve, auc
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import ShuffleSplit

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']   #设置简黑字体
mpl.rcParams['axes.unicode_minus'] = False # 解决‘-’bug
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

2. 准备训练数据

cancer = load_breast_cancer() #载入数据
df = pd.DataFrame(cancer.data,columns=cancer.feature_names)
df['target'] = cancer.target

x = cancer.data
y = cancer.target

print('data:',x.shape)
print('target:',y.shape)

# 打印前五行数据
df.head()

data: (569, 30)
target: (569,)

| mean radius| mean texture| mean perimeter| mean area| mean smoothness| mean
compactness| mean concavity| mean concave points| mean symmetry| mean fractal
dimension| …| worst texture| worst perimeter| worst area| worst smoothness|
worst compactness| worst concavity| worst concave points| worst symmetry|
worst fractal dimension| target
—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—|—
0| 17.99| 10.38| 122.80| 1001.0| 0.11840| 0.27760| 0.3001| 0.14710| 0.2419|
0.07871| …| 17.33| 184.60| 2019.0| 0.1622| 0.6656| 0.7119| 0.2654| 0.4601|
0.11890| 0
1| 20.57| 17.77| 132.90| 1326.0| 0.08474| 0.07864| 0.0869| 0.07017| 0.1812|
0.05667| …| 23.41| 158.80| 1956.0| 0.1238| 0.1866| 0.2416| 0.1860| 0.2750|
0.08902| 0
2| 19.69| 21.25| 130.00| 1203.0| 0.10960| 0.15990| 0.1974| 0.12790| 0.2069|
0.05999| …| 25.53| 152.50| 1709.0| 0.1444| 0.4245| 0.4504| 0.2430| 0.3613|
0.08758| 0
3| 11.42| 20.38| 77.58| 386.1| 0.14250| 0.28390| 0.2414| 0.10520| 0.2597|
0.09744| …| 26.50| 98.87| 567.7| 0.2098| 0.8663| 0.6869| 0.2575| 0.6638|
0.17300| 0
4| 20.29| 14.34| 135.10| 1297.0| 0.10030| 0.13280| 0.1980| 0.10430| 0.1809|
0.05883| …| 16.67| 152.20| 1575.0| 0.1374| 0.2050| 0.4000| 0.1625| 0.2364|
0.07678| 0

5 rows × 31 columns

# 查看数据描述

df.info()

RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         569 non-null    float64
 15  compactness error        569 non-null    float64
 16  concavity error          569 non-null    float64
 17  concave points error     569 non-null    float64
 18  symmetry error           569 non-null    float64
 19  fractal dimension error  569 non-null    float64
 20  worst radius             569 non-null    float64
 21  worst texture            569 non-null    float64
 22  worst perimeter          569 non-null    float64
 23  worst area               569 non-null    float64
 24  worst smoothness         569 non-null    float64
 25  worst compactness        569 non-null    float64
 26  worst concavity          569 non-null    float64
 27  worst concave points     569 non-null    float64
 28  worst symmetry           569 non-null    float64
 29  worst fractal dimension  569 non-null    float64
 30  target                   569 non-null    int32  
dtypes: float64(30), int32(1)
memory usage: 135.7 KB

数据未包含空值

# 打印数据类别及每种类别的个数
df['target'].value_counts()

1    357
0    212
Name: target, dtype: int64

# 查看对数值属性的概括

df.describe()

# 画出数据分布直方图
df.hist(bins=50,figsize=(20,15))

png

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.33)

训练集

df_train = pd.DataFrame(x_train,columns=cancer.feature_names)
df_train['target'] = y_train

df_train

测试集

df_test = pd.DataFrame(x_test,columns=cancer.feature_names)
df_test['target'] = y_test

df_test

3.创建模型

# Multinomial Naive Bayes Classifier 
def mul_naive_bayes_classifier(train_x, train_y): 
    model = MultinomialNB(alpha=0.01) 
    model.fit(train_x, train_y) 
    return model 

def naive_bayes_classifier(train_x, train_y): 
    model = GaussianNB(priors=None)
    model.fit(train_x, train_y) 
    return model 

# KNN Classifier 
def knn_classifier(train_x, train_y): 
    model = KNeighborsClassifier() 
    model.fit(train_x, train_y) 
    return model 

# Logistic Regression Classifier 
def logistic_regression_classifier(train_x, train_y): 
    model = LogisticRegression(penalty='l2') 
    model.fit(train_x, train_y) 
    return model 
  
# Random Forest Classifier 
def random_forest_classifier(train_x, train_y): 
    model = RandomForestClassifier(n_estimators=8) 
    model.fit(train_x, train_y) 
    return model 
  
# Decision Tree Classifier 
def decision_tree_classifier(train_x, train_y): 
    model = DecisionTreeClassifier() 
    model.fit(train_x, train_y) 
    return model 
  
# GBDT(Gradient Boosting Decision Tree) Classifier 
def gradient_boosting_classifier(train_x, train_y): 
    model = GradientBoostingClassifier(n_estimators=200) 
    model.fit(train_x, train_y) 
    return model

# SVM Classifier 
def svm_classifier(train_x, train_y): 
    model = SVC(kernel='rbf', probability=True) 
    model.fit(train_x, train_y) 
    return model 

def adaboost_classifier(train_x, train_y): 
    model = AdaBoostClassifier(DecisionTreeClassifier(),algorithm="SAMME", n_estimators=7, learning_rate=0.4)
    model.fit(train_x, train_y)
    return model

def bagging_classifier(train_x, train_y): 
    model = BaggingClassifier(DecisionTreeClassifier(), bootstrap=True)
    model.fit(train_x,train_y)
    return model

4.测试模型

test_classifiers = ['NB(高斯朴素贝叶斯)','MNB(多项式分布朴素贝叶斯)', 'KNN(最近邻)', 'LR(Logistic回归)', 'RF(随机森林)', 'DT(决策树)', 'SVM(支持向量机)', 'GBDT(梯度提升决策树)','Adaboost','Bagging'] 
classifiers = {
    'GBDT(梯度提升决策树)':gradient_boosting_classifier,
    'Adaboost':adaboost_classifier,
    'Bagging':bagging_classifier,
    'NB(高斯朴素贝叶斯)':naive_bayes_classifier,  
    'MNB(多项式分布朴素贝叶斯)':mul_naive_bayes_classifier,
    'KNN(最近邻)':knn_classifier,
    'LR(Logistic回归)':logistic_regression_classifier,
    'RF(随机森林)':random_forest_classifier,
    'DT(决策树)':decision_tree_classifier,
    'SVM(支持向量机)':svm_classifier
}

for classifier in test_classifiers:
    print('******************* %s ********************' % classifier)
    start_time = time.time()
    model = classifiers[classifier](x_train, y_train)
    print(model)
    print('training took %fs!' % (time.time() - start_time))
    predict = model.predict(x_test)
#     if model_save_file != None: 
#         model_save[classifier] = model )
    score = metrics.precision_score(y_test, predict) 
    recall = metrics.recall_score(y_test, predict)
    print('precision: %.2f%%, recall: %.2f%%' % (100 * score, 100 * recall)) 
    accuracy = metrics.accuracy_score(y_test, predict) 
    print('accuracy: %.2f%%' % (100 * accuracy))
    c_matrix = confusion_matrix(
        y_test,   # array, Gound true (correct) target values
        predict,  # array, Estimated targets as returned by a classifier
        labels=[0,1],  # array, List of labels to index the matrix.
        sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
    )
    print('\nclassification_report:')
    print(classification_report( y_test,predict,labels=[0,1]))

    print('\nconfusion_matrix:')
    print(c_matrix)
    
    cv = ShuffleSplit(n_splits=10, test_size=0.25, random_state=0)
    title = classifier+' Learning Curves'
    start = time.clock()
    plot_learning_curve(plt, model,title,cancer.data, cancer.target, ylim=(0.5, 1.01), cv=cv)

    print('elaspe: {0:.6f}'.format(time.clock()-start))
    
    curve1 = plot_roc_curve(model, x_train, y_train,  alpha=0.8,name=classifier)
    curve1.figure_.suptitle("乳腺癌 ROC")
    
    #画出决策树
    if classifier == 'DT(决策树)':
        dot_data = export_graphviz(model,
                                out_file = None,
                                # feature_names = iris_feature_name,
                                # class_names = iris_target_name,
                                filled=True,
                                rounded=True
                               )
        graph = pydotplus.graph_from_dot_data(dot_data)
        display(Image(graph.create_png()))
    plt.show()
    print()

******************* NB(高斯朴素贝叶斯) ********************
GaussianNB(priors=None, var_smoothing=1e-09)
training took 0.004002s!
precision: 91.67%, recall: 94.02%
accuracy: 90.96%

classification_report:
              precision    recall  f1-score   support

           0       0.90      0.86      0.88        71
           1       0.92      0.94      0.93       117

    accuracy                           0.91       188
   macro avg       0.91      0.90      0.90       188
weighted avg       0.91      0.91      0.91       188

confusion_matrix:
[[ 61  10]
 [  7 110]]
elaspe: 0.299883

png

******************* MNB(多项式分布朴素贝叶斯) ********************
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
training took 0.008931s!
precision: 88.19%, recall: 95.73%
accuracy: 89.36%

classification_report:
              precision    recall  f1-score   support

           0       0.92      0.79      0.85        71
           1       0.88      0.96      0.92       117

    accuracy                           0.89       188
   macro avg       0.90      0.87      0.88       188
weighted avg       0.90      0.89      0.89       188

confusion_matrix:
[[ 56  15]
 [  5 112]]
elaspe: 0.272553

png

******************* KNN(最近邻) ********************
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')
training took 0.006923s!
precision: 93.28%, recall: 94.87%
accuracy: 92.55%

classification_report:
              precision    recall  f1-score   support

           0       0.91      0.89      0.90        71
           1       0.93      0.95      0.94       117

    accuracy                           0.93       188
   macro avg       0.92      0.92      0.92       188
weighted avg       0.93      0.93      0.93       188

confusion_matrix:
[[ 63   8]
 [  6 111]]
elaspe: 1.937058

png

******************* LR(Logistic回归) ********************
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
training took 0.132035s!
precision: 95.73%, recall: 95.73%
accuracy: 94.68%

classification_report:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        71
           1       0.96      0.96      0.96       117

    accuracy                           0.95       188
   macro avg       0.94      0.94      0.94       188
weighted avg       0.95      0.95      0.95       188

confusion_matrix:
[[ 66   5]
 [  5 112]]
elaspe: 5.063377

png

******************* RF(随机森林) ********************
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=8,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
training took 0.044998s!
precision: 94.83%, recall: 94.02%
accuracy: 93.09%

classification_report:
              precision    recall  f1-score   support

           0       0.90      0.92      0.91        71
           1       0.95      0.94      0.94       117

    accuracy                           0.93       188
   macro avg       0.93      0.93      0.93       188
weighted avg       0.93      0.93      0.93       188

confusion_matrix:
[[ 65   6]
 [  7 110]]
elaspe: 1.873387

png

******************* DT(决策树) ********************
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
training took 0.014005s!
precision: 93.16%, recall: 93.16%
accuracy: 91.49%

classification_report:
              precision    recall  f1-score   support

           0       0.89      0.89      0.89        71
           1       0.93      0.93      0.93       117

    accuracy                           0.91       188
   macro avg       0.91      0.91      0.91       188
weighted avg       0.91      0.91      0.91       188

confusion_matrix:
[[ 63   8]
 [  8 109]]
elaspe: 0.448771

png

******************* SVM(支持向量机) ********************
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
    verbose=False)
training took 0.028140s!
precision: 90.48%, recall: 97.44%
accuracy: 92.02%

classification_report:
              precision    recall  f1-score   support

           0       0.95      0.83      0.89        71
           1       0.90      0.97      0.94       117

    accuracy                           0.92       188
   macro avg       0.93      0.90      0.91       188
weighted avg       0.92      0.92      0.92       188

confusion_matrix:
[[ 59  12]
 [  3 114]]
elaspe: 1.027975

png

******************* GBDT(梯度提升决策树) ********************
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=200,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
training took 0.996242s!
precision: 94.07%, recall: 94.87%
accuracy: 93.09%

classification_report:
              precision    recall  f1-score   support

           0       0.91      0.90      0.91        71
           1       0.94      0.95      0.94       117

    accuracy                           0.93       188
   macro avg       0.93      0.93      0.93       188
weighted avg       0.93      0.93      0.93       188

confusion_matrix:
[[ 64   7]
 [  6 111]]
elaspe: 39.072309

png

******************* Adaboost ********************
AdaBoostClassifier(algorithm='SAMME',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=None,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                                                         random_state=None,
                                                         splitter='best'),
                   learning_rate=0.4, n_estimators=7, random_state=None)
training took 0.025010s!
precision: 93.22%, recall: 94.02%
accuracy: 92.02%

classification_report:
              precision    recall  f1-score   support

           0       0.90      0.89      0.89        71
           1       0.93      0.94      0.94       117

    accuracy                           0.92       188
   macro avg       0.92      0.91      0.91       188
weighted avg       0.92      0.92      0.92       188

confusion_matrix:
[[ 63   8]
 [  7 110]]
elaspe: 0.960197

png

******************* Bagging ********************
BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,
                                                        splitter='best'),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=None, verbose=0,
                  warm_start=False)
training took 0.106950s!
precision: 94.02%, recall: 94.02%
accuracy: 92.55%

classification_report:
              precision    recall  f1-score   support

           0       0.90      0.90      0.90        71
           1       0.94      0.94      0.94       117

    accuracy                           0.93       188
   macro avg       0.92      0.92      0.92       188
weighted avg       0.93      0.93      0.93       188

confusion_matrix:
[[ 64   7]
 [  7 110]]
elaspe: 4.000736

png

当使用默认参数时， GBDT(梯度提升决策树)的准确率和召回率最高，同时耗费的时间也最长；相对的MNB(多项式分布朴素贝叶斯)耗费的时间最短。

5.参数调优
各个分类模型的默认参数

KNN Classifier

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                         metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                         weights='uniform')

    userscript.html?id=1cfc3476-717c-41b6-b4e7-1a24541c7949:24

Logistic Regression Classifier

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Random Forest Classifier

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=8,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Decision Tree Classifier

max_depth（树的深度）
max_leaf_nodes（叶子结点的数目）
max_features（最大特征数目）
min_samples_leaf（叶子结点的最小样本数）
min_samples_split（中间结点的最小样本树）
min_weight_fraction_leaf（叶子节点的样本权重占总权重的比例）
min_impurity_split（最小不纯净度）也可以调整

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
                       
[sklearn决策树之剪枝参数_数据结构与算法_The Zen of Data Analysis-CSDN博客](https://blog.csdn.net/gracejpw/article/details/102239574)

GBDT(Gradient Boosting Decision Tree) Classifier

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=200,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

AdaBoost

AdaBoostClassifier(algorithm='SAMME',  
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,  
                                                         class_weight=None,  
                                                         criterion='gini',  
                                                         max_depth=None,  
                                                         max_features=None,  
                                                         max_leaf_nodes=None,  
                                                         min_impurity_decrease=0.0,  
                                                         min_impurity_split=None,  
                                                         min_samples_leaf=1,  
                                                         min_samples_split=2,  
                                                         min_weight_fraction_leaf=0.0,  
                                                         presort='deprecated',  
                                                         random_state=None,  
                                                         splitter='best'),  
                   learning_rate=0.4, n_estimators=7, random_state=None)

Bagging

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,
                                                        splitter='best'),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

SVM

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

def grid_search(model,param_grid,train_x,train_y,cv=5):
    grid_search = GridSearchCV(model, param_grid=param_grid, n_jobs = -1, verbose=1) # cv:交叉验证参数，默认是None， 使用三折交叉验证，指定 fold数量， default = 3
    grid_search.fit(train_x, train_y) 
    best_parameters = grid_search.best_estimator_.get_params() 
#     for para, val in list(best_parameters.items()): 
#         print(para, val) 

    print('最优参数:',best_parameters)
    return grid_search.best_estimator_

# 调整参数的字典
common_classifiers = ['KNN(最近邻)', 'LR(Logistic回归)',  'DT(决策树)', 'SVM(支持向量机)' ] 
ensem_classifiers = ['RF(随机森林)','GBDT(梯度提升决策树)','Adaboost']
basic_classifiers = {
    'KNN(最近邻)':KNeighborsClassifier(),
    'LR(Logistic回归)':LogisticRegression(penalty='l2'),
    'DT(决策树)': DecisionTreeClassifier() ,
    'SVM(支持向量机)': SVC(kernel='rbf', probability=True),
    'GBDT(梯度提升决策树)': GradientBoostingClassifier(n_estimators=200),
    'RF(随机森林)': RandomForestClassifier(n_estimators=8) ,
    'Adaboost': AdaBoostClassifier(DecisionTreeClassifier(),algorithm="SAMME", n_estimators=7, learning_rate=0.4)
}
grid_params = {
    'KNN(最近邻)':[
        {'weights':['uniform'],'n_neighbors':np.arange(4,8,1)},
        {'weights':['distance'],'n_neighbors':np.arange(4,8,1)},
    ],
    'LR(Logistic回归)':[
        {'C':[0.01,0.1,1.0,10.0,100.0],'penalty':['l1']},
        {'C':[0.01,0.1,1.0,10.0,100.0],'penalty':['l2'],'solver':['liblinear','newton-cg','sag','lbfgs']},
    ],
    'DT(决策树)':[
        {'min_samples_split':np.arange(1,15,1),'min_samples_leaf':np.arange(1,15,1),'splitter':['random']},
        {'min_samples_split':np.arange(1,15,1),'min_samples_leaf':np.arange(1,15,1),'splitter':['best']},
    ],
    'SVM(支持向量机)':[
      {'C': [1e-1, 1, 10, 100, 1000], 'kernel': ['linear']},
      {'C': [1e-1, 1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
    ]
    
}

ensem_params = {
    'GBDT(梯度提升决策树)':{'n_estimators':np.arange(20,500,50),'max_depth':np.arange(3,14,2), 'min_samples_split':np.arange(2,10,2)},#'min_samples_split':list(range(800,1900,200)), 'min_samples_leaf':list(range(60,101,10))
    'RF(随机森林)':{'n_estimators':np.arange(10,71,10),'max_depth':np.arange(3,14,2), 'min_samples_split':np.arange(80,150,20), 'min_samples_leaf':np.arange(10,60,10)},
    'Adaboost':{'n_estimators':np.arange(1,11,1),'learning_rate':np.arange(0.1,1,0.1)}
}

from sklearn.metrics import roc_curve, auc, roc_auc_score

for classifier in common_classifiers:
    print('******************* %s ********************' % classifier)
    start_time = time.time()
    model = basic_classifiers[classifier]
    clf = grid_search(model,grid_params[classifier],x_train,y_train,cv=5) 
    print('training took %fs!' % (time.time() - start_time))
    print(clf)
    clf.fit(x_train,y_train)
    predict = clf.predict(x_test)

    score = metrics.precision_score(y_test, predict) 
    recall = metrics.recall_score(y_test, predict)
    print('precision: %.2f%%, recall: %.2f%%' % (100 * score, 100 * recall)) 
    accuracy = metrics.accuracy_score(y_test, predict) 
    print('accuracy: %.2f%%' % (100 * accuracy))
    c_matrix = confusion_matrix(
        y_test,   # array, Gound true (correct) target values
        predict,  # array, Estimated targets as returned by a classifier
        labels=[0,1],  # array, List of labels to index the matrix.
        sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
    )
    print('\nclassification_report:')
    print(classification_report( y_test,predict,labels=[0,1]))

    print('\nconfusion_matrix:')
    print(c_matrix)
    print()
    curve1 = plot_roc_curve(clf, x_train, y_train,  alpha=0.8,name=classifier)
    curve1.figure_.suptitle("乳腺癌 ROC")
    plt.show()

******************* KNN(最近邻) ********************
Fitting 5 folds for each of 8 candidates, totalling 40 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    2.7s finished

最优参数: {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 6, 'p': 2, 'weights': 'uniform'}
training took 2.806710s!
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                     weights='uniform')
precision: 94.92%, recall: 97.39%
accuracy: 95.21%

classification_report:
              precision    recall  f1-score   support

           0       0.96      0.92      0.94        73
           1       0.95      0.97      0.96       115

    accuracy                           0.95       188
   macro avg       0.95      0.95      0.95       188
weighted avg       0.95      0.95      0.95       188

confusion_matrix:
[[ 67   6]
 [  3 112]]

png

******************* LR(Logistic回归) ********************
Fitting 5 folds for each of 25 candidates, totalling 125 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed:    3.8s finished

最优参数: {'C': 100.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
training took 3.939845s!
LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
precision: 95.83%, recall: 100.00%
accuracy: 97.34%

classification_report:
              precision    recall  f1-score   support

           0       1.00      0.93      0.96        73
           1       0.96      1.00      0.98       115

    accuracy                           0.97       188
   macro avg       0.98      0.97      0.97       188
weighted avg       0.97      0.97      0.97       188

confusion_matrix:
[[ 68   5]
 [  0 115]]

png

******************* DT(决策树) ********************
Fitting 5 folds for each of 392 candidates, totalling 1960 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 312 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 1960 out of 1960 | elapsed:    4.6s finished

最优参数: {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 4, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': None, 'splitter': 'random'}
training took 4.746069s!
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=4,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='random')
precision: 95.54%, recall: 93.04%
accuracy: 93.09%

classification_report:
              precision    recall  f1-score   support

           0       0.89      0.93      0.91        73
           1       0.96      0.93      0.94       115

    accuracy                           0.93       188
   macro avg       0.93      0.93      0.93       188
weighted avg       0.93      0.93      0.93       188

confusion_matrix:
[[ 68   5]
 [  8 107]]

png

******************* SVM(支持向量机) ********************
Fitting 5 folds for each of 15 candidates, totalling 75 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed: 10.5min finished

最优参数: {'C': 10, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'linear', 'max_iter': -1, 'probability': True, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
training took 698.401171s!
SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
    verbose=False)
precision: 96.58%, recall: 98.26%
accuracy: 96.81%

classification_report:
              precision    recall  f1-score   support

           0       0.97      0.95      0.96        73
           1       0.97      0.98      0.97       115

    accuracy                           0.97       188
   macro avg       0.97      0.96      0.97       188
weighted avg       0.97      0.97      0.97       188

confusion_matrix:
[[ 69   4]
 [  2 113]]

png

集成学习调参

for classifier in ensem_classifiers:
    print('******************* %s ********************' % classifier)
    start_time = time.time()
    model = basic_classifiers[classifier]
    clf = grid_search(model,ensem_params[classifier],x_train,y_train,cv=5) 
    print('training took %fs!' % (time.time() - start_time))
    print(clf)
    clf.fit(x_train,y_train)
    predict = clf.predict(x_test)

    score = metrics.precision_score(y_test, predict) 
    recall = metrics.recall_score(y_test, predict)
    print('precision: %.2f%%, recall: %.2f%%' % (100 * score, 100 * recall)) 
    accuracy = metrics.accuracy_score(y_test, predict) 
    print('accuracy: %.2f%%' % (100 * accuracy))
    c_matrix = confusion_matrix(
        y_test,   # array, Gound true (correct) target values
        predict,  # array, Estimated targets as returned by a classifier
        labels=[0,1],  # array, List of labels to index the matrix.
        sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
    )
    print('\nclassification_report:')
    print(classification_report( y_test,predict,labels=[0,1]))

    print('\nconfusion_matrix:')
    print(c_matrix)

    print()
    curve1 = plot_roc_curve(clf, x_train, y_train,  alpha=0.8,name=classifier)
    curve1.figure_.suptitle("乳腺癌 ROC")
    plt.show()

******************* RF(随机森林) ********************
Fitting 5 folds for each of 840 candidates, totalling 4200 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   12.4s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   27.4s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:   55.1s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 4200 out of 4200 | elapsed:  5.0min finished

最优参数: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 3, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 10, 'min_samples_split': 120, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 30, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
training took 301.064679s!
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=3, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=10, min_samples_split=120,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
precision: 94.17%, recall: 94.17%
accuracy: 92.55%

classification_report:
              precision    recall  f1-score   support

           0       0.90      0.90      0.90        68
           1       0.94      0.94      0.94       120

    accuracy                           0.93       188
   macro avg       0.92      0.92      0.92       188
weighted avg       0.93      0.93      0.93       188

confusion_matrix:
[[ 61   7]
 [  7 113]]

png

******************* GBDT(梯度提升决策树) ********************
Fitting 5 folds for each of 240 candidates, totalling 1200 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   11.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed:  5.2min finished

最优参数: {'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 420, 'n_iter_no_change': None, 'presort': 'deprecated', 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
training took 314.633900s!
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=420,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
precision: 96.75%, recall: 99.17%
accuracy: 97.34%

classification_report:
              precision    recall  f1-score   support

           0       0.98      0.94      0.96        68
           1       0.97      0.99      0.98       120

    accuracy                           0.97       188
   macro avg       0.98      0.97      0.97       188
weighted avg       0.97      0.97      0.97       188

confusion_matrix:
[[ 64   4]
 [  1 119]]

png

******************* Adaboost ********************
Fitting 5 folds for each of 90 candidates, totalling 450 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed:    2.0s finished

最优参数: {'algorithm': 'SAMME', 'base_estimator__ccp_alpha': 0.0, 'base_estimator__class_weight': None, 'base_estimator__criterion': 'gini', 'base_estimator__max_depth': None, 'base_estimator__max_features': None, 'base_estimator__max_leaf_nodes': None, 'base_estimator__min_impurity_decrease': 0.0, 'base_estimator__min_impurity_split': None, 'base_estimator__min_samples_leaf': 1, 'base_estimator__min_samples_split': 2, 'base_estimator__min_weight_fraction_leaf': 0.0, 'base_estimator__presort': 'deprecated', 'base_estimator__random_state': None, 'base_estimator__splitter': 'best', 'base_estimator': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best'), 'learning_rate': 0.7000000000000001, 'n_estimators': 3, 'random_state': None}
training took 2.143543s!
AdaBoostClassifier(algorithm='SAMME',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=None,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                                                         random_state=None,
                                                         splitter='best'),
                   learning_rate=0.7000000000000001, n_estimators=3,
                   random_state=None)
precision: 95.04%, recall: 95.83%
accuracy: 94.15%

classification_report:
              precision    recall  f1-score   support

           0       0.93      0.91      0.92        68
           1       0.95      0.96      0.95       120

    accuracy                           0.94       188
   macro avg       0.94      0.94      0.94       188
weighted avg       0.94      0.94      0.94       188

confusion_matrix:
[[ 62   6]
 [  5 115]]

png

经过对比发现，通过网格寻优对参数进行调参后，模型的准确率有所上升。

最后

项目获取：

https://gitee.com/sinonfin/algorithm-sharing