kaggle员工离职预测案例(2)

机器学习 专栏收录该内容
3 篇文章 0 订阅

传送门 🚪

案例背景及数据分析处理过程请看《kaggle员工离职预测案例(1)
模型评估《kaggle员工离职预测案例(3)

背景

第一篇文章中说此案例是一个典型的二分类问题,这里我们将采用常见的一些模型如逻辑回归(LogisticRegression),决策树分类(DecisionTreeClassifier),随机森林分类(RandomForestClassifier)。

另本文属于学习笔记,在下面的建模中会对不同的数据集(原始数据、降采数据、过采数据)都进行建模学习预测,以此来对比降采过采的效果。

建模

逻辑回归(LogisticRegression)

#导入常用库
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score,cross_val_predict
from sklearn.metrics import confusion_matrix, recall_score, classification_report

使用交叉验证的方法寻找最佳参数,将方法函数封装起来,后面不同的模型和数据集可以直接调用,而非常方便。

#导入常用库
def printing_Kfold_scores(x_train_XXX, y_train_XXX):
    fold = KFold(5, shuffle = False)
    
    #定义不同的惩罚力度,数值越大惩罚力度越小,可以查看官网
    c_param_range = [0.01,0.1,1,10,100]
    
    #展示结果用的表格
    result_table = pd.DataFrame(index = range(len(c_param_range),2),columns = ['C_parameter','Mean recall score'])
    result_table['C_parameter'] = c_param_range
    
    #k-fold表示k折的交叉验证,这里会得到两个索引集合:训练集 = indices[0],验证集 = indices[1]
    j = 0
    #循环遍历不同的参数
    for c_param in c_param_range:
        print('-----------------------------')
        print('正则化惩罚力度:', c_param)
        print('-----------------------------')
        print('')
        
        recall_accs = []
        
        #一步步分解来执行交叉验证
        for iteration,indices in enumerate(fold.split(x_train_XXX)):
            #指定算法模型,并且给定参数,也可以尝试一下惩罚参数选择'l2'
            lr = LogisticRegression(C = c_param,penalty = 'l1',solver='liblinear')
            
            #训练模型,注意索引不要给错了,训练的时候一定传入的是训练集,所以X和Y的索引都是0
            lr.fit(x_train_XXX.iloc[indices[0],:], y_train_XXX.iloc[indices[0],:].values.ravel())
            
            #建立好模型后,预测模型结果,这里用的就是验证集,索引为1
            y_pred_XXXsample = lr.predict(x_train_XXX.iloc[indices[1],:].values)
            
            #有了预测结果之后就可以来进行评估了,这里recall_score需要传入预测值和真实值
            recall_acc = recall_score(y_train_XXX.iloc[indices[1],:].values, y_pred_XXXsample)
            
            #一会还要算平均,这里就把每一步的结果保存起来
            recall_accs.append(recall_acc)
            print('Iteration {}: 召回率 = {}'.format(iteration,recall_acc))
            
        #当执行完所有交叉验证后,计算平均结果
        result_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1 
        print('')
        print('平均召回率',np.mean(recall_accs))
        print('')
        
    #找到最好的参数,哪一个recall高,自然就是最好的了
    best_c = result_table.loc[result_table['Mean recall score'].astype(float).idxmax()]['C_parameter']
    
    #打印最好的结果
    print('************************************************')
    print('效果最好的模型所选的参数 = ',best_c)
    print('************************************************')
    
    return best_c
#导入常用库
print('降采数据下:')
best_c_undersample = printing_Kfold_scores(X_train_undersample,y_train_undersample)
print('过采数据下:')
best_c_oversample = printing_Kfold_scores(X_train_oversample,y_train_oversample)
print('未进行重采样数据下:')
best_c = printing_Kfold_scores(X_train,y_train)
#导入常用库
降采数据下:
-----------------------------
正则化惩罚力度: 0.01
-----------------------------

Iteration 0: 召回率 = 0.0
Iteration 1: 召回率 = 0.0
Iteration 2: 召回率 = 0.0
Iteration 3: 召回率 = 0.0
Iteration 4: 召回率 = 0.0

平均召回率 0.0

-----------------------------
正则化惩罚力度: 0.1
-----------------------------

Iteration 0: 召回率 = 0.8333333333333334
Iteration 1: 召回率 = 0.4411764705882353
Iteration 2: 召回率 = 0.8
Iteration 3: 召回率 = 0.5151515151515151
Iteration 4: 召回率 = 0.48484848484848486

平均召回率 0.6149019607843137

-----------------------------
正则化惩罚力度: 1
-----------------------------

Iteration 0: 召回率 = 0.8666666666666667
Iteration 1: 召回率 = 0.6764705882352942
Iteration 2: 召回率 = 0.8
Iteration 3: 召回率 = 0.6363636363636364
Iteration 4: 召回率 = 0.696969696969697

平均召回率 0.7352941176470589

-----------------------------
正则化惩罚力度: 10
-----------------------------

Iteration 0: 召回率 = 0.8666666666666667
Iteration 1: 召回率 = 0.7647058823529411
Iteration 2: 召回率 = 0.8333333333333334
Iteration 3: 召回率 = 0.696969696969697
Iteration 4: 召回率 = 0.6666666666666666

平均召回率 0.765668449197861

-----------------------------
正则化惩罚力度: 100
-----------------------------

Iteration 0: 召回率 = 0.8666666666666667
Iteration 1: 召回率 = 0.7647058823529411
Iteration 2: 召回率 = 0.8
Iteration 3: 召回率 = 0.6666666666666666
Iteration 4: 召回率 = 0.6666666666666666

平均召回率 0.7529411764705882

************************************************
效果最好的模型所选的参数 =  10.0
************************************************
过采数据下:
-----------------------------
正则化惩罚力度: 0.01
-----------------------------

Iteration 0: 召回率 = 0.0
Iteration 1: 召回率 = 0.0
Iteration 2: 召回率 = 0.0
Iteration 3: 召回率 = 0.0
Iteration 4: 召回率 = 0.0

平均召回率 0.0

-----------------------------
正则化惩罚力度: 0.1
-----------------------------

Iteration 0: 召回率 = 0.7631578947368421
Iteration 1: 召回率 = 0.7131147540983607
Iteration 2: 召回率 = 0.746031746031746
Iteration 3: 召回率 = 0.7580645161290323
Iteration 4: 召回率 = 0.6608695652173913

平均召回率 0.7282476952426744

-----------------------------
正则化惩罚力度: 1
-----------------------------

Iteration 0: 召回率 = 0.7982456140350878
Iteration 1: 召回率 = 0.819672131147541
Iteration 2: 召回率 = 0.7857142857142857
Iteration 3: 召回率 = 0.7983870967741935
Iteration 4: 召回率 = 0.6695652173913044

平均召回率 0.7743168690124824

-----------------------------
正则化惩罚力度: 10
-----------------------------

Iteration 0: 召回率 = 0.7631578947368421
Iteration 1: 召回率 = 0.7704918032786885
Iteration 2: 召回率 = 0.7777777777777778
Iteration 3: 召回率 = 0.75
Iteration 4: 召回率 = 0.6782608695652174

平均召回率 0.7479376690717052

-----------------------------
正则化惩罚力度: 100
-----------------------------

Iteration 0: 召回率 = 0.7631578947368421
Iteration 1: 召回率 = 0.7704918032786885
Iteration 2: 召回率 = 0.7857142857142857
Iteration 3: 召回率 = 0.75
Iteration 4: 召回率 = 0.6782608695652174

平均召回率 0.7495249706590068

************************************************
效果最好的模型所选的参数 =  1.0
************************************************
未进行重采样数据下:
-----------------------------
正则化惩罚力度: 0.01
-----------------------------

Iteration 0: 召回率 = 0.0
Iteration 1: 召回率 = 0.0
Iteration 2: 召回率 = 0.0
Iteration 3: 召回率 = 0.0
Iteration 4: 召回率 = 0.0

平均召回率 0.0

-----------------------------
正则化惩罚力度: 0.1
-----------------------------

Iteration 0: 召回率 = 0.0
Iteration 1: 召回率 = 0.0
Iteration 2: 召回率 = 0.03125
Iteration 3: 召回率 = 0.02631578947368421
Iteration 4: 召回率 = 0.02857142857142857

平均召回率 0.017227443609022557

-----------------------------
正则化惩罚力度: 1
-----------------------------

Iteration 0: 召回率 = 0.3
Iteration 1: 召回率 = 0.375
Iteration 2: 召回率 = 0.28125
Iteration 3: 召回率 = 0.39473684210526316
Iteration 4: 召回率 = 0.3142857142857143

平均召回率 0.33305451127819546

-----------------------------
正则化惩罚力度: 10
-----------------------------

Iteration 0: 召回率 = 0.3333333333333333
Iteration 1: 召回率 = 0.4375
Iteration 2: 召回率 = 0.40625
Iteration 3: 召回率 = 0.39473684210526316
Iteration 4: 召回率 = 0.4857142857142857

平均召回率 0.4115068922305764

-----------------------------
正则化惩罚力度: 100
-----------------------------

Iteration 0: 召回率 = 0.3333333333333333
Iteration 1: 召回率 = 0.4375
Iteration 2: 召回率 = 0.40625
Iteration 3: 召回率 = 0.39473684210526316
Iteration 4: 召回率 = 0.4857142857142857

平均召回率 0.4115068922305764

************************************************
效果最好的模型所选的参数 =  10.0
************************************************
def plot_confusion_matrix(cm,classes,
                          title = 'Confusion matrix',
                            cmap = plt.cm.Blues):
    """定义一个画confusion_matrix的方法"""
    plt.imshow(cm, interpolation='nearest',cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks,classes, rotation = 0)
    plt.yticks(tick_marks,classes)
    
    thresh = cm.max()/2
    for i, j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
        plt.text(j,i,cm[i,j],
                horizontalalignment = 'center',
                color = 'white' if cm[i,j]>thresh else 'black')
    plt.tight_layout()
    plt.xlabel('True label')
    plt.ylabel('Predicted label')
import itertools
def print_result(best_c,X_train_XXXsample,y_train_XXXsample,X_test_XXXsample,y_test_XXXsample):
"""定义一个输出模型结果评分报告的函数"""
    lr = LogisticRegression(C=best_c,penalty = 'l1',solver='liblinear')
    lr.fit(X_train_XXXsample,y_train_XXXsample.values.ravel())
    y_pred_XXXsample = lr.predict(X_test_XXXsample.values)

    #计算所需要的值
    cnf_matrix = confusion_matrix(y_test_XXXsample, y_pred_XXXsample)
    np.set_printoptions(precision = 2)

    print('召回率: ', cnf_matrix[1,1]/(cnf_matrix[1,0] + cnf_matrix[1,1]))
    print(f"Accuracy Score: {accuracy_score(y_test_XXXsample, y_pred_XXXsample) * 100:.2f}%")
    print(classification_report(y_test_XXXsample,y_pred_XXXsample,target_names=['0','1']))

    #绘制
    class_names = [0,1]
    plt.figure()
    plot_confusion_matrix(cnf_matrix,
                         classes=class_names,
                         )
    plt.show()

未进行重采样数据下的结果:

best_c = 10
print_result(best_c,X_train,y_train,X_test,y_test)

在这里插入图片描述

降采数据下的结果:

best_c_undersample = 10
print('降采数据下:')
print('训练集:')
print_result(best_c_undersample,X_train_undersample,y_train_undersample,X_test_undersample,y_test_undersample)
print('*'*40)
print('测试集:')
print_result(best_c_undersample,X_train_undersample,y_train_undersample,X_test,y_test)

在这里插入图片描述
在这里插入图片描述

过采数据下的结果:

best_c_oversample = 1
print('过采数据下:')
print('训练集:')
print_result(best_c_oversample,X_train_oversample,y_train_oversample,X_test_oversample,y_test_oversample)
print('测试集:')
print_result(best_c_oversample,X_train_undersample,y_train_undersample,X_test,y_test)

在这里插入图片描述
在这里插入图片描述

confusion_matrix图形的解读:

(1)四区间中的左下角,表示模型错误的将1(离职)预测成了0(未离职)

(2)四区间中的右上角,表示模型错误的将0(未离职)预测成了1(离职)

(3)四区间中的左上角,表示模型预测1(离职情况)完全正确

(4)四区间中的右下角,表示模型预测0(未离职情况)完全正确


总结:


  1. 降采数据下,模型在训练集中的召回率73%,准确率73%,但是在测试集中的表现提高了,召回率和准确率都提高到了80%
  2. 过采数据下,模型在训练集中的召回率80%,准确率79%。相比降采的数据,过采数据下模型的学习效果明显更好,可能更多的学习样本提高了模型的学习精度。在测试集中召回率84%和准确率79%,效果同样更好。
  3. 在未进行重采样数据下,模型的召回率只有45%,准确率89%,明显最差。

决策树分类(DecisionTreeClassifier)

#导入决策树分类包
from sklearn.tree import DecisionTreeClassifier
#通过GridSearchCV 进行调参
from sklearn.model_selection import GridSearchCV

首先看一看在默认参数下的结果:

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
	"""构建模型,函数可直接输出训练集和测试集下的模型分数,方便后面直接调用"""
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
         #绘制
        cnf_matrix = confusion_matrix(y_train,pred)
        np.set_printoptions(precision = 2)
        #绘制
        class_names = [0,1]
        plt.figure()
        plot_confusion_matrix(cnf_matrix,
                             classes=class_names,)
        plt.show()
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        
        cnf_matrix = confusion_matrix(y_test,pred)
        np.set_printoptions(precision = 2)
        #绘制
        class_names = [0,1]
        plt.figure()
        plot_confusion_matrix(cnf_matrix,
                             classes=class_names,)
        plt.show()

默认参数:

降采数据下的结果:

#构建模型,参数采用默认参数
tree_clf = DecisionTreeClassifier(random_state=42)
#把训练集数据喂给模型
tree_clf.fit(X_train_undersample, y_train_undersample)
print('降采数据下:')
print('训练集:')
print_score(tree_clf, X_train_undersample, y_train_undersample, X_test_undersample, y_test_undersample, train=True)
print('测试集:')
print_score(tree_clf, X_train_undersample, y_train_undersample, X_test, y_test, train=False)

在这里插入图片描述
在这里插入图片描述

过采数据下的结果:

tree_clf.fit(X_train_oversample, y_train_oversample)
print('过采数据下:')
print('训练集:')
print_score(tree_clf, X_train_oversample, y_train_oversample, X_test_oversample, y_test_oversample, train=True)
print('测试集:')
print_score(tree_clf, X_train_oversample, y_train_oversample, X_test, y_test, train=False)

在这里插入图片描述
在这里插入图片描述

未进行重采样数据下的结果:

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

print('训练集:')
print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print('测试集:')
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

在这里插入图片描述
在这里插入图片描述
我们发现相对而言,决策树模型在降采数据下的表现相对较好,召回率94%,精确率73.7%。
我们先不纠结以上结果,接下来我们通过网格搜索来进行调参,寻找最佳参数。


寻找最优参数:

关于决策树模型参数的解读可以参考文章《机器学习——决策树,DecisionTreeClassifier参数详解,决策树可视化查看树结构

params = {
    "criterion":("gini", "entropy"), 
    "splitter":("best", "random"), 
    "max_depth":(list(range(1, 20))), 
    "min_samples_split":[2, 3, 4], 
    "min_samples_leaf":list(range(1, 20)), 
}

tree_clf = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree_clf, params, scoring="accuracy", n_jobs=-1, verbose=1, cv=3)
tree_cv.fit(X_train_undersample, y_train_undersample)
best_params = tree_cv.best_params_
print(f"降采下Best paramters: {best_params})")

得到降采数据下最优参数

Fitting 3 folds for each of 4332 candidates, totalling 12996 fits
降采下Best paramters: {'criterion': 'entropy', 'max_depth': 7, 'min_samples_leaf': 4, 'min_samples_split': 2, 'splitter': 'random'})

将最优参数代入模型

tree_clf = DecisionTreeClassifier(**best_params)
tree_clf.fit(X_train_undersample, y_train_undersample)
print('降采数据下:')
print('训练集:')
print_score(tree_clf, X_train_undersample, y_train_undersample, X_test_undersample, y_test_undersample, train=True)
print('测试集:')
print_score(tree_clf, X_train_undersample, y_train_undersample, X_test, y_test, train=False)

在这里插入图片描述
在这里插入图片描述

总结:


情况表现
默认参数召回率94%,精确率73%
网格搜索最佳参数召回率82%,精确率78.9%

大体看起来默认两种参数配置下模型的表现基本差不多,一个召回率更好(宁肯错杀一百,不肯放过一个坏人,但精确度就低了),一个精确度更高。



随机森林分类(RandomForestClassifier)

该模型参数的解读可参考文章《sklearn.ensemble.RandomForestClassifier 随机深林参数详解》。本文收集展示了两种调参方式RandomizedSearchCV(随机的带交叉验证的网格搜索)和GridSearchCV(带交叉验证的网格搜索),以此来加深记忆以及比较之间的差异。关于具体的两种方法的解读和对比可以参考文章《GridSearchCV 与 RandomizedSearchCV 调参》、《随机搜索RandomizedSearchCV原理

GridSearchCV

降采数据下的结果:

#导入必需包
from sklearn.model_selection import GridSearchCV
#可选参数配置
n_estimators = [100, 500, 1000, 1500]
max_features = ['auto', 'sqrt']
max_depth = [2, 3, 5]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4, 10]
bootstrap = [True, False]

params_grid = {'n_estimators': n_estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}
#模型调用可选参数
rf_clf = RandomForestClassifier(random_state=42)
rf_cv = GridSearchCV(rf_clf, params_grid, scoring="f1", cv=3, verbose=2, n_jobs=-1)

#使用降采数据进行输出最近参数
rf_cv.fit(X_train_undersample, y_train_undersample.values.ravel())
best_params = rf_cv.best_params_
print(f"Best parameters: {best_params}")

得到最佳参数

Fitting 3 folds for each of 768 candidates, totalling 2304 fits
Best parameters: {'bootstrap': True, 'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 1500}
#调用最佳参数进行建模并输出结果
rf_clf = RandomForestClassifier(**best_params)
rf_clf.fit(X_train_undersample, y_train_undersample.values.ravel())

print_score(rf_clf, X_train_undersample, y_train_undersample, X_test_undersample, y_test_undersample, train=True)
print_score(rf_clf, X_train_undersample, y_train_undersample, X_test, y_test, train=False)

在这里插入图片描述
在这里插入图片描述

过采数据下的结果:

#可选参数配置
n_estimators = [100, 500, 1000, 1500]
max_features = ['auto', 'sqrt']
max_depth = [2, 3, 5]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4, 10]
bootstrap = [True, False]

params_grid = {'n_estimators': n_estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}
#模型调用可选参数
rf_clf = RandomForestClassifier(random_state=42)
rf_cv = GridSearchCV(rf_clf, params_grid, scoring="f1", cv=3, verbose=2, n_jobs=-1)
#使用过采数据进行输出最佳参数
rf_cv.fit(X_train_oversample, y_train_oversample.values.ravel())
best_params = rf_cv.best_params_
print(f"Best parameters: {best_params}")

得到最佳参数

Fitting 3 folds for each of 768 candidates, totalling 2304 fits
Best parameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
#调用最佳参数进行建模并输出结果
rf_clf = RandomForestClassifier(**best_params)
rf_clf.fit(X_train_oversample, y_train_oversample.values.ravel())

print_score(rf_clf, X_train_oversample, y_train_oversample, X_test_oversample, y_test_oversample, train=True)
print_score(rf_clf, X_train_oversample, y_train_oversample, X_test, y_test, train=False)

在这里插入图片描述
在这里插入图片描述

未进行重采样数据下的结果:

代码略,结果如下
在这里插入图片描述
在这里插入图片描述

总结:


降采数据集下,调参后得到最佳结果。召回率91%,精确率83%



RandomizedSearchCV

from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train.values.ravel())

print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

降采数据下的结果:

from sklearn.model_selection import RandomizedSearchCV
#可选参数设置
n_estimators = [100,200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}

rf_clf = RandomForestClassifier(random_state=42)

rf_cv = RandomizedSearchCV(estimator=rf_clf, scoring='f1',param_distributions=random_grid, n_iter=100, cv=3, 
                               verbose=2, random_state=42, n_jobs=-1)
#模型训练
rf_cv.fit(X_train_undersample, y_train_undersample.values.ravel())
rf_best_params = rf_cv.best_params_
#打印最佳参数配置
print(f"Best paramters: {rf_best_params})")

#调用最佳参数询训练集、测试集进行输出
rf_clf = RandomForestClassifier(**rf_best_params)
rf_clf.fit(X_train_undersample, y_train_undersample.values.ravel())

print_score(rf_clf, X_train_undersample, y_train_undersample, X_test_undersample, y_test_undersample, train=True)
print_score(rf_clf, X_train_undersample, y_train_undersample, X_test, y_test, train=False)

得到最佳参数

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best paramters: {'n_estimators': 1800, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 110, 'bootstrap': True})
rf_clf = RandomForestClassifier(**rf_best_params)
rf_clf.fit(X_train_undersample, y_train_undersample.values.ravel())

print_score(rf_clf, X_train_undersample, y_train_undersample, X_test_undersample, y_test_undersample, train=True)
print_score(rf_clf, X_train_undersample, y_train_undersample, X_test, y_test, train=False)

在这里插入图片描述

过采数据下的结果:

#用过采数据寻找最佳参数
rf_cv.fit(X_train_oversample, y_train_oversample.values.ravel())
rf_best_params = rf_cv.best_params_
print(f"Best paramters: {rf_best_params})")

得到最佳参数

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best paramters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 20, 'bootstrap': False})
#调用最佳参数询训练集、测试集进行输出
rf_clf = RandomForestClassifier(**rf_best_params)
rf_clf.fit(X_train_oversample, y_train_oversample.values.ravel())

print_score(rf_clf, X_train_oversample, y_train_oversample, X_test_oversample, y_test_oversample, train=True)
print_score(rf_clf, X_train_oversample, y_train_oversample, X_test, y_test, train=False)

在这里插入图片描述

未进行重采样数据下的结果:

rf_cv.fit(X_train, y_train.values.ravel())
rf_best_params = rf_cv.best_params_
print(f"Best paramters: {rf_best_params})")

得到最佳参数

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best paramters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 90, 'bootstrap': False})
rf_clf = RandomForestClassifier(**rf_best_params)
rf_clf.fit(X_train, y_train.values.ravel())

print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

在这里插入图片描述

总结:


降采数据集下,调参后得到最佳结果。召回率92.85%,精确率84.35%



未调参,默认参数

rf_clf = RandomForestClassifier()
rf_clf.fit(X_train_undersample, y_train_undersample)

print_score(rf_clf, X_train_undersample, y_train_undersample, X_test_undersample, y_test_undersample, train=True)
print_score(rf_clf, X_train_undersample, y_train_undersample, X_test, y_test, train=False)

总汇总

逻辑回归模型备注
召回率80%,精准率80%降采数据
召回率84%,精准率79%过采数据
决策树模型备注
召回率94%,精准率73%降采数据,默认参数
召回率82%,精准率78.9%降采数据,调参后
随机森林模型备注
召回率84%,精准率83%降采数据,默认参数
召回率91%,精准率83%降采数据,GridSearchCV调参
召回率92.85%,精准率84.35%降采数据,RandomizedSearchCV调参
  • 5
    点赞
  • 7
    评论
  • 12
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: 深蓝海洋 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值