如何建立随机森林、自适应增强（AdaBoost）与梯度提升（Gradient Boost）算法模型

最新推荐文章于 2024-03-15 08:29:13 发布

Chenshuo_Xu

最新推荐文章于 2024-03-15 08:29:13 发布

阅读量1.1k

点赞数 2

分类专栏：机器学习文章标签：算法随机森林 python 回归机器学习

本文链接：https://blog.csdn.net/Chenshuo_Xu/article/details/129095256

版权

机器学习专栏收录该内容

9 篇文章 10 订阅

订阅专栏

在这个练习中，我们使用电信企业的客户流失数据集，Orange_Telecom_Churn_Data.csv。我们先读入数据集，做一些数据预处理，然后使用各种模型根据用户的特点来预测其是否会流失。

第一步：读入和处理数据

读入数据集，并查看其基本信息。
去除其中对预测无用的列，如“state"，“area_code"和"phone_number”
把’intl_plan’和’voice_mail_plan’两列的值转换成布尔类型：‘yes’替换成’True’，'no’替换成’False’

# 读入数据集，并查看其基本信息
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('Orange_Telecom_Churn_Data.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   state                          5000 non-null   object 
 1   account_length                 5000 non-null   int64  
 2   area_code                      5000 non-null   int64  
 3   phone_number                   5000 non-null   object 
 4   intl_plan                      5000 non-null   object 
 5   voice_mail_plan                5000 non-null   object 
 6   number_vmail_messages          5000 non-null   int64  
 7   total_day_minutes              5000 non-null   float64
 8   total_day_calls                5000 non-null   int64  
 9   total_day_charge               5000 non-null   float64
 10  total_eve_minutes              5000 non-null   float64
 11  total_eve_calls                5000 non-null   int64  
 12  total_eve_charge               5000 non-null   float64
 13  total_night_minutes            5000 non-null   float64
 14  total_night_calls              5000 non-null   int64  
 15  total_night_charge             5000 non-null   float64
 16  total_intl_minutes             5000 non-null   float64
 17  total_intl_calls               5000 non-null   int64  
 18  total_intl_charge              5000 non-null   float64
 19  number_customer_service_calls  5000 non-null   int64  
 20  churned                        5000 non-null   bool   
dtypes: bool(1), float64(8), int64(8), object(4)
memory usage: 786.3+ KB

# 去除“state"，"area_code"和"phone_number"三列
data = data.drop(["state", "area_code", "phone_number"], axis = 1)

# 把'intl_plan'和'voice_mail_plan'两列的值转换成布尔类型
data["intl_plan"] = data.intl_plan.map(lambda x: True if x=='yes' else False)

data["voice_mail_plan"] = data.voice_mail_plan.map(lambda x: True if x=='yes' else False)

第二步：生成X和y

将"churned"列之外的所有列作为X, "churned"列作为y
检查y列中所有类别的个数
划分成训练集和测试集
分别检查训练集和测试集中所有类别的个数

# 生成X和y
X = data[data.columns[:-1]]

y = data.churned

# 检查y中所有类别的个数
print(y.value_counts())

False    4293
True      707
Name: churned, dtype: int64

# 划分成训练集和测试集
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

# 分别检查训练集和测试集中所有类别的个数
print(y_train.value_counts())

print(y_test.value_counts())

False    2996
True      504
Name: churned, dtype: int64
False    1297
True      203
Name: churned, dtype: int64

第三步：随机森林

将决策树个数设置为一个范围内的多个不同的值，分别训练出不同的随机森林，并计算每个森林的袋外错误
将袋外错误作为决策树个数的函数，绘制在一张图上
使用带交叉验证的网格搜索自动为随机森林模型搜索一个最佳决策树个数
预测测试数据，并输出其精度、查准率、查全率和F1分数

# 将决策树的个数设置为一个范围内的多个不同的值，分别训练出不同的随机森林，并计算每个森林的袋外错误
from sklearn.ensemble import RandomForestClassifier

nsimu = 21
error_rate=[0]*nsimu
ntree = [0]*nsimu
for i in range(1,nsimu):
    rfc = RandomForestClassifier(n_estimators=i*10, min_samples_split=10, max_depth=None, criterion='gini',
                                oob_score=True)
    rfc.fit(X_train, y_train)
    rfc_pred = rfc.predict(X_test)
    error_rate[i] = 1 - rfc.oob_score_
    ntree[i]=i*10

# 将袋外错误作为决策树个数的函数，绘制在一张图上

plt.figure(figsize=(10,6))
plt.scatter(x=ntree[1:nsimu], y=error_rate[1:nsimu], s=60, c='red')
plt.title("Number of trees in the Random Forest vs. oob_score (criterion: 'gini')", fontsize=18)
plt.xlabel("Number of trees", fontsize=15)
plt.ylabel("oob_score", fontsize=15)

Text(0, 0.5, 'oob_score')

在这里插入图片描述

# 使用带交叉验证的网格搜索自动为随机森林模型搜索一个最佳决策树个数
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': range(175,200)}

grid = GridSearchCV(rfc, param_grid, cv = 14, scoring = "accuracy")

grid.fit(X_train,y_train)

grid.best_params_

{'n_estimators': 187}

# 预测测试数据
rfc_new = RandomForestClassifier(n_estimators= 187, min_samples_split=10, max_depth=None, criterion='gini')

rfc_new.fit(X_train, y_train)

rfc_new_pred = rfc_new.predict(X_test)

# 输出测试集上预测结果的精度、查准率、查全率和F1分数
cm = confusion_matrix(y_test, rfc_new_pred)
cr = classification_report(y_test, rfc_new_pred)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)

Accuracy of prediction: 0.957
              precision    recall  f1-score   support

       False       0.96      0.99      0.98      1297
        True       0.92      0.74      0.82       203

    accuracy                           0.96      1500
   macro avg       0.94      0.87      0.90      1500
weighted avg       0.96      0.96      0.95      1500

第四步：AdaBoost

使用带交叉验证的网格搜索训练一个最佳的AdaBoost模型，可以尝试调节参数：树的个数、学习率等
预测测试数据，并输出其精度、查准率、查全率和F1分数

# 使用带交叉验证的网格搜索训练一个最佳的AdaBoost模型，可以尝试调节参数：树的个数、学习率
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report,confusion_matrix

param_grid = {'n_estimators': range(100, 601, 100), 'learning_rate': [0.1, 0.3, 0.5, 0.7]}
treecla = DecisionTreeClassifier(criterion='entropy', 
                              max_depth=1,
                              random_state=1)
ada = AdaBoostClassifier(base_estimator=treecla, random_state=1)

grid = GridSearchCV(ada, param_grid, cv = 5, scoring = "accuracy")

grid.fit(X_train,y_train)

grid.best_params_

{'learning_rate': 0.3, 'n_estimators': 200}

# 预测测试数据
ada_new = AdaBoostClassifier(base_estimator=treecla, learning_rate= 0.3, n_estimators= 200,random_state=1)

ada_new.fit(X_train, y_train)

ada_new_pred = ada_new.predict(X_test)

# 输出测试集上预测结果的精度、查准率、查全率和F1分数
cm = confusion_matrix(y_test, ada_new_pred)
cr = classification_report(y_test, ada_new_pred)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)

Accuracy of prediction: 0.877
              precision    recall  f1-score   support

       False       0.90      0.97      0.93      1297
        True       0.59      0.29      0.39       203

    accuracy                           0.88      1500
   macro avg       0.74      0.63      0.66      1500
weighted avg       0.86      0.88      0.86      1500

第五步：Gradient Boost

使用带交叉验证的网格搜索训练一个最佳的Gradient Boosting模型，可以尝试调节参数：树的个数、学习率、子采样、最大特征数等
预测测试数据，并输出其精度、查准率、查全率和F1分数，并和AdaBoost模型做比较

# 使用带交叉验证的网格搜索训练一个最佳的Gradient Boosting模型，可以尝试调节参数：树的个数、学习率、子采样、最大特征数
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {'learning_rate': [0.1, 0.3, 0.5, 0.7], 'max_features': range(1, 5), 
              'subsample': [0.1, 0.3, 0.5, 0.7], 'n_estimators': range(100, 401, 100)}

gbc = GradientBoostingClassifier(random_state=1)

grid = GridSearchCV(gbc, param_grid, cv = 4, scoring = "accuracy")

grid.fit(X_train,y_train)

grid.best_params_

{'learning_rate': 0.1,
 'max_features': 4,
 'n_estimators': 300,
 'subsample': 0.7}

# 预测测试数据
gbc_new = GradientBoostingClassifier(learning_rate= 0.1, max_features= 4, subsample= 0.7, n_estimators= 300, random_state=1)

gbc_new.fit(X_train, y_train)

gbc_new_pred = gbc_new.predict(X_test)

# 输出测试集上预测结果的精度、查准率、查全率和F1分数
from sklearn.metrics import classification_report,confusion_matrix

cm = confusion_matrix(y_test, gbc_new_pred)
cr = classification_report(y_test, gbc_new_pred)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)

Accuracy of prediction: 0.955
              precision    recall  f1-score   support

       False       0.96      0.98      0.97      1297
        True       0.89      0.77      0.82       203

    accuracy                           0.96      1500
   macro avg       0.93      0.88      0.90      1500
weighted avg       0.95      0.96      0.95      1500

第六步：堆叠模型

从前面训练出的随机森林、AdaBoost和Gradient Boosting模型中任取三个不同的模型，堆叠成一个模型，并拟合训练数据
对测试数据进行预测
输出测试集上预测结果的精度、查准率、查全率和F1分数，并分析比较

# 从前面训练出的随机森林、AdaBoost和Gradient Boosting模型中任取三个不同的模型，堆叠成一个模型，并拟合训练数据
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

SC = StackingClassifier(estimators=[("rf", rfc_new), ("ada", ada_new), ("gbc", gbc_new)], final_estimator=LogisticRegression())
SC = SC.fit(X_train, y_train)

# 预测测试数据
y_predict = SC.predict(X_test)

# 输出测试集上预测结果的精度、查准率、查全率和F1分数，并分析比较
cm = confusion_matrix(y_test, y_predict)
cr = classification_report(y_test, y_predict)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)

Accuracy of prediction: 0.955
              precision    recall  f1-score   support

       False       0.96      0.98      0.97      1297
        True       0.89      0.77      0.82       203

    accuracy                           0.96      1500
   macro avg       0.93      0.88      0.90      1500
weighted avg       0.95      0.96      0.95      1500