如何建立随机森林、自适应增强(AdaBoost)与梯度提升(Gradient Boost)算法模型

在这个练习中,我们使用电信企业的客户流失数据集,Orange_Telecom_Churn_Data.csv。我们先读入数据集,做一些数据预处理,然后使用各种模型根据用户的特点来预测其是否会流失。

第一步:读入和处理数据

  • 读入数据集,并查看其基本信息。
  • 去除其中对预测无用的列,如“state",“area_code"和"phone_number”
  • 把’intl_plan’和’voice_mail_plan’两列的值转换成布尔类型:‘yes’替换成’True’,'no’替换成’False’
# 读入数据集,并查看其基本信息
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('Orange_Telecom_Churn_Data.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   state                          5000 non-null   object 
 1   account_length                 5000 non-null   int64  
 2   area_code                      5000 non-null   int64  
 3   phone_number                   5000 non-null   object 
 4   intl_plan                      5000 non-null   object 
 5   voice_mail_plan                5000 non-null   object 
 6   number_vmail_messages          5000 non-null   int64  
 7   total_day_minutes              5000 non-null   float64
 8   total_day_calls                5000 non-null   int64  
 9   total_day_charge               5000 non-null   float64
 10  total_eve_minutes              5000 non-null   float64
 11  total_eve_calls                5000 non-null   int64  
 12  total_eve_charge               5000 non-null   float64
 13  total_night_minutes            5000 non-null   float64
 14  total_night_calls              5000 non-null   int64  
 15  total_night_charge             5000 non-null   float64
 16  total_intl_minutes             5000 non-null   float64
 17  total_intl_calls               5000 non-null   int64  
 18  total_intl_charge              5000 non-null   float64
 19  number_customer_service_calls  5000 non-null   int64  
 20  churned                        5000 non-null   bool   
dtypes: bool(1), float64(8), int64(8), object(4)
memory usage: 786.3+ KB
# 去除“state","area_code"和"phone_number"三列
data = data.drop(["state", "area_code", "phone_number"], axis = 1)
# 把'intl_plan'和'voice_mail_plan'两列的值转换成布尔类型
data["intl_plan"] = data.intl_plan.map(lambda x: True if x=='yes' else False)

data["voice_mail_plan"] = data.voice_mail_plan.map(lambda x: True if x=='yes' else False)

第二步:生成X和y

  • 将"churned"列之外的所有列作为X, "churned"列作为y
  • 检查y列中所有类别的个数
  • 划分成训练集和测试集
  • 分别检查训练集和测试集中所有类别的个数
# 生成X和y
X = data[data.columns[:-1]]

y = data.churned
# 检查y中所有类别的个数
print(y.value_counts())
False    4293
True      707
Name: churned, dtype: int64
# 划分成训练集和测试集
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
# 分别检查训练集和测试集中所有类别的个数
print(y_train.value_counts())

print(y_test.value_counts())
False    2996
True      504
Name: churned, dtype: int64
False    1297
True      203
Name: churned, dtype: int64

第三步:随机森林

  • 将决策树个数设置为一个范围内的多个不同的值,分别训练出不同的随机森林,并计算每个森林的袋外错误
  • 将袋外错误作为决策树个数的函数,绘制在一张图上
  • 使用带交叉验证的网格搜索自动为随机森林模型搜索一个最佳决策树个数
  • 预测测试数据,并输出其精度、查准率、查全率和F1分数
# 将决策树的个数设置为一个范围内的多个不同的值,分别训练出不同的随机森林,并计算每个森林的袋外错误
from sklearn.ensemble import RandomForestClassifier

nsimu = 21
error_rate=[0]*nsimu
ntree = [0]*nsimu
for i in range(1,nsimu):
    rfc = RandomForestClassifier(n_estimators=i*10, min_samples_split=10, max_depth=None, criterion='gini',
                                oob_score=True)
    rfc.fit(X_train, y_train)
    rfc_pred = rfc.predict(X_test)
    error_rate[i] = 1 - rfc.oob_score_
    ntree[i]=i*10
# 将袋外错误作为决策树个数的函数,绘制在一张图上

plt.figure(figsize=(10,6))
plt.scatter(x=ntree[1:nsimu], y=error_rate[1:nsimu], s=60, c='red')
plt.title("Number of trees in the Random Forest vs. oob_score (criterion: 'gini')", fontsize=18)
plt.xlabel("Number of trees", fontsize=15)
plt.ylabel("oob_score", fontsize=15)
Text(0, 0.5, 'oob_score')

在这里插入图片描述

# 使用带交叉验证的网格搜索自动为随机森林模型搜索一个最佳决策树个数
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': range(175,200)}

grid = GridSearchCV(rfc, param_grid, cv = 14, scoring = "accuracy")

grid.fit(X_train,y_train)

grid.best_params_
{'n_estimators': 187}
# 预测测试数据
rfc_new = RandomForestClassifier(n_estimators= 187, min_samples_split=10, max_depth=None, criterion='gini')

rfc_new.fit(X_train, y_train)

rfc_new_pred = rfc_new.predict(X_test)
# 输出测试集上预测结果的精度、查准率、查全率和F1分数
cm = confusion_matrix(y_test, rfc_new_pred)
cr = classification_report(y_test, rfc_new_pred)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)
Accuracy of prediction: 0.957
              precision    recall  f1-score   support

       False       0.96      0.99      0.98      1297
        True       0.92      0.74      0.82       203

    accuracy                           0.96      1500
   macro avg       0.94      0.87      0.90      1500
weighted avg       0.96      0.96      0.95      1500

第四步:AdaBoost

  • 使用带交叉验证的网格搜索训练一个最佳的AdaBoost模型,可以尝试调节参数:树的个数、学习率等
  • 预测测试数据,并输出其精度、查准率、查全率和F1分数
# 使用带交叉验证的网格搜索训练一个最佳的AdaBoost模型,可以尝试调节参数:树的个数、学习率
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report,confusion_matrix

param_grid = {'n_estimators': range(100, 601, 100), 'learning_rate': [0.1, 0.3, 0.5, 0.7]}
treecla = DecisionTreeClassifier(criterion='entropy', 
                              max_depth=1,
                              random_state=1)
ada = AdaBoostClassifier(base_estimator=treecla, random_state=1)

grid = GridSearchCV(ada, param_grid, cv = 5, scoring = "accuracy")

grid.fit(X_train,y_train)

grid.best_params_
{'learning_rate': 0.3, 'n_estimators': 200}
# 预测测试数据
ada_new = AdaBoostClassifier(base_estimator=treecla, learning_rate= 0.3, n_estimators= 200,random_state=1)

ada_new.fit(X_train, y_train)

ada_new_pred = ada_new.predict(X_test)
# 输出测试集上预测结果的精度、查准率、查全率和F1分数
cm = confusion_matrix(y_test, ada_new_pred)
cr = classification_report(y_test, ada_new_pred)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)
Accuracy of prediction: 0.877
              precision    recall  f1-score   support

       False       0.90      0.97      0.93      1297
        True       0.59      0.29      0.39       203

    accuracy                           0.88      1500
   macro avg       0.74      0.63      0.66      1500
weighted avg       0.86      0.88      0.86      1500

第五步:Gradient Boost

  • 使用带交叉验证的网格搜索训练一个最佳的Gradient Boosting模型,可以尝试调节参数:树的个数、学习率、子采样、最大特征数等
  • 预测测试数据,并输出其精度、查准率、查全率和F1分数,并和AdaBoost模型做比较
# 使用带交叉验证的网格搜索训练一个最佳的Gradient Boosting模型,可以尝试调节参数:树的个数、学习率、子采样、最大特征数
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {'learning_rate': [0.1, 0.3, 0.5, 0.7], 'max_features': range(1, 5), 
              'subsample': [0.1, 0.3, 0.5, 0.7], 'n_estimators': range(100, 401, 100)}

gbc = GradientBoostingClassifier(random_state=1)

grid = GridSearchCV(gbc, param_grid, cv = 4, scoring = "accuracy")

grid.fit(X_train,y_train)

grid.best_params_
{'learning_rate': 0.1,
 'max_features': 4,
 'n_estimators': 300,
 'subsample': 0.7}
# 预测测试数据
gbc_new = GradientBoostingClassifier(learning_rate= 0.1, max_features= 4, subsample= 0.7, n_estimators= 300, random_state=1)

gbc_new.fit(X_train, y_train)

gbc_new_pred = gbc_new.predict(X_test)
# 输出测试集上预测结果的精度、查准率、查全率和F1分数
from sklearn.metrics import classification_report,confusion_matrix

cm = confusion_matrix(y_test, gbc_new_pred)
cr = classification_report(y_test, gbc_new_pred)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)
Accuracy of prediction: 0.955
              precision    recall  f1-score   support

       False       0.96      0.98      0.97      1297
        True       0.89      0.77      0.82       203

    accuracy                           0.96      1500
   macro avg       0.93      0.88      0.90      1500
weighted avg       0.95      0.96      0.95      1500

第六步:堆叠模型

  • 从前面训练出的随机森林、AdaBoost和Gradient Boosting模型中任取三个不同的模型,堆叠成一个模型,并拟合训练数据
  • 对测试数据进行预测
  • 输出测试集上预测结果的精度、查准率、查全率和F1分数,并分析比较
# 从前面训练出的随机森林、AdaBoost和Gradient Boosting模型中任取三个不同的模型,堆叠成一个模型,并拟合训练数据
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

SC = StackingClassifier(estimators=[("rf", rfc_new), ("ada", ada_new), ("gbc", gbc_new)], final_estimator=LogisticRegression())
SC = SC.fit(X_train, y_train)
# 预测测试数据
y_predict = SC.predict(X_test)
# 输出测试集上预测结果的精度、查准率、查全率和F1分数,并分析比较
cm = confusion_matrix(y_test, y_predict)
cr = classification_report(y_test, y_predict)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)
Accuracy of prediction: 0.955
              precision    recall  f1-score   support

       False       0.96      0.98      0.97      1297
        True       0.89      0.77      0.82       203

    accuracy                           0.96      1500
   macro avg       0.93      0.88      0.90      1500
weighted avg       0.95      0.96      0.95      1500
  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值