在这个练习中,我们使用电信企业的客户流失数据集,Orange_Telecom_Churn_Data.csv。我们先读入数据集,做一些数据预处理,然后使用各种模型根据用户的特点来预测其是否会流失。
第一步:读入和处理数据
- 读入数据集,并查看其基本信息。
- 去除其中对预测无用的列,如“state",“area_code"和"phone_number”
- 把’intl_plan’和’voice_mail_plan’两列的值转换成布尔类型:‘yes’替换成’True’,'no’替换成’False’
# 读入数据集,并查看其基本信息
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('Orange_Telecom_Churn_Data.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state 5000 non-null object
1 account_length 5000 non-null int64
2 area_code 5000 non-null int64
3 phone_number 5000 non-null object
4 intl_plan 5000 non-null object
5 voice_mail_plan 5000 non-null object
6 number_vmail_messages 5000 non-null int64
7 total_day_minutes 5000 non-null float64
8 total_day_calls 5000 non-null int64
9 total_day_charge 5000 non-null float64
10 total_eve_minutes 5000 non-null float64
11 total_eve_calls 5000 non-null int64
12 total_eve_charge 5000 non-null float64
13 total_night_minutes 5000 non-null float64
14 total_night_calls 5000 non-null int64
15 total_night_charge 5000 non-null float64
16 total_intl_minutes 5000 non-null float64
17 total_intl_calls 5000 non-null int64
18 total_intl_charge 5000 non-null float64
19 number_customer_service_calls 5000 non-null int64
20 churned 5000 non-null bool
dtypes: bool(1), float64(8), int64(8), object(4)
memory usage: 786.3+ KB
# 去除“state","area_code"和"phone_number"三列
data = data.drop(["state", "area_code", "phone_number"], axis = 1)
# 把'intl_plan'和'voice_mail_plan'两列的值转换成布尔类型
data["intl_plan"] = data.intl_plan.map(lambda x: True if x=='yes' else False)
data["voice_mail_plan"] = data.voice_mail_plan.map(lambda x: True if x=='yes' else False)
第二步:生成X和y
- 将"churned"列之外的所有列作为X, "churned"列作为y
- 检查y列中所有类别的个数
- 划分成训练集和测试集
- 分别检查训练集和测试集中所有类别的个数
# 生成X和y
X = data[data.columns[:-1]]
y = data.churned
# 检查y中所有类别的个数
print(y.value_counts())
False 4293
True 707
Name: churned, dtype: int64
# 划分成训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
# 分别检查训练集和测试集中所有类别的个数
print(y_train.value_counts())
print(y_test.value_counts())
False 2996
True 504
Name: churned, dtype: int64
False 1297
True 203
Name: churned, dtype: int64
第三步:随机森林
- 将决策树个数设置为一个范围内的多个不同的值,分别训练出不同的随机森林,并计算每个森林的袋外错误
- 将袋外错误作为决策树个数的函数,绘制在一张图上
- 使用带交叉验证的网格搜索自动为随机森林模型搜索一个最佳决策树个数
- 预测测试数据,并输出其精度、查准率、查全率和F1分数
# 将决策树的个数设置为一个范围内的多个不同的值,分别训练出不同的随机森林,并计算每个森林的袋外错误
from sklearn.ensemble import RandomForestClassifier
nsimu = 21
error_rate=[0]*nsimu
ntree = [0]*nsimu
for i in range(1,nsimu):
rfc = RandomForestClassifier(n_estimators=i*10, min_samples_split=10, max_depth=None, criterion='gini',
oob_score=True)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
error_rate[i] = 1 - rfc.oob_score_
ntree[i]=i*10
# 将袋外错误作为决策树个数的函数,绘制在一张图上
plt.figure(figsize=(10,6))
plt.scatter(x=ntree[1:nsimu], y=error_rate[1:nsimu], s=60, c='red')
plt.title("Number of trees in the Random Forest vs. oob_score (criterion: 'gini')", fontsize=18)
plt.xlabel("Number of trees", fontsize=15)
plt.ylabel("oob_score", fontsize=15)
Text(0, 0.5, 'oob_score')
# 使用带交叉验证的网格搜索自动为随机森林模型搜索一个最佳决策树个数
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': range(175,200)}
grid = GridSearchCV(rfc, param_grid, cv = 14, scoring = "accuracy")
grid.fit(X_train,y_train)
grid.best_params_
{'n_estimators': 187}
# 预测测试数据
rfc_new = RandomForestClassifier(n_estimators= 187, min_samples_split=10, max_depth=None, criterion='gini')
rfc_new.fit(X_train, y_train)
rfc_new_pred = rfc_new.predict(X_test)
# 输出测试集上预测结果的精度、查准率、查全率和F1分数
cm = confusion_matrix(y_test, rfc_new_pred)
cr = classification_report(y_test, rfc_new_pred)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)
Accuracy of prediction: 0.957
precision recall f1-score support
False 0.96 0.99 0.98 1297
True 0.92 0.74 0.82 203
accuracy 0.96 1500
macro avg 0.94 0.87 0.90 1500
weighted avg 0.96 0.96 0.95 1500
第四步:AdaBoost
- 使用带交叉验证的网格搜索训练一个最佳的AdaBoost模型,可以尝试调节参数:树的个数、学习率等
- 预测测试数据,并输出其精度、查准率、查全率和F1分数
# 使用带交叉验证的网格搜索训练一个最佳的AdaBoost模型,可以尝试调节参数:树的个数、学习率
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report,confusion_matrix
param_grid = {'n_estimators': range(100, 601, 100), 'learning_rate': [0.1, 0.3, 0.5, 0.7]}
treecla = DecisionTreeClassifier(criterion='entropy',
max_depth=1,
random_state=1)
ada = AdaBoostClassifier(base_estimator=treecla, random_state=1)
grid = GridSearchCV(ada, param_grid, cv = 5, scoring = "accuracy")
grid.fit(X_train,y_train)
grid.best_params_
{'learning_rate': 0.3, 'n_estimators': 200}
# 预测测试数据
ada_new = AdaBoostClassifier(base_estimator=treecla, learning_rate= 0.3, n_estimators= 200,random_state=1)
ada_new.fit(X_train, y_train)
ada_new_pred = ada_new.predict(X_test)
# 输出测试集上预测结果的精度、查准率、查全率和F1分数
cm = confusion_matrix(y_test, ada_new_pred)
cr = classification_report(y_test, ada_new_pred)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)
Accuracy of prediction: 0.877
precision recall f1-score support
False 0.90 0.97 0.93 1297
True 0.59 0.29 0.39 203
accuracy 0.88 1500
macro avg 0.74 0.63 0.66 1500
weighted avg 0.86 0.88 0.86 1500
第五步:Gradient Boost
- 使用带交叉验证的网格搜索训练一个最佳的Gradient Boosting模型,可以尝试调节参数:树的个数、学习率、子采样、最大特征数等
- 预测测试数据,并输出其精度、查准率、查全率和F1分数,并和AdaBoost模型做比较
# 使用带交叉验证的网格搜索训练一个最佳的Gradient Boosting模型,可以尝试调节参数:树的个数、学习率、子采样、最大特征数
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {'learning_rate': [0.1, 0.3, 0.5, 0.7], 'max_features': range(1, 5),
'subsample': [0.1, 0.3, 0.5, 0.7], 'n_estimators': range(100, 401, 100)}
gbc = GradientBoostingClassifier(random_state=1)
grid = GridSearchCV(gbc, param_grid, cv = 4, scoring = "accuracy")
grid.fit(X_train,y_train)
grid.best_params_
{'learning_rate': 0.1,
'max_features': 4,
'n_estimators': 300,
'subsample': 0.7}
# 预测测试数据
gbc_new = GradientBoostingClassifier(learning_rate= 0.1, max_features= 4, subsample= 0.7, n_estimators= 300, random_state=1)
gbc_new.fit(X_train, y_train)
gbc_new_pred = gbc_new.predict(X_test)
# 输出测试集上预测结果的精度、查准率、查全率和F1分数
from sklearn.metrics import classification_report,confusion_matrix
cm = confusion_matrix(y_test, gbc_new_pred)
cr = classification_report(y_test, gbc_new_pred)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)
Accuracy of prediction: 0.955
precision recall f1-score support
False 0.96 0.98 0.97 1297
True 0.89 0.77 0.82 203
accuracy 0.96 1500
macro avg 0.93 0.88 0.90 1500
weighted avg 0.95 0.96 0.95 1500
第六步:堆叠模型
- 从前面训练出的随机森林、AdaBoost和Gradient Boosting模型中任取三个不同的模型,堆叠成一个模型,并拟合训练数据
- 对测试数据进行预测
- 输出测试集上预测结果的精度、查准率、查全率和F1分数,并分析比较
# 从前面训练出的随机森林、AdaBoost和Gradient Boosting模型中任取三个不同的模型,堆叠成一个模型,并拟合训练数据
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
SC = StackingClassifier(estimators=[("rf", rfc_new), ("ada", ada_new), ("gbc", gbc_new)], final_estimator=LogisticRegression())
SC = SC.fit(X_train, y_train)
# 预测测试数据
y_predict = SC.predict(X_test)
# 输出测试集上预测结果的精度、查准率、查全率和F1分数,并分析比较
cm = confusion_matrix(y_test, y_predict)
cr = classification_report(y_test, y_predict)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))
print(cr)
Accuracy of prediction: 0.955
precision recall f1-score support
False 0.96 0.98 0.97 1297
True 0.89 0.77 0.82 203
accuracy 0.96 1500
macro avg 0.93 0.88 0.90 1500
weighted avg 0.95 0.96 0.95 1500