汽车保险客户分类问题

最新推荐文章于 2022-03-06 21:19:31 发布

静听山水

最新推荐文章于 2022-03-06 21:19:31 发布

阅读量2.4k

点赞数

分类专栏：机器学习

原文链接：https://www.kaggle.com/kondla/carinsurance

版权

机器学习专栏收录该内容

55 篇文章

订阅专栏

代码：https://www.kaggle.com/manibhask/cleaning-visualizing-and-modeling-cold-call-data
数据：https://www.kaggle.com/kondla/carinsurance

让我们查看数据集的特征并了解每个属性/特征的含义。下表显示了数据集的简要说明以及变量是连续的，分类的还是离散的。

Feature	Description	Example
Id	唯一标识	“1” … “5000”
Age	客户年龄
Job	客户的工作	“admin.”, “blue-collar”, etc.
Marital	客户的婚姻状态	“divorced”, “married”, “single”
Education	客户的学历层次	“primary”, “secondary”, etc.
Default	是否有过信用违约	“yes” - 1,“no” - 0
Balance	年平均余额（美元）
HHInsurance	是否有家庭保险	“yes” - 1,“no” - 0
CarLoan	是否有汽车贷款	“yes” - 1,“no” - 0
Communication	联系人通讯类型	“cellular”, “telephone”, “NA”
LastContactMonth	上次联系在哪一月	“jan”, “feb”, etc.
LastContactDay	上次联系在哪一天
CallStart	上次通话的开始时间 (HH:MM:SS)	12:43:15
CallEnd	上次通话的结束时间 (HH:MM:SS)	12:43:15
NoOfContacts	在此广告系列中为此客户执行的联系数量
DaysPassed	上次联系客户后经过的天数, -1表示还没有联系过
PrevAttempts	此广告系列之前为此客户执行的联系数量
Outcome	先前营销活动的结果	“failure”, “other”, “success”, “NA”
CarInsurance	客户是否购买汽车保险	“yes” - 1,“no” - 0

数据整理

数据整理是将数据从一种形式转换为另一种形式以更好地理解它的过程。在本例中，我们的数据以CSV文件的形式提供给我们，让我们使用功能强大的python数据科学库将其加载到数据框中。好吧，我从未想过它看起来会如此简单！

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
%matplotlib inline
from sklearn.model_selection import train_test_split,cross_val_score,KFold,cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score,confusion_matrix,precision_recall_curve,roc_curve
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier,RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.neighbors  import KNeighborsClassifier
from sklearn import svm,tree

df = pd.read_csv('../data/carInsurance_train.csv',index_col = 'Id')

df.head()

	Age	Job	Marital	Education	Default	Balance	HHInsurance	CarLoan	Communication	LastContactDay	LastContactMonth	NoOfContacts	DaysPassed	PrevAttempts	Outcome	CallStart	CallEnd	CarInsurance
Id
1	32	management	single	tertiary	0	1218	1	0	telephone	28	jan	2	-1	0	NaN	13:45:20	13:46:30	0
2	32	blue-collar	married	primary	0	1156	1	0	NaN	26	may	5	-1	0	NaN	14:49:03	14:52:08	0
3	29	management	single	tertiary	0	637	1	0	cellular	3	jun	1	119	1	failure	16:30:24	16:36:04	1
4	25	student	single	primary	0	373	1	0	cellular	11	may	2	-1	0	NaN	12:06:43	12:20:22	1
5	30	management	married	tertiary	0	2694	0	0	cellular	3	jun	1	-1	0	NaN	14:35:44	14:38:56	0

df.shape

(4000, 18)

df.columns

Index(['Age', 'Job', 'Marital', 'Education', 'Default', 'Balance',
       'HHInsurance', 'CarLoan', 'Communication', 'LastContactDay',
       'LastContactMonth', 'NoOfContacts', 'DaysPassed', 'PrevAttempts',
       'Outcome', 'CallStart', 'CallEnd', 'CarInsurance'],
      dtype='object')

df.describe()

	Age	Default	Balance	HHInsurance	CarLoan	LastContactDay	NoOfContacts	DaysPassed	PrevAttempts	CarInsurance
count	4000.000000	4000.000000	4000.000000	4000.00000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000
mean	41.214750	0.014500	1532.937250	0.49275	0.133000	15.721250	2.607250	48.706500	0.717500	0.401000
std	11.550194	0.119555	3511.452489	0.50001	0.339617	8.425307	3.064204	106.685385	2.078647	0.490162
min	18.000000	0.000000	-3058.000000	0.00000	0.000000	1.000000	1.000000	-1.000000	0.000000	0.000000
25%	32.000000	0.000000	111.000000	0.00000	0.000000	8.000000	1.000000	-1.000000	0.000000	0.000000
50%	39.000000	0.000000	551.500000	0.00000	0.000000	16.000000	2.000000	-1.000000	0.000000	0.000000
75%	49.000000	0.000000	1619.000000	1.00000	0.000000	22.000000	3.000000	-1.000000	0.000000	1.000000
max	95.000000	1.000000	98417.000000	1.00000	1.000000	31.000000	43.000000	854.000000	58.000000	1.000000

df.dtypes

Age                  int64
Job                 object
Marital             object
Education           object
Default              int64
Balance              int64
HHInsurance          int64
CarLoan              int64
Communication       object
LastContactDay       int64
LastContactMonth    object
NoOfContacts         int64
DaysPassed           int64
PrevAttempts         int64
Outcome             object
CallStart           object
CallEnd             object
CarInsurance         int64
dtype: object

描述非数值型变量的特点，这里主要有计数，类别总数，频数最多的类别及对应频数。

df.describe(include=['O'])

	Job	Marital	Education	Communication	LastContactMonth	Outcome	CallStart	CallEnd
count	3981	4000	3831	3098	4000	958	4000	4000
unique	11	3	3	2	12	3	3777	3764
top	management	married	secondary	cellular	may	failure	15:27:56	10:22:30
freq	893	2304	1988	2831	1049	437	3	3

离群值分析

https://blog.csdn.net/weixin_42056745/article/details/90516835

https://blog.csdn.net/yuxeaotao/article/details/79876377

从箱线图可以发现，数值范围比较大，离群值比较多，但都表现连起来了，但最大值已经超过其他值太多，所以需要删除，防止过拟合。

sns.boxplot(x='Balance',data=df,palette='hls');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Al13kfLc-1591933821424)(output_15_0.png)]

df.Balance.max()

df[df['Balance'] == 98417]

	Age	Job	Marital	Education	Default	Balance	HHInsurance	CarLoan	Communication	LastContactDay	LastContactMonth	NoOfContacts	DaysPassed	PrevAttempts	Outcome	CallStart	CallEnd	CarInsurance
Id
1743	59	management	married	tertiary	0	98417	0	0	telephone	20	nov	5	-1	0	NaN	10:51:42	10:54:07	0

df.index[1742]

#删除异常值对应的索引值
df_new = df.drop(df.index[1742]);

处理缺失值

缺失值是数据分析的主要问题，处理它们是另一个障碍。 Python将丢失的数据视为NaN，但不将其包括在计算和可视化中。同样，如果不处理缺失值就无法建立预测模型。在我们的情况下，缺失值主要发生在Outcome和Communication字段中。 Job和Education也具有一定量的缺失值。

像Job和Education这样的缺失值非常少，可以使用python中的backfill / frontfill pad方法估算。结果和Communication缺失值很多，因此对于NaN值使用None估算。

fillna： https://blog.csdn.net/weixin_39549734/article/details/81221276

df_new.isnull().sum()

Age                    0
Job                   19
Marital                0
Education            169
Default                0
Balance                0
HHInsurance            0
CarLoan                0
Communication        902
LastContactDay         0
LastContactMonth       0
NoOfContacts           0
DaysPassed             0
PrevAttempts           0
Outcome             3041
CallStart              0
CallEnd                0
CarInsurance           0
dtype: int64

#method ='pad'用前一个非缺失值去填充该缺失值

df_new['Job'] = df_new['Job'].fillna(method ='pad')
df_new['Education'] = df_new['Education'].fillna(method ='pad')

df_new['Communication'] = df_new['Communication'].fillna('none')
df_new['Outcome'] = df_new['Outcome'].fillna('none')

df_new['Outcome'].value_counts()

none       3041
failure     437
success     326
other       195
Name: Outcome, dtype: int64

将Outcome字段的缺失值填充为none，none的频数显示也刚好为3041

df_new.isnull().sum()

Age                 0
Job                 0
Marital             0
Education           0
Default             0
Balance             0
HHInsurance         0
CarLoan             0
Communication       0
LastContactDay      0
LastContactMonth    0
NoOfContacts        0
DaysPassed          0
PrevAttempts        0
Outcome             0
CallStart           0
CallEnd             0
CarInsurance        0
dtype: int64

可视化

可视化是数据科学的一个重要方面，没有它就很难轻易地得出结果。尽管结果在表中是确定的，但是查看细节并得出结论是一个痛点。图表/图形对非技术人员轻松完成这些任务非常有帮助。高管人员和经理们喜欢以可视化的方式查看报告，以便他们可以轻松地制定复杂的决策。下面是一个配对图，可以将感兴趣的字段配对并绘制出来。 Pairplot的变量是从热图中选择的，这些变量会影响结果

** Pairplot的关键要点**

*30-60岁更有可能购买汽车保险【（1,1）图】。
*有汽车贷款和买过家庭保险的人购买的可能性较小一些【（3,3）（4,4）位置的双峰图】。
*如果过去的天数（联系他们之前的时间）增加，则人们会给出正号【（7,6）图】。
*当经常与他人联系时，他们的购买倾向会在20多次接触后减大幅减少【最后一排图】。
*在此广告系列中为此客户执行的联系数量越多效果越好，即增加了汽车保险的购买【（7,5）图】。

df_sub = ['Age','Balance','HHInsurance', 'CarLoan','NoOfContacts','DaysPassed','PrevAttempts','CarInsurance']  #这里都是数值变量
sns.pairplot(df_new[df_sub],hue='CarInsurance',size=1.5);   #注意这里df_sub包含因变量CarInsurance

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jv7qb36R-1591933821434)(output_33_1.png)]

PairGrid帮助我们查看了CarInsurance，Balance和分类变量（如Education，Marital和Job）之间的关系。学生和退休人员购买的汽车保险最多【（1,3）图】，单身身份和受过高等教育的人也更倾向购买汽车保险【（1,1），（1,2）图】。下面一层图可以观察哪些人的年平均余额比较多。CarInsurance的范围是[0,1],反映了购买保险比例，而Balance反映了一个平均值水平。

g = sns.PairGrid(df_new,
                 x_vars=["Education","Marital", "Job"],
                 y_vars=["CarInsurance", "Balance"],
                 aspect=.75, size=6)
plt.xticks(rotation=90)
g.map(sns.barplot, palette="pastel");

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xkh7Iw5x-1591933821467)(output_35_1.png)]

小提琴图在y轴处的凸出值接近1，表明3月，9月，10月和12月是人们购买汽车保险的理想月份。

sns.violinplot(x="LastContactMonth",y='CarInsurance',data=df_new);

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hCxqnGna-1591933821469)(output_37_0.png)]

sns.countplot(x="Outcome",hue='CarInsurance',data=df_new);

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-eWtu03YT-1591933821471)(output_38_0.png)]

特征工程

特征工程是机器学习问题的基本要素。在我们的问题中，有一系列连续变量，例如Age和Balance，需要将它们进行装箱。使用四分位数剪切功能将“年龄”和“平衡”连续变量分类为5个部分。

pd.qcut：https://blog.csdn.net/starter_____/article/details/79327997

#qcut将两个属性按频数均分成5个区间，值为0,1,2,3,4
df_new['AgeBinned'] = pd.qcut(df_new['Age'], 5 , labels = False)
df_new['BalanceBinned'] = pd.qcut(df_new['Balance'], 5,labels = False)

关于CallStart和CallEnd属性似乎存在一个独特的问题，它们记录为可以使用datetime函数轻松计算的对象变量，因此将其转换为datetime函数并减去它们会得出实际的CallTime，可以对其进一步进行分箱如上。

#将CallStart和CallEnd转换为datetime数据类型
df_new['CallStart'] = pd.to_datetime(df_new['CallStart'] )
df_new['CallEnd'] = pd.to_datetime(df_new['CallEnd'] )

#结束时间-开始时间以得出实际的通话时间
df_new['CallTime'] = (df_new['CallEnd'] - df_new['CallStart']).dt.total_seconds()

#分组
df_new['CallTimeBinned'] = pd.qcut(df_new['CallTime'], 5,labels = False)

#删除被合并的原始列，为了使变量看起来更简洁
df_new.drop(['Age','Balance','CallStart','CallEnd','CallTime'],axis = 1,inplace = True)

分类变量也可以参与模型构建，前提是它们必须获得其虚拟值才能被包括在内。通过此过程，我们将在数据框中包含更多列。

get_dummies用法：https://blog.csdn.net/maymay_/article/details/80198468

#使用get_dummies函数将二进制值分配给分类列中的每个值
Job = pd.get_dummies(data = df_new['Job'],prefix = "Job")
Marital= pd.get_dummies(data = df_new['Marital'],prefix = "Marital")
Education= pd.get_dummies(data = df_new['Education'],prefix="Education")
Communication = pd.get_dummies(data = df_new['Communication'],prefix = "Communication")
LastContactMonth = pd.get_dummies(data = df_new['LastContactMonth'],prefix= "LastContactMonth")
Outcome = pd.get_dummies(data = df_new['Outcome'],prefix = "Outcome")

#删除已分配了虚拟变量的类别列
df_new.drop(['Job','Marital','Education','Communication','LastContactMonth','Outcome'],axis=1,inplace=True)

#合并需要用到的所有列
df = pd.concat([df_new,Job,Marital,Education,Communication,LastContactMonth,Outcome],axis=1)

df.columns

Index(['Default', 'HHInsurance', 'CarLoan', 'LastContactDay', 'NoOfContacts',
       'DaysPassed', 'PrevAttempts', 'CarInsurance', 'AgeBinned',
       'BalanceBinned', 'CallTimeBinned', 'Job_admin.', 'Job_blue-collar',
       'Job_entrepreneur', 'Job_housemaid', 'Job_management', 'Job_retired',
       'Job_self-employed', 'Job_services', 'Job_student', 'Job_technician',
       'Job_unemployed', 'Marital_divorced', 'Marital_married',
       'Marital_single', 'Education_primary', 'Education_secondary',
       'Education_tertiary', 'Communication_cellular', 'Communication_none',
       'Communication_telephone', 'LastContactMonth_apr',
       'LastContactMonth_aug', 'LastContactMonth_dec', 'LastContactMonth_feb',
       'LastContactMonth_jan', 'LastContactMonth_jul', 'LastContactMonth_jun',
       'LastContactMonth_mar', 'LastContactMonth_may', 'LastContactMonth_nov',
       'LastContactMonth_oct', 'LastContactMonth_sep', 'Outcome_failure',
       'Outcome_none', 'Outcome_other', 'Outcome_success'],
      dtype='object')

通常通过在已知输出（标记的数据）上对模型进行训练来对模型进行评估，以使模型可以从中学习，并使用未标记的数据进行测试，从而可以确定模型的预测准确性，从而进行训练测试拆分。

train_test_split：https://blog.csdn.net/Lynn_mg/article/details/83062630

X= df.drop(['CarInsurance'],axis=1).values
y=df['CarInsurance'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=42, stratify = y) #将stratify=y就是按照y中的比例分配

预测模型的建立和验证

**预测模型** sklearn中集成了很多分类预测算法，在我们的案例中，我们利用了与问题相关的大多数分类算法。我们的分类器包括 1. kNN 2. Logistic Regression 3. SVM 4. Decision Tree 5. Random Forest 6. AdaBoost 7. XGBoost **交叉验证**

交叉验证用于将数据分为训练集和测试集，以评估模型的性能。在KFold中，K确定要在数据上进行划分的数目，并从中使用1个样本进行训练，而在我们的案例中，将10-1作为样本用于验证。每个模型的交叉验证得分是通过将模型分为10折来评估的。

最好的模型是** Random Forest 和 XGBoost **，它们都以良好的准确性得分很好地完成了自己的任务。

#以下矩阵的代码来自sklearn文档
#定义混淆矩阵函数
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    
   
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    


    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
        
        
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

class_names = ['Success','Failure']

knn = KNeighborsClassifier(n_neighbors = 6)
knn.fit(X_train,y_train)
print ("kNN Accuracy is %2.2f" % accuracy_score(y_test, knn.predict(X_test)))

#10折交叉验证
score_knn = cross_val_score(knn, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_knn)
y_pred= knn.predict(X_test)
print(classification_report(y_test, y_pred))


cm = confusion_matrix(y_test,y_pred)
#画混淆矩阵
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

kNN Accuracy is 0.76
Cross Validation Score = 0.75
              precision    recall  f1-score   support

           0       0.75      0.90      0.82       479
           1       0.78      0.55      0.65       321

    accuracy                           0.76       800
   macro avg       0.76      0.72      0.73       800
weighted avg       0.76      0.76      0.75       800

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ka8HsYHS-1591933821473)(output_58_1.png)]

#Logistic Regression Classifier
LR = LogisticRegression()
LR.fit(X_train,y_train)
print ("Logistic Accuracy is %2.2f" % accuracy_score(y_test, LR.predict(X_test)))
score_LR = cross_val_score(LR, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_LR)
y_pred = LR.predict(X_test)
print(classification_report(y_test, y_pred))
# Confusion matrix for LR
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

Logistic Accuracy is 0.83
Cross Validation Score = 0.81
              precision    recall  f1-score   support

           0       0.85      0.87      0.86       479
           1       0.80      0.78      0.79       321

    accuracy                           0.83       800
   macro avg       0.82      0.82      0.82       800
weighted avg       0.83      0.83      0.83       800

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mmSdBJTo-1591933821475)(output_59_5.png)]

SVM = svm.SVC()
SVM.fit(X_train, y_train)
print ("SVM Accuracy is %2.2f" % accuracy_score(y_test, SVM.predict(X_test)))
score_svm = cross_val_score(SVM, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_svm)
y_pred = SVM.predict(X_test)
print(classification_report(y_test,y_pred))
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

SVM Accuracy is 0.67
Cross Validation Score = 0.66
              precision    recall  f1-score   support

           0       0.66      0.91      0.77       479
           1       0.70      0.31      0.43       321

    accuracy                           0.67       800
   macro avg       0.68      0.61      0.60       800
weighted avg       0.68      0.67      0.63       800

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6XP35U8l-1591933821476)(output_60_1.png)]

# Decision Tree Classifier
DT = tree.DecisionTreeClassifier(random_state = 0,class_weight="balanced",
    min_weight_fraction_leaf=0.01)
DT = DT.fit(X_train,y_train)
print ("Decision Tree Accuracy is %2.2f" % accuracy_score(y_test, DT.predict(X_test)))
score_DT = cross_val_score(DT, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_DT)
y_pred = DT.predict(X_test)
print(classification_report(y_test, y_pred))
# Confusion Matrix for Decision Tree
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

Decision Tree Accuracy is 0.82
Cross Validation Score = 0.81
              precision    recall  f1-score   support

           0       0.88      0.81      0.84       479
           1       0.74      0.83      0.79       321

    accuracy                           0.82       800
   macro avg       0.81      0.82      0.81       800
weighted avg       0.82      0.82      0.82       800

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-su7UynS6-1591933821478)(output_61_1.png)]

#Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=10,class_weight="balanced")
rfc.fit(X_train, y_train)
print ("Random Forest Accuracy is %2.2f" % accuracy_score(y_test, rfc.predict(X_test)))
score_rfc = cross_val_score(rfc, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_rfc)
y_pred = rfc.predict(X_test)
print(classification_report(y_test,y_pred ))
#Confusion Matrix for Random Forest
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

Random Forest Accuracy is 0.86
Cross Validation Score = 0.84
              precision    recall  f1-score   support

           0       0.90      0.86      0.88       479
           1       0.80      0.86      0.83       321

    accuracy                           0.86       800
   macro avg       0.85      0.86      0.85       800
weighted avg       0.86      0.86      0.86       800

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KgsP8H9u-1591933821480)(output_62_1.png)]

#AdaBoost Classifier
ada = AdaBoostClassifier(n_estimators=400, learning_rate=0.1)
ada.fit(X_train,y_train)
print ("AdaBoost Accuracy= %2.2f" % accuracy_score(y_test,ada.predict(X_test)))
score_ada = cross_val_score(ada, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_ada)
y_pred = ada.predict(X_test)
print(classification_report(y_test,y_pred ))
#Confusion Marix for AdaBoost
cm = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

AdaBoost Accuracy= 0.83
Cross Validation Score = 0.82
              precision    recall  f1-score   support

           0       0.83      0.90      0.86       479
           1       0.82      0.73      0.77       321

    accuracy                           0.83       800
   macro avg       0.83      0.81      0.82       800
weighted avg       0.83      0.83      0.83       800

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NZAOYMh8-1591933821484)(output_63_1.png)]

#XGBoost Classifier
xgb = GradientBoostingClassifier(n_estimators=1000,learning_rate=0.01)
xgb.fit(X_train,y_train)
print ("GradientBoost Accuracy= %2.2f" % accuracy_score(y_test,xgb.predict(X_test)))
score_xgb = cross_val_score(xgb, X, y, cv=10).mean()
print("Cross Validation Score = %2.2f" % score_ada)
y_pred = xgb.predict(X_test) 
print(classification_report(y_test,y_pred))
#Confusion Matrix for XGBoost Classifier
cm_xg = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(cm_xg, classes=class_names, title='Confusion matrix')

GradientBoost Accuracy= 0.85
Cross Validation Score = 0.82
              precision    recall  f1-score   support

           0       0.87      0.89      0.88       479
           1       0.82      0.79      0.81       321

    accuracy                           0.85       800
   macro avg       0.84      0.84      0.84       800
weighted avg       0.85      0.85      0.85       800

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oViXJ83U-1591933821487)(output_64_1.png)]

ROC曲线

ROC绘制了所有模型，并向左上方绘制了Gradient Boosting（XGBoost）和Randomforest的对应曲线，表明这些预测器模型是最好的

ROC： https://blog.csdn.net/kMD8d5R/article/details/98552574?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-4.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-4.nonecase

#Obtaining False Positive Rate, True Positive Rate and Threshold for all classifiers
fpr, tpr, thresholds = roc_curve(y_test, knn.predict_proba(X_test)[:,1])
LR_fpr, LR_tpr, thresholds = roc_curve(y_test, LR.predict_proba(X_test)[:,1])
#SVM_fpr, SVM_tpr, thresholds = roc_curve(y_test, SVM.predict_proba(X_test)[:,1])
DT_fpr, DT_tpr, thresholds = roc_curve(y_test, DT.predict_proba(X_test)[:,1])
rfc_fpr, rfc_tpr, thresholds = roc_curve(y_test, rfc.predict_proba(X_test)[:,1])
ada_fpr, ada_tpr, thresholds = roc_curve(y_test, ada.predict_proba(X_test)[:,1])
xgb_fpr, xgb_tpr, thresholds = roc_curve(y_test, xgb.predict_proba(X_test)[:,1])
#PLotting ROC Curves for all classifiers
plt.plot(fpr, tpr, label='KNN' )
plt.plot(LR_fpr, LR_tpr, label='Logistic Regression')
#plt.plot(SVM_fpr, SVM_tpr, label='SVM')
plt.plot(DT_fpr, DT_tpr, label='Decision Tree')
plt.plot(rfc_fpr, rfc_tpr, label='Random Forest')
plt.plot(ada_fpr, ada_tpr, label='AdaBoost')
plt.plot(xgb_fpr, xgb_tpr, label='GradientBoosting')
# Plot Base Rate ROC
plt.plot([0,1],[0,1],label='Base Rate')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-x08527AY-1591933821488)(output_67_0.png)]

特征重要性

重要特征识别是通过使用诸如Logistic回归和决策树之类的模型完成的。两者在识别特征时都非常清晰。下图显示了ExtraTreesClassifier确定的最重要变量，而前10个变量是

CallTime
LastContactDay
Balance
NoofContacts
Outcome_success
Age
HHInsurance
Communication_none
Dayspassed
Outcome_none

#使用递归特征消除函数并将其拟合到Logistic回归模型中
modell = LogisticRegression()
rfe = RFE(modell, 5)
rfe = rfe.fit(X_train,y_train)
# 显示变量等级排序
rfe.ranking_

array([10,  9, 16, 41, 34, 42, 32, 36, 33,  3, 24, 15, 17, 27, 28, 20, 22,
       29,  2, 40, 26, 38, 21, 37, 31, 30, 25, 19,  1, 18, 39,  7, 11, 35,
        4,  5, 23,  1,  6,  8,  1,  1, 13, 12, 14,  1])

#使用ExtraTreesClassifier模型函数
model = ExtraTreesClassifier()
model.fit(X_train, y_train)


print(model.feature_importances_)
importances = model.feature_importances_
feat_names = df.drop(['CarInsurance'],axis=1).columns

#通过按重要性顺序对功能重要性进行排序将其显示为图表
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12,6))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], color='lightblue',  align="center")
plt.step(range(len(indices)), np.cumsum(importances[indices]), where='mid', label='Cumulative')
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical',fontsize=14)
plt.xlim([-1, len(indices)])
plt.show()

[0.00268756 0.0313893  0.01658237 0.06520325 0.04730283 0.01636741
 0.01347456 0.04461972 0.04937946 0.25765298 0.01216092 0.01304826
 0.00579582 0.00493788 0.01324779 0.0097079  0.00640912 0.0091661
 0.00601809 0.01457932 0.00655431 0.01135298 0.01605904 0.01331409
 0.00989652 0.01628276 0.01403667 0.01703035 0.02387476 0.00606056
 0.01792573 0.01598123 0.00329203 0.0090054  0.00783392 0.01389945
 0.01580459 0.01121934 0.01654914 0.00965812 0.01079382 0.00951623
 0.00932224 0.02123694 0.00591377 0.04785536]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wWMDhnMM-1591933821490)(output_70_1.png)]

rfc = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=10,class_weight="balanced")
y_proba = cross_val_predict(rfc, X, y, cv=10, n_jobs=-1, method='predict_proba')
results = pd.DataFrame({'y': y, 'y_proba': y_proba[:,1]})
results = results.sort_values(by='y_proba', ascending=False).reset_index(drop=True)
results.index = results.index + 1
results.index = results.index / len(results.index) * 100

sns.set_style('darkgrid')
pred = results
pred['Lift Curve'] = pred.y.cumsum() / pred.y.sum() * 100
pred['Baseline'] = pred.index
base_rate = y.sum() / len(y) * 100
pred[['Lift Curve', 'Baseline']].plot(style=['-', '--', '--'])
pd.Series(data=[0, 100, 100], index=[0, base_rate, 100]).plot(style='--')
plt.title('Cumulative Gains')
plt.xlabel('% of Customers Contacted')
plt.ylabel("% of Positive Results")
plt.legend(['Lift Curve', 'Baseline', 'Ideal']);