项目-机器学习分析电信用户的流失原因

Carrie_Lei

于 2024-09-28 21:38:20 发布

阅读量510

点赞数 19

分类专栏：机器学习文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/finly4599/article/details/142612426

版权

机器学习专栏收录该内容

26 篇文章 0 订阅

订阅专栏

数据源

来自datafountain上的《电信客户流失数据》https://sso.datafountain.cn/

分析目的

流失用户的特征
采取的数据集中特征和流失用户的相关性

思路

1. 数据收集

用户特征数据：收集用户的基本信息（如年龄、性别、地区、套餐类型等）。
用户行为数据：收集用户的通话记录、上网时长、充值记录、投诉记录等。
流失标签：标记哪些用户在特定时间段内流失（例如，取消订阅或不再充值）。

2. 数据预处理

数据清洗：处理缺失值和异常值，确保数据的质量。
特征工程：
- 编码：对分类变量（如套餐类型）进行独热编码（One-Hot Encoding）或标签编码（Label Encoding）。
- 特征选择：使用相关性分析和特征选择算法（如递归特征消除RFE）来选择重要特征。
- 生成新特征：例如，计算用户的平均消费、最近一次充值时间等。

3. 模型选择

分类模型：
- 逻辑回归：用于基线模型，快速理解特征与流失的关系。
- 决策树/随机森林：能够捕捉非线性关系，并提供特征重要性评估。
- 支持向量机（SVM）：适合高维数据，能有效处理线性和非线性问题。
- XGBoost：梯度提升树模型，性能强劲，适合处理复杂的模式。

4. 模型训练与评估

数据集划分：将数据划分为训练集和测试集（例如，70%训练，30%测试）。
模型训练：使用训练集进行模型训练，并调整超参数以优化模型性能。
模型评估：使用准确率、召回率、F1-score、ROC曲线等指标评估模型性能。考虑使用交叉验证来避免过拟合。

5. 结果分析与解释

特征重要性分析：识别影响用户流失的主要因素，帮助运营团队制定干预措施。
模型解释：使用SHAP值或LIME等技术解释模型决策，确保决策的透明性。

6. 预测与应用

流失预测：使用训练好的模型对新用户进行流失预测，提前识别潜在流失用户。
用户干预策略：根据流失原因制定相应的用户挽留策略，如优惠活动、改进客户服务等。

7. 持续监测与优化

模型监测：定期监测模型的预测效果，及时更新模型以适应用户行为的变化。
反馈机制：建立反馈机制，通过用户的实际流失情况进一步优化模型。

实际步骤

导入数据

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import time

# 设置忽略警告
import warnings
warnings.filterwarnings('ignore')

#显示所有列
pd.set_option('display.max_columns', None)
#显示所有行
pd.set_option('display.max_rows', None)

### 设置不使用科学计数法  #为了直观的显示数字，不采用科学计数法
np.set_printoptions(precision=3, suppress=True)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# 加载数据
data = pd.read_csv('./datas/Telco-Customer-Churn.csv')

# 查看头10行
data.head(10)

在这里插入图片描述

# 查看字段信息
data.info()

在这里插入图片描述

分析数据源

字段名	含义
customerID	用户ID，唯一值，string类型，如：7590-VHVEG
gender	用户性别，只有两个值取一个，string类型，如：Female/Male
SeniorCitizen	是否老年人，只有两个值，bool类型，如：0/1
Partner	是否有配偶，只有两个值，string类型，如：Yes/No
Dependents	是否有家属，只有两个值，string类型，如：Yes/No
tenure	入网月数，int值，如：34
PhoneService	是否开通手机服务，只有两个值，string类型，如：Yes/No
MultipleLines	是否开通多线业务，三个值取其一，string类型，如：Yes, No, No phone service
InternetService	是否开通互联网业务，，三值取其一，string类型，如：DSL数字网络, Fiber optic光纤网络, No
OnlineSecurity	是否开通在线安全业务，三个值取其一，string类型，如：Yes, No, No phone service
OnlineBackup	是否开通在线备份业务，三个值取其一，string类型，如：Yes, No, No phone service
DeviceProtection	是否开通设备保护业务，三个值取其一，string类型，如：Yes, No, No phone service
TechSupport	是否开通技术支持业务，三个值取其一，string类型，如：Yes, No, No phone service
StreamingTV	是否开通网络电视业务，三个值取其一，string类型，如：Yes, No, No phone service
StreamingMovies	是否开通网络电影业务，三个值取其一，string类型，如：Yes, No, No phone service
Contract	合约期限，三个值取其一，string类型，如：Month-to-month月度, One year年度, Two year两年度
PaperlessBilling	是否采用电子结算，只有两个值，string类型，如：Yes/No
PaymentMethod	付款方式，多个值取一，string类型，如：Bank transfer (automatic)自动银行卡结算, Credit card (automatic)自动信用卡结算, Electronic check电子结算, Mailed check邮件结算
MonthlyCharges	每月费用，浮点型，如：29.85
TotalCharges	总费用，浮点型，如：1889.35
Churn	客户是否流失，只有两个值，string类型，如：Yes/No

数据预处理

# 查看数据集中缺失值情况
data.isnull().any()

在这里插入图片描述
没有缺失值。

# 查看重复值
data.duplicated().sum()

在这里插入图片描述
没有重复值。
但是分析上面info的信息发现。TotalCharges是浮点型，但是info输出为object。将TotalCharges转换为float64再看。

# 强制转换为数字，用astype报错
data['TotalCharges'] = data['TotalCharges'].apply(pd.to_numeric, errors='coerce')
df['TotalCharges'].dtype
data.info()

在这里插入图片描述
转换为float64后发现TotalCharges里面缺失了数据。
分析total charges和chrun之间的直方图关系，来决定用哪种填充。

import seaborn as sns

sns.set(style='darkgrid',font_scale=1.3)
#分别作直方图：全部客户类型、流失客户类型、留存客户类型
plt.figure(figsize=(14,5))
plt.subplot(1,3,1)
plt.title('total charges ')
sns.distplot(data['TotalCharges'].dropna())

plt.subplot(1,3,2)
plt.title('chrun = yes & total charges')
sns.distplot(data[data['Churn']=='Yes']['TotalCharges'].dropna())

plt.subplot(1,3,3)
plt.title('chrun = no & total charges')
sns.distplot(data[data['Churn']=='No']['TotalCharges'].dropna())

在这里插入图片描述
由图看出total charges图和留存关系之间存在偏态分布关系,用中位数填充。

data['TotalCharges'] = data['TotalCharges'].fillna(data['TotalCharges'].median())
data.isnull().sum()

在这里插入图片描述

异常值处理

# 查看数值类特征的统计信息
data.describe()

在这里插入图片描述
SeniorCitizen取值只有0和1，对其他数值类型做箱线图分析是否存在离群点

import seaborn as sns 
import matplotlib.pyplot as plt    # 可视化
# 在Jupyter notebook里嵌入图片
%matplotlib inline

# 分析百分比特征
fig = plt.figure(figsize=(15,6)) # 建立图像

# tenure特征
ax1 = fig.add_subplot(311)    # 子图1
list1 = list(data['tenure'])
ax1.boxplot(list1, vert=False, showmeans=True, flierprops = {"marker":"o","markerfacecolor":"steelblue"})
ax1.set_title('tenure')

# MonthlyCharges特征
ax2 = fig.add_subplot(312)    # 子图2
list2 = list(data['MonthlyCharges'])
ax2.boxplot(list2, vert=False, showmeans=True, flierprops = {"marker":"o","markerfacecolor":"steelblue"})
ax2.set_title('MonthlyCharges')

# TotalCharges
ax3 = fig.add_subplot(313)    # 子图3
list3 = list(data['TotalCharges'])
ax3.boxplot(list3, vert=False, showmeans=True, flierprops = {"marker":"o","markerfacecolor":"steelblue"})
ax3.set_title('TotalCharges')

plt.tight_layout(pad=1.5)    # 设置子图之间的间距
plt.show() # 展示箱型图

在这里插入图片描述
分析发现数值类型不存在离群点。

可视化分析

分析流失客户占比

churn_value = data["Churn"].value_counts()

plt.figure(figsize=(10,6))
plt.pie(churn_value, labels = ['No','Yes'],explode=(0,0.1),autopct='%1.2f%%')
plt.title("Churn=Yes%")
plt.show()

在这里插入图片描述
由图得知，流失客户占比26.54%。样本不均衡。
2. 查看特征之间相关性

在这里插入图片描述
由图得知，internetService、onlinesecurity、onlinebackup、DeviceProtection、TechSupport、StreamingTV、StreamingMovies之间存在正相关性。
PhoneService和Multiplelines也存在正相关性。
3. 查看Churn和其他特征之间的关系

df_onehot = pd.get_dummies(data.iloc[:,1:21])
df_onehot.head()

plt.figure(figsize=(15,6))
df_onehot.corr()['Churn_Yes'].sort_values(ascending=False).plot(kind='bar')
plt.title('Correlation between Churn  and variables ')

在这里插入图片描述
由图知, Churn_Yes和PhoneService、gender基本无关。可以忽略这两个特征。
4. 观察SeniorCitizen、Partner、Dependents和Churn之间的关系

### 是否老年人、是否有配偶、是否有家属等特征对客户流失的影响
baseCols = ['SeniorCitizen', 'Partner', 'Dependents']

for i in baseCols:
    cnt = pd.crosstab(data[i], data['Churn'])    # 构建特征与目标变量的列联表
    cnt.plot.bar(stacked=True)    # 绘制堆叠条形图，便于观察不同特征值流失的占比情况
    plt.show()    # 展示图像

在这里插入图片描述
由图看出，SeniorCitizen对于流失率有影响。老年人中流失占比比年轻人高
由图看出，是否有配偶和流失率有一定关系，无配偶的流失率大于有配偶的流失率

由图看出，是否有家属和客户流失率有关联，无家属的客户流失率比有家属的客户流失率更高。
5. 查看tensure和churn之间的关系。

### 观察流失率与入网月数的关系
# 折线图
groupDf = data[['tenure', 'Churn']]    # 只需要用到两列数据
groupDf['Churn'] = groupDf['Churn'].map({'Yes': 1, 'No': 0})    # 将正负样本目标变量改为1和0方便计算
pctDf = groupDf.groupby(['tenure']).sum() / groupDf.groupby(['tenure']).count()    # 计算不同入网月数对应的流失率
pctDf = pctDf.reset_index()    # 将索引变成列

plt.figure(figsize=(10, 5))
plt.plot(pctDf['tenure'], pctDf['Churn'], label='Churn percentage')    # 绘制折线图
plt.legend()    # 显示图例
plt.show()

在这里插入图片描述
由图得知，入网月数和客户流失率之间有关联，新入网客户更有可能流失。流失率随着入网月数呈现下降趋势。

pctDf.head()

在这里插入图片描述
由图得知，入网月数超过2个月后，用户留存率超过了流失率。
6. 查看MultipleLines和Churn之间的关系

# 多线业务
df1 = data[data['MultipleLines'] == 'Yes']
df2 = data[data['MultipleLines'] == 'No']
df3 = data[data['MultipleLines'] == 'No phone service']

fig = plt.figure(figsize=(15,6)) # 建立图像

ax1 = fig.add_subplot(131)
p1 = df1['Churn'].value_counts()
ax1.pie(p1,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax1.set_title('Churn of (MultipleLines = Yes)')

ax2 = fig.add_subplot(132)
p2 = df2['Churn'].value_counts()
ax2.pie(p2,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax2.set_title('Churn of (MultipleLines = No)')

ax3 = fig.add_subplot(133)
p3 = df3['Churn'].value_counts()
ax3.pie(p3,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax3.set_title('Churn of (MultipleLines = No phone service)')

plt.tight_layout(pad=0.5)    # 设置子图之间的间距
plt.show() # 展示饼状图

在这里插入图片描述
由图得知：是否开通多线业务对流失率影响很小，并且Multiplelines=No和Multiplelines = No Phone Service两种数据值基本一致，后续可以合并这两种情况一起分析。
8. 查看InternetService和churn之间的关联

# 互联网业务
cnt = pd.crosstab(data['InternetService'], data['Churn'])    # 构建特征与目标变量的列联表
cnt.plot.barh(stacked=True, figsize=(15,6))    # 绘制堆叠条形图，便于观察不同特征值流失的占比情况
plt.show()    # 展示图像

在这里插入图片描述
由图得知，未开通互联网业务的总量最少，流失率也最少，而开通了光纤业务的占比最多，流失率也最多。

查看’OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’, ‘StreamingTV’, 'StreamingMovies’和churn之间的关联

# 与互联网相关的业务
internetCols = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for i in internetCols:
    df1 = data[data[i] == 'Yes']
    df2 = data[data[i] == 'No']
    df3 = data[data[i] == 'No internet service']

    fig = plt.figure(figsize=(10,3)) # 建立图像
    plt.title(i)
    
    ax1 = fig.add_subplot(131)
    p1 = df1['Churn'].value_counts()
    ax1.pie(p1,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))    # 开通业务

    ax2 = fig.add_subplot(132)
    p2 = df2['Churn'].value_counts()
    ax2.pie(p2,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))    # 未开通业务

    ax3 = fig.add_subplot(133)
    p3 = df3['Churn'].value_counts()
    ax3.pie(p3,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))    # 未开通互联网业务
  
    plt.tight_layout()    # 设置子图之间的间距
    plt.show() # 展示饼状图

在这里插入图片描述

由图可知：所有’OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’, ‘StreamingTV’, 'StreamingMovies’和churn关联中，没有开通这些业务的客户流失率都是7.4%，原因应该是这些业务都只在客户开通互联网之后才有分析的意义。分析每个前面的两个图得知，开通对应业务之后客户流失率都有所下降，可以认为开通更多绑定业务的用户流失率更低，‘StreamingTV’, ‘StreamingMovies’这两个业务在开通后和开通前客户流失率区别不大。所以’StreamingTV’, 'StreamingMovies’这两个属性也可以忽略。
11. 查看Contract和churn之间的关联

# 合约期限
df1 = data[data['Contract'] == 'Month-to-month']
df2 = data[data['Contract'] == 'One year']
df3 = data[data['Contract'] == 'Two year']

fig = plt.figure(figsize=(15,4)) # 建立图像

ax1 = fig.add_subplot(131)
p1 = df1['Churn'].value_counts()
ax1.pie(p1,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax1.set_title('Churn of (Contract = Month-to-month)')

ax2 = fig.add_subplot(132)
p2 = df2['Churn'].value_counts()
ax2.pie(p2,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax2.set_title('Churn of (Contract = One year)')

ax3 = fig.add_subplot(133)
p3 = df3['Churn'].value_counts()
ax3.pie(p3,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax3.set_title('Churn of (Contract = Two year)')

plt.tight_layout(pad=0.5)    # 设置子图之间的间距
plt.show() # 展示饼状图

在这里插入图片描述
有图得知：合约期越长，客户的流失率越低，
12.查看PaperlessBilling和churn之间的关联。

# 是否采用电子结算
df1 = data[data['PaperlessBilling'] == 'Yes']
df2 = data[data['PaperlessBilling'] == 'No']

fig = plt.figure(figsize=(10,4)) # 建立图像

ax1 = fig.add_subplot(121)
p1 = df1['Churn'].value_counts()
ax1.pie(p1,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax1.set_title('Churn of (PaperlessBilling = Yes)')

ax2 = fig.add_subplot(122)
p2 = df2['Churn'].value_counts()
ax2.pie(p2,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax2.set_title('Churn of (PaperlessBilling = No)')

plt.tight_layout(pad=0.5)    # 设置子图之间的间距
plt.show() # 展示饼状图

在这里插入图片描述
有图而知：采用电子结算会提升用户的流失率。
13.查看PaymentMethod和churn之间的关联

# 付款方式
df1 = data[data['PaymentMethod'] == 'Bank transfer (automatic)']    # 银行转账（自动）
df2 = data[data['PaymentMethod'] == 'Credit card (automatic)']    # 信用卡（自动）
df3 = data[data['PaymentMethod'] == 'Electronic check']    # 电子支票
df4 = data[data['PaymentMethod'] == 'Mailed check']    # 邮寄支票

fig = plt.figure(figsize=(10,8)) # 建立图像

ax1 = fig.add_subplot(221)
p1 = df1['Churn'].value_counts()
ax1.pie(p1,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax1.set_title('Churn of (PaymentMethod = Bank transfer')

ax2 = fig.add_subplot(222)
p2 = df2['Churn'].value_counts()
ax2.pie(p2,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax2.set_title('Churn of (PaymentMethod = Credit card)')

ax3 = fig.add_subplot(223)
p3 = df3['Churn'].value_counts()
ax3.pie(p3,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax3.set_title('Churn of (PaymentMethod = Electronic check)')

ax4 = fig.add_subplot(224)
p4 = df4['Churn'].value_counts()
ax4.pie(p4,labels=['No','Yes'],autopct='%1.2f%%',explode=(0,0.1))
ax4.set_title('Churn of (PaymentMethod = Mailed check)')

plt.tight_layout(pad=0.5)    # 设置子图之间的间距
plt.show() # 展示饼状图

在这里插入图片描述

由图得知：采用电子账单支付的方式，客户的流失率最高。
14.查看MonthlyCharges、TotalCharges和churn之间的关联

# 每月费用核密度估计图
plt.figure(figsize=(10, 5))    # 构建图像

negDf = data[data['Churn'] == 'No']
sns.distplot(negDf['MonthlyCharges'], hist=False, label= 'No')
posDf = data[data['Churn'] == 'Yes']
sns.distplot(posDf['MonthlyCharges'], hist=False, label= 'Yes')

plt.show()    # 展示图像

在这里插入图片描述
有图得知：月度费用在0~40时，客户的留存率最高，在70-140之间时，客户的流失率最高。

# 总费用核密度估计图
plt.figure(figsize=(10, 5))    # 构建图像

negDf = data[data['Churn'] == 'No']
sns.distplot(negDf['TotalCharges'], hist=False, label= 'No')
posDf = data[data['Churn'] == 'Yes']
sns.distplot(posDf['TotalCharges'], hist=False, label= 'Yes')

plt.show()    # 展示图像

在这里插入图片描述

有图得知：随着总费用持续上升，用户的留存率越高，这也是符合实际情况，当用户用的时间越长，越不容易流失。

提取特征

通过上面的可视化分析，去掉无关特征gender、‘StreamingTV’, 'StreamingMovies’和PhoneService。customerID是唯一值也去掉。

churn_var=df.iloc[:,2:20]
churn_var.drop("PhoneService",axis=1, inplace=True)
churn_var.drop("StreamingTV",axis=1, inplace=True)
churn_var.drop("StreamingMovies",axis=1, inplace=True)
churn_var.head()

在这里插入图片描述

处理量纲差异大的数据

离散数据

# 首先将部分特征值进行合并
data.loc[data['MultipleLines']=='No phone service', 'MultipleLines'] = 'No'

internetCols = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
for i in internetCols:
    data.loc[data[i]=='No internet service', i] = 'No'
print(data[data['MultipleLines']=='No phone service'].shape[0])
print(data[data['OnlineSecurity']=='No internet service'].shape[0])

0
0

# 选择特征值为‘Yes’和 'No' 的列名,用0，1替换
encodeCols = list(data.columns[3: 17].drop(['tenure', 'PhoneService', 'InternetService', 'StreamingTV', 'StreamingMovies', 'Contract']))     
for i in encodeCols:
    data[i] = data[i].map({'Yes': 1, 'No': 0})    # 用1代替'Yes’，0代替 'No'
# 顺便把目标变量也进行编码    
data['Churn'] = data['Churn'].map({'Yes': 1, 'No': 0})

# 其他无序的类别特征采用独热编码  
onehotCols = ['InternetService', 'Contract', 'PaymentMethod']
churnDf = data['Churn'].to_frame()    # 取出目标变量列，以便后续进行合并
featureDf = data.drop(['Churn'], axis=1)    # 所有特征列

for i in onehotCols:
    onehotDf = pd.get_dummies(featureDf[i],prefix=i) 
    featureDf = pd.concat([featureDf, onehotDf],axis=1)    # 编码后特征拼接到去除目标变量的数据集中

data = pd.concat([featureDf, churnDf],axis=1)    # 拼回目标变量，确保目标变量在最后一列
data = data.drop(onehotCols, axis=1)    # 删除原特征列
data.head()

在这里插入图片描述

连续数据，标准化

from sklearn.preprocessing import StandardScaler    # 导入标准化库
scaler = StandardScaler()
data[['tenure']] = scaler.fit_transform(data[['tenure']])
data[['MonthlyCharges']] = scaler.fit_transform(data[['MonthlyCharges']])
data[['TotalCharges']] = scaler.fit_transform(data[['TotalCharges']])

data[['tenure', 'MonthlyCharges', 'TotalCharges']].head()    # 观察此时的数值特征

在这里插入图片描述

data[['tenure', 'MonthlyCharges', 'TotalCharges']].describe()

在这里插入图片描述

处理数据不均衡

采用smooth方法上采样

from imblearn.over_sampling import SMOTE
model_smote=SMOTE()
x,y=model_smote.fit_sample(x,y)
x=pd.DataFrame(x,columns=churn_var.columns)
#分拆数据集：训练集 和 测试集
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)

过抽样后数据特征： (10348, 17) 训练数据特征： (7243, 17) 测试数据特征： (3105, 17)
过抽样后数据标签： (10348,) 训练数据标签： (7243,) 测试数据标签： (3105,)

模型选择训练

使用逻辑回归、SVC、随机森林、LightBGM

from sklearn.linear_model import LogisticRegression as LR
from sklearn.svm import SVC as SVC
from sklearn.ensemble import RandomForestClassifier as RF
from lightgbm import LGBMClassifier as LGB

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

ef kFold_cv(X, y, classifier, **kwargs):
    """
    :param X: 特征
    :param y: 目标变量
    :param classifier: 分类器
    :param **kwargs: 参数
    :return: 预测结果
    """
    kf = KFold(n_splits=5, shuffle=True) # 5折交叉验证
    y_pred = np.zeros(len(y))    # 初始化y_pred数组
    start = time.time()
    for train_index, test_index in kf.split(X):  
        X_train = X[train_index]    
        X_test = X[test_index]
        y_train = y[train_index]    # 划分数据集
        clf = classifier(**kwargs)    
        clf.fit(X_train, y_train)    # 模型训练
        y_pred[test_index] = clf.predict(X_test)    # 模型预测
    print("used time : {}".format(time.time()-start))
    return y_pred  

# 获取X,Y
# 加载数据
data = pd.read_csv("./processed_data/processed_smote.csv")
X = data.iloc[:,:-1]
Y = data.iloc[:,-1]
print(X.shape)
print(Y.shape)
print(data.head())

# 进行K折交叉验证
lr_pred = kFold_cv(x_train.values,y_train.values,LR,penalty='l2',C=1.0)
svc_pred = kFold_cv(x_train.values,y_train.values,SVC,C=1.0)
rf_pred = kFold_cv(x_train.values,y_train.values, RF,n_estimators=100,max_depth=10)
lgb_pred = kFold_cv(x_train.values,y_train.values,LGB,learning_rate=0.1,n_estimators=500,max_depth=10)

#输出模型评分
scoreDf = pd.DataFrame(columns=['LR', 'SVC', 'RandomForest', 'LGB'])
pred = [lr_pred, svc_pred, rf_pred, lgb_pred]
for i in range(len(pred)):
    r = recall_score(y_train.values, pred[i])
    p = precision_score(y_train.values, pred[i])
    f1 = f1_score(y_train.values, pred[i])
    scoreDf.iloc[:, i] = pd.Series([r, p, f1])

scoreDf.index = ['Recall', 'Precision', 'F1-score']
scoreDf

在这里插入图片描述
由图知，LGB模型效果最好。

lgb = LGBMClassifier(learning_rate=0.1,n_estimators=500,max_depth=10)
lgb.fit(x_train,y_train)
y_train_pred = lgb.predict(x_train)
y_test_pred = lgb.predict(x_test)
print(classification_report(y_train,y_train_pred))
print(classification_report(y_test,y_test_pred))

在这里插入图片描述

# 特征重要度
import matplotlib.pyplot as plt
import lightgbm 

fig, ax = plt.subplots(figsize=(20,20))
lightgbm.plot_importance(lgb,ax=ax,height=0.5,grid=False)
plt.title("Feature importances")
plt.show()