电信行业用户流失预测——你的用户会流失吗？

最新推荐文章于 2023-12-21 17:14:08 发布

置顶

小步积

最新推荐文章于 2023-12-21 17:14:08 发布

阅读量2.5k

点赞数 5

分类专栏：数据分析文章标签：用户流失预测逻辑回归可视化

本文链接：https://blog.csdn.net/lvhuike/article/details/106861549

版权

博客目的

随着通信技术的飞速发展，通信用户数量的急剧增加，通信市场趋于饱和，运营商之间的竞争愈演愈烈，使得运营商更加关注用户资源流失的问题。通过使用用户产生的数据预测潜在的流失用户，并对这些潜在的流失用户进行挽留，可以保持市场占比和利润。所以用户流失预测问题的研究对于电信行业而言有着重要的意义。
本文从特征和流失的关联性和逻辑回归模型两个方面来对电信用户流失预测问题进行分析研究，主要解决两个问题，一是什么样的用户容易流失，二是用户会不会流失。针对以上两个问题分别给出容易流失的用户画像和用户流失模型。

关键词：用户流失预测逻辑回归机器学习

一、加载数据

数据来源有：
1、kaggle
2、百度网盘链接：https://pan.baidu.com/s/1APmQrOz2mTCislWqUiFSdA
提取码：6ice

开发环境介绍：
编程语言：python，编程工具：Jupyter notebook，常用库：pandas、numpy、sklearn。

首先加载文件，接着观察特征有无缺失值，特征数据类型符合不符合认知等，为数据预处理做准备，然后对特征进行分类方便后期分组研究特征和流失的关联性。

df_ = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
df = df_.copy()

# 看下数据类型
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

从结果可以看出：一共7043行数据，21个特征，每一列都没有缺失值，但是TotalCharges应该是数值型变量的而不应该是object类型，所以在数据预处理里要对TotalCharges进行转换，转换成数值型变量。
大多数模型都不能直接处理string数据只能处理数值数据，观察离散变量的属性值能帮助我们判断将其转成有序变量还是名义变量。

# 看看离散型变量都有哪些属性值
col_dict = {
   }
del_col = ["customerID", "tenure", "MonthlyCharges", "TotalCharges"]
for i in [x for x in df.columns.tolist() if x not in del_col]:
    col_dict[i] = df[i].unique().tolist()
col_dict

{'gender': ['Female', 'Male'],
 'SeniorCitizen': [0, 1],
 'Partner': ['Yes', 'No'],
 'Dependents': ['No', 'Yes'],
 'PhoneService': ['No', 'Yes'],
 'MultipleLines': ['No phone service', 'No', 'Yes'],
 'InternetService': ['DSL', 'Fiber optic', 'No'],
 'OnlineSecurity': ['No', 'Yes', 'No internet service'],
 'OnlineBackup': ['Yes', 'No', 'No internet service'],
 'DeviceProtection': ['No', 'Yes', 'No internet service'],
 'TechSupport': ['No', 'Yes', 'No internet service'],
 'StreamingTV': ['No', 'Yes', 'No internet service'],
 'StreamingMovies': ['No', 'Yes', 'No internet service'],
 'Contract': ['Month-to-month', 'One year', 'Two year'],
 'PaperlessBilling': ['Yes', 'No'],
 'PaymentMethod': ['Electronic check',
  'Mailed check',
  'Bank transfer (automatic)',
  'Credit card (automatic)'],
 'Churn': ['No', 'Yes']}

# 看看数值型变量的描述性统计信息
df.describe()

	SeniorCitizen	tenure	MonthlyCharges
count	7043.000000	7043.000000	7043.000000
mean	0.162147	32.371149	64.761692
std	0.368612	24.559481	30.090047
min	0.000000	0.000000	18.250000
25%	0.000000	9.000000	35.500000
50%	0.000000	29.000000	70.350000
75%	0.000000	55.000000	89.850000
max	1.000000	72.000000	118.750000

二、数据预处理

这部分展示数据常规的预处理手段。

2.1 去重复

print("去重前数据大小为：{0}".format(df.shape))
df.drop_duplicates()
print("去重后数据大小为：{0}".format(df.shape))

去重前数据大小为：(7043, 21)
去重后数据大小为：(7043, 21)

从结果可以看出，数据没有重复的。如果有重复数据一定要删除，删除后记得重新设置索引。

2.2 缺失值

df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

从结果可以看出没有特征含有缺失值。

2.3 TotalCharges特征数值化

df['TotalCharges'] = pd.to_numeric(df["TotalCharges"])

报错：ValueError: Unable to parse string " " at position 488。
从报错信息里可以看出这一列包含空白值，我们看看多不多，多的话用均值或其他的填补，少的话直接删掉。

(df['TotalCharges']==" ").sum()

(df['TotalCharges']==" ").sum()/df.shape[0]

0.001561834445548772

从以上2个结果可以看出，TotalCharges列一共有11个空白值" “，占总数据的比例是0.15%，这个比例还是很小的，这11个空白值” "直接删掉，删掉后注意更新下数据的索引。

df.drop(df[df["TotalCharges"]==" "].index, axis=0, inplace=True)
# 重设索引，删除某些行后最好是重设下索引
df.index = range(df.shape[0])
df['TotalCharges'] = pd.to_numeric(df["TotalCharges"])

2.4 异常值

观察分位数和可视化结合，检查连续变量有没有异常值。

range_ = list(np.linspace(0,1,6))
df.describe(percentiles=range_)

	SeniorCitizen	tenure	MonthlyCharges	TotalCharges
count	7032.000000	7032.000000	7032.000000	7032.000000
mean	0.162400	32.421786	64.798208	2283.300441
std	0.368844	24.545260	30.085974	2266.771362
min	0.000000	1.000000	18.250000	18.800000
0%	0.000000	1.000000	18.250000	18.800000
20%	0.000000	6.000000	25.050000	267.070000
40%	0.000000	20.000000	58.920000	944.170000
50%	0.000000	29.000000	70.350000	1397.475000
60%	0.000000	40.000000	79.150000	2048.950000
80%	0.000000	60.800000	94.300000	4475.410000
100%	1.000000	72.000000	118.750000	8684.800000
max	1.000000	72.000000	118.750000	8684.800000

df.plot(kind='scatter', x='tenure', y='MonthlyCharges')

在这里插入图片描述

df.plot(kind='scatter', x='tenure', y='TotalCharges')

在这里插入图片描述
从分位数和图中可以看出，连续型变量没有很特别的异常值。

2.5 无量纲化——标准化

树模型不需要对数据缩放就能得到较好的准确率，因为我们还要构建其他需要对数据进行缩放的模型，所以才需要对数据做无量纲化。连续型变量尤其是"MonthlyCharges" 和 “TotalCharges”，数值从几十到几千取值范围很大，使用无量纲化可以帮助我们提升某些模型的准确率，避免某些取值范围特别大的特征对模型的影响。
无量纲化通常有两种，归一化和标准化，本文使用更常用的标准化。因为归一化对异常值非常敏感，在PCA，聚类，逻辑回归，支持向量机，神经网络这些算法中，标准化往往是最好的选择。

# 连续变量做无量纲化处理，离散变量不需要
scaler_ = ["tenure", "MonthlyCharges", "TotalCharges"]
df_nor = df.copy()
df_nor[scaler_] = StandardScaler().fit_transform(df_nor[scaler_])

df_nor.head()

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	-1.280248	No	No phone service	DSL	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	-1.161694	-0.994194	No
1	5575-GNVDE	Male	No	No	0.064303	Yes	No	DSL	Yes	...	Yes	No	No	No	One year	No	Mailed check	-0.260878	-0.173740	No
2	3668-QPYBK	Male	No	No	-1.239504	Yes	No	DSL	Yes	...	No	No	No	No	Month-to-month	Yes	Mailed check	-0.363923	-0.959649	Yes
3	7795-CFOCW	Male	No	No	0.512486	No	No phone service	DSL	Yes	...	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	-0.747850	-0.195248	No
4	9237-HQITU	Female	No	No	-1.239504	Yes	No	Fiber optic	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	0.196178	-0.940457	Yes

5 rows × 21 columns

2.6 编码/哑变量

本文使用的数据集中，离散型变量都不是有序变量，因此使用one_hot编码把这些特征转成哑变量。

# 分类变量转换为数值变量，one_hot编码
df_oh0 = df_nor.iloc[:, 1:]
print("one_hot编码前特征数量：{0}".format(df_oh0.shape))
df_oh1 = pd.get_dummies(df_oh0)
print("one_hot编码后特征数量：{0}".format(df_oh1.shape))

one_hot编码前特征数量：(7032, 20)
one_hot编码后特征数量：(7032, 47)

range_ = df_oh1.columns.tolist()
range_.remove("Churn_No")
df_oh = df_oh1.loc[:, range_]
df_oh.head()

	SeniorCitizen	tenure	MonthlyCharges	TotalCharges	gender_Female

最低0.47元/天解锁文章

小步积

关注

5
点赞
踩
60

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录