快速上手客户流失简单建模分析

最新推荐文章于 2022-07-19 21:10:57 发布

playwrighter

最新推荐文章于 2022-07-19 21:10:57 发布

阅读量2.6k

点赞数 1

分类专栏： Python数据分析文章标签：数据分析机器学习 python

本文链接：https://blog.csdn.net/qq_15378385/article/details/112107176

版权

Python数据分析专栏收录该内容

77 篇文章 8 订阅

订阅专栏

快速上手客户流失模型分析

1、处理客户流失数据集
客户流失数据集是一个记录电信公司现有的和曾经的客户的数据文件，有1个输出变量和20个输入变量。输出变量是一个布尔型变量（True/False），表示客户是否已经流失。输入变量是客户的电话计划和通话行为的特征，包括状态、账户时间、区号、电话号码、是否有国际通话计划、是否有语音信箱、语音信箱消息数量、白天通话时长、白天通话次数、白天通话费用、傍晚通话时长、傍晚通话次数、傍晚通话费用、夜间通话时长、夜间通话次数、夜间通话费用、国际通话时长、国际通话次数、国际通话费用和客户服务通话次数。
数据集地址：https://raw.githubusercontent.com/EricChiang/churn/master/data/churn.csv

#导入需要使用的包
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#statsmodels提供对许多不同统计模型估计的类和函数
import statsmodels.api as sm  
import statsmodels.formula.api as smf

创建一个新列churn01，并使用numpy的where函数根据churn这一列中的值用1或0来填充它。churn这一列中的值不是True就是False，所以如果churn中的值是True，那么churn01中的值就是1，如果churn中的值是False，那么churn01中的值就是 0。

# 使用pandas读取数据集
churn = pd.read_csv('churn.csv', sep=',', header=0)
churn.columns = [heading.lower() for heading in \
churn.columns.str.replace(' ', '_').str.replace("\'", "").str.strip('?')]
churn['churn01'] = np.where(churn['churn'] == 'True.', 1., 0.)
churn.head(16)

	state	account_length	area_code	phone	intl_plan	vmail_plan	vmail_message	day_mins	day_calls	day_charge	...	eve_charge	night_mins	night_calls	night_charge	intl_mins	intl_calls	intl_charge	custserv_calls	churn	churn01
0	KS	128	415	382-4657	no	yes	25	265.1	110	45.07	...	16.78	244.7	91	11.01	10.0	3	2.70	1	False.	0.0
1	OH	107	415	371-7191	no	yes	26	161.6	123	27.47	...	16.62	254.4	103	11.45	13.7	3	3.70	1	False.	0.0
2	NJ	137	415	358-1921	no	no	0	243.4	114	41.38	...	10.30	162.6	104	7.32	12.2	5	3.29	0	False.	0.0
3	OH	84	408	375-9999	yes	no	0	299.4	71	50.90	...	5.26	196.9	89	8.86	6.6	7	1.78	2	False.	0.0
4	OK	75	415	330-6626	yes	no	0	166.7	113	28.34	...	12.61	186.9	121	8.41	10.1	3	2.73	3	False.	0.0
5	AL	118	510	391-8027	yes	no	0	223.4	98	37.98	...	18.75	203.9	118	9.18	6.3	6	1.70	0	False.	0.0
6	MA	121	510	355-9993	no	yes	24	218.2	88	37.09	...	29.62	212.6	118	9.57	7.5	7	2.03	3	False.	0.0
7	MO	147	415	329-9001	yes	no	0	157.0	79	26.69	...	8.76	211.8	96	9.53	7.1	6	1.92	0	False.	0.0
8	LA	117	408	335-4719	no	no	0	184.5	97	31.37	...	29.89	215.8	90	9.71	8.7	4	2.35	1	False.	0.0
9	WV	141	415	330-8173	yes	yes	37	258.6	84	43.96	...	18.87	326.4	97	14.69	11.2	5	3.02	0	False.	0.0
10	IN	65	415	329-6603	no	no	0	129.1	137	21.95	...	19.42	208.8	111	9.40	12.7	6	3.43	4	True.	1.0
11	RI	74	415	344-9403	no	no	0	187.7	127	31.91	...	13.89	196.0	94	8.82	9.1	5	2.46	0	False.	0.0
12	IA	168	408	363-1107	no	no	0	128.8	96	21.90	...	8.92	141.1	128	6.35	11.2	2	3.02	1	False.	0.0
13	MT	95	510	394-8006	no	no	0	156.6	88	26.62	...	21.05	192.3	115	8.65	12.3	5	3.32	3	False.	0.0
14	IA	62	415	366-9238	no	no	0	120.7	70	20.52	...	26.11	203.0	99	9.14	13.1	6	3.54	4	False.	0.0
15	NY	161	415	351-7269	no	no	0	332.9	67	56.59	...	27.01	160.6	128	7.23	5.4	9	1.46	4	True.	1.0

16 rows × 22 columns

2、选用逻辑斯蒂回归（logistic regression）简单建模
在这个数据集中，因变量是一个二值变量，表示客户是否已经流失。因为因变量是一个二值变量，所以需要将预测值限制在0和1之间。逻辑斯蒂回归可以满足这个要求。逻辑斯蒂回归通过使用逻辑函数（或称逻辑斯蒂函数）的反函数估计概率的方式来测量自变量和二值型因变量之间的关系。这个函数可以将连续值转换为0和1之间的值，这是个必要条件，因为预测值表示概率，而概率必须在0和1之间。

对客户服务通话次数这部分数据进行了摘要分析，先按照一个新变量 total_charges 中的值使用等宽分箱法将数据分成 5 个组，然后为每个分组计算 5 个统计量：总数、最小值、均值、最大值和标准差。为了完成这些操作，第一行代码创建一个新变量total_charges，表示白天、傍晚、夜间和国际通话费用的总和。

churn['total_charges'] = churn['day_charge'] + churn['eve_charge'] + \
                         churn['night_charge'] + churn['intl_charge']
dependent_variable = churn['churn01']
independent_variables = churn[['account_length', 'custserv_calls', 'total_charges']]
independent_variables_with_constant = sm.add_constant(independent_variables, prepend=True)
logit_model = sm.Logit(dependent_variable, independent_variables_with_constant).fit()
print(logit_model.summary2())
# print("\nQuantities you can extract from the result:\n%s" % dir(logit_model))
print("\nCoefficients:\n%s" % logit_model.params)
print("\nCoefficient Std Errors:\n%s" % logit_model.bse)

Optimization terminated successfully.
         Current function value: 0.363480
         Iterations 7
                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.122     
Dependent Variable: churn01          AIC:              2430.9594 
Date:               2020-04-07 16:12 BIC:              2455.4060 
No. Observations:   3333             Log-Likelihood:   -1211.5   
Df Model:           3                LL-Null:          -1379.1   
Df Residuals:       3329             LLR p-value:      2.2343e-72
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     7.0000                                       
-----------------------------------------------------------------
                  Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
-----------------------------------------------------------------
const            -7.2205   0.3944 -18.3093 0.0000 -7.9935 -6.4476
account_length    0.0012   0.0013   0.9274 0.3537 -0.0014  0.0038
custserv_calls    0.4443   0.0366  12.1290 0.0000  0.3725  0.5161
total_charges     0.0729   0.0054  13.4479 0.0000  0.0623  0.0835
=================================================================


Coefficients:
const            -7.220520
account_length    0.001222
custserv_calls    0.444323
total_charges     0.072914
dtype: float64

Coefficient Std Errors:
const             0.394363
account_length    0.001317
custserv_calls    0.036633
total_charges     0.005422
dtype: float64

3、最后使用这个拟合模型来预测，变量 y_predicted中包含着16个预测值。为了使输出更简单易懂，可以将预测值保
留两位小数。

new_observations = churn.loc[churn.index.isin(range(16)), independent_variables.columns]
new_observations_with_constant = sm.add_constant(new_observations, prepend=True)
y_predicted = logit_model.predict(new_observations_with_constant)
y_predicted_rounded = [round(score, 2) for score in y_predicted]
print(y_predicted_rounded)

[0.25, 0.09, 0.08, 0.2, 0.12, 0.1, 0.49, 0.03, 0.22, 0.24, 0.2, 0.05, 0.03, 0.19, 0.26, 0.81]

playwrighter

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录