使用神经网络和ml模型预测客户流失

最新推荐文章于 2023-04-12 21:30:32 发布

weixin_26756255

最新推荐文章于 2023-04-12 21:30:32 发布

阅读量1.4k

点赞数

文章标签：神经网络 python 人工智能机器学习深度学习

原文链接：https://towardsdatascience.com/churn-prediction-using-neural-networks-and-ml-models-c817aadb7057

版权

本文详细介绍了使用神经网络和机器学习模型预测客户流失的项目。通过探索性数据分析，发现月度合同、纤维宽带互联网服务等因素对流失率影响显著。通过分层交叉验证和超参数调整，优化了随机森林、逻辑回归和神经网络模型。结果显示，逻辑回归在不平衡数据集上表现出最佳性能，总费用和任期是最重要的特征。最后，提出了折扣优惠和合同调整等策略以降低客户流失率。

摘要由CSDN通过智能技术生成

This story is a walk-through of a notebook I uploaded on Kaggle. Originally, it only used machine learning models and since then I have added a couple of basic neural network models. The churn prediction topic has been extensively covered by many blogs on Medium and notebooks on Kaggle, however, there are very few using neural networks. The application of neural networks to structured data in itself is seldom covered in the literature. I learned neural networks through the deeplearning.ai specialization on Coursera and the documentation of Tensorflow with Keras.

这个故事是我在Kaggle上上传的笔记本的逐步介绍。最初，它仅使用机器学习模型，从那时起，我添加了几个基本的神经网络模型。流失预测主题已被Medium上的许多博客和Kaggle上的笔记本广泛涵盖，但是，很少使用神经网络。在文献中很少涉及将神经网络本身应用于结构化数据。我通过Coursera的deeplearning.ai专业知识以及Keras的Tensorflow文档学习了神经网络。

介绍 (Introduction)

Customer attrition or customer churn occurs when customers or subscribers stop doing business with a company or service. Customer churn is a critical metric because it is much more cost effective to retain existing customers than it is to acquire new customers as it saves cost of sales and marketing. Customer retention is more cost-effective as you’ve already earned the trust and loyalty of existing customers.

当客户或订户停止与公司或服务开展业务时，就会发生客户流失或客户流失。客户流失率是一项关键指标，因为保留现有客户比获取新客户更具成本效益，因为它可以节省销售和营销成本。客户保留率更具成本效益，因为您已经赢得了现有客户的信任和忠诚度。

There are various ways to calculate this metric as churn rate may represent the total number of customers lost, the percentage of customers lost compared to the company’s total customer count, the value of recurring business lost, or the percent of recurring value lost. However, in this dataset, it is defined as a binary variable for each customer and calculating the rate is not the objective. Thus the objective here is to identify and quantify the factors which influence churn rate.

有多种方法可以计算此指标，因为流失率可以代表丢失的客户总数，丢失的客户占公司总客户数的百分比，经常性业务损失的价值或经常性价值损失的百分比。但是，在此数据集中，它被定义为每个客户的二进制变量，并且计算费率不是目标。因此，这里的目的是识别和量化影响客户流失率的因素。

This is a fairly easy and beginner level project with fewer variables. It is not a useful application for neural networks as number of training examples are comparatively less but it is easy to understand neural networks using this.

这是一个相当简单的入门级项目，变量较少。这对神经网络不是有用的应用程序，因为训练示例的数量相对较少，但是使用它可以很容易地理解神经网络。

探索性数据分析 (Exploratory Data Analysis)

The data cleaning steps are skipped here. Missing values were only minute and found in Total Charges column and thus dropped. No features were dropped owing to multi-collinearity as only few features are present.

此处跳过数据清理步骤。缺少的值只有分钟，可以在“总费用”列中找到，因此被丢弃。由于存在多个共线性，因此没有因多共线性而丢失任何特征。

Image for post — Familiarizing yourself with the features.

The first step in data analysis is familiarizing yourself with the data variables, features and the target. This dataset contains 20 features and one target variable. The customer ID feature is a string identification, thus not useful for prediction.

数据分析的第一步是使自己熟悉数据变量，功能和目标。该数据集包含20个要素和一个目标变量。客户ID功能是字符串标识，因此对预测没有用。

In the categorical features, some features are binary and some have exactly 3 unique values. On examining, it is noted that only Contract and Internet service have a different number of unique values in the categorical features. The ‘No internet service’ class could be assigned to ‘No’ as in some notebooks on Kaggle. However, dummy variables seem better encoding option instead as in former case there will be loss of data that the customer has chosen not to opt for a service despite having internet service. In case the number of features were larger, label encoding or mapping would be considered as then the one hot encoding would become a large sparse matrix.

在分类特征中，某些特征是二进制的，而某些特征恰好具有3个唯一值。在检查时，请注意，只有合同和Internet服务在分类功能中具有不同数量的唯一值。像在Kaggle上的某些笔记本中一样，可以将“没有互联网服务”类别分配为“否”。但是，伪变量似乎是更好的编码选项，因为在前一种情况下，尽管有互联网服务，但客户选择不选择服务的数据将会丢失。在特征数量较大的情况下，标签编码或映射将被认为是一种热编码将成为大型稀疏矩阵。

It is important to check the distribution of these features provided in our data to check for biases if the feature values are impartially distributed. A function such as below is used to plot the distributions.

重要的是要检查我们数据中提供的这些特征的分布，以检查特征值是否公平分布是否存在偏差。下面的函数用于绘制分布。

def srt_dist(df=df,cols=cat_feats):
    fig, axes = plt.subplots(8, 2,squeeze=True)
    axes = axes.flatten()
    for i, j in zip(cols, axes):
        (df[i].value_counts()*100.0 /len(df)).plot.pie(autopct='%.1f%%',figsize =(10,37), fontsize =15,ax=j )                                                                      
        j.yaxis.label.set_size(15)
srt_dist()

It is observed that very few are senior citizens, only 30% have dependents, and only 10% have no phone service. Thus, the correlations drawn from these variables can be doubted.

据观察，很少有老年人，只有30％有受抚养人，只有10％没有电话服务。因此，可以怀疑从这些变量得出的相关性。