数据分析3-电信行业用户流失预测实例

最新推荐文章于 2024-10-16 09:41:05 发布

错落星辰.

最新推荐文章于 2024-10-16 09:41:05 发布

阅读量1.1k

点赞数 1

本文链接：https://blog.csdn.net/qq_46068895/article/details/107064286

版权

该博客介绍了电信行业用户流失预测的数据预处理步骤，包括删除无关特征、标准化数值、箱线图检测异常值、字段值替换和编码。接着，通过划分数据集，使用随机森林、决策树、线性支持向量机、K近邻和朴素贝叶斯等算法训练模型，并对模型性能进行了评估，最终确定随机森林模型表现最佳。

摘要由CSDN通过智能技术生成

四、数据预处理

4-1 删除

由前面结果可知，（参考数据分析1，2）CustomerID表示每个客户的随机字符，对后续建模不影响，我这里选择删除CustomerID列；gender 和 PhoneService 与流失率的相关性低，可直接忽略。

#数据预处理
df1=df.iloc[:,2:20]
df1.drop("PhoneService",axis=1,inplace=True)
df_id=df["customerID"]
df1.head()

数据展示：
在这里插入图片描述

对客户的职位、月费用和总费用进行去均值和方差缩放，对数据进行标准化：使得数据方差为1，均值为0，则预测结果不会被某些维度过大的特征值主导

4-2标准化

#数据标准化
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler(copy=False)
scaler.fit_transform(df1[['tenure','MonthlyCharges','TotalCharges']])

标准化后：
在这里插入图片描述

4-3 箱线图

使用箱线图查看数据是否存在异常值：
第一种：

df1[['tenure','MonthlyCharges','TotalCharges']]=scaler.transform(df1[['tenure','MonthlyCharges','TotalCharges']])
df1[['tenure','MonthlyCharges','TotalCharges']].plot(kind='box',subplots=True,layout=(3,1),sharex=False,fontsize=8)
pyplot.show()

在这里插入图片描述
第二种：

#使用箱线图查看数据异常值
import seaborn as sns
df1[['tenure','MonthlyCharges','TotalCharges']]=scaler.transform(df1[['tenure','MonthlyCharges','TotalCharges']]

最低0.47元/天解锁文章