简单的用户流失预警实战

最新推荐文章于 2022-02-22 10:08:16 发布

上课不要摸鱼江

最新推荐文章于 2022-02-22 10:08:16 发布

阅读量430

点赞数 1

分类专栏：机器学习文章标签： python 机器学习

本文链接：https://blog.csdn.net/qq_43653405/article/details/107875825

版权

机器学习专栏收录该内容

23 篇文章 0 订阅

订阅专栏

现实情况下有很多游戏公司给回归用户发礼包、打电话，这里做一个类似的用户流失预警，有助于公司或厂商做出反应。记录一下学习过程，方便以后复习和查找资料。

开始

先导入库，数据

from __future__ import division
import pandas as pd
import numpy as np

churn_df = pd.read_csv('churn.csv')
col_names = churn_df.columns.tolist() #取所有特征

print("Column names:")
print(col_names)

to_show = col_names[:6] + col_names[-6:] #举个操作例子eg：取前六个特征，后六个特征

print("\nSample data:")
churn_df[to_show].head(6)

在这里插入图片描述

数据预处理

数据中如false和true机器认不了，需要转化为整型。

churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.',1,0)  #将数据中的false和True用0 1替代

# We don't need these columns
to_drop = ['State','Area Code','Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop,axis=1) #删去不需要的列

# 'yes'/'no' has to be converted to boolean values
# NumPy converts these from boolean to 1. and 0. later
yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'

# Pull out features for future use
features = churn_feat_space.columns

X = churn_feat_space.values.astype(np.float)

特征间有不同的衡量单位，进行标准化处理

# This is important
#用于解决衡量指标不同，指标间数值差别大
from sklearn.preprocessing import StandardScaler  #标准化处理库
scaler = StandardScaler()  #取出模型
X = scaler.fit_transform(X)  #fit一下
#数据正负样本不均匀可以采用 下采样、上采样 的方法

多种分类器处理

可能需要用到多个分类器如SVM，RF，KNN，这里写一个函数给一个接口方便调用。

#交叉验证
from sklearn.model_selection import KFold

#刚开始不知道用哪个分类器，写一个函数来方便传入各种分类器对它交叉验证
#X：标准化后的数据 y:用户是否丢失  clf_class：分类器  **kwargs：分类器参数
def run_cv(X,y,clf_class,**kwargs):
    # Construct a kfolds object
    kf = KFold(len(y),n_splits=5,shuffle=True)  #交叉验证
    y_pred = y.copy()

    # Iterate through folds
    for train_index, test_index in kf.split(X):  #对每个训练集、测试集
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        # Initialize a classifier with key word arguments
        clf = clf_class(**kwargs) #拿出这个分类器
        clf.fit(X_train,y_train)  #fit一下
        y_pred[test_index] = clf.predict(X_test)  #预测
    return y_pred

然后就是导入三个库，调用函数啦

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN

#精度函数
def accuracy(y_true,y_pred):
    # NumPy interprets True and False as 1. and 0.
    return np.mean(y_true == y_pred)

print("Support vector machines:")
print("%.3f" % accuracy(y, run_cv(X,y,SVC)))
print("Random forest:")
print("%.3f" % accuracy(y, run_cv(X,y,RF)))
print("K-nearest-neighbors:")
print("%.3f" % accuracy(y, run_cv(X,y,KNN)))

Support vector machines:
0.925
Random forest:
0.955
K-nearest-neighbors:
0.895

可以发现精度还是蛮高的，但是，精度很多时候都是骗人的，用recall值更好（就是之前博客说的TP、TN、FP、FN），更好的可以用mAP积分算面积。

得到阈值

这里写一个看用户有多大概率丢失的函数

#看用户有多大概率丢失
def run_prob_cv(X, y, clf_class, **kwargs):
    kf = KFold(len(y), n_folds=5, shuffle=True)
    y_prob = np.zeros((len(y),2))
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        # Predict probabilities, not classes
        y_prob[test_index] = clf.predict_proba(X_test) #直接用predict_proba函数
    return y_prob

真实决策会给出一个阈值，如果预测的丢失率大于阈值，公司会对此用户做出反应。决策的根据：枚举出测试集中每个预测的概率和此时的准确率。
eg：这个例子中可以取0.7

import warnings
warnings.filterwarnings('ignore')

# Use 10 estimators so predictions are all multiples of 0.1
pred_prob = run_prob_cv(X, y, RF, n_estimators=10)
#print pred_prob[0]
pred_churn = pred_prob[:,1]
is_churn = y == 1

# Number of times a predicted probability is assigned to an observation
counts = pd.value_counts(pred_churn)
#print counts

# calculate true probabilities
true_prob = {}
for prob in counts.index:
    true_prob[prob] = np.mean(is_churn[pred_churn == prob])
    true_prob = pd.Series(true_prob)

# pandas-fu
counts = pd.concat([counts,true_prob], axis=1).reset_index()
counts.columns = ['pred_prob', 'count', 'true_prob']
counts

在这里插入图片描述

记录中间遇到的error

DataFrame object has no attribute 'as_matrix'
原因：新库删去了as_matrix
操作：将df.as_matrix()改成df.values

上课不要摸鱼江

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
简单的用户流失预警实战

现实情况下有很多游戏公司给回归用户发礼包、打电话，这里做一个类似的用户流失预警，有助于公司或厂商做出反应。记录一下学习过程，方便以后复习和查找资料。开始先导入库，数据from __future__ import divisionimport pandas as pdimport numpy as npchurn_df = pd.read_csv('churn.csv')col_names = churn_df.columns.tolist() #取所有特征print("Column na
复制链接

扫一扫

专栏目录