现实情况下有很多游戏公司给回归用户发礼包、打电话,这里做一个类似的用户流失预警,有助于公司或厂商做出反应。记录一下学习过程,方便以后复习和查找资料。
开始
先导入库,数据
from __future__ import division
import pandas as pd
import numpy as np
churn_df = pd.read_csv('churn.csv')
col_names = churn_df.columns.tolist() #取所有特征
print("Column names:")
print(col_names)
to_show = col_names[:6] + col_names[-6:] #举个操作例子eg:取前六个特征,后六个特征
print("\nSample data:")
churn_df[to_show].head(6)
数据预处理
数据中如false和true机器认不了,需要转化为整型。
churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.',1,0) #将数据中的false和True用0 1替代
# We don't need these columns
to_drop = ['State','Area Code','Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop,axis=1) #删去不需要的列
# 'yes'/'no' has to be converted to boolean values
# NumPy converts these from boolean to 1. and 0. later
yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'
# Pull out features for future use
features = churn_feat_space.columns
X = churn_feat_space.values.astype(np.float)
特征间有不同的衡量单位,进行标准化处理
# This is important
#用于解决衡量指标不同,指标间数值差别大
from sklearn.preprocessing import StandardScaler #标准化处理库
scaler = StandardScaler() #取出模型
X = scaler.fit_transform(X) #fit一下
#数据正负样本不均匀可以采用 下采样、上采样 的方法
多种分类器处理
可能需要用到多个分类器如SVM,RF,KNN,这里写一个函数给一个接口方便调用。
#交叉验证
from sklearn.model_selection import KFold
#刚开始不知道用哪个分类器,写一个函数来方便传入各种分类器对它交叉验证
#X:标准化后的数据 y:用户是否丢失 clf_class:分类器 **kwargs:分类器参数
def run_cv(X,y,clf_class,**kwargs):
# Construct a kfolds object
kf = KFold(len(y),n_splits=5,shuffle=True) #交叉验证
y_pred = y.copy()
# Iterate through folds
for train_index, test_index in kf.split(X): #对每个训练集、测试集
X_train, X_test = X[train_index], X[test_index]
y_train = y[train_index]
# Initialize a classifier with key word arguments
clf = clf_class(**kwargs) #拿出这个分类器
clf.fit(X_train,y_train) #fit一下
y_pred[test_index] = clf.predict(X_test) #预测
return y_pred
然后就是导入三个库,调用函数啦
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN
#精度函数
def accuracy(y_true,y_pred):
# NumPy interprets True and False as 1. and 0.
return np.mean(y_true == y_pred)
print("Support vector machines:")
print("%.3f" % accuracy(y, run_cv(X,y,SVC)))
print("Random forest:")
print("%.3f" % accuracy(y, run_cv(X,y,RF)))
print("K-nearest-neighbors:")
print("%.3f" % accuracy(y, run_cv(X,y,KNN)))
Support vector machines:
0.925
Random forest:
0.955
K-nearest-neighbors:
0.895
可以发现精度还是蛮高的,但是,精度很多时候都是骗人的,用recall值更好(就是之前博客说的TP、TN、FP、FN),更好的可以用mAP积分算面积。
得到阈值
这里写一个看用户有多大概率丢失的函数
#看用户有多大概率丢失
def run_prob_cv(X, y, clf_class, **kwargs):
kf = KFold(len(y), n_folds=5, shuffle=True)
y_prob = np.zeros((len(y),2))
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train = y[train_index]
clf = clf_class(**kwargs)
clf.fit(X_train,y_train)
# Predict probabilities, not classes
y_prob[test_index] = clf.predict_proba(X_test) #直接用predict_proba函数
return y_prob
真实决策会给出一个阈值,如果预测的丢失率大于阈值,公司会对此用户做出反应。决策的根据:枚举出测试集中每个预测的概率和此时的准确率。
eg:这个例子中可以取0.7
import warnings
warnings.filterwarnings('ignore')
# Use 10 estimators so predictions are all multiples of 0.1
pred_prob = run_prob_cv(X, y, RF, n_estimators=10)
#print pred_prob[0]
pred_churn = pred_prob[:,1]
is_churn = y == 1
# Number of times a predicted probability is assigned to an observation
counts = pd.value_counts(pred_churn)
#print counts
# calculate true probabilities
true_prob = {}
for prob in counts.index:
true_prob[prob] = np.mean(is_churn[pred_churn == prob])
true_prob = pd.Series(true_prob)
# pandas-fu
counts = pd.concat([counts,true_prob], axis=1).reset_index()
counts.columns = ['pred_prob', 'count', 'true_prob']
counts
记录中间遇到的error
DataFrame object has no attribute 'as_matrix'
原因:新库删去了as_matrix
操作:将df.as_matrix()
改成df.values