Sparkify项目分享

最新推荐文章于 2022-08-07 11:09:11 发布

novelan

最新推荐文章于 2022-08-07 11:09:11 发布

阅读量748

点赞数 1

文章标签：机器学习数据分析大数据深度学习数据挖掘

本文链接：https://blog.csdn.net/novelan/article/details/108325284

版权

Sparkify项目

本项目为Udacity Nano Degree 最终的实战通关项目，在Anaconda的Jupyter notebook下运行，项目导出的格式为.ipynb。
下方为项目相关文件和源码链接：

1. 项目说明

1.1 项目简介

项目使用Spark分析探索某数字音乐服务平台Sparkify（类似于网易云音乐和QQ音乐的音乐平台）2016年10月1日-2016年12月1日期间用户在该平台上的行为数据。通过对用户行为和用户信息的分析，提取可能对预测用户是否流失有帮助的相关特征，从而建立流失用户预测模型。

1.2 数据集

为了快速分析建模，该项目使用的是完整数据集（12GB）的迷你集medium-sparkify-event-data.json（128MB），分析完成后，再将项目整个流程部署到Amazon云。

1.3 变量解释

变量名称	所属类型	说明
artist	音乐信息	歌手名称
auth	网页信息	用户进入平台的方式
firstName	用户信息	用户的名
gender	用户信息	用户性别：F为女，M为男
itemInSession	网页信息	会话顺序
lastName	用户信息	用户的姓氏
length	音乐信息	音乐时长（秒）
level	Event	用户等级：free为免费用户，paid为付费用户
location	会话信息	用户在会话期间所属位置
method	网页信息	HTTP method ,GET 或者 PUT
page	网页信息	用户行为类型
registration	用户信息	用户注册时间
sessionId	会话信息	会话编号
song	音乐信息	歌曲名称
status	网页信息	HTTP状态编码. 2xx=Successful, 3xx=Redirection, 4xx=Client Error.
ts	网页信息	用户行为发生的时间
userAgent	会话信息	网络环境，所属浏览器
userId	用户信息	用户编码，具有唯一性

1.5 问题定义

本项目着力解决的问题是如何精准地预测哪些用户存在流失风险，从而在这些用户离开前通过一些激励方式留住用户。要解决这一问题，需要建立机器学习模型，通过Spark分析数据，进行特征工程，传入历史数据对模型进行训练，从而预测该用户是流失还是留存，所以本项目是一个关于二分类问题。

1.6 评估指标

考虑到流失用户约占整体用户数量22%，流失用户和留存用户在用户数量上差异过大，选择F1score，即precision和recall的调和平均数作为模型的评估指标，选择F1score作为评估指标的原因请看项目第5部分建模篇。

2. 数据评估和清理

创建Spark读取迷你集medium-sparkify-event-data.json作为本项目的数据，通过对数据的完整性、有效性、准确性和一致性的四个方面进行评估，评估结果作为数据清洗的操作方向。

2.1 数据评估

1、检查数据集的数据类型

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)

以上结果发现registration和ts在源数据中为long类型，但有效数据类型应为日期时间，或者日期型。

2、检查各个特征的描述性数值


def build_feat_info(df):
    return pd.DataFrame(data={
        "value_count":[df[x].count() for x in df.columns],
        "value_distinct":[df[x].unique().shape[0] for x in df.columns],
        "num_nans":[df[x].isnull().sum() for x in df.columns],
        "percent_nans":[round((df[x].isnull().sum())/(df[x].shape[0]),3) for x in df.columns]  
    },index=df.columns)

data_df = df.toPandas()
feat_info = build_feat_info(data_df)
feat_info.sort_values('percent_nans', axis=0, ascending=False)

特征统计量

artist/length/song存在相同数量的缺失值，均存在20.4%的缺失值，该三列变量属于歌曲信息相关的变量；
firstName/gender/lastName/location/registration/userAgent存在相同数量的缺失值，均存在2.9%的缺失值，这些变量属于用户个人信息相关的变量；
userId虽然从计数上看似没有缺失，但是其最小值是空字符串，如下分析

3、选择一个歌曲信息相关的特征，如aritst，查看该变量下为空的记录不存在哪些用户行为：

+--------+
|    page|
+--------+
|NextSong|
+--------+
df.filter("page == 'NextSong'").count()
432877

4、选择userId为空字符串的记录，查看这些记录下存在哪些用户行为：

+-------------------+
|               page|
+-------------------+
|               Home|
|              About|
|Submit Registration|
|              Login|
|           Register|
|               Help|
|              Error|
+-------------------+

歌曲信息的缺失实际是用户没有播放音乐，所以没有产生歌曲信息相关的数据，应保留；
数据用户信息相关变量的缺失，其userId为空字符串，userId为空的用户如上表所示，可能是未注册的用户，这部分数据对本项目分析毫无意义，应删除。

数据评估小结：

修正ts和registration的数据类型
保留歌曲信息缺失部分
删除用户信息缺失部分

2.2 数据清理

根据以上数据评估的结论，对数据进行清洗，得到新的数据集valid_user_log

# 去重
df = df.dropDuplicates()

# 删除usrId为空字符串的记录
valid_user_log = df.filter(df.userId != "")

# 把ts和reg的数据类型更正为日期时间类型
time_change = func.udf(lambda x: datetime.datetime.fromtimestamp(x/1000.0), returnType=TimestampType())
valid_user_log = valid_user_log.withColumn("use_dt",time_change(valid_user_log.ts))
valid_user_log = valid_user_log.withColumn("reg_dt",time_change(valid_user_log.registration))

清理后的特征统计量
查看用户使用日期的时间跨度和用户注册日期分布

+-------------------+-------------------+
|      start_us_date|        end_us_date|
+-------------------+-------------------+
|2018-10-01 08:00:11|2018-12-01 08:01:06|
+-------------------+-------------------+
+-------------------+-------------------+
|     start_reg_date|       end_reg_date|
+-------------------+-------------------+
|2017-11-05 11:56:33|2018-11-24 23:37:54|
+-------------------+-------------------+

3. 数据分析

3.1 定义流失

valid_user_log中共有448个独立用户，其中100%的用户都存在播放音乐的行为，可见音乐播放是用户使用该平台的主要用途；另外90%以上的用户存在浏览主页、点赞、添加歌曲至播放列表、添加好友等行为；
52%的用户升级服务成为付费用户，22%用户降级重新成为免费用户，还有22%的用户注销了帐户，如下图：
不同用户行为的用户比例
付费用户和免费用户的注销人数分布：
付费和免费的注销比例

付费用户存在的行为，免费用户不存在的用户行为：

+----------------+
|            page|
+----------------+
|Submit Downgrade|
|       Downgrade|
+----------------+

以上哪些用户行为能够较好地代表用户流失，从字面意义上看，Cancellation Confirmation表明用户确认注销其帐户，而Downgrade则是说明用户由付费状态降级为免费，而且付费和免费客户均存在Cancellation Confirmation行为，而Downgrade仅出现在付费用户中。

用户流失定义代码：

churned_users = valid_user_log.select("userId").filter("page == 'Cancellation Confirmation'").dropDuplicates()
churned_user_list = churned_users.toPandas()['userId'].tolist()

flag_churn = func.udf(lambda x: 1 if x in churned_user_list else 0, IntegerType())
valid_user_log = valid_user_log.withColumn("Churn", flag_churn(valid_user_log["userId"]))

查看一个流失用户后10条用户记录：

[Row(userId='200002', ts=1540875879000, use_dt=datetime.datetime(2018, 10, 30, 13, 4, 39), page='NextSong', Churn=1),
 Row(userId='200002', ts=1540876079000, use_dt=datetime.datetime(2018, 10, 30, 13, 7, 59), page='NextSong', Churn=1),
 Row(userId='200002', ts=1540876372000, use_dt=datetime.datetime(2018, 10, 30, 13, 12, 52), page='NextSong', Churn=1),
 Row(userId='200002', ts=1540876493000, use_dt=datetime.datetime(2018, 10, 30, 13, 14, 53), page='NextSong', Churn=1),
 Row(userId='200002', ts=1540876681000, use_dt=datetime.datetime(2018, 10, 30, 13, 18, 1), page='NextSong', Churn=1),
 Row(userId='200002', ts=1540876723000, use_dt=datetime.datetime(2018, 10, 30, 13, 18, 43), page='Add to Playlist', Churn=1),
 Row(userId='200002', ts=1540876983000, use_dt=datetime.datetime(2018, 10, 30, 13, 23, 3), page='NextSong', Churn=1),
 Row(userId='200002', ts=1540877020000, use_dt=datetime.datetime(2018, 10, 30, 13, 23, 40), page='Settings', Churn=1),
 Row(userId='200002', ts=1540877021000, use_dt=datetime.datetime(2018, 10, 30, 13, 23, 41), page='Cancel', Churn=1),
 Row(userId='200002', ts=1540877026000, use_dt=datetime.datetime(2018, 10, 30, 13, 23, 46), page='Cancellation Confirmation', Churn=1)]

3.2 探索性分析

对留存用户和流失用户进行探索性数据分析，观察留存用户和流失用户特定的行为数据分布，如观察固定时间内某个特定动作出现的次数或者播放音乐的数量。

提问：

不同用户的session次数以及平均每个session的音乐播放量的分布；
不同用户近一月内的音乐播放总量；
用户性别/所属level/是否存在downgrade行为在流失用户和留存用户的分布上；
不同用户在平台使用时长上的分布；
不同用户对于平台或音乐的好评度，即点赞量的分布；
不同用户对于社交的需求程度，即添加好友数的分布；
不同用户添加歌曲至播放列表的次数分布。

Q1. 探索流失用户和留存用户在性别上的分布

男女的用户数量和PV分布

+------+---+------+-------+
|gender| AU|    PV|  perPv|
+------+---+------+-------+
|     F|198|225393|1138.35|
|     M|250|302612|1210.45|
+------+---+------+-------+

男女用户的流失率
不同性别的流失率分布

男性用户多于女性用户，平均每一位男性用户在该平台的页面浏览量略大于女性用户，男性用户的流失率略低于女性用户

Q2. 不同level在流失用户和留存用户的人数分布

不同Level的流失率分布

= 付费用户的流失率较高

Q3. 存在降级行为的用户在流失用户和留存用户的分布

是否产生Downgrade行为的用户流失率
是否产生Downgrade行为的流失率

没有downgrade行为用户的流失率略高于存在downgrade行为的用户

Q4. 流失用户和留存用户sesson次数以及平均session音乐播放量的分布

sessionId表示用于标识用户的会话编号，当用户打开浏览器浏览网页时，session会储存用户会话所需的属性及配置信息。session具有时间属性，当session在某一段时间内不活动，session就会过期，用户再登陆平台会重新再分配session，所以sessionId的独立个数在某种意义上代表用户在这段时间登陆平台的次数。

探索流失用户和留存用户的session次数分布
sessionCount分布
探索流失用户和留存用户平均每个session的音乐播放量
Session音乐播放量分布

从登陆平台的次数上可知，留存用户多于流失用户
从平均每session的音乐播放量可知，流失用户相比留存用户分布较分散，刨除异常点，整体而言留存用户每session的音乐播放量略高；

Q5. 探索流失用户和留存用户距离最近一次用户行为的前一个月的音乐播放量分布

由于本数据集仅包含了两个月的用户记录，所以在探索最近音乐播放量数据使用近一个月的音乐播放量

操作代码

# 计算每个用户每一天的音乐播放量
day_song = valid_user_log \
            .groupby("userId",func.date_trunc("day",valid_user_log.use_dt).alias("day_date")) \
            .agg(func.sum("songplay").alias("amt_song"), func.max("Churn").alias("churn")) \
            .orderBy("userId","day_date")
# 计算当前日期的前30天的音乐播放量
window_recent1month = Window.partitionBy("userId","churn") \
            .orderBy(func.column("day_date").cast("long")) \
            .rangeBetween(-29*86400, 0)
day_song = day_song.withColumn("moving1month_sum", func.sum("amt_song").over(window_recent1month))
day_song_pd = day_song.toPandas()

# 抓取最近用户行为的前30天的音乐播放量
recent1month_songplay = day_song_pd.groupby("userId").apply(lambda x:x.iloc[-1])
recent1month_songplay.head(3)

近30天音乐播放量

fig, ax = plt.subplots(figsize=(8,6))
bin_edges2 = 10 ** np.arange(0.8, np.log10(recent1month_songplay.moving1month_sum.max())+0.1, 0.1)
tick_locs2 = [10, 30, 100, 300, 1000]
sns.distplot(recent1month_songplay.query("churn == 1")["moving1month_sum"], kde=False, norm_hist=True, bins=bin_edges2, ax=ax, label="churn", hist_kws={"alpha":0.7})
sns.distplot(recent1month_songplay.query("churn == 0")["moving1month_sum"], kde=False, norm_hist=True, bins=bin_edges2, ax=ax, label="retain", hist_kws={"alpha":0.7});
ax.set_xscale("log")
ax.set_xticks(tick_locs2)
ax.set_xticklabels(tick_locs2)
ax.set(title="total songplay in recent 1 month before last event".title(),
       xlabel="Total Songplay", ylabel="Normalized Frequency")
ax.legend(loc=1);

近1个月的音乐播放量分布

整体上来看，近期播放量，留存用户多于流失用户。但是由于本数据集上仅有两个月的数据量，对于计算近期播放量，总的平均session播放量也能代表近期播放，但是如果延展到大的数据量，时间跨度较大的数据集上，就需要计算用户在近3个月，近6个月或者近1年的播放量，具体问题具体分析。

Q6. 探索流失用户和留存用户使用平台的时间

通过计算用户自注册之日起，累计使用平台的天数

流失用户和留存用户的累计使用天数

留存用户在该平台上使用的时间比流失用户使用的时间长，老用户的用户粘性较高

Q7. 探索流失用户和留存用户对平台的喜爱程度

Thumbs Up是95%以上的用户行为，通过计算用户Thumbs Up的次数来探索两类用户对于该平台喜好度的分布

用户的点赞量的分布

存用户对平台或音乐的点赞量略高于流失用户

Q8. 探索添加好友和添加歌曲至播放列表的行为在流失用户和留存用户上的分布

添加好友和添加歌曲至播放列表也是大多数用户都有的用户行为，添加好友和添加歌曲至播放列表的次数一定程度能反映用户对于该平台除了播放音乐主要功能外的其他需求，如社交需求，音乐需求。

在这里插入图片描述

在该平台上，添加好友数量和歌曲添加至播放列表的数量越大，越有可能是留存用户

Q9. 探索喜爱程度、存在天数和音乐播放量之间的关系

探索喜爱程度、存在天数和音乐播放量之间的关系

上图可以得出，用户的音乐播放量与用户对该平台的点赞量呈正相关，点赞量的越大，音乐播放量也越大

4. 特征工程

根据上述探索性分析结果，以下是可能帮助本项目预测用户使用流失的相关特征：

Feature	Type	Description
gender_female	Categorical	male is 0, female is 1
level_paid	Categorical	free user is 0, paid is 1
churn	Categorical	retain user is 0, churn user is 1
avg_sess_songplay	Numberic	Avg of songplay in per session
sessionCount	Numberic	Total session count
amt_thumbsup	Numberic	Numberic amount of thumbsup by every unique user in this period
amt_add_playlist	Numberic	Total amount of add to playlist by every unique user in this period
amt_add_friends	Numberic	Total amount of add friends by every unique user in this period
daysUsing	Numberic	Number of days from last event to registered day

4.1 整合特征

根据上述表格所列的特征，向源数据提取关键特征，整合形成以上特征，并将所有特征合并到一个Dataframe成为用于建模的数据集，源码请查看github

features.show(5)
+------+-------------+----------+-----+------------+---------------+----------------+---------+---------+-----------------+
|userId|gendel_female|level_paid|churn|amt_thumbsup|amt_add_friends|amt_add_playlist|sessCount|daysUsing|avg_sess_songplay|
+------+-------------+----------+-----+------------+---------------+----------------+---------+---------+-----------------+
|100010|            1|         0|    1|           4|              3|               1|        2|       14|             48.0|
|200002|            0|         1|    1|          15|              2|               6|        5|       53|             62.0|
|   296|            1|         1|    1|           8|              2|               3|        5|       27|             22.4|
|   125|            0|         0|    0|           3|              3|               2|        3|      105|            20.67|
|   124|            1|         1|    1|         102|             26|              45|       17|      112|           107.41|
+------+-------------+----------+-----+------------+---------------+----------------+---------+---------+-----------------+

5. 建模

将完整数据集分成训练集、验证集和测试集。测试几种你学过的机器学习方法。评价不同机器学习方法的准确率，根据情况调节参数。根据准确率你挑选出表现最好的那个模型，然后报告在训练集上的结果。因为流失顾客数据集很小，我建议选用 F1 score 作为优化指标。

5.1 为何选用 F1score 作为优化指标

$F 1 s c o r e$ 是 $P r e c i s i o n$ 和 $R e c a l l$ 的调和平均数，即：

$2\times{\frac{Precision \times{Recall}}{Precision + Recall}}$

$\frac{TP}{TP + FP}$
$\frac{TP}{TP + FN}$

$F P$ 代表假阳性个数，即实际是留存用户但预测为流失用户的个数； $F N$ 代表假阴性的个数，即实际是流失用户但预测为留存用户的个数；
实际流失用户数量过小，则对于实际为流失用户，预测流失和留存的分布数量也小，也就是 $F N$ 的数量会很小；相反实际留存数量过大，对于实际为留存用户，预测流失和留存的分布数量也较大，也就是 $F P$ 的数量会很大；
如果选择 $p r e c i s i o n$ 或者 $r e c a l l$ 任意一个作为优化指标的话，因为用户数量的差异， $p r e c i s i o n$ 可能因为 $F P$ 的数量过大而导致精度过小，而 $r e c a l l$ 可能因为 $F N$ 过小而导致召回率过大，所以单纯的选择一个作为评估指标不能真实地反映模型评估的结果，所以选择 $F 1 s c o r e$ 或者 $F - b e t a$ 作为评估结果
选择 $F 1 s c o r e$ 而不选择 $F - b e t a$ ，是因为本项目不涉及后续的为了挽回流失用户而采取的营销手段或者其他激励手段所带来的成本和效果，所以无法判定是否要偏倚 $p r e c i s i o n$ 或者 $r e c a l l$ 。

5.2 拆分数据

train, validation, test = features.randomSplit([0.7,0.15, 0.15], seed=42)

print("The whole data's shape : {}\nTrain's shape : {}\nValidation's shape : {}\nTest's shape : {}" \
      .format(features.count(), train.count(), validation.count(), test.count()))
      
>>>
The whole data's shape : 448
Train's shape : 319
Validation's shape : 68
Test's shape : 61

# 探索测试集中流失用户占总独立用户人数的比重
train.select(func.round(func.mean("churn"),2).alias("churn_pct")).show()

+---------+
|churn_pct|
+---------+
|     0.23|
+---------+

训练集中的流失用户仅占总用户人数的23%，流失用户和留存用户数量的不平衡不利于模型的学习，从而导致模型的预测效果不佳，所以需要在训练集中重新取样

5.3 重新取样

从训练集的留存用户中随意取样与流失用户数量相等的用户记录，组成新的训练数据集

def resample(df, minor, major):
    '''
    INPUTS:
    df - (dataframe) dataset that should be balanced
    minor - (int) labeled minority user
    major - (int) labeled majority user
    
    OUTPUTS:
    
    bal_df - (dataframe) balanced dataset that number of labeled majority equals to the number of labeled minority
    
    '''
    
    minor_record_count = df.filter(df.churn == minor).count()
    minor_records = df.filter(df.churn == minor)
    major_records = df.filter(df.churn == major).sample(withReplacement=False, fraction=1.0, seed=42).limit(minor_record_count)
    
    bal_df = minor_records.unionByName(major_records)
    
    return bal_df

bal_train = resample(train, 1, 0)
bal_train.select(func.round(func.mean(bal_train.churn)).alias("train_churn_pct")).show()
+---------------+
|train_churn_pct|
+---------------+
|            1.0|
+---------------+

5.4 构建管道，训练模型

对数据集进行预处理，选择学习器进行模型的训练。为了应用到较大的数据集中，下方使用函数来定义整个操作流程：

区分features和label
数据预处理，如数据缩放
选择学习器
构建管道
使用训练数据集进行模型的训练
预测
评估

构建管道，在管道里实现上述步骤前四项，这里选择MinMaxScaler对数据进行标准化缩放：

def create_pipeline(label_col, feature_cols, clf):
    
    '''
    INPUTS:
    
    label_col - (string) label column name
    feature_cols - (list) list of feature column name
    clf - (classifier) the learning algorithm to be trained and predicted on
    
    OUTPUTS:
    
    pipeline - (pipeline) learning steps
    '''
    
    labelIndexer = StringIndexer(inputCol=label_col, outputCol="label")
    featureIndexer = VectorAssembler(inputCols=feature_cols, outputCol="rawFeatures")
    scaler = MinMaxScaler(inputCol="rawFeatures", outputCol="features")

    pipeline = Pipeline(stages=[labelIndexer, featureIndexer, scaler, clf])
    
    return pipeline

使用训练数据集训练模型，使用验证集进行预测并评估，这里选择F1score作为评估指标：

def train_predict(training_data, test_data, pipeline):
    '''
    INPUTS:
    
    training_data - (dataframe) balanced dataset that be used to train model
    test_data - (dataframe) dataset that be used to predict model.
    pipeline - (pipeline) the learning stage have been defined
    
    OUTPUTS:
    
    result - (dict) dict of accuracy, precision, recall , f1 and so on 
    '''
    result = dict()
    
    # Train model and calculate runtime.
    startTime = time.time()
    model = pipeline.fit(training_data)
    endTime = time.time()
    
    train_run_time = round(endTime - startTime, 2)
    result["train_runtime"] = train_run_time
    
    # Make predictions.
    startTime = time.time()
    trainPredictions = model.transform(training_data)
    testPredictions = model.transform(test_data)
    endTime = time.time()
    
    pred_run_time = round(endTime - startTime, 2)
    result["pred_runtime"] = pred_run_time
    
    # Calculate f1 score on training data and test data
    train_predictionAndLabels = trainPredictions.select("prediction","label").rdd.map(lambda row: (row[0], row[1]))
    test_predictionAndLabels = testPredictions.select("prediction","label").rdd.map(lambda row: (row[0], row[1]))
    
    train_metrics = MulticlassMetrics(train_predictionAndLabels)
    test_metrics = MulticlassMetrics(test_predictionAndLabels)
    
    train_f1Score = train_metrics.fMeasure(label=1.0, beta=1.0)
    test_f1Score = test_metrics.fMeasure(label=1.0, beta=1.0)
    result["train_f1"] = train_f1Score
    result["test_f1"] = test_f1Score
    
    return result

5.5 初始化算法

本项目是一个二项式分类的问题，通过用户的相关特征数据来预测该用户是否是流失用户。常用的分类算法如下所示，遍历这些算法，先使用各个算法的默认参数，观察验证集在各个算法的预测效果，预测评估指标选用F1score。

常用的分类算法：

逻辑回归
决策树
随机森林
简单贝叶斯

# 构建分类器对象list
LR_clf = LogisticRegression()
DT_clf = DecisionTreeClassifier()
RF_clf = RandomForestClassifier()
NB_clf = NaiveBayes()
learner_list = [LR_clf, DT_clf, RF_clf, NB_clf]

# 选择features和label，以代入以上自定义的参数
feature_cols = ['gendel_female','level_paid','amt_thumbsup','amt_add_friends',
                'amt_add_playlist','sessCount','daysUsing','avg_sess_songplay']
label_col = "churn"

# 开始训练模型
results = dict()
start_point_time = time.time()
for learner in learner_list:
    clf_name = learner.__class__.__name__
    print("This is {} started".format(clf_name)) 
    # 遍历所有的分类器
    pipeline = create_pipeline(label_col=label_col, feature_cols=feature_cols, clf=learner) 
     # 代入balance之后的训练集和验证集
    result = train_predict(bal_train, validation, pipeline)
    results[clf_name] = result 

end_point_time = time.time()
print("Running time : {} minintes".format((end_point_time - start_point_time)/60))

evaluation_df = pd.DataFrame(results)
evaluation_df

得出结果：

This is LogisticRegression started
This is DecisionTreeClassifier started
This is RandomForestClassifier started
This is NaiveBayes started
Running time : 322.1041066328684 minintes

	LogisticRegression	DecisionTreeClassifier	RandomForestClassifier	NaiveBayes
train_runtime	2860.10	2694.77	2501.14	2408.14
pred_runtime	0.72	0.85	0.88	0.71
train_f1	0.6345	0.7383	0.8281	0.6
test_f1	0.3846	0.3929	0.4878	0.3774

随机森林相对于其他分类算法，不管在训练集中的拟合效果和验证集中的预测效果，得分都较高，所以选择随机森林用于本项目的建模，并对该算法进行网格化调参，选择最佳模型

5.6 网格化调参

通过上述的评估结果显示，选择随机森林作为本项目的适用算法，并对该算法的参数进行调参，在训练集中使用交叉验证3折验证，对模型进行优化，以避免过拟合。

start_time = time.time()

print("Unoptimized RF model: F1 score on validation data is {:.2f}".format(evaluation_df.loc["test_f1","RandomForestClassifier"]))
pipeline_rf = create_pipeline(label_col=label_col, feature_cols=feature_cols, clf=RF_clf)
paramGrid_rf = ParamGridBuilder() \
    .addGrid(RF_clf.maxDepth, [3, 5, 7]) \
    .addGrid(RF_clf.minInstancesPerNode, [2, 4, 5]) \
    .addGrid(RF_clf.numTrees, [4, 6, 8]) \
    .addGrid(RF_clf.seed, [42]) \
    .build()

crossval_rf = CrossValidator(estimator=pipeline_rf,
                          estimatorParamMaps=paramGrid_rf,
                          evaluator=BinaryClassificationEvaluator(metricName="areaUnderROC"),
                          numFolds=3)

cvModel_rf = crossval_rf.fit(bal_train)

pred_tuned_rf = cvModel_rf.transform(validation)
pred_tuned_rdd_rf = pred_tuned_rf.select("prediction","label").rdd.map(lambda row: (row[0], row[1]))
metrics_tuned_rf = MulticlassMetrics(pred_tuned_rdd_rf)
f1score_tuned_rf = metrics_tuned_rf.fMeasure(label=1.0, beta=1.0)


end_time= time.time()
print("Optimized RF model: F1 score on validation data is {:.2f}".format(f1score_tuned_rf))
print("RF tuned running time is {:.2f} minutes".format((end_time - start_time)/60))

Unoptimized RF model: F1 score on validation data is 0.49
Optimized RF model: F1 score on validation data is 0.53
RF tuned running time is 229.27 minutes

并得到最佳模型参数为：

maxDepth: 7
minInstancesPerNode: 4
numTrees: 4

5.7 评估结果

5.7.1 F1score

testPredictions = cvModel_rf.transform(test)
test_predictionAndLabels = testPredictions.select("prediction","label").rdd.map(lambda row: (row[0], row[1]))
test_metrics = MulticlassMetrics(test_predictionAndLabels)
test_f1 = test_metrics.fMeasure(label=1.0, beta=1.0)
print("Optimized RF model: F1 score on test data is {:.2f}".format(test_f1))

Optimized RF model: F1 score on test data is 0.18

5.7.2 Confusion Metrics

confMatrix = test_metrics.confusionMatrix().toArray()
confMatrix_pd = pd.DataFrame(confMatrix,columns=[0, 1])
confMatrix_pd = confMatrix_pd.astype('int')

混淆矩阵

print("Optimized model to predict whether the user is churned, more evaluation data as follows:")
print("Precision rate is {:.2f}\nRecall rate is {:.2f}\nAccuracy is {:.2f}".format(test_metrics.precision(label=1.0),
                                                                                   test_metrics.recall(label=1.0),
                                                                                   test_metrics.accuracy))
                                                                                   Optimized model to predict whether the user is churned, more evaluation data as follows:
Precision rate is 0.12
Recall rate is 0.43
Accuracy is 0.56

模型在测试集的预测效果不佳，F1score仅为18%，通过混淆矩阵的结果可知，测试集在预测流失用户的精准度较差，辨别流失用户为流失的精准率仅为12%，预测效果不佳，为能实现项目需要解决的问题。

6. 总结

6.1 分析流程

在本次项目中，使用Spark&Python对数据进行了评估、清理和可视化探索性分析；
通过探索性分析的过程，发现本项目是一个二分类的问题，即用户是留存用户还是流失用户两种分类结果，根据探索性分析的结果，提取对预测用户是否流失的关键特征，进行特征工程；
构建Pipeline，选择分类器，由于本项目是一个二分类的问题，所以选择四种适用算法，数据训练结果表明Random Forrest的效果较好；
选择Random Forrest网格化调参和交叉验证，得到最佳模型，运用到测试集中；
最后，选择F1score和混淆矩阵对最佳模型在测试集的预测结果进行评估。

6.2 问题和挑战

6.2.1 问题

构建的模型在测试集的预测效果较差，F1得分仅为18%，辨别流失用户为流失的精准率仅为12%，而本次项目的分析目标是建立能够精准预测用户流失的模型，所以本次分析没有达到本次数据分析的目标。
项目失败的原因，可能是由于数据量特别是流失用户的数据量相对较小，本数据集中流失用户仅占总用户数的22%，从而导致留存用户和流失用户在数据量上的失衡，而不得不在留存用户的数据上重新取样，从而达到流失用户和留存用户在数据量上的平衡，所以事实上本次数据集有56%的数据没有用到，这对于模型的构建会产生极大的影响。
模型经过训练之后，在验证集中预测的效果较好，但是在测试集中验证预测效果较差，也说明了由于本数据集的数据量（独立用户数）较少，样本与样本之间差异较大。