Dataquest用户流失预测

最新推荐文章于 2024-06-29 16:56:12 发布

mmい

最新推荐文章于 2024-06-29 16:56:12 发布

阅读量2.5k

点赞数

分类专栏： Machine Learning

本文链接：https://blog.csdn.net/zm714981790/article/details/51302765

版权

Machine Learning 专栏收录该内容

18 篇文章 3 订阅

订阅专栏

上一篇做了一些简单的数据分析，现在我们做一个预测任务，预测一下哪些用户有可能会离开Dataquest这个学习平台。我们利用逻辑回归来做这件事。我们不想知道某人正好要离开所做的事，而是关心他们在离开之前的一些屏幕中所做的事，这样我们能提前对他们做些帮助避免用户离开平台，因此我们提取每个session的最后5个events，认为这五件事情是有离开的风险的，而其他所有事件是没有风险离开的。

Remove Columns

对于预测来说，event中的id这个属性是没有意义的，因此需要将其剔除：

'''
column:['created_at', 'event_type', 'id', 'mission', 'sequence', 'session_id', 'type']
'''
# drop a row (0), or column (1)
event_frame = event_frame.drop("id", axis=1)

Convert Text Fields

对于分类变量比如event_type，最简单的一种办法是将其转换为数值型数据，这样才能作为机器学习算法的输入。
按照session_id将数据划分为一个个单独的session，然后在每个sesssion中按照created_at从小到大进行排序。
predictor_frame存储所有的分类属性，首先添加一个event_type属性：

# Split the data into groups.
groups = event_frame.groupby("session_id")

# Make a dictionary that maps events to codes.
event_codes = {
    'started-mission': 1,
    'started-screen': 2,
    'run-code': 3,
    'next-screen': 4,
    'get-answer': 5,
    'reset-code': 6,
    'interactive-mode-start': 7,
    'show-hint': 8,
    'open-forum': 9
}

all_predictors = []
for name, group in groups:
    # Sort the group
    group = group.sort(["created_at"], ascending=[1])

    # Replace the values in the event_type column with their corresponding codes.
    # The .apply method applies a function to each item in a series or dataframe in turn.
    # The lambda function will return the result of looking up the event_type
    # value in the event_codes dictionary.
    event_type = group["event_type"].apply(lambda x: event_codes[x])
    # Make a dataframe
    predictor_frame = pd.DataFrame({
                "event_type": event_type
            })
    # Add the predictor frame to the list of predictor frames
    # We'll concatenate everything in this list later to make our predictor frame.
    all_predictors.append(predictor_frame)
'''
predictor_frame ： DataFrame (<class 'pandas.core.frame.DataFrame'>)
event_type

7441     3
7442     3
7443     8
7444     3
7445     4
7446     2
'''
'''
 all_predictors : list (<class 'list'>)
 [    event_type
 0            1
 1            2
 2            2
 3            2
 4            2
 ...
'''

对于这种分组之后的数据，先将每个组的DataFrame添加到列表中，然后对列表中的DataFrame进行连接成一个完整的DataFrame。

Adding More Columns

继续向predictors中添加更多的column，比如添加一个session_time：当前event到最初始的event之间的秒数：通过created_at的差值可以计算得到：

groups = event_frame.groupby("session_id")

all_predictors = []
for name, group in groups:
    group = group.sort(["created_at"], ascending=[1])

    # Convert the created_at column to a datetime type, so we can do math with it.
    group['created_at'] = group['created_at'].astype('datetime64[ns]')
    predictor_frame = pd.DataFrame({
                "event_type": group["event_type"].apply(lambda x: event_codes[x]),
                # Find the total seconds between the current event and the first event in the session.
                "session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds())
            })

    all_predictors.append(predictor_frame)
'''
all_predictors
list (<class 'list'>)
[    event_type  session_time
 0            1         0.000
 1            2         0.013
 2            2        18.551
 3            2        20.722
 4            2        33.783
 5            3       190.410
 6            4       200.283
 7            2       201.776
 8            3       364.327
'''

Number Of Previous Events Column

在每个session中当前event之前的events的数量session_events：

groups = event_frame.groupby("session_id")

all_predictors = []
for name, group in groups:
    group = group.sort(["created_at"], ascending=[1])
    group['created_at'] = group['created_at'].astype('datetime64[ns]')

    predictor_frame = pd.DataFrame({
                "event_type": group["event_type"].apply(lambda x: event_codes[x]),
                "session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds()),
                # Generate a sequence the same length as the number of rows in the group.
                # It will start at 0, and end at the length of the group.
                # Because of the fact that the group is sorted in ascending order,
                # this is also a counter of number of previous events.
                "session_events": range(group.shape[0])
            })

    all_predictors.append(predictor_frame)
'''
...
 34           4              34       969.435
 35           2              35       969.899
 36           1              36       986.034
 37           2              37       986.281,
     event_type  session_events  session_time
 38           3               0         0.000
 39           3               1        14.140
 40           3               2        25.562
 41           4               3        30.110
 ...
'''

Number Of Events On Current Screen

获取每个session中每个event所对应的screen之前出现的次数，sequence 这个属性存储的是当前event的屏幕。
遍历所有event，记录当前event所在的screen，第一次出现这个event时，计数为0，第二次出现时记录为1（表示用户一直停留在这个event中）。如果下一个event与上一个event的screen不同，那么计数从0重新开始。

groups = event_frame.groupby("session_id")

all_predictors = []
for name, group in groups:
    group = group.sort(["created_at"], ascending=[1])
    group['created_at'] = group['created_at'].astype('datetime64[ns]')

    # Compute how many events occured on each screen.
    screen_events = []
    counter = 0
    prev_sequence = None
    for sequence in group["sequence"]:
        if sequence == prev_sequence:
            counter += 1
        else:
            counter = 0
        prev_sequence = sequence
        screen_events.append(counter) 
    predictor_frame = pd.DataFrame({
                "event_type": group["event_type"].apply(lambda x: event_codes[x]),
                "session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds()),
                "session_events": range(group.shape[0]),
                "screen_events": screen_events
            })

    all_predictors.append(predictor_frame)
'''
all_predictors
list (<class 'list'>)
[    event_type  screen_events  session_events  session_time
 0            1              0               0         0.000
 1            2              0               1         0.013
 2            2              0               2        18.551
 3            2              0               3        20.722
 4            2              0               4        33.783
 5            3              1               5       190.410
 6            4              2               6       200.283
 7            2              0               7       201.776
 ...
'''

More Predictors

还有更多的属性可以添加进来，比如：

当前mission有多少个screen
当前screen的代码被运行了多少次
show answer与show hint被使用了多少次

Creating Our Target Variable

对于每个session，将最后五个event的target设置为True，其他的设置为False。

groups = event_frame.groupby("session_id")

all_predictors = []
for name, group in groups:
    group = group.sort(["created_at"], ascending=[1])
    group['created_at'] = group['created_at'].astype('datetime64[ns]')

    # Compute how many events occured on each screen.
    screen_events = []
    counter = 0
    prev_sequence = None
    for sequence in group["sequence"]:
        if sequence == prev_sequence:
            counter += 1
        else:
            counter = 0
        prev_sequence = sequence
        screen_events.append(counter)

    predictor_frame = pd.DataFrame({
                "event_type": group["event_type"].apply(lambda x: event_codes[x]),
                "session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds()),
                "session_events": range(group.shape[0]),
                "screen_events": screen_events,
                # We generate our target by seeing if a row is in the last 5 rows in the group.
                # The group index is a series that has a unique value for each row in the group.
                # We're seeing if the index is in the last 5 indices.
                "target": group.index.isin(group.tail(5).index)
            })

    all_predictors.append(predictor_frame)

Train And Test Split

现在我们获取的是每个session的DataFrame，并且将其存在一个list中。首先对列表划分训练集和测试集，然后将训练集和测试集中的所有DataFrame进行连接。由于每个DataFrame长度不一，因此最后训练集和测试集的event比例并不是7:3，但这并没有什么关系：

import math

# Decide where to split the data -- we want the first 70% in the training set.
train_thresh = math.floor(len(all_predictors) * .7)

# Split the list of dataframes
train = all_predictors[:train_thresh]
test = all_predictors[train_thresh:]

# Concatenate all the items in the split lists.
# Concatenate them along the row axis.
# This results in one big dataframe.
train = pd.concat(train, axis=0)
test = pd.concat(test, axis=0)

# Around 6000 rows in the training data.
print(train.shape[0])

# Around 1000 rows in the test data.
print(test.shape[0])
'''
6196
1251
'''

Training The Algorithm

from sklearn.linear_model import LogisticRegression

predictors = ['event_type', 'screen_events', 'session_events', 'session_time']
clf = LogisticRegression()
clf.fit(train[predictors], train["target"])

# Make predictions of the probability that the row is a 0 or a 1.
predictions = clf.predict_proba(test[predictors])

Measuring Error

计算其AUC值：

from sklearn.metrics import roc_auc_score

# This is the score of our classifier.  We want to compare our target against 
# the probability that the row is a 1 (the second column of the predictions).
print(roc_auc_score(test["target"], predictions[:,1]))
'''
0.61255321887
'''