Dataquest用户流失预测

  • 上一篇做了一些简单的数据分析,现在我们做一个预测任务,预测一下哪些用户有可能会离开Dataquest这个学习平台。我们利用逻辑回归来做这件事。我们不想知道某人正好要离开所做的事,而是关心他们在离开之前的一些屏幕中所做的事,这样我们能提前对他们做些帮助避免用户离开平台,因此我们提取每个session的最后5个events,认为这五件事情是有离开的风险的,而其他所有事件是没有风险离开的。

Remove Columns

  • 对于预测来说,event中的id这个属性是没有意义的,因此需要将其剔除:
'''
column:['created_at', 'event_type', 'id', 'mission', 'sequence', 'session_id', 'type']
'''
# drop a row (0), or column (1)
event_frame = event_frame.drop("id", axis=1)

Convert Text Fields

  • 对于分类变量比如event_type,最简单的一种办法是将其转换为数值型数据,这样才能作为机器学习算法的输入。
  • 按照session_id将数据划分为一个个单独的session,然后在每个sesssion中按照created_at从小到大进行排序。
  • predictor_frame存储所有的分类属性,首先添加一个event_type属性:
# Split the data into groups.
groups = event_frame.groupby("session_id")

# Make a dictionary that maps events to codes.
event_codes = {
    'started-mission': 1,
    'started-screen': 2,
    'run-code': 3,
    'next-screen': 4,
    'get-answer': 5,
    'reset-code': 6,
    'interactive-mode-start': 7,
    'show-hint': 8,
    'open-forum': 9
}

all_predictors = []
for name, group in groups:
    # Sort the group
    group = group.sort(["created_at"], ascending=[1])

    # Replace the values in the event_type column with their corresponding codes.
    # The .apply method applies a function to each item in a series or dataframe in turn.
    # The lambda function will return the result of looking up the event_type
    # value in the event_codes dictionary.
    event_type = group["event_type"].apply(lambda x: event_codes[x])
    # Make a dataframe
    predictor_frame = pd.DataFrame({
                "event_type": event_type
            })
    # Add the predictor frame to the list of predictor frames
    # We'll concatenate everything in this list later to make our predictor frame.
    all_predictors.append(predictor_frame)
'''
predictor_frame : DataFrame (<class 'pandas.core.frame.DataFrame'>)
event_type

7441     3
7442     3
7443     8
7444     3
7445     4
7446     2
'''
'''
 all_predictors : list (<class 'list'>)
 [    event_type
 0            1
 1            2
 2            2
 3            2
 4            2
 ...
'''
  • 对于这种分组之后的数据,先将每个组的DataFrame添加到列表中,然后对列表中的DataFrame进行连接成一个完整的DataFrame。

Adding More Columns

  • 继续向predictors中添加更多的column,比如添加一个session_time:当前event到最初始的event之间的秒数:通过created_at的差值可以计算得到:
groups = event_frame.groupby("session_id")

all_predictors = []
for name, group in groups:
    group = group.sort(["created_at"], ascending=[1])

    # Convert the created_at column to a datetime type, so we can do math with it.
    group['created_at'] = group['created_at'].astype('datetime64[ns]')
    predictor_frame = pd.DataFrame({
                "event_type": group["event_type"].apply(lambda x: event_codes[x]),
                # Find the total seconds between the current event and the first event in the session.
                "session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds())
            })

    all_predictors.append(predictor_frame)
'''
all_predictors
list (<class 'list'>)
[    event_type  session_time
 0            1         0.000
 1            2         0.013
 2            2        18.551
 3            2        20.722
 4            2        33.783
 5            3       190.410
 6            4       200.283
 7            2       201.776
 8            3       364.327
'''

Number Of Previous Events Column

  • 在每个session中当前event之前的events的数量session_events:
groups = event_frame.groupby("session_id")

all_predictors = []
for name, group in groups:
    group = group.sort(["created_at"], ascending=[1])
    group['created_at'] = group['created_at'].astype('datetime64[ns]')

    predictor_frame = pd.DataFrame({
                "event_type": group["event_type"].apply(lambda x: event_codes[x]),
                "session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds()),
                # Generate a sequence the same length as the number of rows in the group.
                # It will start at 0, and end at the length of the group.
                # Because of the fact that the group is sorted in ascending order,
                # this is also a counter of number of previous events.
                "session_events": range(group.shape[0])
            })

    all_predictors.append(predictor_frame)
'''
...
 34           4              34       969.435
 35           2              35       969.899
 36           1              36       986.034
 37           2              37       986.281,
     event_type  session_events  session_time
 38           3               0         0.000
 39           3               1        14.140
 40           3               2        25.562
 41           4               3        30.110
 ...
'''

Number Of Events On Current Screen

  • 获取每个session中每个event所对应的screen之前出现的次数,sequence 这个属性存储的是当前event的屏幕。
  • 遍历所有event,记录当前event所在的screen,第一次出现这个event时,计数为0,第二次出现时记录为1(表示用户一直停留在这个event中)。如果下一个event与上一个event的screen不同,那么计数从0重新开始。
groups = event_frame.groupby("session_id")

all_predictors = []
for name, group in groups:
    group = group.sort(["created_at"], ascending=[1])
    group['created_at'] = group['created_at'].astype('datetime64[ns]')

    # Compute how many events occured on each screen.
    screen_events = []
    counter = 0
    prev_sequence = None
    for sequence in group["sequence"]:
        if sequence == prev_sequence:
            counter += 1
        else:
            counter = 0
        prev_sequence = sequence
        screen_events.append(counter) 
    predictor_frame = pd.DataFrame({
                "event_type": group["event_type"].apply(lambda x: event_codes[x]),
                "session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds()),
                "session_events": range(group.shape[0]),
                "screen_events": screen_events
            })

    all_predictors.append(predictor_frame)
'''
all_predictors
list (<class 'list'>)
[    event_type  screen_events  session_events  session_time
 0            1              0               0         0.000
 1            2              0               1         0.013
 2            2              0               2        18.551
 3            2              0               3        20.722
 4            2              0               4        33.783
 5            3              1               5       190.410
 6            4              2               6       200.283
 7            2              0               7       201.776
 ...
'''

More Predictors

还有更多的属性可以添加进来,比如:

  • 当前mission有多少个screen
  • 当前screen的代码被运行了多少次
  • show answer与show hint被使用了多少次

Creating Our Target Variable

  • 对于每个session,将最后五个event的target设置为True,其他的设置为False。
groups = event_frame.groupby("session_id")

all_predictors = []
for name, group in groups:
    group = group.sort(["created_at"], ascending=[1])
    group['created_at'] = group['created_at'].astype('datetime64[ns]')

    # Compute how many events occured on each screen.
    screen_events = []
    counter = 0
    prev_sequence = None
    for sequence in group["sequence"]:
        if sequence == prev_sequence:
            counter += 1
        else:
            counter = 0
        prev_sequence = sequence
        screen_events.append(counter)

    predictor_frame = pd.DataFrame({
                "event_type": group["event_type"].apply(lambda x: event_codes[x]),
                "session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds()),
                "session_events": range(group.shape[0]),
                "screen_events": screen_events,
                # We generate our target by seeing if a row is in the last 5 rows in the group.
                # The group index is a series that has a unique value for each row in the group.
                # We're seeing if the index is in the last 5 indices.
                "target": group.index.isin(group.tail(5).index)
            })

    all_predictors.append(predictor_frame)

Train And Test Split

  • 现在我们获取的是每个session的DataFrame,并且将其存在一个list中。首先对列表划分训练集和测试集,然后将训练集和测试集中的所有DataFrame进行连接。由于每个DataFrame长度不一,因此最后训练集和测试集的event比例并不是7:3,但这并没有什么关系:
import math

# Decide where to split the data -- we want the first 70% in the training set.
train_thresh = math.floor(len(all_predictors) * .7)

# Split the list of dataframes
train = all_predictors[:train_thresh]
test = all_predictors[train_thresh:]

# Concatenate all the items in the split lists.
# Concatenate them along the row axis.
# This results in one big dataframe.
train = pd.concat(train, axis=0)
test = pd.concat(test, axis=0)

# Around 6000 rows in the training data.
print(train.shape[0])

# Around 1000 rows in the test data.
print(test.shape[0])
'''
6196
1251
'''

Training The Algorithm

from sklearn.linear_model import LogisticRegression

predictors = ['event_type', 'screen_events', 'session_events', 'session_time']
clf = LogisticRegression()
clf.fit(train[predictors], train["target"])

# Make predictions of the probability that the row is a 0 or a 1.
predictions = clf.predict_proba(test[predictors])

Measuring Error

  • 计算其AUC值:
from sklearn.metrics import roc_auc_score

# This is the score of our classifier.  We want to compare our target against 
# the probability that the row is a 1 (the second column of the predictions).
print(roc_auc_score(test["target"], predictions[:,1]))
'''
0.61255321887
'''

Conclusions

  • 可以发现这个分类器并不是很精确,但是可以添加一些额外的信息来提高精度,比如前面提到的一些属性。并且可以尝试其他的算法,比如随机森林通常都能取得好的结果。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值