- 上一篇做了一些简单的数据分析,现在我们做一个预测任务,预测一下哪些用户有可能会离开Dataquest这个学习平台。我们利用逻辑回归来做这件事。我们不想知道某人正好要离开所做的事,而是关心他们在离开之前的一些屏幕中所做的事,这样我们能提前对他们做些帮助避免用户离开平台,因此我们提取每个session的最后5个events,认为这五件事情是有离开的风险的,而其他所有事件是没有风险离开的。
Remove Columns
- 对于预测来说,event中的id这个属性是没有意义的,因此需要将其剔除:
'''
column:['created_at', 'event_type', 'id', 'mission', 'sequence', 'session_id', 'type']
'''
# drop a row (0), or column (1)
event_frame = event_frame.drop("id", axis=1)
Convert Text Fields
- 对于分类变量比如event_type,最简单的一种办法是将其转换为数值型数据,这样才能作为机器学习算法的输入。
- 按照session_id将数据划分为一个个单独的session,然后在每个sesssion中按照created_at从小到大进行排序。
- predictor_frame存储所有的分类属性,首先添加一个event_type属性:
# Split the data into groups.
groups = event_frame.groupby("session_id")
# Make a dictionary that maps events to codes.
event_codes = {
'started-mission': 1,
'started-screen': 2,
'run-code': 3,
'next-screen': 4,
'get-answer': 5,
'reset-code': 6,
'interactive-mode-start': 7,
'show-hint': 8,
'open-forum': 9
}
all_predictors = []
for name, group in groups:
# Sort the group
group = group.sort(["created_at"], ascending=[1])
# Replace the values in the event_type column with their corresponding codes.
# The .apply method applies a function to each item in a series or dataframe in turn.
# The lambda function will return the result of looking up the event_type
# value in the event_codes dictionary.
event_type = group["event_type"].apply(lambda x: event_codes[x])
# Make a dataframe
predictor_frame = pd.DataFrame({
"event_type": event_type
})
# Add the predictor frame to the list of predictor frames
# We'll concatenate everything in this list later to make our predictor frame.
all_predictors.append(predictor_frame)
'''
predictor_frame : DataFrame (<class 'pandas.core.frame.DataFrame'>)
event_type
7441 3
7442 3
7443 8
7444 3
7445 4
7446 2
'''
'''
all_predictors : list (<class 'list'>)
[ event_type
0 1
1 2
2 2
3 2
4 2
...
'''
- 对于这种分组之后的数据,先将每个组的DataFrame添加到列表中,然后对列表中的DataFrame进行连接成一个完整的DataFrame。
Adding More Columns
- 继续向predictors中添加更多的column,比如添加一个session_time:当前event到最初始的event之间的秒数:通过created_at的差值可以计算得到:
groups = event_frame.groupby("session_id")
all_predictors = []
for name, group in groups:
group = group.sort(["created_at"], ascending=[1])
# Convert the created_at column to a datetime type, so we can do math with it.
group['created_at'] = group['created_at'].astype('datetime64[ns]')
predictor_frame = pd.DataFrame({
"event_type": group["event_type"].apply(lambda x: event_codes[x]),
# Find the total seconds between the current event and the first event in the session.
"session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds())
})
all_predictors.append(predictor_frame)
'''
all_predictors
list (<class 'list'>)
[ event_type session_time
0 1 0.000
1 2 0.013
2 2 18.551
3 2 20.722
4 2 33.783
5 3 190.410
6 4 200.283
7 2 201.776
8 3 364.327
'''
Number Of Previous Events Column
- 在每个session中当前event之前的events的数量session_events:
groups = event_frame.groupby("session_id")
all_predictors = []
for name, group in groups:
group = group.sort(["created_at"], ascending=[1])
group['created_at'] = group['created_at'].astype('datetime64[ns]')
predictor_frame = pd.DataFrame({
"event_type": group["event_type"].apply(lambda x: event_codes[x]),
"session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds()),
# Generate a sequence the same length as the number of rows in the group.
# It will start at 0, and end at the length of the group.
# Because of the fact that the group is sorted in ascending order,
# this is also a counter of number of previous events.
"session_events": range(group.shape[0])
})
all_predictors.append(predictor_frame)
'''
...
34 4 34 969.435
35 2 35 969.899
36 1 36 986.034
37 2 37 986.281,
event_type session_events session_time
38 3 0 0.000
39 3 1 14.140
40 3 2 25.562
41 4 3 30.110
...
'''
Number Of Events On Current Screen
- 获取每个session中每个event所对应的screen之前出现的次数,sequence 这个属性存储的是当前event的屏幕。
- 遍历所有event,记录当前event所在的screen,第一次出现这个event时,计数为0,第二次出现时记录为1(表示用户一直停留在这个event中)。如果下一个event与上一个event的screen不同,那么计数从0重新开始。
groups = event_frame.groupby("session_id")
all_predictors = []
for name, group in groups:
group = group.sort(["created_at"], ascending=[1])
group['created_at'] = group['created_at'].astype('datetime64[ns]')
# Compute how many events occured on each screen.
screen_events = []
counter = 0
prev_sequence = None
for sequence in group["sequence"]:
if sequence == prev_sequence:
counter += 1
else:
counter = 0
prev_sequence = sequence
screen_events.append(counter)
predictor_frame = pd.DataFrame({
"event_type": group["event_type"].apply(lambda x: event_codes[x]),
"session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds()),
"session_events": range(group.shape[0]),
"screen_events": screen_events
})
all_predictors.append(predictor_frame)
'''
all_predictors
list (<class 'list'>)
[ event_type screen_events session_events session_time
0 1 0 0 0.000
1 2 0 1 0.013
2 2 0 2 18.551
3 2 0 3 20.722
4 2 0 4 33.783
5 3 1 5 190.410
6 4 2 6 200.283
7 2 0 7 201.776
...
'''
More Predictors
还有更多的属性可以添加进来,比如:
- 当前mission有多少个screen
- 当前screen的代码被运行了多少次
- show answer与show hint被使用了多少次
Creating Our Target Variable
- 对于每个session,将最后五个event的target设置为True,其他的设置为False。
groups = event_frame.groupby("session_id")
all_predictors = []
for name, group in groups:
group = group.sort(["created_at"], ascending=[1])
group['created_at'] = group['created_at'].astype('datetime64[ns]')
# Compute how many events occured on each screen.
screen_events = []
counter = 0
prev_sequence = None
for sequence in group["sequence"]:
if sequence == prev_sequence:
counter += 1
else:
counter = 0
prev_sequence = sequence
screen_events.append(counter)
predictor_frame = pd.DataFrame({
"event_type": group["event_type"].apply(lambda x: event_codes[x]),
"session_time": group["created_at"].apply(lambda x: (x - group["created_at"].iloc[0]).total_seconds()),
"session_events": range(group.shape[0]),
"screen_events": screen_events,
# We generate our target by seeing if a row is in the last 5 rows in the group.
# The group index is a series that has a unique value for each row in the group.
# We're seeing if the index is in the last 5 indices.
"target": group.index.isin(group.tail(5).index)
})
all_predictors.append(predictor_frame)
Train And Test Split
- 现在我们获取的是每个session的DataFrame,并且将其存在一个list中。首先对列表划分训练集和测试集,然后将训练集和测试集中的所有DataFrame进行连接。由于每个DataFrame长度不一,因此最后训练集和测试集的event比例并不是7:3,但这并没有什么关系:
import math
# Decide where to split the data -- we want the first 70% in the training set.
train_thresh = math.floor(len(all_predictors) * .7)
# Split the list of dataframes
train = all_predictors[:train_thresh]
test = all_predictors[train_thresh:]
# Concatenate all the items in the split lists.
# Concatenate them along the row axis.
# This results in one big dataframe.
train = pd.concat(train, axis=0)
test = pd.concat(test, axis=0)
# Around 6000 rows in the training data.
print(train.shape[0])
# Around 1000 rows in the test data.
print(test.shape[0])
'''
6196
1251
'''
Training The Algorithm
from sklearn.linear_model import LogisticRegression
predictors = ['event_type', 'screen_events', 'session_events', 'session_time']
clf = LogisticRegression()
clf.fit(train[predictors], train["target"])
# Make predictions of the probability that the row is a 0 or a 1.
predictions = clf.predict_proba(test[predictors])
Measuring Error
- 计算其AUC值:
from sklearn.metrics import roc_auc_score
# This is the score of our classifier. We want to compare our target against
# the probability that the row is a 1 (the second column of the predictions).
print(roc_auc_score(test["target"], predictions[:,1]))
'''
0.61255321887
'''
Conclusions
- 可以发现这个分类器并不是很精确,但是可以添加一些额外的信息来提高精度,比如前面提到的一些属性。并且可以尝试其他的算法,比如随机森林通常都能取得好的结果。