使用sklearn 实现 Logistics Regression 分类

使用Pandas 数据清洗特征选择 + sklearn 实现 Logistics Regression 分类

(记录一次Data Mining作业)
关于LR基础可以看这里

数据描述与分析

我们有这么一个数据集,记录学生在教务网站上看某学科的视频流数据来预测学生是否挂科。(这之间有关系吗..)
DataSet

user_id: Identifies the individual who is performing the action.

session: This 32-character value is a key that identifies the user’s session. All browser events include a value for the session. Other mobile events do not include a session value.

load_video: This tag appears when the video is rendered and ready to play.

play_video: This tag appears when a user selects the video player’s play control.

pause_video: This tag appears when a user select the video player’s pause control.

seek_video: This tag appears when a user selects a user interface control to go to a different
point in the video file.

stop_video: This tag appears when the video player reaches the end of the video file and play
automatically stops.

speed_change_video: This tag appears when a user selects a different playing speed for the video.

event_time: The time that this event occurs. Gives the UTC time at which the event was emitted in ‘YYYY-MM-DDThh:mm:ss.xxxxxx’ format.

new_time: The time in the video, in seconds, that the user selected as the destination point. This filed appears for seek_video action only.

old_time: The time in the video, in seconds, at which the user chose to go to a different point in the file. This filed appears for seek_video action only.

old_speed: The speed at which the video was playing. This filed appears for speed_change_video action only.

new_speed: The speed that the user selected for the video to play. This filed appears for
speed_change_video action only.

grade: Final performance status, 0 for not pass and 1 for pass

训练环境

OS: Win 10
Python version:3.6.3
Scikit-learn: 0.19.1
Pandas: 0.21.0
Numpy: 1.13.3
A typical example is run as:

python lr.py

特征选择

  1. The number of videos that student have watched.
  2. The times that student watch the videos.
  3. The times that student pause the videos when watching.
  4. The times that student stop the videos when watching.
  5. The times that student change the videos speed when watching.
  6. the number of session of one student ( the times that student open the browser to watch the video )

PS: 当然这是些很简单的特征,数据集里面的时间等都没用上。

模型选择(当然是选择LR)

Use the logistic regression model.

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.).
The goal of logistic regression is to find the best fitting (yet biologically reasonable) model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of presence of the characteristic of interest.
Binary class L2 penalized logistic regression minimizes the following cost function:
cost function

sklearn 中 LogisticRegression 参数默认值

class sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)

我们在训练时可以直接使用默认参数,当然也可以根据数据集合理设置theta调参

输出结果

0.860396039604
0.866336633663
0.890099009901
0.869306930693
0.869306930693
0.880198019802
0.862376237624
0.870297029703
0.892079207921
0.887128712871

precision recall f1-score support
neg 0.93 0.93 0.93 827
pos 0.69 0.68 0.69 183

avg / total 0.89 0.89 0.89 1010

time spent: 7.203231573104858

绘制出P/R 图 (AUC = 0.5):
P/R curve

参考代码

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, roc_curve, auc
from sklearn.metrics import classification_report
from matplotlib import pyplot
from matplotlib import pylab
import pandas as pd
import numpy as np
import time

start_time = time.time()
trainDf = pd.read_csv('TrainFeatures.csv')
testDf = pd.read_csv('TestFeatures.csv')
labelDf = pd.read_csv('TrainLabel.csv')


# Draw R/P Curve
def plot_pr(auc_score, precision, recall, label=None):
    pylab.figure(num=None, figsize=(6, 5))
    pylab.xlim([0.0, 1.0])
    pylab.ylim([0.0, 1.0])
    pylab.xlabel('Recall')
    pylab.ylabel('Precision')
    pylab.title('P/R (AUC=%0.2f) / %s' % (auc_score, label))
    pylab.fill_between(recall, precision, alpha=0.5)
    pylab.grid(True, linestyle='-', color='0.75')
    pylab.plot(recall, precision, lw=1)
    pylab.show()


# do data cleaning job
def data_cleaning(df):
    # Feature for video number for one student
    video_number = df.iloc[:, 0:2].drop_duplicates().dropna()
    video_number = video_number.groupby(by=['user_id']).size().reset_index(name='watchVideoTimes')
    # Feature for session
    session_number = df.iloc[:, [0, 2]].drop_duplicates()
    session_number = session_number.groupby(by=['user_id']).size().reset_index(name='sessionCount')
    # Feature for video event type
    video_type_number = df.iloc[:, [0, 7]].dropna()
    video_type_number = video_type_number.groupby(by=['user_id', 'event_type']).size()\
        .reset_index(name='video_type_number')
    # select event_type == play_video
    play_video_times = video_type_number[video_type_number.event_type == 'play_video'].drop(['event_type'], axis=1)
    pause_video_times = video_type_number[video_type_number.event_type == 'pause_video'].drop(['event_type'], axis=1)
    seek_video_times = video_type_number[video_type_number.event_type == 'seek_video'].drop(['event_type'], axis=1)
    stop_video_times = video_type_number[video_type_number.event_type == 'stop_video'].drop(['event_type'], axis=1)
    speed_change_times = video_type_number[video_type_number.event_type == 'speed_change_video']\
        .drop(['event_type'], axis=1)
    # rename columns
    play_video_times.rename(columns={'video_type_number': 'play_video_times'}, inplace=True)
    pause_video_times.rename(columns={'video_type_number': 'pause_video_times'}, inplace=True)
    seek_video_times.rename(columns={'video_type_number': 'seek_video_times'}, inplace=True)
    stop_video_times.rename(columns={'video_type_number': 'stop_video_times'}, inplace=True)
    speed_change_times.rename(columns={'video_type_number': 'speed_change_times'}, inplace=True)
    # merger the columns by key = user_id
    feature_df = pd.merge(video_number, session_number, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, play_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, pause_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, seek_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, stop_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, speed_change_times, on='user_id', how='outer')
    # replace NAN to 0
    feature_df = feature_df.fillna(0)
    return feature_df

trainingFeature = data_cleaning(trainDf)
testingFeature = data_cleaning(testDf)
trainingFeature = pd.merge(trainingFeature, labelDf, on='user_id')
# trainingFeature.to_csv('cleaning_data_training.csv')
# testingFeature.to_csv('cleaning_data_testing.csv')

# training model
average = 0
testNum = 10
for i in range(0, testNum):
    X_train, X_test, y_train, y_test = train_test_split(trainingFeature.iloc[:, 1:7], trainingFeature.iloc[:, 8],
                                                    test_size=0.2)
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    p = np.mean(y_pred == y_test)
    print(p)
    average += p

# precision and recall
answer = lr.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, answer)
report = answer > 0.5
print(classification_report(y_test, report, target_names=['neg', 'pos']))
print("average precision:", average / testNum)
print("time spent:", time.time() - start_time)
plot_pr(0.5, precision, recall, "pos")

# predict testing data
predict = lr.predict(testingFeature.iloc[:, 1:7])
output = pd.DataFrame(predict.T, columns=['grade'])
output.insert(0, 'user_id', testingFeature.iloc[:, 0])
output.to_csv('prediction.csv', index=False)

参考文献

  1. http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression Sklearn documentation
  2. 李航, 统计学习方法
  3. https://czep.net/stat/mlelr.pdf Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation
  • 3
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值