使用sklearn 实现 Logistics Regression 分类

最新推荐文章于 2024-04-24 22:40:35 发布

小鹅鹅

最新推荐文章于 2024-04-24 22:40:35 发布

阅读量1.3w

点赞数 3

分类专栏：数据挖掘机器学习文章标签：数据挖掘机器学习 python sklearn LR回归

本文链接：https://blog.csdn.net/asd136912/article/details/78636327

版权

机器学习同时被 2 个专栏收录

14 篇文章 3 订阅

订阅专栏

数据挖掘

5 篇文章 0 订阅

订阅专栏

使用Pandas 数据清洗特征选择 + sklearn 实现 Logistics Regression 分类

(记录一次Data Mining作业)
关于LR基础可以看这里

数据描述与分析

我们有这么一个数据集，记录学生在教务网站上看某学科的视频流数据来预测学生是否挂科。(这之间有关系吗..)
DataSet

user_id: Identifies the individual who is performing the action.

session: This 32-character value is a key that identifies the user’s session. All browser events include a value for the session. Other mobile events do not include a session value.

load_video: This tag appears when the video is rendered and ready to play.

play_video: This tag appears when a user selects the video player’s play control.

pause_video: This tag appears when a user select the video player’s pause control.

seek_video: This tag appears when a user selects a user interface control to go to a different
point in the video file.

stop_video: This tag appears when the video player reaches the end of the video file and play
automatically stops.

speed_change_video: This tag appears when a user selects a different playing speed for the video.

event_time: The time that this event occurs. Gives the UTC time at which the event was emitted in ‘YYYY-MM-DDThh:mm:ss.xxxxxx’ format.

new_time: The time in the video, in seconds, that the user selected as the destination point. This filed appears for seek_video action only.

old_time: The time in the video, in seconds, at which the user chose to go to a different point in the file. This filed appears for seek_video action only.

old_speed: The speed at which the video was playing. This filed appears for speed_change_video action only.

new_speed: The speed that the user selected for the video to play. This filed appears for
speed_change_video action only.

grade: Final performance status, 0 for not pass and 1 for pass

训练环境

OS: Win 10
Python version:3.6.3
Scikit-learn: 0.19.1
Pandas: 0.21.0
Numpy: 1.13.3
A typical example is run as:

python lr.py

特征选择

The number of videos that student have watched.
The times that student watch the videos.
The times that student pause the videos when watching.
The times that student stop the videos when watching.
The times that student change the videos speed when watching.
the number of session of one student ( the times that student open the browser to watch the video )

PS：当然这是些很简单的特征，数据集里面的时间等都没用上。

模型选择(当然是选择LR)

Use the logistic regression model.

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.).
The goal of logistic regression is to find the best fitting (yet biologically reasonable) model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of presence of the characteristic of interest.
Binary class L2 penalized logistic regression minimizes the following cost function:
$cost function$

sklearn 中 LogisticRegression 参数默认值

class sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)

我们在训练时可以直接使用默认参数，当然也可以根据数据集合理设置theta调参

输出结果

0.860396039604
0.866336633663
0.890099009901
0.869306930693
0.869306930693
0.880198019802
0.862376237624
0.870297029703
0.892079207921
0.887128712871

precision recall f1-score support
neg 0.93 0.93 0.93 827
pos 0.69 0.68 0.69 183

avg / total 0.89 0.89 0.89 1010

time spent: 7.203231573104858

绘制出P/R 图 (AUC = 0.5):
P/R curve

参考代码

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, roc_curve, auc
from sklearn.metrics import classification_report
from matplotlib import pyplot
from matplotlib import pylab
import pandas as pd
import numpy as np
import time

start_time = time.time()
trainDf = pd.read_csv('TrainFeatures.csv')
testDf = pd.read_csv('TestFeatures.csv')
labelDf = pd.read_csv('TrainLabel.csv')


# Draw R/P Curve
def plot_pr(auc_score, precision, recall, label=None):
    pylab.figure(num=None, figsize=(6, 5))
    pylab.xlim([0.0, 1.0])
    pylab.ylim([0.0, 1.0])
    pylab.xlabel('Recall')
    pylab.ylabel('Precision')
    pylab.title('P/R (AUC=%0.2f) / %s' % (auc_score, label))
    pylab.fill_between(recall, precision, alpha=0.5)
    pylab.grid(True, linestyle='-', color='0.75')
    pylab.plot(recall, precision, lw=1)
    pylab.show()


# do data cleaning job
def data_cleaning(df):
    # Feature for video number for one student
    video_number = df.iloc[:, 0:2].drop_duplicates().dropna()
    video_number = video_number.groupby(by=['user_id']).size().reset_index(name='watchVideoTimes')
    # Feature for session
    session_number = df.iloc[:, [0, 2]].drop_duplicates()
    session_number = session_number.groupby(by=['user_id']).size().reset_index(name='sessionCount')
    # Feature for video event type
    video_type_number = df.iloc[:, [0, 7]].dropna()
    video_type_number = video_type_number.groupby(by=['user_id', 'event_type']).size()\
        .reset_index(name='video_type_number')
    # select event_type == play_video
    play_video_times = video_type_number[video_type_number.event_type == 'play_video'].drop(['event_type'], axis=1)
    pause_video_times = video_type_number[video_type_number.event_type == 'pause_video'].drop(['event_type'], axis=1)
    seek_video_times = video_type_number[video_type_number.event_type == 'seek_video'].drop(['event_type'], axis=1)
    stop_video_times = video_type_number[video_type_number.event_type == 'stop_video'].drop(['event_type'], axis=1)
    speed_change_times = video_type_number[video_type_number.event_type == 'speed_change_video']\
        .drop(['event_type'], axis=1)
    # rename columns
    play_video_times.rename(columns={'video_type_number': 'play_video_times'}, inplace=True)
    pause_video_times.rename(columns={'video_type_number': 'pause_video_times'}, inplace=True)
    seek_video_times.rename(columns={'video_type_number': 'seek_video_times'}, inplace=True)
    stop_video_times.rename(columns={'video_type_number': 'stop_video_times'}, inplace=True)
    speed_change_times.rename(columns={'video_type_number': 'speed_change_times'}, inplace=True)
    # merger the columns by key = user_id
    feature_df = pd.merge(video_number, session_number, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, play_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, pause_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, seek_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, stop_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, speed_change_times, on='user_id', how='outer')
    # replace NAN to 0
    feature_df = feature_df.fillna(0)
    return feature_df

trainingFeature = data_cleaning(trainDf)
testingFeature = data_cleaning(testDf)
trainingFeature = pd.merge(trainingFeature, labelDf, on='user_id')
# trainingFeature.to_csv('cleaning_data_training.csv')
# testingFeature.to_csv('cleaning_data_testing.csv')

# training model
average = 0
testNum = 10
for i in range(0, testNum):
    X_train, X_test, y_train, y_test = train_test_split(trainingFeature.iloc[:, 1:7], trainingFeature.iloc[:, 8],
                                                    test_size=0.2)
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    p = np.mean(y_pred == y_test)
    print(p)
    average += p

# precision and recall
answer = lr.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, answer)
report = answer > 0.5
print(classification_report(y_test, report, target_names=['neg', 'pos']))
print("average precision:", average / testNum)
print("time spent:", time.time() - start_time)
plot_pr(0.5, precision, recall, "pos")

# predict testing data
predict = lr.predict(testingFeature.iloc[:, 1:7])
output = pd.DataFrame(predict.T, columns=['grade'])
output.insert(0, 'user_id', testingFeature.iloc[:, 0])
output.to_csv('prediction.csv', index=False)

参考文献

http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression Sklearn documentation
李航, 统计学习方法
https://czep.net/stat/mlelr.pdf Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation

小鹅鹅

关注

3
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
使用sklearn 实现 Logistics Regression 分类

使用Pandas 数据清洗特征选择 + sklearn 实现 Logistics Regression 分类(记录一次Data Mining作业) 关于LR基础可以看这里数据描述与分析我们有这么一个数据集，记录学生在教务网站上看某学科的视频流数据来预测学生是否挂科。(这之间有关系吗..) user_id: Identifies the individual who is ...
复制链接

扫一扫