案例背景
数据集“online_shoppers_intention”给出了网购人群是否将浏览行为转化为购买行为的相关数据,包括 10 个数值型属性与 8 个类别型属性,其中“revenue”可以作为分类的类标签。请将该数据集随机划分为训练集(80%)和测试集(20%)并进行分类。
数据预处理
- 导入库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import os
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve,roc_auc_score
import time
- 读取数据
os.chdir('E:\\input')
df=pd.read_csv('online_shoppers_intention.csv')
- 数据了解
df.columns
[‘Administrative’, ‘Administrative_Duration’, ‘Informational’, ‘Informational_Duration’, ‘ProductRelated’, ‘ProductRelated_Duration’, ‘BounceRates’, ‘ExitRates’, ‘PageValues’, ‘SpecialDay’, ‘Month’, ‘OperatingSystems’, ‘Browser’, ‘Region’, ‘TrafficType’, ‘VisitorType’, ‘Weekend’, ‘Revenue’]
df.dtypes
- 处理缺失值
- 查看缺失值情况
df.isna().sum()
每一列的数据都是完整的,没有缺失值。
- 划分训练集测试集
x=df.drop('Revenue',axis=1)
y=df['Revenue']
x=pd.get_dummies(x)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
Logistics 回归
要求
请建立带有 L1 惩罚项的 Logistics 回归模型对数据集进行分类,并利用基于 5 折交叉验证的格子搜索技术确定最优惩罚因子;在最优惩罚因子下,分别评价模型在训练集和测试集的预测效果(包括混肴矩阵、准确率、F1-Score、AUC 等)。
模型参数
LogisticRegression().get_params()
{‘C’: 1.0,
‘class_weight’: None,
‘dual’: False,
‘fit_intercept’: True,
‘intercept_scaling’: 1,
‘max_iter’: 100,
‘multi_class’: ‘warn’,
‘n_jobs’: None,
‘penalty’: ‘l2’,
‘random_state’: None,
‘solver’: ‘warn’,
‘tol’: 0.0001,
‘verbose’: 0,
‘warm_start’: False}
调参
logit=LogisticRegression(penalty='l1')
parameters={'C':np.arange(0.1,30,1)}
logit_cv=GridSearchCV(logit,param_grid=parameters,cv=5)
logit_cv.fit(x,y)
print(logit_cv.best_params_) #22.1
print(logit_cv.best_score_)
best_params_:{‘C’: 22.1}
best_score:0.884