【课后题练习】陈强-机器学习-Python-Ch9 惩罚回归（student-mat.csv）

赛博机器喵

已于 2024-08-19 21:10:38 修改

阅读量828

点赞数 13

分类专栏：陈强-机器学习-Python 文章标签：机器学习 python 笔记学习

于 2024-08-15 20:46:09 首次发布

本文链接：https://blog.csdn.net/2201_76026029/article/details/141215818

版权

陈强-机器学习-Python 专栏收录该内容

9 篇文章 2 订阅

订阅专栏

系列文章目录

【学习笔记】陈强-机器学习-Python-Ch4 线性回归
 【学习笔记】陈强-机器学习-Python-Ch5 逻辑回归
 【课后题练习】陈强-机器学习-Python-Ch5 逻辑回归（SAheart.csv）
【学习笔记】陈强-机器学习-Python-Ch6 多项逻辑回归
 【学习笔记及课后题练习】陈强-机器学习-Python-Ch7 判别分析
 【学习笔记】陈强-机器学习-Python-Ch8 朴素贝叶斯
 【学习笔记】陈强-机器学习-Python-Ch9 惩罚回归

前言

本学习笔记仅为以防自己忘记了，顺便分享给一起学习的网友们参考。如有不同意见/建议，可以友好讨论。

本学习笔记所有的代码和数据都可以从陈强老师的个人主页上下载

参考书目：陈强.机器学习及Python应用. 北京：高等教育出版社, 2021.

数学原理等详见陈强老师的 PPT

参考了：
网友带我去滑雪的机器学习之惩罚回归—基于python实现（附完整代码）

一、数据集student-mat.csv

UCI Machine Learning Repository 的葡萄牙高中数学成绩数据student-mat.csv。
变量s:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)

响应变量:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)

二、课后习题

1.载入数据

（1）由于此csv文件以分号 “；”分割，故使用命令“pd.read_table(‘student-mat.csv’,sep=‘;’)”载入数据，并考察此数据框的形状与前5个观测值；

#导入模块
import numpy as np
import pandas as pd
#载入数据
student_mat =pd.read_table(r'D:\桌面文件\Python\【陈强-机器学习】MLPython-PPT-PDF\MLPython_Data\student-mat.csv',sep=';')
#数据集基本信息
print(student_mat.shape)
student_mat.head()

结果输出： (395, 33) 395个观测值与35个变量
在这里插入图片描述

`笔记：pd.read_table(）`

pd.read_table() 是一个读取文本文件的函数，功能类似于 pd.read_csv()。
pd.read_table() 默认以制表符（\t）作为分隔符，而 pd.read_csv() 默认以逗号（,）作为分隔符。
pd.read_table() 是用来读取以制表符分隔的文本文件的，但也可以通过设置参数来读取其他分隔符的文件。

#基本语法和参数
import pandas as pd
# 读取文件
df = pd.read_table(
    'file_path.txt', 
    sep=',',  #分隔符或字段分隔符。默认为 \t（制表符）。如果文件使用其他分隔符，比如逗号（,）或分号（;）
    header=None,  # 默认为 infer，即自动推断。如果没有列名行，可以设置为 None。
    names=['col1', 'col2', 'col3'], #指定列名。如果文件没有列名行，可以通过这个参数提供列名。
    index_col=0,  #指定用作索引的列。可以是列的名称或列的索引号。
    usecols=['col1', 'col3'],  #指定要读取的列。可以是列的名称或列的索引号列表。
    dtype={'col1': float, 'col2': int}, #指定数据类型。如，int 或 float。
    parse_dates=['date_column'], #指定解析为日期的列。可以是列名或列的索引号列表。
    skiprows=2,  #跳过前两行。可以是一个整数（跳过前 N 行）或一个列表（指定要跳过的行号）。
    na_values=['NA', 'N/A'], #指定哪些值应被视为缺失值。可以是单个值、列表或字典。
    encoding='utf-8', #指定文件的编码格式（如 utf-8, latin1, ascii 等）。
    comment='#')  #指定行开头的注释字符（即以该字符开头的行将被忽略）。

2.确定X与y

（2）从数据集中去掉变量G1和G2，因为这是同一学期的前两阶段成绩，且与G3高度相关；

# 删除变量G1和G2
final = student_mat.drop(columns=['G1', 'G2'])
print("删除变量G1和G2后的形状:", final.shape)
# 指定X与y
X_raw = final.iloc[:, :-1]
y = final.iloc[:, -1]
print("X_raw:",X_raw.shape)

结果输出： 删除变量G1和G2后的形状: (395, 31)
X_raw: (395, 30)

3.画出响应变量的直方图（sns.histplot()）

（3）画出响应变量G3的直方图

sns.histplot(y)

在这里插入图片描述

4.设置虚拟变量（pd.get_dummies()）

（4）使用函数pd.get_dummies()将数据矩阵中的分类变量都变为虚拟变量

#查看X_raw 每列数据类型的属性
X_raw.dtypes

在这里插入图片描述
object 类型，通常表示字符串或其他非数值数据

#使用函数pd.get_dummies()将数据矩阵中的分类变量都变为虚拟变量
X_dummies = pd.get_dummies(X_raw)
X_dummies.head()

在这里插入图片描述

`笔记：pd.get_dummies()`

pd.get_dummies() 是 Pandas 库中用于将分类变量转换为虚拟变量（即独热编码）的函数。这种编码方式将每个分类变量的类别转换为一个新的列，并用 0 和 1 表示该类别的存在与否。

#基本语法和参数
import pandas as pd
# 对data进行独热编码
dummies = pd.get_dummies(
    data, #输入的数据。可以是列表、Series 或 DataFrame。
    prefix=None, #为独热编码的列名添加前缀。默认为 None
    prefix_sep='_', #前缀与列名之间的分隔符。默认为 '_'
    dummy_na=False, #是否为 NaN 值添加一列。默认为 False(不加），True则 NaN 也会被视为一个类别。
    columns=None, #指定需要进行独热编码的列名。如果未指定，所有字符串列和分类列都将被转换。
    sparse=False, #是否将得到的 DataFrame 使用稀疏矩阵表示。如果数据大部分为 0，这样可以节省内存。 默认为 False
    drop_first=False, #是否从每个类别的独热编码中删除第一列，用于避免一些模型的多重共线性问题。默认为 False
    dtype=None) #输出的独热编码的数据类型。默认为 np.uint8

5.变量标准化（StandardScaler）

（5）将所有特征变量标准化；

#将所有特征变量标准化（sklearn的StandardScaler）
from sklearn.preprocessing import （5）将所有特征变量标准化；
scaler = StandardScaler() #初始化StandardScaler对象
X = scaler.fit_transform(X_dummies) 

#验证每列的均值（=0）与标准差（=1）
print(np.mean(X,axis=0)) #axis=0 表示按列计算均值
np.std(X,axis=0)

在这里插入图片描述

6.画出岭回归的系数路径

（6）考虑惩罚参数α的网格np.logspace(-3，6，100)，画出岭回归的系数路径；

from sklearn.linear_model import Ridge
# 1.定义参数alpha的网格 
alphas = np.logspace(-3, 6, 100)
# 2.for循环得到不同的alpha取值的响应回归系数
coefs = [] #创建一个空列表coefs用于存储不同alpha值下的回归系数。
for alpha in alphas: #循环遍历:遍历alphas数组 中的每一个alpha值。
    model = Ridge(alpha=alpha) #创建模型:对于每一个alpha值，创建一个Ridge回归模型实例。
    model.fit(X, y) 
    coefs.append(model.coef_) 
#3.画出系数路径
import matplotlib.pyplot as plt
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha (log scale)')
plt.ylabel('Coefficients')
plt.title('Ridge Cofficient Path')
plt.axhline(0, linestyle='--', linewidth=1, color='k')
plt.legend(X_raw.columns)

在这里插入图片描述

7.10折交叉验证的最优岭回归的系数

通过10折交叉验证（使用random-state=1），选择最优惩罚参数α
，进行岭回归，并以数据框的形式展示最优岭回归的系数；

#1.RidgeCV用10折交叉验证
from sklearn.model_selection import KFold
from sklearn.linear_model import RidgeCV
kfold = KFold(n_splits=10,shuffle=True, random_state=1) 
kfold_10 = RidgeCV(alphas=np.logspace(-3, 6, 100),cv=kfold)
kfold_10.fit(X,y)
#2.10折cv的最优惩罚参数
kfold_10.alpha_

结果输出： 284.8035868435805

#3.展示最佳模型 的回归系数
pd.DataFrame(
    kfold_10.coef_, 
    index=X_dummies.columns, #注意：不是X，也不是X_raw!!
    columns=['Coefficient'])

结果输出： Coefficient
age -0.236550
Medu 0.251025
Fedu 0.087646
traveltime -0.124284
studytime 0.229581
failures -0.726786
famrel 0.114305
freetime 0.107220
goout -0.334575
Dalc -0.078008
Walc 0.010513
health -0.139548
absences 0.198280
school_GP -0.030708
school_MS 0.030708
sex_F -0.208679
sex_M 0.208679
address_R -0.093834
address_U 0.093834
famsize_GT3 -0.120611
famsize_LE3 0.120611
Pstatus_A 0.061713
Pstatus_T -0.061713
Mjob_at_home -0.064841
Mjob_health 0.192810
Mjob_other -0.132952
Mjob_services 0.174610
Mjob_teacher -0.124117
Fjob_at_home 0.010097
Fjob_health 0.061756
Fjob_other -0.084339
Fjob_services -0.049912
Fjob_teacher 0.189040
reason_course -0.132146
reason_home -0.069339
reason_other 0.128259
reason_reputation 0.130782
guardian_father 0.016487
guardian_mother -0.002703
guardian_other -0.020767
schoolsup_no 0.159480
schoolsup_yes -0.159480
famsup_no 0.144278
famsup_yes -0.144278
paid_no -0.085224
paid_yes 0.085224
activities_no 0.044861
activities_yes -0.044861
nursery_no 0.014338
nursery_yes -0.014338
higher_no -0.171359
higher_yes 0.171359
internet_no -0.082825
internet_yes 0.082825
romantic_no 0.188334
romantic_yes -0.188334

8.画出lasso回归的系数路径

（8）设定参数“eps=le-4”，使用lasso_path()函数，画出lasso回归的系数路径；

#1.用lasso_path()得到惩罚系数与响应回归系数
from sklearn.linear_model import lasso_path
alphas, coefs, _ = lasso_path(X, y, eps=1e-4)
alphas.shape, coefs.shape #建议检查一下

结果输出： ((100,), (56, 100))

#2.画出lasso的系数路径图
ax = plt.gca()
ax.plot(alphas, coefs.T) #转置coefs，因为alphas.shape ≠ coefs.shape
ax.set_xscale('log')
plt.xlabel('alpha (log scale)')
plt.ylabel('Coefficients')
plt.title('Lasso Cofficient Path')
plt.axhline(0, linestyle='--', linewidth=1, color='k')
plt.legend(X_raw.columns)

**结果输出：**

9.10折交叉验证的最优Lasso回归的系数

（9）通过10折交叉验证（使用random-state=1），在网格np.logspace(-3，1，100)上选择最优惩罚参数α，进行Lasso回归，并以数据框的形式展示最优Lasso回归的系数；

#1.10折交叉验证LassoCV选择最优α
from sklearn.linear_model import LassoCV
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
alphas=np.logspace(-3, 1, 100)
model_lasso_10cv = LassoCV(alphas=alphas, cv=kfold)
model_lasso_10cv.fit(X, y)
print(model_lasso_10cv.alpha_)
#展示 最优lasso回归系数
pd.DataFrame(
    model_lasso_10cv.coef_, 
    index=X_dummies.columns, 
    columns=['Coefficient'])

结果输出： 0.1668100537200059
Coefficient
age -1.335234e-01
Medu 3.037488e-01
Fedu 0.000000e+00
traveltime -7.782359e-02
studytime 1.608208e-01
failures -1.239094e+00
famrel 7.038920e-03
freetime 3.799272e-02
goout -3.502907e-01
Dalc -0.000000e+00
Walc 0.000000e+00
health -6.823233e-02
absences 1.547020e-01
school_GP -0.000000e+00
school_MS 0.000000e+00
sex_F -3.598980e-01
sex_M 1.439074e-16
address_R -9.641718e-02
address_U 0.000000e+00
famsize_GT3 -1.640174e-01
famsize_LE3 3.597685e-17
Pstatus_A 9.131875e-03
Pstatus_T -3.597685e-17
Mjob_at_home -0.000000e+00
Mjob_health 2.895078e-01
Mjob_other -0.000000e+00
Mjob_services 3.410783e-01
Mjob_teacher -0.000000e+00
Fjob_at_home 0.000000e+00
Fjob_health 0.000000e+00
Fjob_other -0.000000e+00
Fjob_services -0.000000e+00
Fjob_teacher 1.367522e-01
reason_course -1.042161e-01
reason_home -0.000000e+00
reason_other 4.090333e-02
reason_reputation 5.860171e-02
guardian_father 0.000000e+00
guardian_mother -0.000000e+00
guardian_other 0.000000e+00
schoolsup_no 2.318489e-01
schoolsup_yes -1.798842e-16
famsup_no 1.753731e-01
famsup_yes -0.000000e+00
paid_no -0.000000e+00
paid_yes 0.000000e+00
activities_no 0.000000e+00
activities_yes -0.000000e+00
nursery_no 0.000000e+00
nursery_yes -0.000000e+00
higher_no -1.939954e-01
higher_yes 1.295167e-15
internet_no -5.816817e-02
internet_yes 0.000000e+00
romantic_no 3.038162e-01
romantic_yes -0.000000e+00

10.最优弹性网回归的系数

（10）考虑惩罚参数α的网格np.logspace(-3，1，100)与调节参数11_ratio的网格[0.001，0.01，0.1，0.5，1]，并通过10折交叉验证（使用random-state=1）,选择最优参数α与11_ratio，进行弹性网回归，汇报样本内的拟合优度，并以数据框的形式展示最优弹性网回归的系数；

from sklearn.linear_model import ElasticNetCV
# 选出最优alpha （交叉验证法 ElasticNetCV）
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
alphas=np.logspace(-3, 1, 100)
model_enet_10cv = ElasticNetCV(
    alphas=alphas, cv=kfold,
    l1_ratio=[0.0001, 0.01, 0.01, 0.1, 0.5, 1]
)
model_enet_10cv.fit(X, y)
#最优参数α
model_enet_10cv.alpha_

结果输出： 0.7390722033525783

#最优L1惩罚项比重
model_enet_10cv.l1_ratio_

结果输出： 0.0001

#展示 最佳弹性网模型的回归系数
pd.DataFrame(model_enet_10cv.coef_, 
             index=X_dummies.columns, 
             columns=['Coefficient'])

结果输出： Coefficient
age -0.234207
Medu 0.249271
Fedu 0.088374
traveltime -0.123689
studytime 0.226776
failures -0.719687
famrel 0.113137
freetime 0.105266
goout -0.330888
Dalc -0.077127
Walc 0.008907
health -0.138091
absences 0.195197
school_GP -0.029868
school_MS 0.029867
sex_F -0.206908
sex_M 0.206907
address_R -0.093429
address_U 0.093429
famsize_GT3 -0.119904
famsize_LE3 0.119903
Pstatus_A 0.061574
Pstatus_T -0.061574
Mjob_at_home -0.065240
Mjob_health 0.191109
Mjob_other -0.132072
Mjob_services 0.172806
Mjob_teacher -0.121191
Fjob_at_home 0.009670
Fjob_health 0.061303
Fjob_other -0.083404
Fjob_services -0.049209
Fjob_teacher 0.186866
reason_course -0.131521
reason_home -0.068707
reason_other 0.127367
reason_reputation 0.130086
guardian_father 0.016938
guardian_mother -0.002386
guardian_other -0.021881
schoolsup_no 0.158340
schoolsup_yes -0.158340
famsup_no 0.143039
famsup_yes -0.143039
paid_no -0.084988
paid_yes 0.084988
activities_no 0.044180
activities_yes -0.044180
nursery_no 0.013812
nursery_yes -0.013812
higher_no -0.171041
higher_yes 0.171041
internet_no -0.082580
internet_yes 0.082580
romantic_no 0.187161
romantic_yes -0.187161

11.训练集与测试集的拟合优度

（11）使用random-state=0，随机预留100个观测值作为测试集，进行最优的弹性网回归，分别计算训练集与测试集的拟合优度。

#划分训练集和测试集:验证模型是否过拟合
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,  train_size=100, random_state=0)
#进行最优的弹性网回归
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
alphas=np.logspace(-3, 1, 100)
model = ElasticNetCV(
    alphas=alphas, cv=kfold,
    l1_ratio=[0.0001, 0.01, 0.01, 0.1, 0.5, 1])
model.fit(X_train, y_train)
#展示最优惩罚参数alpha
model.alpha_

结果输出： 2.0565123083486534

#训练集拟合优度
print("训练集拟合优度:",model.score(X_train, y_train))
#测试集拟合优度
print("测试集拟合优度:",model.score(X_test, y_test))

结果输出：
训练集拟合优度: 0.3183454821508972
测试集拟合优度: 0.043896742263800026

赛博机器喵

关注

13
点赞
踩
28

收藏

觉得还不错? 一键收藏
0
评论
【课后题练习】陈强-机器学习-Python-Ch9 惩罚回归（student-mat.csv）

本学习笔记仅为以防自己忘记了，顺便分享给一起学习的网友们参考。如有不同意见/建议，可以友好讨论。本学习笔记所有的代码和数据都可以从陈强老师的个人主页上下载陈强.机器学习及Python应用. 北京：高等教育出版社, 2021.数学原理等详见陈强老师的PPT参考了：网友带我去滑雪的机器学习之惩罚回归—基于python实现（附完整代码）
复制链接

扫一扫