基于回归模型预测鲍鱼年龄 (1)

碎碎念念，岁岁年年

于 2024-05-25 20:53:24 发布

阅读量713

点赞数 25

分类专栏：机器学习第三期文章标签：回归数据挖掘 scikit-learn 机器学习

本文链接：https://blog.csdn.net/m0_73519831/article/details/139202601

版权

机器学习第三期专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一、数据集探索性分析

import pandas as pd
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv(r"abalone_dataset.csv")
data.head()

	sex	length	diameter	height	whole weight	shucked weight	viscera weight	shell weight	rings
0	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
1	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
2	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
3	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7
4	I	0.425	0.300	0.095	0.3515	0.1410	0.0775	0.120	8

#查看数据集中样本数量和特征数量
data.shape

(4176, 9)

#查看数据信息，检查是否有缺失值
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4176 entries, 0 to 4175
Data columns (total 9 columns):
sex               4176 non-null object
length            4176 non-null float64
diameter          4176 non-null float64
height            4176 non-null float64
whole weight      4176 non-null float64
shucked weight    4176 non-null float64
viscera weight    4176 non-null float64
shell weight      4176 non-null float64
rings             4176 non-null int64
dtypes: float64(7), int64(1), object(1)
memory usage: 293.7+ KB

这是一个Pandas DataFrame，包含4176行数据和9列。数据类型包括float64（7列）、int64（1列）和object（1列）。

data.describe()

	length	diameter	height	whole weight	shucked weight	viscera weight	shell weight	rings
count	4176.000000	4176.000000	4176.000000	4176.000000	4176.00000	4176.000000	4176.000000	4176.000000
mean	0.524009	0.407892	0.139527	0.828818	0.35940	0.180613	0.238852	9.932471
std	0.120103	0.099250	0.041826	0.490424	0.22198	0.109620	0.139213	3.223601
min	0.075000	0.055000	0.000000	0.002000	0.00100	0.000500	0.001500	1.000000
25%	0.450000	0.350000	0.115000	0.441500	0.18600	0.093375	0.130000	8.000000
50%	0.545000	0.425000	0.140000	0.799750	0.33600	0.171000	0.234000	9.000000
75%	0.615000	0.480000	0.165000	1.153250	0.50200	0.253000	0.329000	11.000000
max	0.815000	0.650000	1.130000	2.825500	1.48800	0.760000	1.005000	29.000000

数据集一共有4177个样本，每个样本有9个特征，其中rings为鲍鱼环数，能够代表鲍鱼年龄，是预测变量。除了sex为离散特征，其余都为连续变量。

观察sex列的取值分布情况。

#观察sex列的取值分布情况
#性别分为雄性、雌性、未成年
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.countplot(x = "sex",data = data)

<matplotlib.axes._subplots.AxesSubplot at 0x185dcc26da0>

在这里插入图片描述

data['sex'].value_counts()
#用于统计dataframe中的sex列中的各个值的数量。

M    1527
I    1342
F    1307
Name: sex, dtype: int64

对于连续特征，可以使用seaborn的distplot函数绘制直方图观察特征取值情况。我们将8个连续特征的直方图绘制在一个4行2列的子图布局中。

i = 1 #子图计数
plt.figure(figsize=(16,8))
for col in data.columns[1:]:
    plt.subplot(4,2,i)
    i = i + 1
    sns.distplot(data[col])
    plt.tight_layout()

在这里插入图片描述

这张图片显示了六个不同的曲线图，每个都代表了一个特定的统计分布。从上到下、从左到右，它们分别表示：密度与长度的关系；密度与高度的关系；密度与重量的关系；密度与悬挂重量的关系；密度与壳重的关系；以及密度与体积的关系。

sns.pairplot()官网 seaborn.pairplot — seaborn 0.12.2 documentation

默认情况下，此函数将创建一个轴网格，这样数据中的每个数字变量将在单行的y轴和单列的x轴上共享。对角图的处理方式不同：绘制单变量分布图以显示每列数据的边际分布。也可以显示变量的子集或在行和列上绘制不同的变量。

sns.pairplot(data,hue="sex")

<seaborn.axisgrid.PairGrid at 0x185dcba7668>

在这里插入图片描述

从以上连续特征之间的散点图我们可以看到一些基本的结果:

●例如从第一行可以看到鲍鱼的长度length 和鲍鱼直径diameter 、鲍鱼高度height 存在明显的线性关系。鲍鱼长度与鮑鱼的四种重量之间存在明显的非线性关系。
●观察最后一行，鲍鱼环数rings 与各个特征均存在正相关性,中与height 的线性关系最为直观。
●观察对角线上的直方图，可以看到幼鲍鱼( sex 取值为"L")在各个特征上的取值明显小于其他成年鲍鱼。而雄性鲍鱼( sex取值为"M")和雌性鲍鱼( sex 取值为"F")各个特征取值分布没有明显的差异。

为了定量地分析特征之间的线性相关性,我们计算特征之间的相关系数矩阵,并借助热力图将相关性可视化。

#data.corr() 是 pandas 库中的一个函数，
#用于计算数据框中数值型变量之间的相关系数矩阵。
#相关系数衡量了两个变量之间的线性关系强度和方向。
# 在 corr_df = data.corr() 中：
# data 是你要分析的数据集，通常是 pandas 数据框（DataFrame）。
# corr_df 是一个新创建的变量，用于存储相关系数矩阵的结果。
# 这段代码将计算数据集中所有数值型变量之间的相关系数，
#并将结果存储在 corr_df 中。相关系数的值范围在 -1 到 1 之间，其中 -1 表示完全负相关，1 表示完全正相关，0 表示无相关性。
corr_df = data.corr()
corr_df

	length	diameter	height	whole weight	shucked weight	viscera weight	shell weight	rings
length	1.000000	0.986813	0.827552	0.925255	0.897905	0.903010	0.897697	0.557123
diameter	0.986813	1.000000	0.833705	0.925452	0.893159	0.899726	0.905328	0.575005
height	0.827552	0.833705	1.000000	0.819209	0.774957	0.798293	0.817326	0.558109
whole weight	0.925255	0.925452	0.819209	1.000000	0.969403	0.966372	0.955351	0.540818
shucked weight	0.897905	0.893159	0.774957	0.969403	1.000000	0.931956	0.882606	0.421256
viscera weight	0.903010	0.899726	0.798293	0.966372	0.931956	1.000000	0.907647	0.504274
shell weight	0.897697	0.905328	0.817326	0.955351	0.882606	0.907647	1.000000	0.628031
rings	0.557123	0.575005	0.558109	0.540818	0.421256	0.504274	0.628031	1.000000

fig,ax = plt.subplots(figsize=(12,12))
#绘制热力图
ax = sns.heatmap(corr_df,linewidths=.5,
                cmap="Greens",
                annot=True,
                xticklabels=corr_df.columns,
                yticklabels=corr_df.index)
ax.xaxis.set_label_position('top')
ax.xaxis.tick_top()

在这里插入图片描述

二、鲍鱼数据预处理

2.1 对sex特征进行Onehot编码，便于后续模型纳入哑变量

使用pandas的get_dummies函数对sex特征做Onehot编码处理。

One-hot编码（独热编码）是一种将类别变量（categorical variables）转换为机器学习算法可以处理的形式的技术。在数据预处理阶段，当我们的数据集包含非数值型特征时，我们经常使用one-hot编码来将这些非数值型特征转换为数值型。

具体来说，one-hot编码会为每一个类别创建一个新的二进制列（或称为“维度”），当该列对应的类别存在时，该列的值为1，否则为0。例如，如果我们有一个颜色的特征，它有三个可能的值：红色、绿色和蓝色，那么one-hot编码将会创建三个新的列：‘红色’、‘绿色’和’蓝色’。如果原始数据中的某个实例的颜色是红色，那么在编码后的数据中，'红色’列的值为1，而’绿色’和’蓝色’列的值都为0。

sex_onehot = pd.get_dummies(data["sex"],prefix="sex")
data[sex_onehot.columns] = sex_onehot
data.head()

	sex	length	diameter	height	whole weight	shucked weight	viscera weight	shell weight	rings	sex_F	sex_I	sex_M
0	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7	0	0	1
1	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9	1	0	0
2	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10	0	0	1
3	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7	0	1	0
4	I	0.425	0.300	0.095	0.3515	0.1410	0.0775	0.120	8	0	1	0

2.2 添加取值为 1 的特征

data["ones"] = 1
data.head()

	sex	length	diameter	height	whole weight	shucked weight	viscera weight	shell weight	rings	sex_F	sex_I	sex_M	ones
0	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7	0	0	1	1
1	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9	1	0	0	1
2	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10	0	0	1	1
3	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7	0	1	0	1
4	I	0.425	0.300	0.095	0.3515	0.1410	0.0775	0.120	8	0	1	0	1

2.3 根据鲍鱼环计算年龄

一般每过一年，鲍鱼就会在壳上留下一道深深地印记，这叫生长纹，就相当于树木的年轮。在本数据集中，我们要预测的是鲍鱼的年龄，可以通过环数rings加上1.5得到。

data["age"] = data["rings"] + 1.5
data.head()

	sex	length	diameter	height	whole weight	shucked weight	viscera weight	shell weight	rings	sex_F	sex_I	sex_M	ones	age
0	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7	0	0	1	1	8.5
1	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9	1	0	0	1	10.5
2	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10	0	0	1	1	11.5
3	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7	0	1	0	1	8.5
4	I	0.425	0.300	0.095	0.3515	0.1410	0.0775	0.120	8	0	1	0	1	9.5

2.4 筛选特征

将预测目标设置为age列，然后构造两组特征，一组包含ones,一组包含ones。对于sex相关的列，我们只使用sex_F和sex_M。

y = data["age"] #因变量
features_with_ones = ["length","diameter","height","whole weight","shucked weight",
                      "viscera weight","shell weight","sex_F","sex_M","ones"]
features_without_ones = ["length","diameter","height","whole weight","shucked weight",
                      "viscera weight","shell weight","sex_F","sex_M"]
X = data[features_with_ones]

data.columns

Index(['sex', 'length', 'diameter', 'height', 'whole weight', 'shucked weight',
       'viscera weight', 'shell weight', 'rings', 'sex_F', 'sex_I', 'sex_M',
       'ones', 'age'],
      dtype='object')

2.5 将鲍鱼数据集划分为训练集和测试集

将数据集随机划分为训练集和测试集，其中80%样本为训练集，剩余20%样本为测试集。

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=111)

3 实现线性回归和岭回归

3.1 使用Numpy实现线性回归

如果矩阵xTx为满秩(行列式不为0),则简单线性回归的解为W=(XTx)-1xTy。实现一个函数linear _regression, 其输入为训练集特征部分和标签部分，返回回归系数向量。我们借助numpy 工具中的np. linalg. det函数和np. linalg. inv函数分别求矩阵的行列式和矩阵的逆。

import numpy as np
def linear_regression(X,y):
    w = np.zeros_like(X.shape[1])
    if np.linalg.det(X.T.dot(X)) != 0:
        w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    return w
#使用上述实现的线性回归模型在鲍鱼训练集上训练模型
w1=linear_regression(X_train,y_train)
w1 = pd.DataFrame(data=w1,index=X.columns,columns=['numpy_w'])
w1.round(decimals=2)

	numpy_w
length	2.06
diameter	8.71
height	9.67
whole weight	9.49
shucked weight	-20.62
viscera weight	-9.76
shell weight	7.31
sex_F	0.82
sex_M	0.88
ones	4.43

可见我们求得的模型为:
y=2.06 х length + 8.71 х diameter + 9.67 х height + 9.49+ х whole_ weight-20.62х shucked_ weight - 9.76 х viscera_ weight + 7.31 х shell_ weight + 0.82x sex_ F+0.88 x sex_ M + 4.33

3.2 使用sklearn实现线性回归

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train[features_without_ones],y_train)
print(lr.coef_)

[  2.06057733   8.70746395   9.66850111   9.49481388 -20.62032761
  -9.76075413   7.31239635   0.81998531   0.88332439]

w_lr = []
w_lr.extend(lr.coef_)
w_lr.append(lr.intercept_)
w1['lr_sklearn_w']=w_lr
w1.round(decimals=2)

	numpy_w	lr_sklearn_w
length	2.06	2.06
diameter	8.71	8.71
height	9.67	9.67
whole weight	9.49	9.49
shucked weight	-20.62	-20.62
viscera weight	-9.76	-9.76
shell weight	7.31	7.31
sex_F	0.82	0.82
sex_M	0.88	0.88
ones	4.43	4.43

3.3 使用Numpy实现岭回归（Ridge）

def ridge_regression(X, y, ridge_lambda):   
    penalty_matrix = np.eye(X.shape[1])    
    #不能直接套入，因为最后一列x都为1，现在最后一列应该都为0
    penalty_matrix[X.shape[1] - 1][X.shape[1] - 1] = 0
    w = np.linalg.inv(X.T.dot(X) + ridge_lambda * penalty_matrix).dot(X.T).dot(y)
    return w
    #不需要条件，一定可逆
w2 = ridge_regression(X_train, y_train, 1.0)
print(w2)

[  3.67238445   6.01243275   6.95275313   6.96992097 -17.45804929
  -5.89717411   9.49827839   0.88409529   0.92098443   4.76604926]

w1['numpy_ridge_w']=w2
w1.round(decimals=2)

	numpy_w	lr_sklearn_w	numpy_ridge_w
length	2.06	2.06	3.67
diameter	8.71	8.71	6.01
height	9.67	9.67	6.95
whole weight	9.49	9.49	6.97
shucked weight	-20.62	-20.62	-17.46
viscera weight	-9.76	-9.76	-5.90
shell weight	7.31	7.31	9.50
sex_F	0.82	0.82	0.88
sex_M	0.88	0.88	0.92
ones	4.43	4.43	4.77

这段代码的作用是将变量w2的值赋给字典w1中的键’numpy_ridge_w’，然后将字典w1中的所有值保留两位小数。

3.4 利用sklearn实现岭回归

与sklearn中岭回归对比，同样正则化系数设置为1。

from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train[features_without_ones],y_train)
w_ridge = []
w_ridge.extend(ridge.coef_)
w_ridge.append(ridge.intercept_)
w1["ridge_sklearn_w"] = w_ridge
w1.round(decimals=2)

	numpy_w	lr_sklearn_w	numpy_ridge_w	ridge_sklearn_w
length	2.06	2.06	3.67	3.67
diameter	8.71	8.71	6.01	6.01
height	9.67	9.67	6.95	6.95
whole weight	9.49	9.49	6.97	6.97
shucked weight	-20.62	-20.62	-17.46	-17.46
viscera weight	-9.76	-9.76	-5.90	-5.90
shell weight	7.31	7.31	9.50	9.50
sex_F	0.82	0.82	0.88	0.88
sex_M	0.88	0.88	0.92	0.92
ones	4.43	4.43	4.77	4.77

3.5 岭迹分析

alphas = np.logspace(-10,10,20)
coef = pd.DataFrame()
for alpha in alphas:
    ridge_clf = Ridge(alpha=alpha)
    ridge_clf.fit(X_train[features_without_ones],y_train)
    df = pd.DataFrame([ridge_clf.coef_],columns=X_train[features_without_ones].columns)
    df['alpha'] = alpha
    coef = coef.append(df,ignore_index=True)
coef.round(decimals=2)

	length	diameter	height	whole weight	shucked weight	viscera weight	shell weight	sex_F	sex_M	alpha
0	2.06	8.71	9.67	9.49	-20.62	-9.76	7.31	0.82	0.88	0.000000e+00
1	2.06	8.71	9.67	9.49	-20.62	-9.76	7.31	0.82	0.88	0.000000e+00
2	2.06	8.71	9.67	9.49	-20.62	-9.76	7.31	0.82	0.88	0.000000e+00
3	2.06	8.71	9.67	9.49	-20.62	-9.76	7.31	0.82	0.88	0.000000e+00
4	2.06	8.71	9.67	9.49	-20.62	-9.76	7.31	0.82	0.88	0.000000e+00
5	2.06	8.71	9.67	9.49	-20.62	-9.76	7.31	0.82	0.88	0.000000e+00
6	2.06	8.71	9.67	9.49	-20.62	-9.76	7.31	0.82	0.88	0.000000e+00
7	2.07	8.69	9.66	9.48	-20.61	-9.75	7.32	0.82	0.88	0.000000e+00
8	2.19	8.52	9.56	9.38	-20.50	-9.60	7.43	0.82	0.88	3.000000e-02
9	3.01	7.29	8.60	8.45	-19.42	-8.18	8.38	0.84	0.90	3.000000e-01
10	3.76	4.66	4.57	5.07	-13.83	-2.97	9.60	0.98	0.97	3.360000e+00
11	1.68	1.68	1.11	2.59	-3.50	-0.01	3.69	1.25	1.07	3.793000e+01
12	0.52	0.46	0.22	1.67	0.20	0.31	0.80	0.86	0.68	4.281300e+02
13	0.12	0.10	0.04	0.47	0.16	0.10	0.16	0.21	0.16	4.832930e+03
14	0.01	0.01	0.00	0.05	0.02	0.01	0.02	0.02	0.02	5.455595e+04
15	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	6.158482e+05
16	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	6.951928e+06
17	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	7.847600e+07
18	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	8.858668e+08
19	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.000000e+10

import matplotlib.pyplot as plt
%matplotlib inline
#绘图
#显示中文和正负号
plt.rcParams['font.sans-serif'] = ['SimHei','Times New Roman']
plt.rcParams['axes.unicode_minus'] = False

plt.rcParams['figure.dpi'] = 300 #分辨率
plt.figure(figsize=(9, 6))
coef['alpha'] = coef['alpha']
for feature in X_train.columns[:-1]:
    plt.plot('alpha',feature,data=coef)
ax = plt.gca()
ax.set_xscale('log')
plt.legend(loc='upper right')
plt.xlabel(r'$\alpha$',fontsize=15)
plt.ylabel('系数',fontsize=15)
plt.show()

在这里插入图片描述

4 使用LASSO 构建鲍鱼年龄预测模型

LASSO的目标函数

(Xw-y)^T(Xw-y)+λ||w||1

随着𝜆增大，LASSO的特征系数逐个减小为0，可以做特征选择；而岭回归变量系数几乎趋近与0

from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.01)
lasso.fit(X_train[features_without_ones],y_train)
print(lasso.coef_)
print(lasso.intercept_)

[  3.02481093   2.92014492   0.           4.64027437 -13.70757191
  -0.          11.64433053   0.92165332   0.91028056]
6.333645071936747

这段代码是使用sklearn库中的Lasso模型进行线性回归拟合。首先，从sklearn.linear_model模块中导入Lasso类。然后，创建一个Lasso对象，设置alpha参数为0.01。接着，使用fit方法对训练数据X_train[features_without_ones]和y_train进行拟合。最后，打印出Lasso模型的系数和截距。

#LASSO的正则化渠道
coef = pd.DataFrame()
for alpha in np.linspace(0.0001,0.2,20):
    lasso_clf = Lasso(alpha=alpha)
    lasso_clf.fit(X_train[features_without_ones],y_train)
    df = pd.DataFrame([lasso_clf.coef_],columns=X_train[features_without_ones].columns)
    df['alpha'] = alpha
    coef = coef.append(df,ignore_index=True)
coef.head()
#绘图
plt.figure(figsize=(9, 6),dpi=600)
for feature in X_train.columns[:-1]:
    plt.plot('alpha',feature,data=coef)
plt.legend(loc='upper right')
plt.xlabel(r'$\alpha$',fontsize=15)
plt.ylabel('系数',fontsize=15)
plt.show()

在这里插入图片描述

coef

	length	diameter	height	whole weight	shucked weight	viscera weight	shell weight	sex_F	sex_M	alpha
0	2.059038	8.661868	9.536137	9.412378	-20.534236	-9.567578	7.387413	0.820628	0.883397	0.000100
1	3.140930	2.426052	0.000000	4.543720	-13.375741	-0.000000	11.691004	0.928439	0.911697	0.010621
2	0.731511	0.000000	0.000000	2.914207	-7.761864	0.000000	12.087232	1.012229	0.909445	0.021142
3	0.000000	0.000000	0.000000	0.950032	-2.635708	0.000000	11.544105	1.021525	0.858244	0.031663
4	0.000000	0.000000	0.000000	0.455492	-0.000000	0.000000	9.056605	0.988677	0.788507	0.042184
5	0.000000	0.000000	0.000000	1.684796	-0.000000	0.000000	4.528911	0.922586	0.705904	0.052705
6	0.000000	0.000000	0.000000	2.913148	-0.000000	0.000000	0.004297	0.856548	0.623360	0.063226
7	0.000000	0.000000	0.000000	2.925570	-0.000000	0.000000	0.000000	0.750680	0.523345	0.073747
8	0.000000	0.000000	0.000000	2.936428	-0.000000	0.000000	0.000000	0.645180	0.423615	0.084268
9	0.000000	0.000000	0.000000	2.947427	-0.000000	0.000000	0.000000	0.539520	0.323771	0.094789
10	0.000000	0.000000	0.000000	2.958268	-0.000000	0.000000	0.000000	0.434040	0.224055	0.105311
11	0.000000	0.000000	0.000000	2.969234	-0.000000	0.000000	0.000000	0.328418	0.124238	0.115832
12	0.000000	0.000000	0.000000	2.980144	-0.000000	0.000000	0.000000	0.222859	0.024465	0.126353
13	0.000000	0.000000	0.000000	2.957719	-0.000000	0.000000	0.000000	0.168073	0.000000	0.136874
14	0.000000	0.000000	0.000000	2.924666	-0.000000	0.000000	0.000000	0.129582	0.000000	0.147395
15	0.000000	0.000000	0.000000	2.891614	-0.000000	0.000000	0.000000	0.091091	0.000000	0.157916
16	0.000000	0.000000	0.000000	2.858562	0.000000	0.000000	0.000000	0.052600	0.000000	0.168437
17	0.000000	0.000000	0.000000	2.825507	0.000000	0.000000	0.000000	0.014110	0.000000	0.178958
18	0.000000	0.000000	0.000000	2.785384	0.000000	0.000000	0.000000	0.000000	0.000000	0.189479
19	0.000000	0.000000	0.000000	2.741172	0.000000	0.000000	0.000000	0.000000	0.000000	0.200000

5 鲍鱼年龄预测模型效果评估

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score



# 线性回归模型的预测
y_test_pred_lr = lr.predict(X_test.iloc[:,:-1])
print(round(mean_absolute_error(y_test, y_test_pred_lr), 4))

# 岭回归模型的预测
y_test_pred_ridge = ridge.predict(X_test[features_without_ones])
print(round(mean_absolute_error(y_test, y_test_pred_ridge), 4))

# Lasso回归模型的预测
y_test_pred_lasso = lasso.predict(X_test[features_without_ones])
print(round(mean_absolute_error(y_test, y_test_pred_lasso), 4))

1.6127
1.6246
1.6737

这段代码是用于计算线性回归、岭回归和Lasso回归模型在测试集上的均方绝对误差（MAE）。首先，它使用训练好的模型对测试集进行预测，然后计算预测值与真实值之间的MAE。最后，将结果保留四位小数并打印出来。

#R2系数
print(round(r2_score(y_test,y_test_pred_lr),4))
print(round(r2_score(y_test,y_test_pred_ridge),4))
print(round(r2_score(y_test,y_test_pred_lasso),4))

0.5455
0.5394
0.5154

这段代码是用于计算线性回归、岭回归和Lasso回归模型在测试集上的R2系数。首先，它使用训练好的模型对测试集进行预测，然后计算预测值与真实值之间的R2系数。最后，将结果保留四位小数并打印出来。

5.2 残差图

残差图是一种用来诊断回归模型效果的图。在残差图中，如果点随机分布在0附近，则说明回归效果较好。如果在残差图中发现了某种结构,则说明回归效果不佳，需要重新建模。

plt.figure(figsize=(9, 6),dpi=600)
y_train_pred_ridge = ridge.predict(X_train[features_without_ones])
plt.scatter(y_train_pred_ridge,y_train_pred_ridge - y_train,c="g",alpha=0.6)
plt.scatter(y_test_pred_ridge,y_test_pred_ridge - y_test,c="r",alpha=0.6)
plt.hlines(y=0,xmin=0,xmax=30,color="b",alpha=0.6)
plt.ylabel("Residuals")
plt.xlabel("Predict")

<matplotlib.text.Text at 0x185da280dd8>

在这里插入图片描述

观察残差图，可以发现测试集的点（红色）与训练集的点（绿点）基本吻合。模型训练效果不错。

碎碎念念，岁岁年年

关注

25
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
基于回归模型预测鲍鱼年龄 (1)

这是一个Pandas DataFrame，包含4176行数据和9列。数据类型包括float64（7列）、int64（1列）和object（1列）。数据集一共有4177个样本，每个样本有9个特征，其中rings为鲍鱼环数，能够代表鲍鱼年龄，是预测变量。除了sex为离散特征，其余都为连续变量。观察sex列的取值分布情况。对于连续特征，可以使用seaborn的distplot函数绘制直方图观察特征取值情况。我们将8个连续特征的直方图绘制在一个4行2列的子图布局中。
复制链接

扫一扫

专栏目录