基于回归模型预测鲍鱼年龄 (1)

一、数据集探索性分析

import pandas as pd
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv(r"abalone_dataset.csv")
data.head()
sexlengthdiameterheightwhole weightshucked weightviscera weightshell weightrings
0M0.3500.2650.0900.22550.09950.04850.0707
1F0.5300.4200.1350.67700.25650.14150.2109
2M0.4400.3650.1250.51600.21550.11400.15510
3I0.3300.2550.0800.20500.08950.03950.0557
4I0.4250.3000.0950.35150.14100.07750.1208
#查看数据集中样本数量和特征数量
data.shape
(4176, 9)
#查看数据信息,检查是否有缺失值
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4176 entries, 0 to 4175
Data columns (total 9 columns):
sex               4176 non-null object
length            4176 non-null float64
diameter          4176 non-null float64
height            4176 non-null float64
whole weight      4176 non-null float64
shucked weight    4176 non-null float64
viscera weight    4176 non-null float64
shell weight      4176 non-null float64
rings             4176 non-null int64
dtypes: float64(7), int64(1), object(1)
memory usage: 293.7+ KB

这是一个Pandas DataFrame,包含4176行数据和9列。数据类型包括float64(7列)、int64(1列)和object(1列)。

data.describe()
lengthdiameterheightwhole weightshucked weightviscera weightshell weightrings
count4176.0000004176.0000004176.0000004176.0000004176.000004176.0000004176.0000004176.000000
mean0.5240090.4078920.1395270.8288180.359400.1806130.2388529.932471
std0.1201030.0992500.0418260.4904240.221980.1096200.1392133.223601
min0.0750000.0550000.0000000.0020000.001000.0005000.0015001.000000
25%0.4500000.3500000.1150000.4415000.186000.0933750.1300008.000000
50%0.5450000.4250000.1400000.7997500.336000.1710000.2340009.000000
75%0.6150000.4800000.1650001.1532500.502000.2530000.32900011.000000
max0.8150000.6500001.1300002.8255001.488000.7600001.00500029.000000

数据集一共有4177个样本,每个样本有9个特征,其中rings为鲍鱼环数,能够代表鲍鱼年龄,是预测变量。除了sex为离散特征,其余都为连续变量。

观察sex列的取值分布情况。

#观察sex列的取值分布情况
#性别分为雄性、雌性、未成年
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.countplot(x = "sex",data = data)
<matplotlib.axes._subplots.AxesSubplot at 0x185dcc26da0>

在这里插入图片描述

data['sex'].value_counts()
#用于统计dataframe中的sex列中的各个值的数量。
M    1527
I    1342
F    1307
Name: sex, dtype: int64

对于连续特征,可以使用seaborn的distplot函数绘制直方图观察特征取值情况。我们将8个连续特征的直方图绘制在一个4行2列的子图布局中。

i = 1 #子图计数
plt.figure(figsize=(16,8))
for col in data.columns[1:]:
    plt.subplot(4,2,i)
    i = i + 1
    sns.distplot(data[col])
    plt.tight_layout()

在这里插入图片描述

这张图片显示了六个不同的曲线图,每个都代表了一个特定的统计分布。从上到下、从左到右,它们分别表示:密度与长度的关系;密度与高度的关系;密度与重量的关系;密度与悬挂重量的关系;密度与壳重的关系;以及密度与体积的关系。

sns.pairplot()官网 seaborn.pairplot — seaborn 0.12.2 documentation

默认情况下,此函数将创建一个轴网格,这样数据中的每个数字变量将在单行的y轴和单列的x轴上共享。对角图的处理方式不同:绘制单变量分布图以显示每列数据的边际分布。也可以显示变量的子集或在行和列上绘制不同的变量。

sns.pairplot(data,hue="sex")
<seaborn.axisgrid.PairGrid at 0x185dcba7668>

在这里插入图片描述

从以上连续特征之间的散点图我们可以看到一些基本的结果:

●例如从第一行可以看到鲍鱼的长度length 和鲍鱼直径diameter 、鲍鱼高度height 存在明显的线性关系。鲍鱼长度与鮑鱼的四种重量之间存在明显的非线性关系。
●观察最后一行,鲍鱼环数rings 与各个特征均存在正相关性,中与height 的线性关系最为直观。
●观察对角线上的直方图,可以看到幼鲍鱼( sex 取值为"L")在各个特征上的取值明显小于其他成年鲍鱼。而雄性鲍鱼( sex取值为"M")和雌性鲍鱼( sex 取值为"F")各个特征取值分布没有明显的差异。

为了定量地分析特征之间的线性相关性,我们计算特征之间的相关系数矩阵,并借助热力图将相关性可视化。

#data.corr() 是 pandas 库中的一个函数,
#用于计算数据框中数值型变量之间的相关系数矩阵。
#相关系数衡量了两个变量之间的线性关系强度和方向。
# 在 corr_df = data.corr() 中:
# data 是你要分析的数据集,通常是 pandas 数据框(DataFrame)。
# corr_df 是一个新创建的变量,用于存储相关系数矩阵的结果。
# 这段代码将计算数据集中所有数值型变量之间的相关系数,
#并将结果存储在 corr_df 中。相关系数的值范围在 -1 到 1 之间,其中 -1 表示完全负相关,1 表示完全正相关,0 表示无相关性。
corr_df = data.corr()
corr_df
lengthdiameterheightwhole weightshucked weightviscera weightshell weightrings
length1.0000000.9868130.8275520.9252550.8979050.9030100.8976970.557123
diameter0.9868131.0000000.8337050.9254520.8931590.8997260.9053280.575005
height0.8275520.8337051.0000000.8192090.7749570.7982930.8173260.558109
whole weight0.9252550.9254520.8192091.0000000.9694030.9663720.9553510.540818
shucked weight0.8979050.8931590.7749570.9694031.0000000.9319560.8826060.421256
viscera weight0.9030100.8997260.7982930.9663720.9319561.0000000.9076470.504274
shell weight0.8976970.9053280.8173260.9553510.8826060.9076471.0000000.628031
rings0.5571230.5750050.5581090.5408180.4212560.5042740.6280311.000000
fig,ax = plt.subplots(figsize=(12,12))
#绘制热力图
ax = sns.heatmap(corr_df,linewidths=.5,
                cmap="Greens",
                annot=True,
                xticklabels=corr_df.columns,
                yticklabels=corr_df.index)
ax.xaxis.set_label_position('top')
ax.xaxis.tick_top()

在这里插入图片描述

二、鲍鱼数据预处理

2.1 对sex特征进行Onehot编码,便于后续模型纳入哑变量

使用pandas的get_dummies函数对sex特征做Onehot编码处理。

One-hot编码(独热编码)是一种将类别变量(categorical variables)转换为机器学习算法可以处理的形式的技术。在数据预处理阶段,当我们的数据集包含非数值型特征时,我们经常使用one-hot编码来将这些非数值型特征转换为数值型。

具体来说,one-hot编码会为每一个类别创建一个新的二进制列(或称为“维度”),当该列对应的类别存在时,该列的值为1,否则为0。例如,如果我们有一个颜色的特征,它有三个可能的值:红色、绿色和蓝色,那么one-hot编码将会创建三个新的列:‘红色’、‘绿色’和’蓝色’。如果原始数据中的某个实例的颜色是红色,那么在编码后的数据中,'红色’列的值为1,而’绿色’和’蓝色’列的值都为0。

sex_onehot = pd.get_dummies(data["sex"],prefix="sex")
data[sex_onehot.columns] = sex_onehot
data.head()
sexlengthdiameterheightwhole weightshucked weightviscera weightshell weightringssex_Fsex_Isex_M
0M0.3500.2650.0900.22550.09950.04850.0707001
1F0.5300.4200.1350.67700.25650.14150.2109100
2M0.4400.3650.1250.51600.21550.11400.15510001
3I0.3300.2550.0800.20500.08950.03950.0557010
4I0.4250.3000.0950.35150.14100.07750.1208010

2.2 添加取值为 1 的特征

data["ones"] = 1
data.head()
sexlengthdiameterheightwhole weightshucked weightviscera weightshell weightringssex_Fsex_Isex_Mones
0M0.3500.2650.0900.22550.09950.04850.07070011
1F0.5300.4200.1350.67700.25650.14150.21091001
2M0.4400.3650.1250.51600.21550.11400.155100011
3I0.3300.2550.0800.20500.08950.03950.05570101
4I0.4250.3000.0950.35150.14100.07750.12080101

2.3 根据鲍鱼环计算年龄

一般每过一年,鲍鱼就会在壳上留下一道深深地印记,这叫生长纹,就相当于树木的年轮。在本数据集中,我们要预测的是鲍鱼的年龄,可以通过环数rings加上1.5得到。

data["age"] = data["rings"] + 1.5
data.head()
sexlengthdiameterheightwhole weightshucked weightviscera weightshell weightringssex_Fsex_Isex_Monesage
0M0.3500.2650.0900.22550.09950.04850.070700118.5
1F0.5300.4200.1350.67700.25650.14150.2109100110.5
2M0.4400.3650.1250.51600.21550.11400.15510001111.5
3I0.3300.2550.0800.20500.08950.03950.055701018.5
4I0.4250.3000.0950.35150.14100.07750.120801019.5

2.4 筛选特征

将预测目标设置为age列,然后构造两组特征,一组包含ones,一组包含ones。对于sex相关的列,我们只使用sex_F和sex_M。

y = data["age"] #因变量
features_with_ones = ["length","diameter","height","whole weight","shucked weight",
                      "viscera weight","shell weight","sex_F","sex_M","ones"]
features_without_ones = ["length","diameter","height","whole weight","shucked weight",
                      "viscera weight","shell weight","sex_F","sex_M"]
X = data[features_with_ones]
data.columns
Index(['sex', 'length', 'diameter', 'height', 'whole weight', 'shucked weight',
       'viscera weight', 'shell weight', 'rings', 'sex_F', 'sex_I', 'sex_M',
       'ones', 'age'],
      dtype='object')

2.5 将鲍鱼数据集划分为训练集和测试集

将数据集随机划分为训练集和测试集,其中80%样本为训练集,剩余20%样本为测试集。

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=111)

3 实现线性回归和岭回归

3.1 使用Numpy实现线性回归

如果矩阵xTx为满秩(行列式不为0),则简单线性回归的解为W=(XTx)-1xTy。实现一个函数linear _regression, 其输入为训练集特征部分和标签部分,返回回归系数向量。我们借助numpy 工具中的np. linalg. det函数和np. linalg. inv函数分别求矩阵的行列式和矩阵的逆。

import numpy as np
def linear_regression(X,y):
    w = np.zeros_like(X.shape[1])
    if np.linalg.det(X.T.dot(X)) != 0:
        w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    return w
#使用上述实现的线性回归模型在鲍鱼训练集上训练模型
w1=linear_regression(X_train,y_train)
w1 = pd.DataFrame(data=w1,index=X.columns,columns=['numpy_w'])
w1.round(decimals=2)
numpy_w
length2.06
diameter8.71
height9.67
whole weight9.49
shucked weight-20.62
viscera weight-9.76
shell weight7.31
sex_F0.82
sex_M0.88
ones4.43

可见我们求得的模型为:
y=2.06 х length + 8.71 х diameter + 9.67 х height + 9.49+ х whole_ weight-20.62х shucked_ weight - 9.76 х viscera_ weight + 7.31 х shell_ weight + 0.82x sex_ F+0.88 x sex_ M + 4.33

3.2 使用sklearn实现线性回归

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train[features_without_ones],y_train)
print(lr.coef_)
[  2.06057733   8.70746395   9.66850111   9.49481388 -20.62032761
  -9.76075413   7.31239635   0.81998531   0.88332439]
w_lr = []
w_lr.extend(lr.coef_)
w_lr.append(lr.intercept_)
w1['lr_sklearn_w']=w_lr
w1.round(decimals=2)
numpy_wlr_sklearn_w
length2.062.06
diameter8.718.71
height9.679.67
whole weight9.499.49
shucked weight-20.62-20.62
viscera weight-9.76-9.76
shell weight7.317.31
sex_F0.820.82
sex_M0.880.88
ones4.434.43

3.3 使用Numpy实现岭回归(Ridge)

def ridge_regression(X, y, ridge_lambda):   
    penalty_matrix = np.eye(X.shape[1])    
    #不能直接套入,因为最后一列x都为1,现在最后一列应该都为0
    penalty_matrix[X.shape[1] - 1][X.shape[1] - 1] = 0
    w = np.linalg.inv(X.T.dot(X) + ridge_lambda * penalty_matrix).dot(X.T).dot(y)
    return w
    #不需要条件,一定可逆
w2 = ridge_regression(X_train, y_train, 1.0)
print(w2)
[  3.67238445   6.01243275   6.95275313   6.96992097 -17.45804929
  -5.89717411   9.49827839   0.88409529   0.92098443   4.76604926]
w1['numpy_ridge_w']=w2
w1.round(decimals=2)
numpy_wlr_sklearn_wnumpy_ridge_w
length2.062.063.67
diameter8.718.716.01
height9.679.676.95
whole weight9.499.496.97
shucked weight-20.62-20.62-17.46
viscera weight-9.76-9.76-5.90
shell weight7.317.319.50
sex_F0.820.820.88
sex_M0.880.880.92
ones4.434.434.77

这段代码的作用是将变量w2的值赋给字典w1中的键’numpy_ridge_w’,然后将字典w1中的所有值保留两位小数。

3.4 利用sklearn实现岭回归

与sklearn中岭回归对比,同样正则化系数设置为1。

from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train[features_without_ones],y_train)
w_ridge = []
w_ridge.extend(ridge.coef_)
w_ridge.append(ridge.intercept_)
w1["ridge_sklearn_w"] = w_ridge
w1.round(decimals=2)
numpy_wlr_sklearn_wnumpy_ridge_wridge_sklearn_w
length2.062.063.673.67
diameter8.718.716.016.01
height9.679.676.956.95
whole weight9.499.496.976.97
shucked weight-20.62-20.62-17.46-17.46
viscera weight-9.76-9.76-5.90-5.90
shell weight7.317.319.509.50
sex_F0.820.820.880.88
sex_M0.880.880.920.92
ones4.434.434.774.77

3.5 岭迹分析

alphas = np.logspace(-10,10,20)
coef = pd.DataFrame()
for alpha in alphas:
    ridge_clf = Ridge(alpha=alpha)
    ridge_clf.fit(X_train[features_without_ones],y_train)
    df = pd.DataFrame([ridge_clf.coef_],columns=X_train[features_without_ones].columns)
    df['alpha'] = alpha
    coef = coef.append(df,ignore_index=True)
coef.round(decimals=2)

lengthdiameterheightwhole weightshucked weightviscera weightshell weightsex_Fsex_Malpha
02.068.719.679.49-20.62-9.767.310.820.880.000000e+00
12.068.719.679.49-20.62-9.767.310.820.880.000000e+00
22.068.719.679.49-20.62-9.767.310.820.880.000000e+00
32.068.719.679.49-20.62-9.767.310.820.880.000000e+00
42.068.719.679.49-20.62-9.767.310.820.880.000000e+00
52.068.719.679.49-20.62-9.767.310.820.880.000000e+00
62.068.719.679.49-20.62-9.767.310.820.880.000000e+00
72.078.699.669.48-20.61-9.757.320.820.880.000000e+00
82.198.529.569.38-20.50-9.607.430.820.883.000000e-02
93.017.298.608.45-19.42-8.188.380.840.903.000000e-01
103.764.664.575.07-13.83-2.979.600.980.973.360000e+00
111.681.681.112.59-3.50-0.013.691.251.073.793000e+01
120.520.460.221.670.200.310.800.860.684.281300e+02
130.120.100.040.470.160.100.160.210.164.832930e+03
140.010.010.000.050.020.010.020.020.025.455595e+04
150.000.000.000.000.000.000.000.000.006.158482e+05
160.000.000.000.000.000.000.000.000.006.951928e+06
170.000.000.000.000.000.000.000.000.007.847600e+07
180.000.000.000.000.000.000.000.000.008.858668e+08
190.000.000.000.000.000.000.000.000.001.000000e+10
import matplotlib.pyplot as plt
%matplotlib inline
#绘图
#显示中文和正负号
plt.rcParams['font.sans-serif'] = ['SimHei','Times New Roman']
plt.rcParams['axes.unicode_minus'] = False

plt.rcParams['figure.dpi'] = 300 #分辨率
plt.figure(figsize=(9, 6))
coef['alpha'] = coef['alpha']
for feature in X_train.columns[:-1]:
    plt.plot('alpha',feature,data=coef)
ax = plt.gca()
ax.set_xscale('log')
plt.legend(loc='upper right')
plt.xlabel(r'$\alpha$',fontsize=15)
plt.ylabel('系数',fontsize=15)
plt.show()

在这里插入图片描述

4 使用LASSO 构建鲍鱼年龄预测模型

LASSO的目标函数

(Xw-y)^T(Xw-y)+λ||w||1

随着𝜆增大,LASSO的特征系数逐个减小为0,可以做特征选择;而岭回归变量系数几乎趋近与0

from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.01)
lasso.fit(X_train[features_without_ones],y_train)
print(lasso.coef_)
print(lasso.intercept_)
[  3.02481093   2.92014492   0.           4.64027437 -13.70757191
  -0.          11.64433053   0.92165332   0.91028056]
6.333645071936747

这段代码是使用sklearn库中的Lasso模型进行线性回归拟合。首先,从sklearn.linear_model模块中导入Lasso类。然后,创建一个Lasso对象,设置alpha参数为0.01。接着,使用fit方法对训练数据X_train[features_without_ones]和y_train进行拟合。最后,打印出Lasso模型的系数和截距。

#LASSO的正则化渠道
coef = pd.DataFrame()
for alpha in np.linspace(0.0001,0.2,20):
    lasso_clf = Lasso(alpha=alpha)
    lasso_clf.fit(X_train[features_without_ones],y_train)
    df = pd.DataFrame([lasso_clf.coef_],columns=X_train[features_without_ones].columns)
    df['alpha'] = alpha
    coef = coef.append(df,ignore_index=True)
coef.head()
#绘图
plt.figure(figsize=(9, 6),dpi=600)
for feature in X_train.columns[:-1]:
    plt.plot('alpha',feature,data=coef)
plt.legend(loc='upper right')
plt.xlabel(r'$\alpha$',fontsize=15)
plt.ylabel('系数',fontsize=15)
plt.show()

在这里插入图片描述

coef
lengthdiameterheightwhole weightshucked weightviscera weightshell weightsex_Fsex_Malpha
02.0590388.6618689.5361379.412378-20.534236-9.5675787.3874130.8206280.8833970.000100
13.1409302.4260520.0000004.543720-13.375741-0.00000011.6910040.9284390.9116970.010621
20.7315110.0000000.0000002.914207-7.7618640.00000012.0872321.0122290.9094450.021142
30.0000000.0000000.0000000.950032-2.6357080.00000011.5441051.0215250.8582440.031663
40.0000000.0000000.0000000.455492-0.0000000.0000009.0566050.9886770.7885070.042184
50.0000000.0000000.0000001.684796-0.0000000.0000004.5289110.9225860.7059040.052705
60.0000000.0000000.0000002.913148-0.0000000.0000000.0042970.8565480.6233600.063226
70.0000000.0000000.0000002.925570-0.0000000.0000000.0000000.7506800.5233450.073747
80.0000000.0000000.0000002.936428-0.0000000.0000000.0000000.6451800.4236150.084268
90.0000000.0000000.0000002.947427-0.0000000.0000000.0000000.5395200.3237710.094789
100.0000000.0000000.0000002.958268-0.0000000.0000000.0000000.4340400.2240550.105311
110.0000000.0000000.0000002.969234-0.0000000.0000000.0000000.3284180.1242380.115832
120.0000000.0000000.0000002.980144-0.0000000.0000000.0000000.2228590.0244650.126353
130.0000000.0000000.0000002.957719-0.0000000.0000000.0000000.1680730.0000000.136874
140.0000000.0000000.0000002.924666-0.0000000.0000000.0000000.1295820.0000000.147395
150.0000000.0000000.0000002.891614-0.0000000.0000000.0000000.0910910.0000000.157916
160.0000000.0000000.0000002.8585620.0000000.0000000.0000000.0526000.0000000.168437
170.0000000.0000000.0000002.8255070.0000000.0000000.0000000.0141100.0000000.178958
180.0000000.0000000.0000002.7853840.0000000.0000000.0000000.0000000.0000000.189479
190.0000000.0000000.0000002.7411720.0000000.0000000.0000000.0000000.0000000.200000

5 鲍鱼年龄预测模型效果评估

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


# 线性回归模型的预测
y_test_pred_lr = lr.predict(X_test.iloc[:,:-1])
print(round(mean_absolute_error(y_test, y_test_pred_lr), 4))

# 岭回归模型的预测
y_test_pred_ridge = ridge.predict(X_test[features_without_ones])
print(round(mean_absolute_error(y_test, y_test_pred_ridge), 4))

# Lasso回归模型的预测
y_test_pred_lasso = lasso.predict(X_test[features_without_ones])
print(round(mean_absolute_error(y_test, y_test_pred_lasso), 4))

1.6127
1.6246
1.6737

这段代码是用于计算线性回归、岭回归和Lasso回归模型在测试集上的均方绝对误差(MAE)。首先,它使用训练好的模型对测试集进行预测,然后计算预测值与真实值之间的MAE。最后,将结果保留四位小数并打印出来。

#R2系数
print(round(r2_score(y_test,y_test_pred_lr),4))
print(round(r2_score(y_test,y_test_pred_ridge),4))
print(round(r2_score(y_test,y_test_pred_lasso),4))
0.5455
0.5394
0.5154

这段代码是用于计算线性回归、岭回归和Lasso回归模型在测试集上的R2系数。首先,它使用训练好的模型对测试集进行预测,然后计算预测值与真实值之间的R2系数。最后,将结果保留四位小数并打印出来。

5.2 残差图

残差图是一种用来诊断回归模型效果的图。在残差图中,如果点随机分布在0附近,则说明回归效果较好。如果在残差图中发现了某种结构,则说明回归效果不佳,需要重新建模。

plt.figure(figsize=(9, 6),dpi=600)
y_train_pred_ridge = ridge.predict(X_train[features_without_ones])
plt.scatter(y_train_pred_ridge,y_train_pred_ridge - y_train,c="g",alpha=0.6)
plt.scatter(y_test_pred_ridge,y_test_pred_ridge - y_test,c="r",alpha=0.6)
plt.hlines(y=0,xmin=0,xmax=30,color="b",alpha=0.6)
plt.ylabel("Residuals")
plt.xlabel("Predict")
<matplotlib.text.Text at 0x185da280dd8>

在这里插入图片描述

观察残差图,可以发现测试集的点(红色)与训练集的点(绿点)基本吻合。模型训练效果不错。

  • 25
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值