基于身高与体重数据集与Auto数据集分别进行线性回归和Lasso回归（代码逐行讲解，超细节）

本文链接：https://blog.csdn.net/zxf2ld/article/details/124892325

本文通过散点图展示了身高与体重之间的关系，并使用线性回归模型分析了两者间的显著性。R2值约为0.6，表明身高对体重有显著影响。接着，运用Lasso回归进行特征选择，发现在众多特征中，只有少数几个特征对目标变量有显著贡献。最后，通过残差图评估了模型的拟合效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

身高体重

#还是先导入要用的包，没下载的要先去下载依赖包
import pandas as pd
import statsmodels.api as sm
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True) #此句的作用是用来设置用户名与API

data = pd.read_csv('weight.csv') 
#读取weight数据集，可以把数据集放在当前目录下，像我这样直接读，也可以用绝对路径读数据
data = data[data.height > 120] #选出身高大于120的数据

#废话不多说，先上图
trace = go.Scatter(x=data.height, y=data.weight, mode='markers') 
#生成绘制散点图的数据，它是Scatter()格式的，里面是字典，键就是mode,x,y,值就是对应的值

layout = go.Layout( #设置绘图的参数layout，这里其实就是把绘图的一些参数返回给layout啦
    width=900,
    height=500,
    xaxis=dict(title='Height',titlefont=dict(family='Consolas, monospace',size=15)),
    yaxis=dict(title='Weight',titlefont=dict(family='Consolas, monospace',size=15))
)
#layout中保存的其实也是一种字典的格式

data2 = [trace] #要把trace套一个[]，因为传入的参数要是列表格式的
fig = go.Figure(data=data2, layout=layout) #传入数据与参数，进行绘图

iplot(fig) #画出来图，这里是iplot不是plot

这个图可以点点看看，很好玩。

X = data.height #就是身高
y = data.weight #就是体重
X = sm.add_constant(X) #增加一个全为1的列

model = sm.OLS(y, X).fit() #OLS建模然后拟合数据，返回结果

model.summary() #查看结果摘要

R2是0.594，F统计量的P值在10的-40次方，t统计量的P值也都接近于0，从指标的分析上来看，R2虽然不算很大，但是也接近0.6了，然后P值也都比较小，所以回归还是显著的，身高与体重之间也有着显著的线性关系啦

$weight = 1.1492 * height - 130.747$

#绘图模块
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(data.height, y, 'o', label="data")
ax.plot(data.height, model.fittedvalues, 'r--.', label="OLS")
ax.legend(loc='best')
plt.show()

基于Auto数据集进行Lasso回归

数据集简介

主要包括3类指标:

汽车的各种特性.
保险风险评级：(-3, -2, -1, 0, 1, 2, 3).
每辆保险车辆年平均相对损失支付.

类别属性

make: 汽车的商标（奥迪，宝马。。。）
fuel-type: 汽油还是天然气
aspiration: 涡轮
num-of-doors: 两门还是四门
body-style: 硬顶车、轿车、掀背车、敞篷车
drive-wheels: 驱动轮
engine-location: 发动机位置
engine-type: 发动机类型
num-of-cylinders: 几个气缸
fuel-system: 燃油系统

连续指标

bore: continuous from 2.54 to 3.94.
stroke: continuous from 2.07 to 4.17.
compression-ratio: continuous from 7 to 23.
horsepower: continuous from 48 to 288.
peak-rpm: continuous from 4150 to 6600.
city-mpg: continuous from 13 to 49.
highway-mpg: continuous from 16 to 54.
price: continuous from 5118 to 45400.

需要Auto数据集的可以自行下载，希望届时您可以点一个star

https://github.com/zxf5912/data-mining-with-python.git

数据读取与分析

# 加载基础包
import numpy as np
import pandas as pd
from datetime import datetime

# 数据可视化
import matplotlib.pyplot as plt
import seaborn as sns # 可视化的包
import missingno as msno # 处理缺失值的包
%matplotlib inline

# 导入统计包
from statsmodels.distributions.empirical_distribution import ECDF
from sklearn.metrics import mean_squared_error, r2_score

# 导入机器学习依赖包
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso, LassoCV
from sklearn.model_selection import train_test_split, cross_val_score 
from sklearn.ensemble import RandomForestRegressor
seed = 123

# 加载数据 ( 将内容为?的值设为缺失值NaN)
data = pd.read_csv("E:/AI/ML/part3_Math/Statistic/Regression/Auto_predict/Auto-Data.csv", na_values = '?')
data.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

可以看到数据集中有这些指标

data.dtypes #查看各变量（指标）的类型

symboling              int64
normalized-losses    float64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64

# 查看数据本身的形状与开头数据
print("In total: ",data.shape)
data.head(5)

In total:  (205, 26)

这是一个比较小的数据集:只有 205 条数据, 26个特征。

data.describe() #展示数值信息的统计分析

缺失值处理

sns.set(style = "ticks") #修改背景风格
msno.matrix(data) #一个缺失值可视化函数，白色部分指的是缺失值

可以看到变量normalized-losses 缺失的比较严重哦。

data[pd.isnull(data['normalized-losses'])].head() 
#data['normalized-losses']取出变量normalized-losses，判断空元素的位置，返回布尔值，用data取出为True的位置的值

sns.set(style = "ticks")
plt.figure(figsize = (12, 5)) #生成画布
c = '#366DE8' #调节色号

plt.subplot(121) #分图设置，1行2列的第一个
cdf = ECDF(data['normalized-losses']) #生成累积分布数据，可以看出哪里有缺失值，缺失部分没有累积增加量
plt.plot(cdf.x, cdf.y, label = "statmodels", color = c); #label图例名，cdf.x得到cdf数据的横轴，即normalized-losses的取值，纵轴是对应的频率
plt.xlabel('normalized losses'); plt.ylabel('ECDF'); #添加坐标轴名

plt.subplot(122)
plt.hist(data['normalized-losses'].dropna(), 
         bins = int(np.sqrt(len(data['normalized-losses']))),
         color = c); 
#数据实际分布情况，先去除缺失值，然后bins分组，这个是根据经验的公式分的，合适就行

可以发现 80% 的 normalized losses 是低于200 并且绝大多数低于125.

一个基本的想法就是用中位数来进行填充，但是我们得来想一想，这个特征跟哪些因素可能有关呢？应该是保险的情况吧，所以我们可以分组来进行填充这样会更精确一些。

首先来看一下对于不同风险情况的统计指标:

data.groupby('symboling')['normalized-losses'].describe() #按风险等级symboling分组再分别提取normalized-losses并describe
#发现不同风险等级，其分布不同，所以应该分组进行填充

symboling不同的组，它们的均值都不同，差别还挺大的，直接用总体均值进行填充，不太合理，应该分组用各组的均值进行填充比较好

data = data.dropna(subset = ['price', 'bore', 'stroke', 'peak-rpm', 'horsepower', 'num-of-doors']) #指定缺失值较少的列，直接删除对应的行
data['normalized-losses'] = data.groupby('symboling')['normalized-losses'].transform(lambda x: x.fillna(x.mean())) 
#agg分组聚合，但是只会给出聚合值，而transform是可以新增一列，可以拿进来进行运算使用，apply的作用与transform相似，且可以传入def函数，但是耗时长
#这里就是分组进行各组的均值填充
print('In total:', data.shape)
data.head()

可以看到NaN都被填充上了，没有缺失值了。

特征相关性

cormatrix = data.corr() #查看相关系数阵
cormatrix

cormatrix *= np.tri(*cormatrix.values.shape, k=-1).T 
#tri生成下三角矩阵，k=-1就是多清零一对角线，然后转置成上三角矩阵，把对角线上的置0，其他元素为1
#cormatrix.values.shape是(M,N)，传入行列的参数通过*展开，就变成了可罗列的数值M N
#cormatrix与该矩阵对应位置相乘，得到cormatrix的上三角矩阵，让他们不是最高的。
cormatrix

可以看到变成上三角矩阵啦

cormatrix = cormatrix.stack() #将cormatrix中任意两个指标的相关系数按列排出
cormatrix

symboling  symboling            0.000000
           normalized-losses    0.593658
           wheel-base          -0.536516
           length              -0.363194
           width               -0.247741
                                  ...   
price      horsepower           0.000000
           peak-rpm            -0.000000
           city-mpg            -0.000000
           highway-mpg         -0.000000
           price                0.000000
Length: 256, dtype: float64

cormatrix = cormatrix.reindex(cormatrix.abs().sort_values(ascending=False).index).reset_index()
#cormatrix.abs().sort_values(ascending=False)先将相关系数取绝对值，然后按取值大小降序排列
#.index获取对应的索引值，这里是双重索引哦，然后进行reindex就是按照上面这个格式重新排序，reset_index则是生成一列新的数字索引

cormatrix

cormatrix.columns = ["FirstVariable", "SecondVariable", "Correlation"] #对三列分别命名
cormatrix.head(10)

city_mpg 和 highway-mpg 相关性很高，可以去掉一个啦，我们就去掉city_mpg吧. 对于这个长宽高，他们应该存在某种配对关系，给他们合体吧！curb-weight与很多指标相关性高，为了防止多重共线性，也要删除它！

data['volume'] = data.length * data.width * data.height #将长宽高合并成一个指标，体积

data.drop(['width', 'length', 'height', 
           'curb-weight', 'city-mpg'], 
          axis = 1, # 1指的是按列处理，这里的处理指的是删除
          inplace = True) #inplace = True指的是覆盖掉原来的数据

data.columns #查看data的新变量

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'engine-type', 'num-of-cylinders', 'engine-size',
       'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak-rpm', 'highway-mpg', 'price', 'volume'],
      dtype='object')

可以发现有变量被删除了，有变量被加进来了。也可以直接画出来。

corr_all = data.corr() # 重新生成相关系数阵

# Generate a mask for the upper triangle
mask = np.zeros_like(corr_all, dtype = np.bool) #获取corr_all的形状，生成全为False的矩阵
mask[np.triu_indices_from(mask)] = True #np.triu_indices_from(mask)就是获取mask的上三角部分的索引index，然后用mask将其取出，将mask的上三角部分置为True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize = (11, 9)) #定制生成画布，f是指画布对象，ax是子图对象

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_all, mask = mask,
            square = True, linewidths = .5,ax = ax,cmap = "BuPu") #mask为True的地方不绘图，square=True就是画出正方块，linewidths是分隔方块的细线宽度
#ax就是绘图大小定制，cmap就是绘图色板
#plt.show()

看起来 price 跟这几个变量的相关程度比较大： wheel-base,enginine-size, bore,horsepower.

sns.pairplot(data, hue = 'fuel-type', palette = 'plasma') #返回任意两变量的散点图，hue是根据某指标对颜色进行分类，palette是色板

画个图来看看：

让我们仔细看看价格和马力变量之间的关系

sns.lmplot('price', 'horsepower', data, 
           hue = 'fuel-type', col = 'fuel-type',  row = 'num-of-doors', 
           palette = 'plasma', 
           fit_reg = True); #price作为横轴变量，horsepower是纵轴，会绘制四个子图，行是按num-of-doors分，列是按fuel-type分，颜色也是按fuel-type分，fit_reg就是绘制回归图

事实上，对于燃料的类型和数门变量，我们看到，在一辆汽车马力的增加与价格成比例的增加相关的各个层面。

预处理

如果一个特性的方差比其他的要大得多，那么它可能支配目标函数，使估计者不能像预期的那样正确地从其他特性中学习。这就是为什么我们需要首先对数据进行缩放。

对连续值进行标准化：

target = data.price #将price作为目标变量，price就是因变量

regressors = [x for x in data.columns if x not in ['price']] #取出除price以外的列，赋给regressors
features = data.loc[:, regressors] #按regressors中的列，来取data数据

num = ['symboling', 'normalized-losses', 'volume', 'horsepower', 'wheel-base',
       'bore', 'stroke','compression-ratio', 'peak-rpm'] #将连续型变量都取出

standard_scaler = StandardScaler() #标准化函数
features[num] = standard_scaler.fit_transform(features[num]) #features[num]先通过feature将num中的连续变量都取出
#standard_scaler.fit_transform对这些变量先拟合数据，再进行标准化，再返回给features[num]

features.head()

可以看到'symboling', 'normalized-losses'等变量都已经标准化了，不再是很大很大的数了，这样就不会让模型认为数据大的变量影响很大了，而是按数据的相对水平进行拟合。

对分类属性就行one-hot编码：

# categorical vars
classes = ['make', 'fuel-type', 'aspiration', 'num-of-doors', 
           'body-style', 'drive-wheels', 'engine-location',
           'engine-type', 'num-of-cylinders', 'fuel-system'] #取出所有分类变量

# create new dataset with only continios vars 
dummies = pd.get_dummies(features[classes]) #对分类变量进行编码，将分类变量全都拆成0-1哑变量
features = features.join(dummies).drop(classes, 
                                       axis = 1) #drop掉原来的分类变量，里面的参数是标签集，然后用join把哑变量添加到features表中，参数是数据框

# new dataset
print('In total:', features.shape)
features.head()

可以看到分类变量，被分成了多个类别的哑变量了。

划分数据集：

X_train, X_test, y_train, y_test = train_test_split(features, target, 
                                                    test_size = 0.3,
                                                    random_state = seed) #train_test_split是切分训练集与测试集的函数，test_size是测试集比例，seed是随机种子，可以按需求去设置，这里在最前面就设好了

Lasso回归

多加了一个绝对值项来惩罚过大的系数，alphas=0那就是最小二乘了

alphas = 2. ** np.arange(2, 12) #形成一个alpha数组
print(alphas)
scores = np.empty_like(alphas) #根据传入的数组生成与其形状和元素类型一致的新数组

#求得每个alpha的拟合得分
for i, a in enumerate(alphas): #i是数字的索引，a是对应的数值，但这只是为了画图。。。和交叉验证无关系
    lasso = Lasso(random_state = seed) #得到lasso模型
    lasso.set_params(alpha = a) #将a的值设为惩罚项的系数
    lasso.fit(X_train, y_train) #将训练数据代入模型进行拟合
    scores[i] = lasso.score(X_test, y_test) #计算每个i下的lasso回归的得分
    
lassocv = LassoCV(cv = 10, random_state = seed)#进行10折交叉验证来寻优，找出最优的惩罚项系数
lassocv.fit(features, target) #拟合数据
lassocv_score = lassocv.score(features, target) #得出最大拟合得分，拟合越好，该得分越高
lassocv_alpha = lassocv.alpha_ #获取到这个
plt.figure(figsize = (10, 4))
plt.plot(alphas, scores, '-ko')
plt.axhline(lassocv_score, color = c) #在lassocv_score处画出水平线
plt.xlabel(r'$\alpha$')
plt.ylabel('CV Score')
plt.xscale('log', base = 2) #x轴的刻度单位是指数级变化的，即2的2次方的下一个刻度就是2的3次方
sns.despine(offset = 15) #去边框，15就是两坐标轴离边框的距离

print('CV results:', lassocv_score, lassocv_alpha)

[   4.    8.   16.   32.   64.  128.  256.  512. 1024. 2048.]
CV results: 0.8319664321686158 342.0081418966275

上面是我们设置的正则化系数（或者叫惩罚项系数）数组，会从里面选出最好的一个来，然后后面这个是没有传入正则化系数进行拟合得到的score和alpha，得到的这个得分是0.83，正则化系数alpha是342。

这个图好像有点问题，按理说应该是取到最左侧的最高点的，不过这里也不是重点，先看后面。

coefs = pd.Series(lassocv.coef_, index = features.columns) #拿出lassocv的回归系数，索引就是列名，也就是变量名了
# prints out the number of picked/eliminated features
print("Lasso picked " + str(sum(coefs != 0)) + " features and eliminated the other " +  \
      str(sum(coefs == 0)) + " features.") #coefs != 0找出不为0的系数，按布尔值求和，可以实现计数功能，同理找出为0的系数

# takes first and last 10
coefs = pd.concat([coefs.sort_values().head(5), coefs.sort_values().tail(5)]) #找出最大的coefs与最小的coefs，其实是绝对值最大的系数们，拼接到一起

plt.figure(figsize = (10, 4))
coefs.plot(kind = "barh", color = c) #发现不为0的只有6个变量
plt.title("Coefficients in the Lasso Model")
plt.show()

Lasso picked 6 features and eliminated the other 60 features.

这里只有6个非0的系数，也就只有这6个有效变量了。

这个图展示了前五大的系数，和最小的五个系数（负数），其实就是绝对值大的十个系数，但可以发现，只有六个绝对值大于0的，中间都是0，所以也可以看出只有6个有效自变量。

model_l1 = LassoCV(alphas = alphas,cv = 10, random_state = seed).fit(X_train, y_train)  #相当于用X_train，y_train做交叉验证，可以自动分训练集与验证集
#传入alphas(惩罚项系数或正则化参数)数组，会在其中找一个最好的参数，如果不传alphas参数，会自动寻找alpha，但是效果不见得会好
y_pred_l1 = model_l1.predict(X_test) #利用训练好的模型model_l1进行测试集预测

score = model_l1.score(X_test, y_test) #根据测试集X_test算出y_pred_l1，然后与y_test比较，算出得分
print(score)

0.8277858219151173

看了一下，通过10交叉验证在设置的alpha序列这个参数中选取最优的正则化系数，拟合出的Lasso回归模型，将测试集代进去，得出score为0.827。

plt.rcParams['figure.figsize'] = (6.0, 6.0)

preds = pd.DataFrame({"preds": model_l1.predict(X_train), "true": y_train})
preds["residuals"] = preds["true"] - preds["preds"] #残差
preds.plot(x = "preds", y = "residuals", kind = "scatter", color = c) #画残差图

画一下残差图：

def MSE(y_true,y_pred):
    mse = mean_squared_error(y_true, y_pred) #计算mse
    print('MSE: %2.3f' % mse)
    return mse

def R2(y_true,y_pred):    
    r2 = r2_score(y_true, y_pred) #计算拟合优度r2
    print('R2: %2.3f' % r2)     
    return r2

MSE(y_test, y_pred_l1); R2(y_test, y_pred_l1);

MSE: 3993172.564
R2: 0.828

MSE好大，可能是因为数据绝对水平太高了，从上面的残差图也能看出来，相对误差还好，但是绝对误差没法看，所有R2看着还好。

d = {'true' : list(y_test),
     'predicted' : pd.Series(y_pred_l1)
    } #生成真实值与预测值

pd.DataFrame(d).head()

可以看一下预测值与真实值的对比，一样的，看绝对误差，就感觉很大，但是相对误差感觉还能接受。

整理分析不易，各位走过路过千万不要错过！求点赞，求收藏，求转发，求关注~

另外有任何问题，欢迎私信我解决，每一个程序猿都想为中国的开源事业出一份力~