数据分析建模BaseLine（Pyhon线性回归_汽车价格预测）

最新推荐文章于 2024-06-08 15:38:10 发布

andy_alexander

最新推荐文章于 2024-06-08 15:38:10 发布

阅读量3.6k

点赞数 16

分类专栏：数据分析建模文章标签： Python 数据建模数据分析 BaseLine 线性回归

本文链接：https://blog.csdn.net/weixin_42503575/article/details/91042556

版权

在进行一个数据分析案例时，都是一些散落的点儿，东做一点西做一点儿，思路不特别清晰。结合网上的学习，对照采用线性回归进行汽车价格预测这一案例，结合自己的理解，搭建了一个分析的框架，作为一个checklist。面对一个新的任务、新的数据集时，以比较顺畅的执行。更换模型时，则只需要在对应部分进行替换即可。希望能给需要的人有所帮助。

准备工作：导入相关包

此处主要列出了常用的一些，在使用过程中可根据需要灵活添加

# 导入相关包
import numpy as np
import pandas as pd

# 导入可视化包
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# 缺失数据可视化的一个小工具包

# 统计函数
from statsmodels.distributions.empirical_distribution  import ECDF
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Lasso, LassoCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
seed = 123

获取数据

实际中有多种多样的方式，此处只简单的以在文件中获取举例，如果有调整，只需要在此处变化即可。

有网友提供了一个网盘，可以下载数据：
https://pan.baidu.com/s/1H7RWWMmb_mXXm2gKjd2E5w 提取码：9fbq

csv_dir = r'线性回归_汽车数据.csv'
# 注意，引处需要指定na_values，否则在缺失值可视化时不能正常显示
# data = pd.read_csv(csv_dir)
data = pd.read_csv(csv_dir, na_values='?')

探索数据

根据《商业数据分析指南》中给出的建议，探索数据的过程主要包括以下几个部分：

0 了解数据类型及基本情况
1 数据质量检查：主要包括检查数据中是否有错误，如性别类型，是否会有拼写错误的，把female 拼写为fmale等等，诸如此类
2 异常值检测：主要通过

数据概览

这些可以理解为数据字典，是基于业务而得到的数据取值范围及类型，后面在检查时需对照是否在这些范围内。
当然，基于此数据集，有些给出的范围是实际数据集的，而不是从业务角度给出的可能范围。注意做好一定的区分即可。

主要包括3类指标:

汽车的各种特性.

保险风险评级：(-3, -2, -1, 0, 1, 2, 3).

每辆保险车辆年平均相对损失支付.

类别属性

make: 汽车的商标（奥迪，宝马。。。）

fuel-type: 汽油还是天然气

aspiration: 涡轮

num-of-doors: 两门还是四门

body-style: 硬顶车、轿车、掀背车、敞篷车

drive-wheels: 驱动轮

engine-location: 发动机位置

engine-type: 发动机类型

num-of-cylinders: 几个气缸

fuel-system: 燃油系统

连续指标

bore: continuous from 2.54 to 3.94.

stroke: continuous from 2.07 to 4.17.

compression-ratio: continuous from 7 to 23.

horsepower: continuous from 48 to 288.

peak-rpm: continuous from 4150 to 6600.

city-mpg: continuous from 13 to 49.

highway-mpg: continuous from 16 to 54.

price: continuous from 5118 to 45400.

# 分析数据类型，看哪些是分类数据，哪些是数据数据，有没有数据类型需要转换等等
data.dtypes

symboling              int64
normalized-losses    float64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

print(data.shape)
data.head(5)

(205, 26)

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	13495.0
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
3	2	164.0	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
4	2	164.0	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0

5 rows × 26 columns

print(data.columns)
# 对数据进行描述统计
# 会返回一个DataFrame结构的数据
data_desc = data.describe()
data_desc

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

	symboling	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
count	205.000000	164.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	201.000000	201.000000	205.000000	203.000000	203.000000	205.000000	205.000000	201.000000
mean	0.834146	122.000000	98.756585	174.049268	65.907805	53.724878	2555.565854	126.907317	3.329751	3.255423	10.142537	104.256158	5125.369458	25.219512	30.751220	13207.129353
std	1.245307	35.442168	6.021776	12.337289	2.145204	2.443522	520.680204	41.642693	0.273539	0.316717	3.972040	39.714369	479.334560	6.542142	6.886443	7947.066342
min	-2.000000	65.000000	86.600000	141.100000	60.300000	47.800000	1488.000000	61.000000	2.540000	2.070000	7.000000	48.000000	4150.000000	13.000000	16.000000	5118.000000
25%	0.000000	94.000000	94.500000	166.300000	64.100000	52.000000	2145.000000	97.000000	3.150000	3.110000	8.600000	70.000000	4800.000000	19.000000	25.000000	7775.000000
50%	1.000000	115.000000	97.000000	173.200000	65.500000	54.100000	2414.000000	120.000000	3.310000	3.290000	9.000000	95.000000	5200.000000	24.000000	30.000000	10295.000000
75%	2.000000	150.000000	102.400000	183.100000	66.900000	55.500000	2935.000000	141.000000	3.590000	3.410000	9.400000	116.000000	5500.000000	30.000000	34.000000	16500.000000
max	3.000000	256.000000	120.900000	208.100000	72.300000	59.800000	4066.000000	326.000000	3.940000	4.170000	23.000000	288.000000	6600.000000	49.000000	54.000000	45400.000000

检查数据取值

对分类数据，查看其所有可能的取值，是否有错漏

classes = ['make', 'fuel-type', 'aspiration', 'num-of-doors', 
           'body-style', 'drive-wheels', 'engine-location',
           'engine-type', 'num-of-cylinders', 'fuel-system']

for each in classes:
    print(each + ':\n')
    print(list(data[each].drop_duplicates()))
    print('\n')

make:

['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda', 'isuzu', 'jaguar', 'mazda', 'mercedes-benz', 'mercury', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'renault', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo']


fuel-type:

['gas', 'diesel']


aspiration:

['std', 'turbo']


num-of-doors:

['two', 'four', nan]


body-style:

['convertible', 'hatchback', 'sedan', 'wagon', 'hardtop']


drive-wheels:

['rwd', 'fwd', '4wd']


engine-location:

['front', 'rear']


engine-type:

['dohc', 'ohcv', 'ohc', 'l', 'rotor', 'ohcf', 'dohcv']


num-of-cylinders:

['four', 'six', 'five', 'three', 'twelve', 'two', 'eight']


fuel-system:

['mpfi', '2bbl', 'mfi', '1bbl', 'spfi', '4bbl', 'idi', 'spdi']

缺失值处理

缺失值处理方法：
1、缺失值较少时，1%以下，可以直接去掉nan；
2、用已有的值取平均值或众数；
3、用已知的数做回归模型，进行预测。

观测异常值的缺失情况，可通过missingno提供的可视化工具，也可以以计数的形式，查看缺失值及所占比例

处理完异常值后，就没有缺失值了。如果采用文中的方法，应该先处理缺失值

# 通过图示查看缺失值
# missing values?
#darkgrid 黑色网格（默认）
#whitegrid 白色网格
#dark 黑色背景
#white 白色背景
#ticks 
sns.set(style='ticks') #设置sns的样式背景
# 注意，在读入csv数据时，需将缺失值指定相关参数 ，如na_values='?',否则不能显示
msno.matrix(data)

在这里插入图片描述

最低0.47元/天解锁文章

andy_alexander

关注

16
点赞
踩
34

收藏

觉得还不错? 一键收藏
1
评论
数据分析建模BaseLine（Pyhon线性回归_汽车价格预测）

在进行一个数据分析案例时，都是一些散落的点儿，东做一点西做一点儿，思路不特别清晰。结合网上的学习，对照采用线性回归进行汽车价格预测这一案例，结合自己的理解，搭建了一个分析的框架，作为一个checklist。面对一个新的任务、新的数据集时，以比较顺畅的执行。更换模型时，则只需要在对应部分进行替换即可。希望能给需要的人有所帮助。准备工作：导入相关包此处主要列出了常用的一些，在使用过程中可根据需...
复制链接

扫一扫