在进行一个数据分析案例时,都是一些散落的点儿,东做一点西做一点儿,思路不特别清晰。结合网上的学习,对照采用线性回归进行汽车价格预测这一案例,结合自己的理解,搭建了一个分析的框架,作为一个checklist。面对一个新的任务、新的数据集时,以比较顺畅的执行。更换模型时,则只需要在对应部分进行替换即可。希望能给需要的人有所帮助。
准备工作:导入相关包
此处主要列出了常用的一些,在使用过程中可根据需要灵活添加
# 导入相关包
import numpy as np
import pandas as pd
# 导入可视化包
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# 缺失数据可视化的一个小工具包
# 统计函数
from statsmodels.distributions.empirical_distribution import ECDF
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Lasso, LassoCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
seed = 123
获取数据
实际中有多种多样的方式,此处只简单的以在文件中获取举例,如果有调整,只需要在此处变化即可。
有网友提供了一个网盘,可以下载数据:
https://pan.baidu.com/s/1H7RWWMmb_mXXm2gKjd2E5w 提取码:9fbq
csv_dir = r'线性回归_汽车数据.csv'
# 注意,引处需要指定na_values,否则在缺失值可视化时不能正常显示
# data = pd.read_csv(csv_dir)
data = pd.read_csv(csv_dir, na_values='?')
探索数据
根据《商业数据分析指南》中给出的建议,探索数据的过程主要包括以下几个部分:
- 0 了解数据类型及基本情况
- 1 数据质量检查:主要包括检查数据中是否有错误,如性别类型,是否会有拼写错误的,把female 拼写为fmale等等,诸如此类
- 2 异常值检测:主要通过
数据概览
这些可以理解为数据字典,是基于业务而得到的数据取值范围及类型,后面在检查时需对照是否在这些范围内。
当然,基于此数据集,有些给出的范围是实际数据集的,而不是从业务角度给出的可能范围。注意做好一定的区分即可。
主要包括3类指标:
- 汽车的各种特性.
保险风险评级:(-3, -2, -1, 0, 1, 2, 3).
每辆保险车辆年平均相对损失支付.
- 类别属性
make: 汽车的商标(奥迪,宝马。。。)
fuel-type: 汽油还是天然气
aspiration: 涡轮
num-of-doors: 两门还是四门
body-style: 硬顶车、轿车、掀背车、敞篷车
drive-wheels: 驱动轮
engine-location: 发动机位置
engine-type: 发动机类型
num-of-cylinders: 几个气缸
fuel-system: 燃油系统
- 连续指标
bore: continuous from 2.54 to 3.94.
stroke: continuous from 2.07 to 4.17.
compression-ratio: continuous from 7 to 23.
horsepower: continuous from 48 to 288.
peak-rpm: continuous from 4150 to 6600.
city-mpg: continuous from 13 to 49.
highway-mpg: continuous from 16 to 54.
price: continuous from 5118 to 45400.
# 分析数据类型,看哪些是分类数据,哪些是数据数据,有没有数据类型需要转换等等
data.dtypes
symboling int64
normalized-losses float64
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower float64
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
dtype: object
print(data.shape)
data.head(5)
(205, 26)
symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495.0 |
1 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 |
2 | 1 | NaN | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 |
3 | 2 | 164.0 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 |
4 | 2 | 164.0 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 |
5 rows × 26 columns
print(data.columns)
# 对数据进行描述统计
# 会返回一个DataFrame结构的数据
data_desc = data.describe()
data_desc
Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg', 'price'],
dtype='object')
symboling | normalized-losses | wheel-base | length | width | height | curb-weight | engine-size | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 205.000000 | 164.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 201.000000 | 201.000000 | 205.000000 | 203.000000 | 203.000000 | 205.000000 | 205.000000 | 201.000000 |
mean | 0.834146 | 122.000000 | 98.756585 | 174.049268 | 65.907805 | 53.724878 | 2555.565854 | 126.907317 | 3.329751 | 3.255423 | 10.142537 | 104.256158 | 5125.369458 | 25.219512 | 30.751220 | 13207.129353 |
std | 1.245307 | 35.442168 | 6.021776 | 12.337289 | 2.145204 | 2.443522 | 520.680204 | 41.642693 | 0.273539 | 0.316717 | 3.972040 | 39.714369 | 479.334560 | 6.542142 | 6.886443 | 7947.066342 |
min | -2.000000 | 65.000000 | 86.600000 | 141.100000 | 60.300000 | 47.800000 | 1488.000000 | 61.000000 | 2.540000 | 2.070000 | 7.000000 | 48.000000 | 4150.000000 | 13.000000 | 16.000000 | 5118.000000 |
25% | 0.000000 | 94.000000 | 94.500000 | 166.300000 | 64.100000 | 52.000000 | 2145.000000 | 97.000000 | 3.150000 | 3.110000 | 8.600000 | 70.000000 | 4800.000000 | 19.000000 | 25.000000 | 7775.000000 |
50% | 1.000000 | 115.000000 | 97.000000 | 173.200000 | 65.500000 | 54.100000 | 2414.000000 | 120.000000 | 3.310000 | 3.290000 | 9.000000 | 95.000000 | 5200.000000 | 24.000000 | 30.000000 | 10295.000000 |
75% | 2.000000 | 150.000000 | 102.400000 | 183.100000 | 66.900000 | 55.500000 | 2935.000000 | 141.000000 | 3.590000 | 3.410000 | 9.400000 | 116.000000 | 5500.000000 | 30.000000 | 34.000000 | 16500.000000 |
max | 3.000000 | 256.000000 | 120.900000 | 208.100000 | 72.300000 | 59.800000 | 4066.000000 | 326.000000 | 3.940000 | 4.170000 | 23.000000 | 288.000000 | 6600.000000 | 49.000000 | 54.000000 | 45400.000000 |
检查数据取值
对分类数据,查看其所有可能的取值,是否有错漏
classes = ['make', 'fuel-type', 'aspiration', 'num-of-doors',
'body-style', 'drive-wheels', 'engine-location',
'engine-type', 'num-of-cylinders', 'fuel-system']
for each in classes:
print(each + ':\n')
print(list(data[each].drop_duplicates()))
print('\n')
make:
['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda', 'isuzu', 'jaguar', 'mazda', 'mercedes-benz', 'mercury', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'renault', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo']
fuel-type:
['gas', 'diesel']
aspiration:
['std', 'turbo']
num-of-doors:
['two', 'four', nan]
body-style:
['convertible', 'hatchback', 'sedan', 'wagon', 'hardtop']
drive-wheels:
['rwd', 'fwd', '4wd']
engine-location:
['front', 'rear']
engine-type:
['dohc', 'ohcv', 'ohc', 'l', 'rotor', 'ohcf', 'dohcv']
num-of-cylinders:
['four', 'six', 'five', 'three', 'twelve', 'two', 'eight']
fuel-system:
['mpfi', '2bbl', 'mfi', '1bbl', 'spfi', '4bbl', 'idi', 'spdi']
缺失值处理
缺失值处理方法:
1、缺失值较少时,1%以下,可以直接去掉nan;
2、用已有的值取平均值或众数;
3、用已知的数做回归模型,进行预测。
观测异常值的缺失情况,可通过missingno提供的可视化工具,也可以以计数的形式,查看缺失值及所占比例
处理完异常值后,就没有缺失值了。如果采用文中的方法,应该先处理缺失值
# 通过图示查看缺失值
# missing values?
#darkgrid 黑色网格(默认)
#whitegrid 白色网格
#dark 黑色背景
#white 白色背景
#ticks
sns.set(style='ticks') #设置sns的样式背景
# 注意,在读入csv数据时,需将缺失值指定相关参数 ,如na_values='?',否则不能显示
msno.matrix(data)