import pandas as pd
import numpy as np
背景
一个汽车的数据集,这次简单的任务是通过制造品牌方,颜色,里程表(KM),和车门的数量来估计汽车的价格
数据查看和缺失值的清理
car_sales_missing = pd.read_csv(r"\Users\Administrator\Desktop\data\car-sales-extended-missing-data.csv")
car_sales_missing.head()
Make | Colour | Odometer (KM) | Doors | Price | |
---|---|---|---|---|---|
0 | Honda | White | 35431.0 | 4.0 | 15323.0 |
1 | BMW | Blue | 192714.0 | 5.0 | 19943.0 |
2 | Honda | White | 84714.0 | 4.0 | 28343.0 |
3 | Toyota | White | 154365.0 | 4.0 | 13434.0 |
4 | Nissan | Blue | 181577.0 | 3.0 | 14043.0 |
car_sales_missing.isnull().sum()
Make 49
Colour 50
Odometer (KM) 50
Doors 50
Price 50
dtype: int64
car_sales_missing.dtypes
Make object
Colour object
Odometer (KM) float64
Doors float64
Price float64
dtype: object
car_sales_missing["Doors"].value_counts()
4.0 811
5.0 75
3.0 64
Name: Doors, dtype: int64
虽然在数据集中,'Doors'
是 float64
的类型,从数据来看,它是离散的,所以需要当作分类标签,待会用热编码进行数字化
方法 1: 用pandas填充缺失值
# 缺失值填充,其中 'Make' 和 'Colour' 用 'missing' 填充,而 "Odometer (KM)" 用的是平均值,"Doors" 用的是众数
car_sales_missing['Make'].fillna('missing', inplace = True)
car_sales_missing['Colour'].fillna('missing', inplace = True)
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)
car_sales_missing["Doors"].fillna(4, inplace=True)
查看填充的效果
car_sales_missing.isna().sum()
Make 0
Colour 0
Odometer (KM) 0
Doors 0
Price 50
dtype: int64
丢掉缺失值,因为是要对 'Price'
预测
car_sales_missing.dropna(inplace=True)
这下数据集中就没有缺失值了
car_sales_missing.isna().sum()
Make 0
Colour 0
Odometer (KM) 0
Doors 0
Price 0
dtype: int64
# 将数据分割成特征集和标签,也就是 `X` 和 `y`
X = car_sales_missing.drop(