车辆数据分析可视化实战
1.引言
“车辆销售和市场趋势数据集”提供了关于各种车辆销售交易的全面信息收集。该数据集包括年份、品牌、型号、车款、车身类型、变速箱类型、车辆识别号码(VIN)、注册州、状况评级、里程表读数、外部和内部颜色、卖家信息、Manheim市场报告(MMR)值、销售价格及销售日期等详细信息。数据来源于https://www.kaggle.com/datasets/syedanwarafridi/vehicle-sales-data/data
2.导入所需的包并加载数据集
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import mean_absolute_error,r2_score,mean_absolute_percentage_error
#读取数据
df=pd.read_csv('/车辆销售数据/car_prices.csv')
3.数据探索
print(df.head(5))
df.info()
"""
year make ... sellingprice saledate
0 2015 Kia ... 21500.0 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
1 2015 Kia ... 21500.0 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
2 2014 BMW ... 30000.0 Thu Jan 15 2015 04:30:00 GMT-0800 (PST)
3 2015 Volvo ... 27750.0 Thu Jan 29 2015 04:30:00 GMT-0800 (PST)
4 2014 BMW ... 67000.0 Thu Dec 18 2014 12:30:00 GMT-0800 (PST)
[5 rows x 16 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 558837 entries, 0 to 558836
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 558837 non-null int64
1 make 548536 non-null object
2 model 548438 non-null object
3 trim 548186 non-null object
4 body 545642 non-null object
5 transmission 493485 non-null object
6 vin 558833 non-null object
7 state 558837 non-null object
8 condition 547017 non-null float64
9 odometer 558743 non-null float64
10 color 558088 non-null object
11 interior 558088 non-null object
12 seller 558837 non-null object
13 mmr 558799 non-null float64
14 sellingprice 558825 non-null float64
15 saledate 558825 non-null object
dtypes: float64(4), int64(1), object(11)
memory usage: 68.2+ MB
"""
4.数据处理
#数据清理
print(df.isna().sum())
print(df.nunique())
df.dropna(axis=0, inplace=True)
df.info()
##删除重复值
df=df.drop_duplicates(keep='first',subset=['vin'])
df.info()
print(df.columns.to_list())
df.drop(['vin'],axis=1)
"""
year 0
make 10301
model 10399
trim 10651
body 13195
transmission 65352
vin 4
state 0
condition 11820
odometer 94
color 749
interior 749
seller 0
mmr 38
sellingprice 12
saledate 12
dtype: int64
year 34
make 96
model 973
trim 1963
body 87
transmission 4
vin 550297
state 64
condition 41
odometer 172278
color 46
interior 17
seller 14263
mmr 1101
sellingprice 1887
saledate 3766
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Index: 472325 entries, 0 to 558836
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 472325 non-null int64
1 make 472325 non-null object
2 model 472325 non-null object
3 trim 472325 non-null object
4 body 472325 non-null object
5 transmission 472325 non-null object
6 state 472325 non-null object
7 condition 472325 non-null float64
8 odometer 472325 non-null float64
9 color 472325 non-null object
10 interior 472325 non-null object
11 seller 472325 non-null object
12 mmr 472325 non-null float64
13 sellingprice 472325