二手车价格预测——task2 数据分析

最新推荐文章于 2024-07-27 12:20:46 发布

唐yi壹佰

最新推荐文章于 2024-07-27 12:20:46 发布

阅读量480

点赞数

分类专栏：二手车价格预测文章标签：机器学习可视化数据分析 python

本文链接：https://blog.csdn.net/m0_46668150/article/details/115604104

版权

文章目录

前言
一、代码示例
总结

前言

对于数据建模来说，数据分析是必不可少的一部分：

EDA的价值主要在于熟悉数据集，了解数据集，对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。
当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值之间的存在关系。

同时，我们能够熟练运用一些数据分析库的使用，例如：

数据科学库 pandas、numpy、scipy；
可视化库 matplotlib、seabon；

一、代码示例

1.引入库

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

2.读入数据

Train_data = pd.read_csv('C:\\Users\\TINKPAD\\Desktop\\python_work\\kaggle\二手车交易价格预测\\used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv('C:\\Users\\TINKPAD\\Desktop\\python_work\\kaggle\二手车交易价格预测\\used_car_testB_20200421.csv', sep=' ')

## 简略观察数据(head()+shape)
Train_data.head().append(Train_data.tail())
Test_data.head().append(Test_data.tail())
Train_data.shape
Test_data.shape

3.总体数据概览

##1) 通过describe()来熟悉数据的相关统计量
Train_data.describe()
Test_data.describe()

              SaleID           name  ...          v_13          v_14
count   50000.000000   50000.000000  ...  50000.000000  50000.000000
mean   224999.500000   68505.606100  ...      0.004570     -0.007209
std     14433.901067   61032.124271  ...      1.287194      1.044718
min    200000.000000       1.000000  ...     -4.157649     -6.098192
25%    212499.750000   11315.000000  ...     -1.048722     -0.440706
50%    224999.500000   52215.000000  ...     -0.036352      0.136849
75%    237499.250000  118710.750000  ...      0.945239      0.685555
max    249999.000000  196808.000000  ...      5.635374      2.649768

[8 rows x 29 columns]

## 2) 通过info()来熟悉数据类型
Train_data.info()
Test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

4.判断数据缺失和异常

## 1) 查看每列的存在nan情况
Train_data.isnull().sum()
Test_data.isnull().sum()

SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1504
fuelType             2924
gearbox              1968
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

# nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

在这里插入图片描述
通过上图可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印，主要的目的在于 nan存在的个数是否真的很大，如果很小一般选择填充，如果使用lgb等树模型可以直接空缺，让树自己去优化，但如果nan存在的过多、可以考虑删掉

## 2) 查看异常值检测
Train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

可以发现除了notRepairedDamage 为object类型其他都为数字这里我们把他的几个不同的值都进行显示就知道了

Train_data['notRepairedDamage'].value_counts()

0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

可以看出来‘ - ’也为空缺值，因为很多模型对nan有直接的处理，这里我们先不做处理，先替换成nan

Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Train_data['notRepairedDamage'].value_counts()

0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64

Train_data.isnull().sum()

SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64

Test_data[</

最低0.47元/天解锁文章

唐yi壹佰

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
二手车价格预测——task2 数据分析

文章目录前言一、代码示例1.引入库2.读入数据3.总体数据概览4.判断数据缺失和异常5.了解预测值的分布6 特征分为类别特征和数字特征，并对类别特征查看unique分布7 数字特征分析8 类别特征分析9 用pandas_profiling生成数据报告总结前言对于数据建模来说，数据分析是必不可少的一部分：EDA的价值主要在于熟悉数据集，了解数据集，对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值..
复制链接

扫一扫

专栏目录