数据引入
import pandas as pd
df=pd.read_csv("./used_car_train_20200313.csv")
简单查看数据
是觉得不对头,用了pandas默认参数,读取出来15w行一列。尝试查看因分隔符问题,加上sep分割符号再看对了。
尝试查看天池文档,观看数据的现实意义.
Field | Description |
---|---|
SaleID | 交易ID,唯一编码 |
name | 汽车交易名称,已脱敏 |
regDate | 汽车注册日期,例如20160101,2016年01月01日 |
model | 车型编码,已脱敏 |
brand | 汽车品牌,已脱敏 |
bodyType | 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType | 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
gearbox | 变速箱:手动:0,自动:1 |
power | 发动机功率:范围 [ 0, 600 ] |
kilometer | 汽车已行驶公里,单位万km |
notRepairedDamage | 汽车有尚未修复的损坏:是:0,否:1 |
regionCode | 地区编码,已脱敏 |
seller | 销售方:个体:0,非个体:1 |
offerType | 报价类型:提供:0,请求:1 |
creatDate | 汽车上线时间,即开始售卖时间 |
price | 二手车交易价格(预测目标) |
v系列特征 | 匿名特征,包含v0-14在内15个匿名特征 |
尝试用to_html输出贴到csdn.
df.head(3)
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | 0.0 | 1046 | 0 | 0 | 20160404 | 1850 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | 4366 | 0 | 0 | 20160309 | 3600 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | 0.0 | 2806 | 0 | 0 | 20160402 | 6222 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
df.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 150000.000000 | 150000.000000 | 1.500000e+05 | 149999.000000 | 150000.000000 | 145494.000000 | 141320.000000 | 144019.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.0 | 1.500000e+05 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 |
mean | 74999.500000 | 68349.172873 | 2.003417e+07 | 47.129021 | 8.052733 | 1.792369 | 0.375842 | 0.224943 | 119.316547 | 12.597160 | 2583.077267 | 0.000007 | 0.0 | 2.016033e+07 | 5923.327333 | 44.406268 | -0.044809 | 0.080765 | 0.078833 | 0.017875 | 0.248204 | 0.044923 | 0.124692 | 0.058144 | 0.061996 | -0.001000 | 0.009035 | 0.004813 | 0.000313 | -0.000688 |
std | 43301.414527 | 61103.875095 | 5.364988e+04 | 49.536040 | 7.864956 | 1.760640 | 0.548677 | 0.417546 | 177.168419 | 3.919576 | 1885.363218 | 0.002582 | 0.0 | 1.067328e+02 | 7501.998477 | 2.457548 | 3.641893 | 2.929618 | 2.026514 | 1.193661 | 0.045804 | 0.051743 | 0.201410 | 0.029186 | 0.035692 | 3.772386 | 3.286071 | 2.517478 | 1.288988 | 1.038685 |
min | 0.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | 0.000000 | 0.000000 | 0.0 | 2.015062e+07 | 11.000000 | 30.451976 | -4.295589 | -4.470671 | -7.275037 | -4.364565 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.168192 | -5.558207 | -9.639552 | -4.153899 | -6.546556 |
25% | 37499.750000 | 11156.000000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | 1018.000000 | 0.000000 | 0.0 | 2.016031e+07 | 1300.000000 | 43.135799 | -3.192349 | -0.970671 | -1.462580 | -0.921191 | 0.243615 | 0.000038 | 0.062474 | 0.035334 | 0.033930 | -3.722303 | -1.951543 | -1.871846 | -1.057789 | -0.437034 |
50% | 74999.500000 | 51638.000000 | 2.003091e+07 | 30.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 110.000000 | 15.000000 | 2196.000000 | 0.000000 | 0.0 | 2.016032e+07 | 3250.000000 | 44.610266 | -3.052671 | -0.382947 | 0.099722 | -0.075910 | 0.257798 | 0.000812 | 0.095866 | 0.057014 | 0.058484 | 1.624076 | -0.358053 | -0.130753 | -0.036245 | 0.141246 |
75% | 112499.250000 | 118841.250000 | 2.007111e+07 | 66.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | 3843.000000 | 0.000000 | 0.0 | 2.016033e+07 | 7700.000000 | 46.004721 | 4.000670 | 0.241335 | 1.565838 | 0.868758 | 0.265297 | 0.102009 | 0.125243 | 0.079382 | 0.087491 | 2.844357 | 1.255022 | 1.776933 | 0.942813 | 0.680378 |
max | 149999.000000 | 196812.000000 | 2.015121e+07 | 247.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 19312.000000 | 15.000000 | 8120.000000 | 1.000000 | 0.0 | 2.016041e+07 | 99999.000000 | 52.304178 | 7.320308 | 19.035496 | 9.854702 | 6.829352 | 0.291838 | 0.151420 | 1.404936 | 0.160791 | 0.222787 | 12.357011 | 18.819042 | 13.847792 | 11.147669 | 8.658418 |
可以看到数值列的简单分布情况,
逐步完善中