Day-1:

1、认识数据

根据一系列的指标来对房租进行合理预测,属于回归问题。
数据集中的数据类别包括租赁房源、小区信息、配套设施、二手房、新房等。
这个太难了,真不适合我这个渣渣,只能硬着头皮走下去!

2、数据EDA

(主要是参考他人)

2.1读入数据和总体概览

import pandas as pd
import matplotlib.pyplot as plt

#先导入包
train = pd.read_csv('./train_data.csv')
test = pd.read_csv('./test_a.csv')
data_all = pd.concat([data_train, data_test], ignore_index=True)
print(train.info())
print(train.describe())

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41440 entries, 0 to 41439
Data columns (total 52 columns):
ID                    41440 non-null int64
area                  41440 non-null float64
rentType              41440 non-null object
houseType             41440 non-null object
houseFloor            41440 non-null object
totalFloor            41440 non-null int64
houseToward           41440 non-null object
houseDecoration       41440 non-null object
communityName         41440 non-null object
city                  41440 non-null object
region                41440 non-null object
plate                 41440 non-null object
buildYear             41440 non-null object
saleSecHouseNum       41440 non-null int64
subwayStationNum      41440 non-null int64
busStationNum         41440 non-null int64
interSchoolNum        41440 non-null int64
schoolNum             41440 non-null int64
privateSchoolNum      41440 non-null int64
hospitalNum           41440 non-null int64
drugStoreNum          41440 non-null int64
gymNum                41440 non-null int64
bankNum               41440 non-null int64
shopNum               41440 non-null int64
parkNum               41440 non-null int64
mallNum               41440 non-null int64
superMarketNum        41440 non-null int64
totalTradeMoney       41440 non-null int64
totalTradeArea        41440 non-null float64
tradeMeanPrice        41440 non-null float64
tradeSecNum           41440 non-null int64
totalNewTradeMoney    41440 non-null int64
totalNewTradeArea     41440 non-null int64
tradeNewMeanPrice     41440 non-null float64
tradeNewNum           41440 non-null int64
remainNewNum          41440 non-null int64
supplyNewNum          41440 non-null int64
supplyLandNum         41440 non-null int64
supplyLandArea        41440 non-null float64
tradeLandNum          41440 non-null int64
tradeLandArea         41440 non-null float64
landTotalPrice        41440 non-null int64
landMeanPrice         41440 non-null float64
totalWorkers          41440 non-null int64
newWorkers            41440 non-null int64
residentPopulation    41440 non-null int64
pv                    41422 non-null float64
uv                    41422 non-null float64
lookNum               41440 non-null int64
tradeTime             41440 non-null object
tradeMoney            41440 non-null float64
Type                  41440 non-null object
dtypes: float64(10), int64(30), object(12)
memory usage: 16.4+ MB
None
                 ID          area  ...       lookNum    tradeMoney
count  4.144000e+04  41440.000000  ...  41440.000000  4.144000e+04
mean   1.001221e+08     70.959409  ...      0.396260  8.837074e+03
std    9.376566e+04     88.119569  ...      1.653932  5.514287e+05
min    1.000000e+08      1.000000  ...      0.000000  0.000000e+00
25%    1.000470e+08     42.607500  ...      0.000000  2.800000e+03
50%    1.000960e+08     65.000000  ...      0.000000  4.000000e+03
75%    1.001902e+08     90.000000  ...      0.000000  5.500000e+03
max    1.003218e+08  15055.000000  ...     37.000000  1.000000e+08

[8 rows x 40 columns]

2.2缺失值分析

cols_columns = [col for col in train.columns
               if train[col].isnull().any() ]
print(cols_columns)

输出为

['pv', 'uv']#这是其中两列

2.3单调特征值分析


def increasing(vals):
    cnt = 0
    len_ = len(vals)
    for i in range(len_ - 1):
        if vals[i + 1] > vals[i]:
            cnt += 1 #刚开始觉得这里有问题,就加上了=,结果出来一堆,才发现是我有问题
    return cnt

fea_cols = [col for col in data_train.columns]
print(fea_cols)
for col in fea_cols:
    cnt = increasing(train[col].values)
    if cnt / train.shape[0] >= 0.55:
        print('单调特征:', col)
        print('单调特征值个数:', cnt)
        print('单调特征值比例:', cnt / train.shape[0])

输出为

['ID', 'area', 'rentType', 'houseType', 'houseFloor', 'totalFloor', 'houseToward', 'houseDecoration', 'communityName', 'city', 'region', 'plate', 'buildYear', 'saleSecHouseNum', 'subwayStationNum', 'busStationNum', 'interSchoolNum', 'schoolNum', 'privateSchoolNum', 'hospitalNum', 'drugStoreNum', 'gymNum', 'bankNum', 'shopNum', 'parkNum', 'mallNum', 'superMarketNum', 'totalTradeMoney', 'totalTradeArea', 'tradeMeanPrice', 'tradeSecNum', 'totalNewTradeMoney', 'totalNewTradeArea', 'tradeNewMeanPrice', 'tradeNewNum', 'remainNewNum', 'supplyNewNum', 'supplyLandNum', 'supplyLandArea', 'tradeLandNum', 'tradeLandArea', 'landTotalPrice', 'landMeanPrice', 'totalWorkers', 'newWorkers', 'residentPopulation', 'pv', 'uv', 'lookNum', 'tradeTime', 'tradeMoney']
单调特征: tradeTime
单调特征值个数: 24085
单调特征值比例: 0.5812017374517374

2.4特征 nunique 分布

print(train.nunique())#打印唯一值的个数

输出为

ID                    41440
area                  10353
rentType                  4
houseType               104
houseFloor                3
totalFloor               55
houseToward              10
houseDecoration           4
communityName          4236
city                      1
region                   15
plate                    66
buildYear                80
saleSecHouseNum          28
subwayStationNum         13
busStationNum            59
interSchoolNum            7
schoolNum                44
privateSchoolNum         17
hospitalNum              11
drugStoreNum             42
gymNum                   39
bankNum                  45
shopNum                  56
parkNum                  18
mallNum                  17
superMarketNum           49
totalTradeMoney         704
totalTradeArea          705
tradeMeanPrice          705
tradeSecNum             333
totalNewTradeMoney      558
totalNewTradeArea       533
tradeNewMeanPrice       557
tradeNewNum             157
remainNewNum            392
supplyNewNum            104
supplyLandNum             4
supplyLandArea           54
tradeLandNum              5
tradeLandArea            46
landTotalPrice           46
landMeanPrice            52
totalWorkers             63
newWorkers              179
residentPopulation       63
pv                      709
uv                      649
lookNum                  32
tradeTime               361
tradeMoney              836
dtype: int64

2.5 label处理


fig, axes = plt.subplots(2, 3, figsize=(20, 5))
fig.set_size_inches(20, 12)#设置图像大小
sns.distplot(train['tradeMoney'], ax=axes[0][0])
sns.distplot(train[(train['tradeMoney'] <= 20000)]['tradeMoney'], ax=axes[0][1])
sns.distplot(train[(train['tradeMoney'] > 20000) & (train['tradeMoney'] <= 50000)]['tradeMoney'], ax=axes[0][2])
sns.distplot(train[(train['tradeMoney'] > 50000) & (train['tradeMoney'] <= 100000)]['tradeMoney'], ax=axes[1][0])
sns.distplot(train[(train['tradeMoney'] > 100000)]['tradeMoney'], ax=axes[1][1])
plt.show()


总结:大神不愧为大神!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值