Datawhale房租预测实战学习笔记(赛题分析)
认识数据
本次实战学习的赛题是城市-房产租金预测,赛题介绍:2019未来杯高校AI挑战赛 ;从题目就可大致看出该题目属于回归问题,需要根据训练集给出的特征数据,选择合适的特征参数建立回归模型并验证预测精度,该比赛采用的模型评估公式(即cost function)为:
s
c
o
r
e
=
1
−
∑
i
=
1
M
(
y
i
−
y
^
)
2
(
y
i
−
y
ˉ
)
2
score=1-\sum_{i=1}^M\frac{(y_i-\hat{y})^2}{(y_i-\bar{y})^2}
score=1−i=1∑M(yi−yˉ)2(yi−y^)2该cost function为R-Square(确定系数),确定系数由SSR和SST决定:
(1)SSR:Sum of squares of the regression,即预测数据与原始数据均值之差的平方和,公式如下:
S
S
R
=
∑
i
=
1
n
w
i
(
y
^
i
−
y
ˉ
i
)
2
SSR=\sum_{i=1}^nw_i(\hat{y}_i-\bar{y}_i)^2
SSR=i=1∑nwi(y^i−yˉi)2(2)SST:Total sum of squares,即原始数据和均值之差的平方和,公式如下:
S
S
T
=
∑
i
=
1
n
w
i
(
y
i
−
y
ˉ
)
2
SST=\sum_{i=1}^nw_i(y_i-\bar{y})^2
SST=i=1∑nwi(yi−yˉ)2确定系数的定义为SSR和SST的比值,它的正常取值范围为[0 1],越接近1,表明方程的变量对y的解释能力越强,这个模型对数据拟合的也较好。
数据的基本字段有:
1.租赁基本信息:
ID——房屋编号
area——房屋面积
rentType——出租方式:整租/合租/未知
houseType——房型
houseFloor——房间所在楼层:高/中/低
totalFloor——房间所在的总楼层数
houseToward——房间朝向
houseDecoration——房屋装修
tradeTime——成交日期
tradeMoney——成交租金
2.小区信息:
CommunityName——小区名称
city——城市
region——地区
plate——区域板块
buildYear——小区建筑年代
saleSecHouseNum——该板块当月二手房挂牌房源数
3.配套设施:
subwayStationNum——该板块地铁站数量
busStationNum——该板块公交站数量
interSchoolNum——该板块国际学校的数量
schoolNum——该板块公立学校的数量
privateSchoolNum——该板块私立学校数量
hospitalNum——该板块综合医院数量
DrugStoreNum——该板块药房数量
gymNum——该板块健身中心数量
bankNum——该板块银行数量
shopNum——该板块商店数量
parkNum——该板块公园数量
mallNum——该板块购物中心数量
superMarketNum——该板块超市数量
4.其他信息:
totalTradeMoney——该板块当月二手房成交总金额
totalTradeArea——该板块二手房成交总面积
tradeMeanPrice——该板块二手房成交均价
tradeSecNum——该板块当月二手房成交套数
totalNewTradeMoney——该板块当月新房成交总金额
totalNewTradeArea——该板块当月新房成交的总面积
totalNewMeanPrice——该板块当月新房成交均价
tradeNewNum——该板块当月新房成交套数
remainNewNum——该板块当月新房未成交套数
supplyNewNum——该板块当月新房供应套数
supplyLandNum——该板块当月土地供应幅数
supplyLandArea——该板块当月土地供应面积
tradeLandNum——该板块当月土地成交幅数
tradeLandArea——该板块当月土地成交面积
landTotalPrice——该板块当月土地成交总价
landMeanPrice——该板块当月楼板价(元/m^{2})
totalWorkers——当前板块现有的办公人数
newWorkers——该板块当月流入人口数(现招聘的人员)
residentPopulation——该板块常住人口
pv——该板块当月租客浏览网页次数
uv——该板块当月租客浏览网页总人数
lookNum——线下看房次数
故本回归模型x为上述特征,需经过主成分分析等降维方法与统计学分析筛选出没有多重共线性的特征,y为trademoney。
用代码导入数据并查看大致信息:
import pandas as pd
import numpy as np
path = "C:\\Users\\w\\Downloads\\team-learning-master\\数据竞赛(房租预测)\\数据集\\train_data.csv"
data = pd.read_csv(path)
print(data.info())
print(data.describe(include='all'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41440 entries, 0 to 41439
Data columns (total 51 columns):
ID 41440 non-null int64
area 41440 non-null float64
rentType 41440 non-null object
houseType 41440 non-null object
houseFloor 41440 non-null object
totalFloor 41440 non-null int64
houseToward 41440 non-null object
houseDecoration 41440 non-null object
communityName 41440 non-null object
city 41440 non-null object
region 41440 non-null object
plate 41440 non-null object
buildYear 41440 non-null object
saleSecHouseNum 41440 non-null int64
subwayStationNum 41440 non-null int64
busStationNum 41440 non-null int64
interSchoolNum 41440 non-null int64
schoolNum 41440 non-null int64
privateSchoolNum 41440 non-null int64
hospitalNum 41440 non-null int64
drugStoreNum 41440 non-null int64
gymNum 41440 non-null int64
bankNum 41440 non-null int64
shopNum 41440 non-null int64
parkNum 41440 non-null int64
mallNum 41440 non-null int64
superMarketNum 41440 non-null int64
totalTradeMoney 41440 non-null int64
totalTradeArea 41440 non-null float64
tradeMeanPrice 41440 non-null float64
tradeSecNum 41440 non-null int64
totalNewTradeMoney 41440 non-null int64
totalNewTradeArea 41440 non-null int64
tradeNewMeanPrice 41440 non-null float64
tradeNewNum 41440 non-null int64
remainNewNum 41440 non-null int64
supplyNewNum 41440 non-null int64
supplyLandNum 41440 non-null int64
supplyLandArea 41440 non-null float64
tradeLandNum 41440 non-null int64
tradeLandArea 41440 non-null float64
landTotalPrice 41440 non-null int64
landMeanPrice 41440 non-null float64
totalWorkers 41440 non-null int64
newWorkers 41440 non-null int64
residentPopulation 41440 non-null int64
pv 41422 non-null float64
uv 41422 non-null float64
lookNum 41440 non-null int64
tradeTime 41440 non-null object
tradeMoney 41440 non-null float64
dtypes: float64(10), int64(30), object(11)
memory usage: 16.1+ MB
None
ID area rentType houseType houseFloor \
count 4.144000e+04 41440.000000 41440 41440 41440
unique NaN NaN 4 104 3
top NaN NaN 未知方式 1室1厅1卫 中
freq NaN NaN 30759 9805 15458
mean 1.001221e+08 70.959409 NaN NaN NaN
std 9.376566e+04 88.119569 NaN NaN NaN
min 1.000000e+08 1.000000 NaN NaN NaN
25% 1.000470e+08 42.607500 NaN NaN NaN
50% 1.000960e+08 65.000000 NaN NaN NaN
75% 1.001902e+08 90.000000 NaN NaN NaN
max 1.003218e+08 15055.000000 NaN NaN NaN
totalFloor houseToward houseDecoration communityName city ... \
count 41440.000000 41440 41440 41440 41440 ...
unique NaN 10 4 4236 1 ...
top NaN 南 其他 XQ01834 SH ...
freq NaN 34377 29040 358 41440 ...
mean 11.413152 NaN NaN NaN NaN ...
std 7.375203 NaN NaN NaN NaN ...
min 0.000000 NaN NaN NaN NaN ...
25% 6.000000 NaN NaN NaN NaN ...
50% 7.000000 NaN NaN NaN NaN ...
75% 16.000000 NaN NaN NaN NaN ...
max 88.000000 NaN NaN NaN NaN ...
landTotalPrice landMeanPrice totalWorkers newWorkers \
count 4.144000e+04 41440.000000 41440.000000 41440.000000
unique NaN NaN NaN NaN
top NaN NaN NaN NaN
freq NaN NaN NaN NaN
mean 1.045363e+08 724.763918 77250.235497 1137.132095
std 5.215216e+08 3224.303831 132052.508523 7667.381627
min 0.000000e+00 0.000000 600.000000 0.000000
25% 0.000000e+00 0.000000 13983.000000 0.000000
50% 0.000000e+00 0.000000 38947.000000 0.000000
75% 0.000000e+00 0.000000 76668.000000 0.000000
max 6.197570e+09 37513.062490 855400.000000 143700.000000
residentPopulation pv uv lookNum \
count 41440.000000 41422.000000 41422.000000 41440.000000
unique NaN NaN NaN NaN
top NaN NaN NaN NaN
freq NaN NaN NaN NaN
mean 294514.059459 26945.663512 3089.077085 0.396260
std 196745.147181 32174.637924 2954.706517 1.653932
min 49330.000000 17.000000 6.000000 0.000000
25% 165293.000000 7928.000000 1053.000000 0.000000
50% 245872.000000 20196.000000 2375.000000 0.000000
75% 330610.000000 34485.000000 4233.000000 0.000000
max 928198.000000 621864.000000 39876.000000 37.000000
tradeTime tradeMoney
count 41440 4.144000e+04
unique 361 NaN
top 2018/3/3 NaN
freq 543 NaN
mean NaN 8.837074e+03
std NaN 5.514287e+05
min NaN 0.000000e+00
25% NaN 2.800000e+03
50% NaN 4.000000e+03
75% NaN 5.500000e+03
max NaN 1.000000e+08
[11 rows x 51 columns]
数据中共有51个特征,其中object对象11个,数值对象40个,考虑用one-hot编码处理object特征。
数据分析
缺失值:data.isnull().sum()
结果显示pv(该板块当月租客浏览网页次数),uv(该板块当月租客浏览网页总人数)两个特征各缺失18条数据,若这两个特征在建立模型时不使用,则不需要对缺失值进行处理。
单调特征分析
根据参考资料,单调特征分析的结果得到单调特征为时间列,个人并没有理解该处分析的意义,附上参考代码:
def incresing(vals):
cnt = 0
len_ = len(vals)
for i in range(len_-1):
if vals[i+1] > vals[i]:
cnt += 1
return cnt
fea_cols = [col for col in data_train.columns]
for col in fea_cols:
cnt = incresing(data_train[col].values)
if cnt / data_train.shape[0] >= 0.55:
print('单调特征:',col)
print('单调特征值个数:', cnt)
print('单调特征值比例:', cnt / data_train.shape[0])
结果:
单调特征: tradeTime
单调特征值个数: 24085
单调特征值比例: 0.5812017374517374
该段代码的意思是数值单调增加的比例超过0.55就认为是单调特征,本菜鸡此处表示懵逼,个人认为此项分析的用处就是为了将样本按时间排序,这样如果作以时间为横坐标的分布图时,不会出现问题。
特征nunique分布
data.nunique()
ID 41440
area 10353
rentType 4
houseType 104
houseFloor 3
totalFloor 55
houseToward 10
houseDecoration 4
communityName 4236
city 1
region 15
plate 66
buildYear 80
saleSecHouseNum 28
subwayStationNum 13
busStationNum 59
interSchoolNum 7
schoolNum 44
privateSchoolNum 17
hospitalNum 11
drugStoreNum 42
gymNum 39
bankNum 45
shopNum 56
parkNum 18
mallNum 17
superMarketNum 49
totalTradeMoney 704
totalTradeArea 705
tradeMeanPrice 705
tradeSecNum 333
totalNewTradeMoney 558
totalNewTradeArea 533
tradeNewMeanPrice 557
tradeNewNum 157
remainNewNum 392
supplyNewNum 104
supplyLandNum 4
supplyLandArea 54
tradeLandNum 5
tradeLandArea 46
landTotalPrice 46
landMeanPrice 52
totalWorkers 63
newWorkers 179
residentPopulation 63
pv 709
uv 649
lookNum 32
tradeTime 361
tradeMoney 836
dtype: int64
#输出结果为各项特征分别有多少个unique值
可以用unique()方法单独查看某一列的unique值,以数量较少的rentType为例:
[in]:data.rentType.unique()
[out]:array(['未知方式', '整租', '合租', '--'], dtype=object)
[in]:data.rentType.value_counts()
[out]:未知方式 30759
整租 5472
合租 5204
-- 5
Name: rentType, dtype: int64
此处在ont-hot处理时,可以将’未知方式’和’–'按同一类型处理。
分析结果:
rentType:4种,且绝大多数是无用的未知方式;
houseType:104种,绝大多数在3室及以下;
houseFloor:3种,分布较为均匀;
region: 15种;
plate: 66种;
houseToward: 10种;
houseDecoration: 4种,一大半是其他;
buildYear: 80种;
communityName: 4236种,且分布较为稀疏;
此步骤是为之后数据处理和特征工程做准备,先理解每个字段的含义以及分布,之后需要根据实际含义对分类变量做不同的处理。
统计特征值频次大于100的特征
for feature in range(data.shape[1]):
feature_name = data.columns[feature]
result = data.iloc[:, feature].value_counts().reset_index()
result.columns = [feature_name, 'counts']
print(result[result['counts'] > 100])
#部分结果展示:
area counts
0 90.0 303
1 89.0 266
2 10.0 203
3 60.0 202
4 50.0 193
5 40.0 188
6 88.0 182
7 12.0 172
8 80.0 169
9 70.0 167
10 55.0 156
11 13.0 156
12 85.0 144
13 56.0 138
14 15.0 137
15 14.0 135
16 52.0 113
17 30.0 112
18 57.0 110
19 78.0 110
20 35.0 108
21 42.0 108
22 54.0 108
23 58.0 108
24 37.0 106
25 65.0 105
26 45.0 104
27 53.0 103
28 51.0 101
此步骤和特征nunique分布结合步骤结合起来看,有一些小于100的是可以直接统一归类为其他的。
Lable分布
首先查看Label(tradeMoney)的value_counts,并根据金额大小升序排列方便作图:
label = label.value_counts().reset_index()
label.columns = ['tradeMoney', 'counts']
label = label.sort_values(by='tradeMoney', ascending=True)
print(label)
#结果
tradeMoney counts
299 0.00 9
709 100.00 1
429 140.00 3
732 150.00 1
448 160.00 3
.. ... ..
378 450000.00 4
733 10000000.00 1
635 50000000.00 1
810 99999999.99 1
[836 rows x 2 columns]
可以看到tradeMoney存在异常值,后续工作需要删除异常数据,以tradeMoney为横坐标观察分布状态:
trademoney = label[label['tradeMoney']<=20000]['tradeMoney']
print(type(trademoney))
counts = label['counts']
plt.rcParams['axes.unicode_minus'] = False
# sns.set_style('darkgrid', {'font.sans-serif':['SimHei', 'Arial']})
# sns.dst(x=trademoney, y=counts, errcolor='b')
sns.distplot(trademoney)