Sklearn：房租租⾦模型预测版本一

最新推荐文章于 2024-04-17 18:21:09 发布

あずにゃん

最新推荐文章于 2024-04-17 18:21:09 发布

阅读量2k

点赞数 3

分类专栏：人工智能 Sklearn 文章标签：人工智能

本文链接：https://blog.csdn.net/zimiao552147572/article/details/105941660

版权

人工智能同时被 2 个专栏收录

503 篇文章

订阅专栏

Sklearn

30 篇文章

订阅专栏

日萌社

人工智能AI：Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战（不定时更新）

Sklearn：房租租⾦模型预测版本一

Sklearn：房租租⾦模型预测版本二

数据集下载链接：https://pan.baidu.com/s/13OtaUv6j4x8dD7cgD4sL5g
提取码：7tze

5.10 房租租⾦模型预测

1 项⽬背景

当今社会，房屋租⾦由装修情况、位置地段、户型格局、交通便利程度、市场供需量等多⽅⾯因素综合

决定，对于租房这个相对传统的⾏业来说，信息严重不对称⼀直存在。

⼀⽅⾯，房东不了解租房的市场真实价格，只能忍痛空置⾼租⾦的房屋；

另⼀⽅⾯，租客也找不到满⾜⾃⼰需求⾼性价⽐房屋，这造成了租房资源的极⼤浪费。

本项⽬将基于租房市场的痛点，提供脱敏处理后的真实租房市场数据。⼤家需要利⽤有⽉租⾦标签的历

史数据建⽴模型，实现基于房屋基本信息的住房⽉租⾦预测，为该城市租房市场提供客观衡量标准。

2 任务

数据为某地 3 个⽉的房屋租赁价格以及房屋的基本信息，我们对数据做了脱敏处理。

⼤家需要利⽤训练集中的房屋信息和⽉租⾦训练模型，利⽤测试集中的房屋信息对测试集数据中的房屋

的⽉租⾦进⾏预测。

3 数据

数据分为两组，分别是训练集和测试集。

训练集为前 3 个⽉采集的数据，共 150539 条。具体数据示例如下图：

测试集为第 3 个⽉采集到的部分数据，相对于训练集，增加了 “id” 字段，为房屋的唯⼀ id ，且⽆ “ ⽉

租⾦ ” 字段，其它字段与训练集相同，共 46000 条。具体数据示例如下图：

4 评分标准

4.1 评价标准

算法通过计算预测值和真实房租⽉租⾦的均⽅根误差来衡量回归模型的优劣。均⽅根误差越⼩，说明回

归模型越好。

均⽅根误差计算公式如下

库安装：pip install xgboost

数据初步分析

In [1]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')  # 忽略一些警告

导入数据

In [2]:

train=pd.read_csv("data/train.csv")

test=pd.read_csv("data/test.csv")

数据探索

基本信息

In [3]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150539 entries, 0 to 150538
Data columns (total 19 columns):
时间          150539 non-null int64
小区名         150539 non-null int64
小区房屋出租数量    149571 non-null float64
楼层          150539 non-null int64
总楼层         150539 non-null float64
房屋面积        150539 non-null float64
房屋朝向        150539 non-null object
居住状态        15979 non-null float64
卧室数量        150539 non-null int64
厅的数量        150539 non-null int64
卫的数量        150539 non-null int64
出租方式        19576 non-null float64
区           150522 non-null float64
位置          150522 non-null float64
地铁线路        70180 non-null float64
地铁站点        70180 non-null float64
距离          70180 non-null float64
装修情况        14604 non-null float64
月租金         150539 non-null float64
dtypes: float64(12), int64(6), object(1)
memory usage: 21.8+ MB

In [4]:

train.describe()

Out[4]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金
count	150539.000000	150539.000000	149571.000000	150539.000000	150539.000000	150539.000000	15979.000000	150539.000000	150539.000000	150539.000000	19576.000000	150522.000000	150522.000000	70180.000000	70180.000000	70180.000000	14604.000000	150539.000000
mean	1.844871	3233.610035	0.120978	0.955852	0.406459	0.013156	2.722761	2.229854	1.303563	1.223291	0.917705	7.906731	67.937923	3.252707	57.571915	0.551246	3.600110	7.962330
std	0.704477	2020.913396	0.129586	0.851612	0.183616	0.007551	0.669594	0.893350	0.612709	0.487023	0.274820	4.010860	43.515929	1.471257	35.141576	0.246250	2.008348	6.314068
min	1.000000	0.000000	0.007812	0.000000	0.000000	0.000166	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	0.001667	1.000000	0.000000
25%	1.000000	1394.000000	0.039062	0.000000	0.290909	0.009268	3.000000	2.000000	1.000000	1.000000	1.000000	4.000000	33.000000	2.000000	23.000000	0.356667	2.000000	4.923599
50%	2.000000	3092.000000	0.082031	1.000000	0.418182	0.012910	3.000000	2.000000	1.000000	1.000000	1.000000	9.000000	61.000000	4.000000	59.000000	0.554167	2.000000	6.621392
75%	2.000000	5199.000000	0.156250	2.000000	0.563636	0.014896	3.000000	3.000000	2.000000	1.000000	1.000000	11.000000	102.000000	5.000000	87.000000	0.745000	6.000000	8.998302
max	3.000000	6627.000000	1.000000	2.000000	1.000000	1.000000	3.000000	11.000000	8.000000	8.000000	1.000000	14.000000	152.000000	5.000000	119.000000	1.000000	6.000000	100.000000

In [5]:

train.shape

Out[5]:

(150539, 19)

In [6]:

test.shape

Out[6]:

(46000, 19)

In [7]:

train.head()

Out[7]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金
0	1	3072	0.128906	2	0.236364	0.008628	东南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	东	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	东南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	东北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [8]:

test.head()

Out[8]:

	id	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况
0	1	3	3882	0.035156	1	0.436364	0.013075	东南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN
1	2	3	6353	0.078125	1	0.436364	0.012248	东南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN
2	3	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN
3	4	3	1532	0.414062	1	0.600000	0.019695	东南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN
4	5	3	1251	0.226562	1	0.381818	0.014730	东	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN

缺失值比例

In [9]:

# 每列的缺失值个数/总行数

train_missing = (train.isnull().sum()/len(train))*100

# 去掉缺失比例为0的列

train_missing = train_missing.drop(

    train_missing[train_missing == 0].index).sort_values(ascending=False)

# 构造确实比例统计表

miss_data = pd.DataFrame({'缺失百分比': train_missing})

miss_data

Out[9]:

	缺失百分比
装修情况	90.298859
居住状态	89.385475
出租方式	86.996061
距离	53.380851
地铁站点	53.380851
地铁线路	53.380851
小区房屋出租数量	0.643023
位置	0.011293
区	0.011293

In [10]:

# 每列的缺失值个数/总行数

train_missing = (train.isnull().sum()/len(test))*100

# 去掉缺失比例为0的列

train_missing = train_missing.drop(

    train_missing[train_missing == 0].index).sort_values(ascending=False)

# 构造确实比例统计表

miss_data = pd.DataFrame({'缺失百分比': train_missing})

miss_data

Out[10]:

	缺失百分比
装修情况	295.510870
居住状态	292.521739
出租方式	284.702174
距离	174.693478
地铁站点	174.693478
地铁线路	174.693478
小区房屋出租数量	2.104348
位置	0.036957
区	0.036957

目标值分布

In [11]:

train['月租金'].head()

Out[11]:

0     5.602716
1    16.977929
2     8.998302
3     5.602716
4     7.300509
Name: 月租金, dtype: float64

In [12]:

plt.figure(figsize=(20, 6))

plt.subplot(221)

plt.title('月租金占比分布', fontsize=18)

sns.distplot(train['月租金'])

plt.subplot(222)

plt.title('月租金价格排序图', fontsize=18)

plt.scatter(range(train.shape[0]), np.sort(train['月租金'].values))

plt.show()

所有特征分布

直方图和柱状分布图

In [13]:

train.head()

Out[13]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金
0	1	3072	0.128906	2	0.236364	0.008628	东南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	东	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	东南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	东北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [14]:

train.hist(figsize=(20,20),bins=50,grid=False)

plt.show()

异常值分析

这里我们主要分析跟月租金相关性较大的房屋面积的异常值

In [21]:

def plot_reg(xs,y,data):

    n=len(xs)

    for i in range(n):

        plt.figure(figsize=(10,10))

        sns.regplot(x=data[xs[i]],y=data[y])

        plt.show()

In [22]:

reg_cols=['房屋面积']

plot_reg(reg_cols,"月租金",train)

问题数据

房间朝向列有多个值

In [23]:

train["房屋朝向"].head()

Out[23]:

0    东南
1     东
2    东南
3     南
4    东北
Name: 房屋朝向, dtype: object

In [24]:

# 查看房屋朝向列有哪些值

train['房屋朝向'].value_counts()

Out[24]:

南              41769
东南             41439
东              24749
西南             13407
北               7898
西               7559
西北              4066
南 北             3046
东北              2574
东南 南             660
东 东南             646
东 西              560
南 西南             334
东 南              309
东南 西南            175
南 西              158
东南 西北            114
西南 西              91
东 北               74
西 北               66
西 西北              64
东 东北              61
西南 西北             57
东南 东北             57
东南 南 西南           52
北 东北              49
南 西北              45
东南 西              44
南 西南 北            44
西北 北              41
西南 东北             40
东南 北              34
西南 北              32
东 西南              32
东 西北              26
东 南 西 北           24
东 东南 南            18
西北 东北             16
南 东               16
南 东北              14
东南 南 北            10
东 南 北              8
南 西 北              8
东南 西南 西北           8
东 南 西              7
东 东南 西南            6
南 西南 西             5
东 西 北              5
东南 西南 西            4
东 西北 北             4
北 南                2
西 西北 北             2
东 南 西北 北           2
东 西 东北             2
东 东南 北             2
东南 南 西南 西          1
东 东南 南 西南 西        1
西南 西 东北            1
北 西                1
Name: 房屋朝向, dtype: int64

In [25]:

%%time

def split(text,i):

"""

    实现对字符串进行分割,并取出结果中下标i对应的值

"""

    items=text.split(" ")

    if i<len(items):

        return items[i]

    else:

        return np.nan

for i in range(5):

    train['朝向_'+str(i)]=train['房屋朝向'].map(lambda x:split(x,i))

CPU times: user 803 ms, sys: 6.96 ms, total: 810 ms
Wall time: 1.09 s

In [26]:

train.head(20)

Out[26]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	...	地铁线路	地铁站点	距离	装修情况	月租金	朝向_0	朝向_1	朝向_2	朝向_3	朝向_4
0	1	3072	0.128906	2	0.236364	0.008628	东南	NaN	1	1	...	2.0	40.0	0.764167	NaN	5.602716	东南	NaN	NaN	NaN	NaN
1	1	3152	0.132812	1	0.381818	0.017046	东	NaN	1	0	...	4.0	58.0	0.709167	NaN	16.977929	东	NaN	NaN	NaN	NaN
2	1	5575	0.042969	0	0.290909	0.010593	东南	NaN	2	1	...	5.0	37.0	0.572500	NaN	8.998302	东南	NaN	NaN	NaN	NaN
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	...	2.0	63.0	0.658333	NaN	5.602716	南	NaN	NaN	NaN	NaN
4	1	5182	0.214844	0	0.545455	0.010427	东北	NaN	2	1	...	NaN	NaN	NaN	NaN	7.300509	东北	NaN	NaN	NaN	NaN
5	1	1192	0.039062	2	0.309091	0.012579	南	NaN	2	1	...	3.0	59.0	0.495833	NaN	4.923599	南	NaN	NaN	NaN	NaN
6	1	1122	0.125000	0	0.381818	0.010593	南	NaN	3	1	...	2.0	9.0	0.193333	NaN	6.621392	南	NaN	NaN	NaN	NaN
7	1	1251	0.128906	2	0.363636	0.018040	南	NaN	4	2	...	NaN	NaN	NaN	NaN	14.091681	南	NaN	NaN	NaN	NaN
8	1	4718	0.246094	2	0.309091	0.007850	西南	NaN	1	1	...	NaN	NaN	NaN	NaN	4.584041	西南	NaN	NaN	NaN	NaN
9	1	2654	0.218750	2	0.890909	0.020026	东南	NaN	2	1	...	4.0	58.0	0.400000	NaN	39.558574	东南	NaN	NaN	NaN	NaN
10	1	4847	0.042969	2	0.272727	0.010096	南北	NaN	2	2	...	NaN	NaN	NaN	NaN	4.923599	南	北	NaN	NaN	NaN
11	1	3069	0.031250	1	0.272727	0.031034	南	NaN	1	0	...	3.0	57.0	0.692500	NaN	24.278438	南	NaN	NaN	NaN	NaN
12	1	1407	0.015625	2	0.109091	0.020026	东南	NaN	3	2	...	NaN	NaN	NaN	NaN	6.960951	东南	NaN	NaN	NaN	NaN
13	1	623	0.039062	1	0.090909	0.023095	东南	NaN	3	2	...	1.0	86.0	0.125833	NaN	20.882852	东南	NaN	NaN	NaN	NaN
14	1	5814	0.273438	0	0.345455	0.007779	东	NaN	2	1	...	3.0	23.0	0.640833	NaN	5.263158	东	NaN	NaN	NaN	NaN
15	1	1697	0.195312	1	0.581818	0.007448	西南	NaN	1	1	...	NaN	NaN	NaN	NaN	4.923599	西南	NaN	NaN	NaN	NaN
16	1	1691	0.027344	0	0.490909	0.012413	西南	NaN	3	2	...	NaN	NaN	NaN	NaN	5.602716	西南	NaN	NaN	NaN	NaN
17	1	5895	0.031250	1	0.709091	0.014227	东南	NaN	2	1	...	4.0	58.0	0.235000	NaN	29.371817	东南	NaN	NaN	NaN	NaN
18	1	3142	0.007812	2	0.109091	0.016882	南	NaN	2	2	...	3.0	87.0	0.173333	NaN	5.602716	南	NaN	NaN	NaN	NaN
19	1	6181	0.015625	0	0.109091	0.024495	东南	NaN	4	2	...	5.0	17.0	0.927500	NaN	15.789474	东南	NaN	NaN	NaN	NaN

20 rows × 24 columns

In [27]:

names=["朝向_{}".format(i) for i in range(5)]

train[names].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150539 entries, 0 to 150538
Data columns (total 5 columns):
朝向_0    150539 non-null object
朝向_1    7078 non-null object
朝向_2    214 non-null object
朝向_3    28 non-null object
朝向_4    1 non-null object
dtypes: object(5)
memory usage: 5.7+ MB

同一个小区属于不同的区

In [28]:

train.head()

Out[28]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	...	地铁线路	地铁站点	距离	装修情况	月租金	朝向_0	朝向_1	朝向_2	朝向_3	朝向_4
0	1	3072	0.128906	2	0.236364	0.008628	东南	NaN	1	1	...	2.0	40.0	0.764167	NaN	5.602716	东南	NaN	NaN	NaN	NaN
1	1	3152	0.132812	1	0.381818	0.017046	东	NaN	1	0	...	4.0	58.0	0.709167	NaN	16.977929	东	NaN	NaN	NaN	NaN
2	1	5575	0.042969	0	0.290909	0.010593	东南	NaN	2	1	...	5.0	37.0	0.572500	NaN	8.998302	东南	NaN	NaN	NaN	NaN
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	...	2.0	63.0	0.658333	NaN	5.602716	南	NaN	NaN	NaN	NaN
4	1	5182	0.214844	0	0.545455	0.010427	东北	NaN	2	1	...	NaN	NaN	NaN	NaN	7.300509	东北	NaN	NaN	NaN	NaN

5 rows × 24 columns

In [29]:

train.columns

Out[29]:

Index(['时间', '小区名', '小区房屋出租数量', '楼层', '总楼层', '房屋面积', '房屋朝向', '居住状态', '卧室数量',
       '厅的数量', '卫的数量', '出租方式', '区', '位置', '地铁线路', '地铁站点', '距离', '装修情况', '月租金',
       '朝向_0', '朝向_1', '朝向_2', '朝向_3', '朝向_4'],
      dtype='object')

In [29]:

neighbors1=train[['小区名','区','位置']]

print(neighbors1.shape)

neighbors1.head()

(150539, 3)

Out[29]:

	小区名	区	位置
0	3072	11.0	118.0
1	3152	10.0	100.0
2	5575	12.0	130.0
3	3103	7.0	90.0
4	5182	3.0	31.0

In [30]:

# 去掉'小区名','位置'两个列重复值后  有5292个不重复值

neighbors1 = train[['小区名', '位置']].drop_duplicates()

neighbors1.shape

Out[30]:

(5292, 2)

In [31]:

# 去掉'小区名','位置'两个列重复值 ,同时删除缺失值  得,有5291个不重复值

neighbors1 = train[['小区名', '位置']].drop_duplicates().dropna()

neighbors1.shape

Out[31]:

(5291, 2)

In [32]:

# neighbors1按照小区名分组后保留分组条数大于1的小区名

count = neighbors1.groupby('小区名')['位置'].count()

ids = count[count > 1].index

ids

Out[32]:

Int64Index([ 284,  385,  418,  701,  783, 2228, 2468, 2513, 3183, 3482, 3645,
            3967, 4054, 4071, 4471, 4767, 4859, 5320, 5699, 5844, 5968, 6122,
            6515, 6626],
           dtype='int64', name='小区名')

In [33]:

# 在原数据中筛选出这些小区的信息

neighbors_has_problem = train[['小区名', '位置']

                              ][train['小区名'].isin(ids)].sort_values(by='小区名')

print(neighbors_has_problem.shape)

neighbors_has_problem.head()

(521, 2)

Out[33]:

	小区名	位置
129747	284	102.0
127972	284	102.0
127314	284	102.0
126698	284	102.0
126496	284	102.0

In [34]:

# 找到每个小区的位置众数

# 这里要注意x.mode有可能返回多个众数，所以用一个np.max拿到最值最大的众数作为最终的结果

position_mode_of_neighbors = neighbors_has_problem.groupby(

    '小区名').apply(lambda x: np.max(x['位置'].mode()))

# 位置缺失值就用这个数据来进行填充，

# 对于已有的一个小区位于不同的位置，考虑到可能是因为小区太大导致，并不能认为是逻辑错误，保持不变

position_mode_of_neighbors.head()

Out[34]:

小区名
284    102.0
385    108.0
418    122.0
701    113.0
783    134.0
dtype: float64

同一个小区地铁线路不同的问题

In [35]:

# 去掉'小区名','地铁线路'两个列重复之后  有3207个不重复值

lines = train[['小区名', '地铁线路']].drop_duplicates().dropna()

lines.shape

Out[35]:

(3207, 2)

In [36]:

# 而有地铁的小区名只有3138个不重复值  说明有69个小区有多个地铁线路

train[train['地铁线路'].notnull()].drop_duplicates(['小区名']).shape

Out[36]:

(3138, 24)

In [37]:

# lines按照小区名分组后保留分组条数大于1的小区名   最终有多条地铁的小区有68个

# 这个地铁线路分位置可能有关系  因为同一个小区位于不同的位置，地铁线路也有可能不同

count = lines.groupby('小区名')['地铁线路'].count()

ids = count[count > 1].index

ids.shape

Out[37]:

(68,)

研究一下位置和地铁线路的关系

In [38]:

train[['位置', '地铁线路']].drop_duplicates().dropna().head()

Out[38]:

	位置	地铁线路
0	118.0	2.0
1	100.0	4.0
2	130.0	5.0
3	90.0	2.0
5	143.0	3.0

In [39]:

# 去掉'位置','地铁线路'两个列重复之后  有184个不重复值

pos_lines = train[['位置', '地铁线路']].drop_duplicates().dropna()

pos_lines.shape

Out[39]:

(184, 2)

In [40]:

#我们在来看一下有地铁的位置中有多少个不同的   120个

pos_lines['位置'].value_counts().head()

Out[40]:

113.0    4
100.0    4
118.0    3
63.0     3
106.0    3
Name: 位置, dtype: int64

In [41]:

# pos_lines按照位置分组后保留分组条数大于1的位置  最终有多条地铁的位置有49个

count = pos_lines.groupby('位置')['地铁线路'].count()

ids = count[count > 1].index

ids.shape

Out[41]:

(49,)

研究一下位置和地铁站点的关系

In [42]:

# 去掉'位置','地铁站点'两个列重复之后  有337个不重复值

pos_stations = train[['位置', '地铁站点']].drop_duplicates().dropna()

print(pos_stations.shape)

pos_stations.head()

(337, 2)

Out[42]:

	位置	地铁站点
0	118.0	40.0
1	100.0	58.0
2	130.0	37.0
3	90.0	63.0
5	143.0	59.0

In [43]:

# 我们在来看一下有地铁的位置中有多少个不同的   120个

pos_stations['位置'].value_counts().head()

Out[43]:

63.0     9
106.0    6
86.0     6
100.0    6
143.0    6
Name: 位置, dtype: int64

In [44]:

# pos_stations按照位置分组后保留分组条数大于1的位置  最终有多个站点的位置有97个

count = pos_stations.groupby('位置')['地铁站点'].count()

ids = count[count > 1].index

ids.shape

Out[44]:

(97,)

研究一下小区名，位置，地铁线路，站点的关系

In [45]:

# 去掉"小区名，位置，地铁线路，站点"四列重复之后  有3356个不重复值

neighbor_pos_stations = train[['小区名', '位置',

                               '地铁线路', '地铁站点']].drop_duplicates().dropna()

neighbor_pos_stations.shape

Out[45]:

(3356, 4)

In [46]:

# 看一下是否存在下小区名，位置一样的情况下，地铁线路不一样的情况

# 可以看出：3356-3209=147条小区名，位置，地铁线路同样的情况下，地铁站点不一样

# 3356-3147=209条小区名，位置一样，地铁线路不一样

# 这种情况可能是因为数据错误，也有可能是实际情况，后面对此我们不做处理

print(neighbor_pos_stations[['小区名', '位置', '地铁线路']

                            ].drop_duplicates().dropna().shape)

print(neighbor_pos_stations[['小区名', '位置']].drop_duplicates().dropna().shape)

(3209, 3)
(3147, 2)

研究一下是否有换乘站的存在

用站点分组，然后统计地铁线路数

In [47]:

train[['地铁线路', '地铁站点']].head()

Out[47]:

	地铁线路	地铁站点
0	2.0	40.0
1	4.0	58.0
2	5.0	37.0
3	2.0	63.0
4	NaN	NaN

In [48]:

train[['地铁线路', '地铁站点']].drop_duplicates(

).dropna().groupby('地铁站点').count().head()

Out[48]:

	地铁线路
地铁站点
1.0	1
2.0	1
3.0	1
4.0	1
5.0	1

In [49]:

# 结果说明没有换乘站点存在，因为每个站点仅仅属于一条地铁线路

train[['地铁线路', '地铁站点']].drop_duplicates(

).dropna().groupby('地铁站点').count().max(0)

Out[49]:

地铁线路    1
dtype: int64

研究一下每个位置的地铁线路数和站点数

In [50]:

#每个位置的线路数 这个可以作为新特征加入

a=train[['位置','地铁线路']].drop_duplicates().dropna().groupby('位置').count()

a.head()

Out[50]:

	地铁线路
位置
0.0	1
1.0	2
2.0	1
3.0	2
4.0	1

In [51]:

# 每个位置的站点数   也可以作为新特征加入

b = train[['位置', '地铁站点']].drop_duplicates().dropna().groupby('位置').count()

b.head()

Out[51]:

	地铁站点
位置
0.0	1
1.0	3
2.0	1
3.0	4
4.0	1

In [52]:

# 两者的相关性

al = pd.concat([a, b], axis=1)

al.head()

Out[52]:

	地铁线路	地铁站点
位置
0.0	1	1
1.0	2	3
2.0	1	1
3.0	2	4
4.0	1	1

In [53]:

al.corr()

Out[53]:

	地铁线路	地铁站点
地铁线路	1.000000	0.689305
地铁站点	0.689305	1.000000

研究一下位置缺失的样本地铁站点是否也是缺失的

In [54]:

train[["位置", "地铁站点", "地铁线路"]].head()

Out[54]:

	位置	地铁站点	地铁线路
0	118.0	40.0	2.0
1	100.0	58.0	4.0
2	130.0	37.0	5.0
3	90.0	63.0	2.0
4	31.0	NaN	NaN

In [55]:

# 发现存在地铁线路为缺失而位置缺失的情况   说明后面在填充位置缺失值的时候可以用地铁站点来进行填充

pos_lines = train[['位置', '地铁站点']].drop_duplicates()

In [56]:

pos_lines.head()

Out[56]:

	位置	地铁站点
0	118.0	40.0
1	100.0	58.0
2	130.0	37.0
3	90.0	63.0
4	31.0	NaN

In [57]:

pos_lines['位置'].isnull().sum()

Out[57]:

In [58]:

# 每个站点的位置数   也可以作为新特征加入

train[['位置', '地铁站点']].drop_duplicates().dropna().groupby('地铁站点').count().head()

Out[58]:

	位置
地铁站点
1.0	4
2.0	1
3.0	5
4.0	1
5.0	5

位置和区的关系校验

In [59]:

# 查看是否存在一个位置率属于不同的区

train[['位置', '区']].head()

Out[59]:

	位置	区
0	118.0	11.0
1	100.0	10.0
2	130.0	12.0
3	90.0	7.0
4	31.0	3.0

In [60]:

train[['位置', '区']].drop_duplicates().dropna().groupby('位置').count().head()

Out[60]:

	区
位置
0.0	1
1.0	1
2.0	1
3.0	1
4.0	1

In [61]:

# 说明每个位置仅仅属于一个区，不存在同一个位置属于两个区的现象

train[['位置', '区']].drop_duplicates().dropna().groupby('位置').count().max()

Out[61]:

区    1
dtype: int64

看一下小区名过多的问题

In [62]:

train['小区名'].head()

Out[62]:

0    3072
1    3152
2    5575
3    3103
4    5182
Name: 小区名, dtype: int64

In [63]:

neighbors=train['小区名'].value_counts()

In [64]:

neighbors.head()

Out[64]:

5512    1406
1085     917
5208     847
6221     815
1532     775
Name: 小区名, dtype: int64

In [65]:

# 观察条目数超过50的小区有多少

(neighbors > 50).sum()

Out[65]:

In [66]:

# 观察条目数超过100的小区有多少

(neighbors > 100).sum()

Out[66]:

数据清洗

In [30]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')  # 忽略一些警告

导入数据

数据基本信息查看

In [31]:

train=pd.read_csv("data/train.csv")

train.head()

Out[31]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金
0	1	3072	0.128906	2	0.236364	0.008628	东南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	东	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	东南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	东北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [32]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150539 entries, 0 to 150538
Data columns (total 19 columns):
时间          150539 non-null int64
小区名         150539 non-null int64
小区房屋出租数量    149571 non-null float64
楼层          150539 non-null int64
总楼层         150539 non-null float64
房屋面积        150539 non-null float64
房屋朝向        150539 non-null object
居住状态        15979 non-null float64
卧室数量        150539 non-null int64
厅的数量        150539 non-null int64
卫的数量        150539 non-null int64
出租方式        19576 non-null float64
区           150522 non-null float64
位置          150522 non-null float64
地铁线路        70180 non-null float64
地铁站点        70180 non-null float64
距离          70180 non-null float64
装修情况        14604 non-null float64
月租金         150539 non-null float64
dtypes: float64(12), int64(6), object(1)
memory usage: 21.8+ MB

In [33]:

train.shape

Out[33]:

(150539, 19)

In [34]:

# 出租方式中有很多缺失值

train["出租方式"].value_counts()

Out[34]:

1.0    17965
0.0     1611
Name: 出租方式, dtype: int64

In [35]:

train["装修情况"].value_counts()

Out[35]:

2.0    7379
6.0    5862
1.0     906
4.0     339
3.0     103
5.0      15
Name: 装修情况, dtype: int64

In [36]:

train["居住状态"].value_counts()

Out[36]:

3.0    13530
1.0     1981
2.0      468
Name: 居住状态, dtype: int64

设置后面要用的填充量

In [37]:

space_threshold = 0.3

dist_value_for_fill = 2  # 为什么是2,因为距离的最大值是1,没有地铁 意味着很远

line_value_for_fill = 0

station_value_for_fill = 0

state_value_for_fill = 0  # train["居住状态"].mode().values[0]

decration_value_for_fill = -1  # train["装修情况"].mode().values[0]

rent_value_for_fill = -1  # train["出租方式"].mode().values[0]

In [38]:

# 拿到每个区的位置众数

area_value_for_fill = train["区"].mode().values[0]

position_by_area = train.groupby('区').apply(lambda x: x["位置"].mode())

# print(position_by_area)

position_value_for_fill = position_by_area[position_by_area.index ==

                                           area_value_for_fill].values[0][0]

# print(position_value_for_fill)

In [39]:

# 拿到每个小区房屋出租数量的众数

ratio_by_neighbor = train.groupby('小区名').apply(lambda x: x["小区房屋出租数量"].mode())

index = [x[0] for x in ratio_by_neighbor.index]

ratio_by_neighbor.index = index

ratio_by_neighbor = ratio_by_neighbor.to_dict()

ratio_mode = train["小区房屋出租数量"].mode().values[0]

缺失值处理

缺失值比例

In [40]:

# 缺失值比例

def ratio_of_null():

    train_missing = (train.isnull().sum()/len(train))*100

    train_missing = train_missing.drop(train_missing[train_missing==0].index).sort_values(ascending=False)

    return pd.DataFrame({'缺失百分比':train_missing})

ratio_of_null()

Out[40]:

	缺失百分比
装修情况	90.298859
居住状态	89.385475
出租方式	86.996061
距离	53.380851
地铁站点	53.380851
地铁线路	53.380851
小区房屋出租数量	0.643023
位置	0.011293
区	0.011293

填充区和位置

寻找位置确实的相应数据

In [41]:

train.head()

Out[41]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金
0	1	3072	0.128906	2	0.236364	0.008628	东南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	东	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	东南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	东北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [42]:

# 检索后发现,都是小区名为3269的,"位置"为NaN

train[train["位置"].isna()]

Out[42]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金
87169	2	3269	0.136719	1	0.290909	0.014565	西南	3.0	3	2	1	1.0	NaN	NaN	NaN	NaN	NaN	6.0	7.640068
87686	2	3269	0.050781	0	0.290909	0.006455	东	NaN	1	1	1	NaN	NaN	NaN	3.0	59.0	0.390000	NaN	4.244482
89090	2	3269	0.238281	2	0.600000	0.010180	西南	NaN	2	2	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11.035654
101618	2	3269	0.082031	1	0.581818	0.020026	南	NaN	4	2	1	1.0	NaN	NaN	NaN	NaN	NaN	NaN	8.998302
102958	2	3269	0.058594	0	0.200000	0.014305	西北	NaN	2	2	1	NaN	NaN	NaN	2.0	70.0	0.950000	NaN	7.300509
105400	2	3269	0.007812	1	0.309091	0.012494	南	NaN	2	1	1	NaN	NaN	NaN	5.0	71.0	0.649167	NaN	5.602716
106243	2	3269	0.070312	1	0.600000	0.012248	南	NaN	2	2	1	NaN	NaN	NaN	2.0	65.0	0.482500	NaN	6.621392
107728	2	3269	0.070312	1	0.309091	0.011255	西	NaN	2	2	1	NaN	NaN	NaN	5.0	27.0	0.294167	NaN	8.998302
108349	2	3269	0.027344	1	0.309091	0.013737	东南	NaN	4	2	2	NaN	NaN	NaN	3.0	59.0	0.491667	NaN	8.319185
113818	2	3269	0.062500	0	0.181818	0.012271	东南	NaN	2	1	1	NaN	NaN	NaN	2.0	55.0	0.400000	NaN	7.300509
119571	2	3269	0.089844	1	0.454545	0.011178	东南	NaN	2	1	1	1.0	NaN	NaN	5.0	29.0	1.000000	NaN	4.584041
127246	3	3269	NaN	1	0.090909	0.011255	东	NaN	2	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.584041
132357	3	3269	0.023438	0	0.290909	0.001821	东南	NaN	1	0	1	NaN	NaN	NaN	3.0	88.0	0.325833	NaN	2.886248
137717	3	3269	0.011719	1	0.090909	0.010593	南	NaN	2	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5.263158
140425	3	3269	0.031250	2	0.581818	0.014234	东	NaN	3	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.960951
141042	3	3269	0.316406	0	0.600000	0.007180	西北	3.0	2	1	1	1.0	NaN	NaN	NaN	NaN	NaN	2.0	4.074703
144922	3	3269	0.117188	1	0.600000	0.006455	东南	1.0	1	0	1	1.0	NaN	NaN	NaN	NaN	NaN	2.0	4.923599

In [43]:

# "位置"为NaN的就那么几条,对他们直接删除处理

train=train[train['小区名']!=3269]

# 此处原文中虽然按照这种模式处理,但是不建议这么做;可以使用众数进行替换,如下面注释代码.

# test["位置"].fillna(test["位置"].mode()[0], inplace=True)

# test["区"].fillna(test["区"].mode()[0], inplace=True)

In [44]:

ratio_of_null()

Out[44]:

	缺失百分比
装修情况	90.299757
居住状态	89.386269
出租方式	86.997914
距离	53.381565
地铁站点	53.381565
地铁线路	53.381565
小区房屋出租数量	0.642431

地铁站点，距离处理

先用每个同名小区名和同位置的地铁线路,地铁站点,距离众数来填充
剩下的地铁站点，距离，地铁线路的缺失值作为一种特征，表示该房屋附近没有地铁

In [45]:

# 先按照小区名和位置分组，然后获取每组的站点众数

station_by_nb_pos = train[['小区名', '位置', '地铁站点', '距离']].drop_duplicates().dropna(

).groupby(['小区名', '位置'])['地铁站点', '距离'].apply(lambda x: np.max(x.mode()))

station_by_nb_pos.head()

Out[45]:

		地铁站点	距离
小区名	位置
0	59.0	57.0	0.478333
1	59.0	57.0	0.563333
2	40.0	33.0	0.971667
11	24.0	103.0	0.914167
12	28.0	69.0	0.487500

In [46]:

station_by_nb = train[['小区名', '地铁站点', '距离']].drop_duplicates().dropna(

).groupby('小区名')['地铁站点', '距离'].apply(lambda x: np.max(x.mode()))

station_by_nb.head()

Out[46]:

	地铁站点	距离
小区名
0	57.0	0.478333
1	57.0	0.563333
2	33.0	0.971667
11	103.0	0.914167
12	69.0	0.487500

In [47]:

# 拿到每个站点对应的线路

lines_by_station = train[['地铁站点', '地铁线路']].drop_duplicates(

).dropna().groupby('地铁站点')['地铁线路'].min()

In [48]:

def fill_stations(line, s_by_np, s_by_n, l_by_s):

"""

    s_by_np:接收station_by_nb_pos

    s_by_n:接收station_by_nb

    l_by_s:接收lines_by_station

"""

    # 首先判断line行地铁站点是否缺失

    # 注意这里最好用pd.isna,不要用np.isnull

    if not pd.isna(line['地铁站点']):  # 不是空，就直接返回原行

        return line

    # 如果小区名和位置组合在数据索引中，就查找进行填充

    if (line['小区名'], line['位置']) in s_by_np:

        line['地铁站点'] = s_by_np.loc[(line['小区名'], line['位置']), '地铁站点']

        line['距离'] = s_by_np.loc[(line['小区名'], line['位置']), '距离']

        line['地铁线路'] = l_by_s[line['地铁站点']]

    elif line['小区名'] in s_by_n.index:

        line['地铁站点'] = s_by_n.loc[line['小区名'], '地铁站点']  # 用小区众数填充

        line['距离'] = s_by_n.loc[line['小区名'], '距离']

        line['地铁线路'] = l_by_s[line['地铁站点']]

    else:  # 小区名也找不到的情况下  单独作为一类，即没有地铁

        line['地铁站点'] = 0

        line['距离'] = 2  # 距离用2填充

        line['地铁线路'] = 0

    return line

train = train.apply(fill_stations, s_by_np=station_by_nb_pos,

                    s_by_n=station_by_nb, l_by_s=lines_by_station, axis=1)

ratio_of_null()

Out[48]:

	缺失百分比
装修情况	90.299757
居住状态	89.386269
出租方式	86.997914
小区房屋出租数量	0.642431

小区房屋出租数量处理

用每个小区的房屋出租数量众数填充

In [49]:

# 拿到每个小区房屋出租数量的众数

ratio_by_neighbor = train[['小区名', '小区房屋出租数量']].dropna().groupby(

    '小区名').apply(lambda x: np.mean(x["小区房屋出租数量"].mode()))

ratio_by_neighbor.head()

Out[49]:

小区名
0    0.007812
1    0.011719
2    0.007812
4    0.017578
5    0.007812
dtype: float64

In [50]:

#拿到所有小区的“小区房屋出租数量”众数

ratio_mode=train["小区房屋出租数量"].mode().values[0]

ratio_mode

Out[50]:

0.015625

In [51]:

def fill_by_key(x,k,v,values,mode):

    if not pd.isna(x[v]):

        return x

    else:

        if x[k] in values.index:

            x[v]=values[x[k]]

        else:

            x[v]=mode

        return x

# train['小区房屋出租数量']=train['小区房屋出租数量'].map()

train=train.apply(fill_by_key,k="小区名",v="小区房屋出租数量",values=ratio_by_neighbor,mode=ratio_mode,axis=1)

In [52]:

ratio_of_null()

Out[52]:

	缺失百分比
装修情况	90.299757
居住状态	89.386269
出租方式	86.997914

装修，居住状态，出租方式--作为单独一类

In [53]:

train["出租方式"]=train["出租方式"].fillna(int(-1))

train["装修情况"]=train["装修情况"].fillna(int(-1))

train["居住状态"]=train["居住状态"].fillna(int(0))

In [54]:

ratio_of_null()

Out[54]:

清除异常样本

针对房屋面积存在的异常值，去掉房屋面积异常的样本

In [55]:

train['房屋面积'].head()

Out[55]:

0    0.008628
1    0.017046
2    0.010593
3    0.019199
4    0.010427
Name: 房屋面积, dtype: float64

In [56]:

print(space_threshold)

[train[train['房屋面积']>space_threshold]]

0.3

Out[56]:

[        时间   小区名  小区房屋出租数量  楼层       总楼层      房屋面积 房屋朝向  居住状态  卧室数量  厅的数量  \
 100648   2    17  0.335938   0  0.727273  1.000000   东南   0.0     1     1   
 105736   2    17  0.320312   0  0.727273  1.000000   东南   0.0     1     1   
 127221   3    17  0.339844   0  0.727273  1.000000   东南   0.0     1     1   
 150066   3  3946  0.050781   0  0.272727  0.330354    西   0.0     2     1   
 
         卫的数量  出租方式     区     位置  地铁线路   地铁站点        距离  装修情况        月租金  
 100648     1  -1.0  11.0   55.0   5.0  113.0  0.364167  -1.0  18.845501  
 105736     1  -1.0  11.0   55.0   5.0  113.0  0.364167  -1.0  18.845501  
 127221     1  -1.0  11.0   55.0   5.0  113.0  0.364167  -1.0  18.845501  
 150066     1  -1.0   0.0  109.0   0.0    0.0  2.000000  -1.0   5.602716  ]

In [57]:

train=train[train['房屋面积']<space_threshold]

train.shape

Out[57]:

(150518, 19)

纠偏

针对目标值月租金普遍分布过散，进行对数平滑

In [58]:

train["log_rent"] = np.log1p(train["月租金"])  # np.log1p  log(1+x)

# 参考资料: https://www.cnblogs.com/wqbin/p/10346292.html

In [59]:

# 纠正之前

plt.figure(figsize=(10, 5))

sns.boxplot(x="月租金", data=train, orient='h')

plt.show()

In [60]:

# 纠正之后

plt.figure(figsize=(10, 5))

sns.boxplot(x="log_rent", data=train, orient='h')

plt.show()

In [61]:

train.head()

Out[61]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金	log_rent
0	1	3072	0.128906	2	0.236364	0.008628	东南	1	1	1	-1.0	11.0	118.0	2.0	40.0	0.764167	-1.0	5.602716	1.887481
1	1	3152	0.132812	1	0.381818	0.017046	东	1	0	0	-1.0	10.0	100.0	4.0	58.0	0.709167	-1.0	16.977929	2.889145
2	1	5575	0.042969	0	0.290909	0.010593	东南	2	1	2	-1.0	12.0	130.0	5.0	37.0	0.572500	-1.0	8.998302	2.302415
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	2	-1.0	7.0	90.0	2.0	63.0	0.658333	-1.0	5.602716	1.887481
4	1	5182	0.214844	0	0.545455	0.010427	东北	2	1	1	-1.0	3.0	31.0	0.0	0.0	2.000000	-1.0	7.300509	2.116317

问题数据处理

房间朝向列有多个值,这里我们只要第一个

In [66]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150518 entries, 0 to 150538
Data columns (total 21 columns):
时间          150518 non-null int64
小区名         150518 non-null int64
小区房屋出租数量    150518 non-null float64
楼层          150518 non-null int64
总楼层         150518 non-null float64
房屋面积        150518 non-null float64
房屋朝向        150518 non-null object
居住状态        150518 non-null float64
卧室数量        150518 non-null int64
厅的数量        150518 non-null int64
卫的数量        150518 non-null int64
出租方式        150518 non-null float64
区           150518 non-null float64
位置          150518 non-null float64
地铁线路        150518 non-null float64
地铁站点        150518 non-null float64
距离          150518 non-null float64
装修情况        150518 non-null float64
月租金         150518 non-null float64
log_rent    150518 non-null float64
新朝向         150518 non-null object
dtypes: float64(13), int64(6), object(2)
memory usage: 25.3+ MB

In [68]:

train["房屋朝向"].head()

Out[68]:

0    东南
1     东
2    东南
3     南
4    东北
Name: 房屋朝向, dtype: object

In [62]:

def split(text,i):

    items=text.split(" ")

    if i<len(items):

        return items[i]

    else:

        return np.nan

train['新朝向']=train['房屋朝向'].map(lambda x:split(x,0))

In [63]:

train.head()

train['新朝向'].value_counts()

Out[63]:

南     45435
东南    42590
东     26533
西南    13626
北      7950
西      7689
西北     4121
东北     2574
Name: 新朝向, dtype: int64

In [64]:

train.head()

Out[64]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	卧室数量	厅的数量	...	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金	log_rent	新朝向
0	1	3072	0.128906	2	0.236364	0.008628	东南	1	1	...	-1.0	11.0	118.0	2.0	40.0	0.764167	-1.0	5.602716	1.887481	东南
1	1	3152	0.132812	1	0.381818	0.017046	东	1	0	...	-1.0	10.0	100.0	4.0	58.0	0.709167	-1.0	16.977929	2.889145	东
2	1	5575	0.042969	0	0.290909	0.010593	东南	2	1	...	-1.0	12.0	130.0	5.0	37.0	0.572500	-1.0	8.998302	2.302415	东南
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	-1.0	7.0	90.0	2.0	63.0	0.658333	-1.0	5.602716	1.887481	南
4	1	5182	0.214844	0	0.545455	0.010427	东北	2	1	...	-1.0	3.0	31.0	0.0	0.0	2.000000	-1.0	7.300509	2.116317	东北

5 rows × 21 columns

存储数据

In [65]:

train.to_csv("./data/train_data_cleaning.csv",index=None)

特征工程

In [1]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')#忽略一些警告

获取数据

数据基本信息产看

In [2]:

train=pd.read_csv("./data/train_data_cleaning.csv")

train.head()

Out[2]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	卧室数量	厅的数量	...	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金	log_rent	新朝向
0	1	3072	0.128906	2	0.236364	0.008628	东南	1	1	...	-1.0	11.0	118.0	2.0	40.0	0.764167	-1.0	5.602716	1.887481	东南
1	1	3152	0.132812	1	0.381818	0.017046	东	1	0	...	-1.0	10.0	100.0	4.0	58.0	0.709167	-1.0	16.977929	2.889145	东
2	1	5575	0.042969	0	0.290909	0.010593	东南	2	1	...	-1.0	12.0	130.0	5.0	37.0	0.572500	-1.0	8.998302	2.302415	东南
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	-1.0	7.0	90.0	2.0	63.0	0.658333	-1.0	5.602716	1.887481	南
4	1	5182	0.214844	0	0.545455	0.010427	东北	2	1	...	-1.0	3.0	31.0	0.0	0.0	2.000000	-1.0	7.300509	2.116317	东北

5 rows × 21 columns

In [3]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150518 entries, 0 to 150517
Data columns (total 21 columns):
时间          150518 non-null int64
小区名         150518 non-null int64
小区房屋出租数量    150518 non-null float64
楼层          150518 non-null int64
总楼层         150518 non-null float64
房屋面积        150518 non-null float64
房屋朝向        150518 non-null object
居住状态        150518 non-null float64
卧室数量        150518 non-null int64
厅的数量        150518 non-null int64
卫的数量        150518 non-null int64
出租方式        150518 non-null float64
区           150518 non-null float64
位置          150518 non-null float64
地铁线路        150518 non-null float64
地铁站点        150518 non-null float64
距离          150518 non-null float64
装修情况        150518 non-null float64
月租金         150518 non-null float64
log_rent    150518 non-null float64
新朝向         150518 non-null object
dtypes: float64(13), int64(6), object(2)
memory usage: 24.1+ MB

特征处理

根据房间,厅,卫,房屋面积构造新特征

In [4]:

train["房+卫+厅"] = train["卧室数量"]+train["厅的数量"]+train["卫的数量"]

train["房/总"] = train["卧室数量"]/(train["房+卫+厅"]+1)  # 加1是为了防止分母=0出现结果为inf无穷大的现象

train["卫/总"] = train["卫的数量"]/(train["房+卫+厅"]+1)

train["厅/总"] = train["厅的数量"]/(train["房+卫+厅"]+1)

train['卧室面积'] = train['房屋面积']/(train['卧室数量']+1)

train['楼层比'] = train['楼层']/(train["总楼层"]+1)

train['户型'] = train[['卧室数量', '厅的数量', '卫的数量']].apply(

    lambda x: str(x['卧室数量'])+str(x['厅的数量'])+str(x['卫的数量']), axis=1)

In [5]:

train.head()

Out[5]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	卧室数量	厅的数量	...	月租金	log_rent	新朝向	房+卫+厅	房/总	卫/总	厅/总	卧室面积	楼层比	户型
0	1	3072	0.128906	2	0.236364	0.008628	东南	1	1	...	5.602716	1.887481	东南	3	0.250000	0.250000	0.250000	0.004314	1.617647	111
1	1	3152	0.132812	1	0.381818	0.017046	东	1	0	...	16.977929	2.889145	东	1	0.500000	0.000000	0.000000	0.008523	0.723684	100
2	1	5575	0.042969	0	0.290909	0.010593	东南	2	1	...	8.998302	2.302415	东南	5	0.333333	0.333333	0.166667	0.003531	0.000000	212
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	5.602716	1.887481	南	7	0.375000	0.250000	0.250000	0.004800	1.264368	322
4	1	5182	0.214844	0	0.545455	0.010427	东北	2	1	...	7.300509	2.116317	东北	4	0.400000	0.200000	0.200000	0.003476	0.000000	211

5 rows × 28 columns

构造是否有地铁

In [6]:

train["有地铁"]=(train["地铁站点"]>-1).map(int)

In [7]:

train.head()

Out[7]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	卧室数量	厅的数量	...	log_rent	新朝向	房+卫+厅	房/总	卫/总	厅/总	卧室面积	楼层比	户型	有地铁
0	1	3072	0.128906	2	0.236364	0.008628	东南	1	1	...	1.887481	东南	3	0.250000	0.250000	0.250000	0.004314	1.617647	111	1
1	1	3152	0.132812	1	0.381818	0.017046	东	1	0	...	2.889145	东	1	0.500000	0.000000	0.000000	0.008523	0.723684	100	1
2	1	5575	0.042969	0	0.290909	0.010593	东南	2	1	...	2.302415	东南	5	0.333333	0.333333	0.166667	0.003531	0.000000	212	1
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	1.887481	南	7	0.375000	0.250000	0.250000	0.004800	1.264368	322	1
4	1	5182	0.214844	0	0.545455	0.010427	东北	2	1	...	2.116317	东北	4	0.400000	0.200000	0.200000	0.003476	0.000000	211	1

5 rows × 29 columns

In [8]:

train.columns

Out[8]:

Index(['时间', '小区名', '小区房屋出租数量', '楼层', '总楼层', '房屋面积', '房屋朝向', '居住状态', '卧室数量',
       '厅的数量', '卫的数量', '出租方式', '区', '位置', '地铁线路', '地铁站点', '距离', '装修情况', '月租金',
       'log_rent', '新朝向', '房+卫+厅', '房/总', '卫/总', '厅/总', '卧室面积', '楼层比', '户型',
       '有地铁'],
      dtype='object')

构造地铁线路数特征

In [9]:

lines_count1=train[['小区名','地铁线路']].drop_duplicates().groupby('小区名').count()

lines_count2=train[['位置','地铁线路']].drop_duplicates().groupby('位置').count()

lines_count2.columns=['位置线路数']

lines_count1.columns=['小区线路数']

In [10]:

train=pd.merge(train,lines_count1,how='left',on=['小区名'])

train=pd.merge(train,lines_count2,how='left',on=['位置'])

In [11]:

train.head()

Out[11]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	卧室数量	厅的数量	...	房+卫+厅	房/总	卫/总	厅/总	卧室面积	楼层比	户型	有地铁	小区线路数	位置线路数
0	1	3072	0.128906	2	0.236364	0.008628	东南	1	1	...	3	0.250000	0.250000	0.250000	0.004314	1.617647	111	1	2	4
1	1	3152	0.132812	1	0.381818	0.017046	东	1	0	...	1	0.500000	0.000000	0.000000	0.008523	0.723684	100	1	1	5
2	1	5575	0.042969	0	0.290909	0.010593	东南	2	1	...	5	0.333333	0.333333	0.166667	0.003531	0.000000	212	1	1	3
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	7	0.375000	0.250000	0.250000	0.004800	1.264368	322	1	1	2
4	1	5182	0.214844	0	0.545455	0.010427	东北	2	1	...	4	0.400000	0.200000	0.200000	0.003476	0.000000	211	1	1	3

5 rows × 31 columns

去掉出现数量较少的小区

In [12]:

neighbors=train['小区名'].value_counts()

neighbors.head()

Out[12]:

5512    1406
1085     917
5208     847
6221     815
1532     775
Name: 小区名, dtype: int64

In [13]:

train['新小区名']=train.apply(lambda x: x['小区名'] if neighbors[x['小区名']]>100 else -1,axis=1)

train['小区条数大于100']=train.apply(lambda x: 1 if neighbors[x['小区名']]>100 else 0,axis=1)

In [14]:

train.head()

Out[14]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	卧室数量	厅的数量	...	卫/总	厅/总	卧室面积	楼层比	户型	有地铁	小区线路数	位置线路数	新小区名	小区条数大于100
0	1	3072	0.128906	2	0.236364	0.008628	东南	1	1	...	0.250000	0.250000	0.004314	1.617647	111	1	2	4	3072	1
1	1	3152	0.132812	1	0.381818	0.017046	东	1	0	...	0.000000	0.000000	0.008523	0.723684	100	1	1	5	-1	0
2	1	5575	0.042969	0	0.290909	0.010593	东南	2	1	...	0.333333	0.166667	0.003531	0.000000	212	1	1	3	-1	0
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	0.250000	0.250000	0.004800	1.264368	322	1	1	2	3103	1
4	1	5182	0.214844	0	0.545455	0.010427	东北	2	1	...	0.200000	0.200000	0.003476	0.000000	211	1	1	3	5182	1

5 rows × 33 columns

转换类型

In [15]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150518 entries, 0 to 150517
Data columns (total 33 columns):
时间           150518 non-null int64
小区名          150518 non-null int64
小区房屋出租数量     150518 non-null float64
楼层           150518 non-null int64
总楼层          150518 non-null float64
房屋面积         150518 non-null float64
房屋朝向         150518 non-null object
居住状态         150518 non-null float64
卧室数量         150518 non-null int64
厅的数量         150518 non-null int64
卫的数量         150518 non-null int64
出租方式         150518 non-null float64
区            150518 non-null float64
位置           150518 non-null float64
地铁线路         150518 non-null float64
地铁站点         150518 non-null float64
距离           150518 non-null float64
装修情况         150518 non-null float64
月租金          150518 non-null float64
log_rent     150518 non-null float64
新朝向          150518 non-null object
房+卫+厅        150518 non-null int64
房/总          150518 non-null float64
卫/总          150518 non-null float64
厅/总          150518 non-null float64
卧室面积         150518 non-null float64
楼层比          150518 non-null float64
户型           150518 non-null object
有地铁          150518 non-null int64
小区线路数        150518 non-null int64
位置线路数        150518 non-null int64
新小区名         150518 non-null int64
小区条数大于100    150518 non-null int64
dtypes: float64(18), int64(12), object(3)
memory usage: 39.0+ MB

In [16]:

#将离散特征转换成字符串类型

colunms = ['时间', '小区名', '居住状态', '出租方式', '区','位置','地铁线路','地铁站点','装修情况']

for col in colunms:

    train[col] = train[col].astype(str)

In [17]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150518 entries, 0 to 150517
Data columns (total 33 columns):
时间           150518 non-null object
小区名          150518 non-null object
小区房屋出租数量     150518 non-null float64
楼层           150518 non-null int64
总楼层          150518 non-null float64
房屋面积         150518 non-null float64
房屋朝向         150518 non-null object
居住状态         150518 non-null object
卧室数量         150518 non-null int64
厅的数量         150518 non-null int64
卫的数量         150518 non-null int64
出租方式         150518 non-null object
区            150518 non-null object
位置           150518 non-null object
地铁线路         150518 non-null object
地铁站点         150518 non-null object
距离           150518 non-null float64
装修情况         150518 non-null object
月租金          150518 non-null float64
log_rent     150518 non-null float64
新朝向          150518 non-null object
房+卫+厅        150518 non-null int64
房/总          150518 non-null float64
卫/总          150518 non-null float64
厅/总          150518 non-null float64
卧室面积         150518 non-null float64
楼层比          150518 non-null float64
户型           150518 non-null object
有地铁          150518 non-null int64
小区线路数        150518 non-null int64
位置线路数        150518 non-null int64
新小区名         150518 non-null int64
小区条数大于100    150518 non-null int64
dtypes: float64(11), int64(10), object(12)
memory usage: 39.0+ MB

保存处理后的数据

In [18]:

# 保存处理后的数据

train.to_csv("./data/onehot_feature.csv")

In [20]:

train.head()

Out[20]:

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	卧室数量	厅的数量	...	卫/总	厅/总	卧室面积	楼层比	户型	有地铁	小区线路数	位置线路数	新小区名	小区条数大于100
0	1	3072	0.128906	2	0.236364	0.008628	东南	1	1	...	0.250000	0.250000	0.004314	1.617647	111	1	2	4	3072	1
1	1	3152	0.132812	1	0.381818	0.017046	东	1	0	...	0.000000	0.000000	0.008523	0.723684	100	1	1	5	-1	0
2	1	5575	0.042969	0	0.290909	0.010593	东南	2	1	...	0.333333	0.166667	0.003531	0.000000	212	1	1	3	-1	0
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	0.250000	0.250000	0.004800	1.264368	322	1	1	2	3103	1
4	1	5182	0.214844	0	0.545455	0.010427	东北	2	1	...	0.200000	0.200000	0.003476	0.000000	211	1	1	3	5182	1

5 rows × 33 columns

初步建模

In [144]:

import pandas as pd

import numpy as np

from sklearn.linear_model import Ridge

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_extraction import DictVectorizer

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.metrics import mean_squared_error

In [145]:

# 使用初步获取的数据,尝试建模,验证数据阶段OK

数据处理

In [146]:

data=pd.read_csv("data/onehot_feature.csv")

data_test = pd.read_csv("./data/onehot_feature_test.csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150518 entries, 0 to 150517
Data columns (total 34 columns):
Unnamed: 0    150518 non-null int64
时间            150518 non-null int64
小区名           150518 non-null int64
小区房屋出租数量      150518 non-null float64
楼层            150518 non-null int64
总楼层           150518 non-null float64
房屋面积          150518 non-null float64
房屋朝向          150518 non-null object
居住状态          150518 non-null float64
卧室数量          150518 non-null int64
厅的数量          150518 non-null int64
卫的数量          150518 non-null int64
出租方式          150518 non-null float64
区             150518 non-null float64
位置            150518 non-null float64
地铁线路          150518 non-null float64
地铁站点          150518 non-null float64
距离            150518 non-null float64
装修情况          150518 non-null float64
月租金           150518 non-null float64
log_rent      150518 non-null float64
新朝向           150518 non-null object
房+卫+厅         150518 non-null int64
房/总           150518 non-null float64
卫/总           150518 non-null float64
厅/总           150518 non-null float64
卧室面积          150518 non-null float64
楼层比           150518 non-null float64
户型            150518 non-null int64
有地铁           150518 non-null int64
小区线路数         150518 non-null int64
位置线路数         150518 non-null int64
新小区名          150518 non-null int64
小区条数大于100     150518 non-null int64
dtypes: float64(18), int64(14), object(2)
memory usage: 39.0+ MB

In [147]:

data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46000 entries, 0 to 45999
Data columns (total 33 columns):
Unnamed: 0    46000 non-null int64
id            46000 non-null int64
时间            46000 non-null int64
小区名           46000 non-null int64
小区房屋出租数量      46000 non-null float64
楼层            46000 non-null int64
总楼层           46000 non-null float64
房屋面积          46000 non-null float64
房屋朝向          46000 non-null object
居住状态          46000 non-null float64
卧室数量          46000 non-null int64
厅的数量          46000 non-null int64
卫的数量          46000 non-null int64
出租方式          46000 non-null float64
区             46000 non-null float64
位置            46000 non-null float64
地铁线路          46000 non-null float64
地铁站点          46000 non-null float64
距离            46000 non-null float64
装修情况          46000 non-null float64
新朝向           46000 non-null object
房+卫+厅         46000 non-null int64
房/总           46000 non-null float64
卫/总           46000 non-null float64
厅/总           46000 non-null float64
卧室面积          46000 non-null float64
楼层比           46000 non-null float64
户型            46000 non-null int64
有地铁           46000 non-null int64
小区线路数         46000 non-null int64
位置线路数         46000 non-null int64
新小区名          46000 non-null int64
小区条数大于100     46000 non-null int64
dtypes: float64(16), int64(15), object(2)
memory usage: 11.6+ MB

In [148]:

# 将离散特征转换成字符串类型

colunms = ['时间', '新小区名', '居住状态', '出租方式', '区',

           '位置', '地铁线路', '地铁站点', '装修情况', '户型']

for col in colunms:

    data[col] = data[col].astype(str)

In [149]:

np.any(data_test.isna())

# np.any(data.isna())

Out[149]:

Unnamed: 0    False
id            False
时间            False
小区名           False
小区房屋出租数量      False
楼层            False
总楼层           False
房屋面积          False
房屋朝向          False
居住状态          False
卧室数量          False
厅的数量          False
卫的数量          False
出租方式          False
区             False
位置            False
地铁线路          False
地铁站点          False
距离            False
装修情况          False
新朝向           False
房+卫+厅         False
房/总           False
卫/总           False
厅/总           False
卧室面积          False
楼层比           False
户型            False
有地铁           False
小区线路数         False
位置线路数         False
新小区名          False
小区条数大于100     False
dtype: bool

确定特征值,目标值

In [150]:

x_columns=['小区房屋出租数量','新小区名', '楼层', '总楼层', '房屋面积','居住状态', '卧室数量',

       '卫的数量',  '位置',  '地铁站点', '距离', '装修情况',

       '新朝向', '房+卫+厅', '房/总', '卫/总', '厅/总', '卧室面积', '楼层比', '户型','有地铁','小区线路数','位置线路数','小区条数大于100',]

y_label='log_rent'

x=data[x_columns]

y=data[y_label]

X_TEST = data_test[x_columns]

分割数据集

In [151]:

train_x, test_x, train_y, test_y = train_test_split(

    x, y, test_size=0.25, random_state=12)

特征工程

In [152]:

# 1.特征转换

vector = DictVectorizer(sparse=True)

x_train = vector.fit_transform(train_x.to_dict(orient='records'))

x_test = vector.transform(test_x.to_dict(orient='records'))

X_TEST = vector.transform(X_TEST.to_dict(orient="records"))

In [153]:

print(x_train.shape, x_test.shape, X_TEST.shape)

(112888, 826) (37630, 826) (46000, 826)

In [155]:

# 2.降维

pca=PCA(0.98)

pca_x_train=pca.fit_transform(x_train.toarray())

pca_x_test=pca.transform(x_test.toarray())

PCA_X_TEST = pca.transform(X_TEST.toarray())

In [156]:

print(pca_x_train.shape, pca_x_test.shape, PCA_X_TEST.shape)

(112888, 361) (37630, 361) (46000, 361)

In [157]:

# 3.特征标准化

trans = StandardScaler()

new_x_train = trans.fit_transform(pca_x_train)

new_x_test = trans.transform(pca_x_test)

NEW_X_TEST = trans.transform(PCA_X_TEST)

In [158]:

print(new_x_train.shape, new_x_test.shape, NEW_X_TEST.shape)

(112888, 361) (37630, 361) (46000, 361)

确定评估函数

In [159]:

def rmse(y_true, y_pred):

    y_pred = np.exp(y_pred)-1  # 转换成真实的租金

    y_true = np.exp(y_true)-1

    return np.sqrt(mean_squared_error(y_true, y_pred))

模型训练

构建岭回归模型

In [160]:

%%time

# 1.通过参数搜索,确定最优参数alpha的值

ridge = Ridge()

params = {

    "alpha": [0.005, 0.01, 1, 5, 10, 20, 50]

model1 = GridSearchCV(ridge, param_grid=params, cv=5, n_jobs=-1)

model1.fit(new_x_train, train_y)

model1.best_params_

#{'alpha': 50, 'fit_intercept': True}

CPU times: user 1.54 s, sys: 781 ms, total: 2.32 s
Wall time: 17.5 s

In [161]:

# 利用搜索出的最优参数构建模型

ridge = Ridge(alpha=50)

ridge.fit(new_x_train, train_y)

Out[161]:

Ridge(alpha=50, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)

In [162]:

y_pred_test=ridge.predict(new_x_test)

y_pred_train=ridge.predict(new_x_train)

print("训练集rmse：",rmse(train_y,y_pred_train))

print("测试集rmse：",rmse(test_y,y_pred_test))

训练集rmse： 4.096368900367207
测试集rmse： 4.198922171577452

模型保存

In [163]:

from sklearn.externals import joblib

joblib.dump(Ridge, "./data/Ridge.kpl")

Out[163]:

['./data/Ridge.kpl']

提交结果输出

In [164]:

Y_PRED_TEST = ridge.predict(NEW_X_TEST)

Y_PRED_TEST = np.exp(Y_PRED_TEST)-1

In [165]:

data = range(1, len(Y_PRED_TEST)+1)

In [166]:

Y_PRED = pd.DataFrame(data=Y_PRED_TEST, columns=["月租金"])

In [167]:

Y_PRED["id"] = range(1, Y_PRED.shape[0]+1)

In [168]:

Y_PRED.head()

Out[168]:

	月租金	id
0	5.182775	1
1	4.600273	2
2	8.306692	3
3	7.178559	4
4	5.187525	5

In [171]:

Y_PRED.shape

Out[171]:

(46000, 2)

In [172]:

Y_PRED.to_csv("./data/Y_PRED_RIDGE.csv")

模型融合

In [1]:

from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso

from sklearn.svm import LinearSVR, SVR

from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_extraction import DictVectorizer

from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor

from sklearn.decomposition import PCA

import pandas as pd

import numpy as np

from sklearn.metrics import mean_squared_error

In [2]:

#没有用bagging和boosting

#stacking    先用几个不同的模型做预测  输出预测值  然后将这几个模型输出的预测值作为特征来训练一个新的模型

获取数据

In [3]:

data=pd.read_csv("data/onehot_feature.csv")

data_test = pd.read_csv("./data/onehot_feature_test.csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150518 entries, 0 to 150517
Data columns (total 34 columns):
Unnamed: 0    150518 non-null int64
时间            150518 non-null int64
小区名           150518 non-null int64
小区房屋出租数量      150518 non-null float64
楼层            150518 non-null int64
总楼层           150518 non-null float64
房屋面积          150518 non-null float64
房屋朝向          150518 non-null object
居住状态          150518 non-null float64
卧室数量          150518 non-null int64
厅的数量          150518 non-null int64
卫的数量          150518 non-null int64
出租方式          150518 non-null float64
区             150518 non-null float64
位置            150518 non-null float64
地铁线路          150518 non-null float64
地铁站点          150518 non-null float64
距离            150518 non-null float64
装修情况          150518 non-null float64
月租金           150518 non-null float64
log_rent      150518 non-null float64
新朝向           150518 non-null object
房+卫+厅         150518 non-null int64
房/总           150518 non-null float64
卫/总           150518 non-null float64
厅/总           150518 non-null float64
卧室面积          150518 non-null float64
楼层比           150518 non-null float64
户型            150518 non-null int64
有地铁           150518 non-null int64
小区线路数         150518 non-null int64
位置线路数         150518 non-null int64
新小区名          150518 non-null int64
小区条数大于100     150518 non-null int64
dtypes: float64(18), int64(14), object(2)
memory usage: 39.0+ MB

In [4]:

# 将离散特征转换成字符串类型

colunms = ['时间', '新小区名', '居住状态', '出租方式', '区',

           '位置', '地铁线路', '地铁站点', '装修情况', '户型']

for col in colunms:

    data[col] = data[col].astype(str)

In [5]:

x_columns=['小区房屋出租数量','新小区名', '楼层', '总楼层', '房屋面积','居住状态', '卧室数量',

       '卫的数量',  '位置',  '地铁站点', '距离', '装修情况',

       '新朝向', '房+卫+厅', '房/总', '卫/总', '厅/总', '卧室面积', '楼层比', '户型','有地铁','小区线路数','位置线路数','小区条数大于100',]

y_label='log_rent'

x=data[x_columns]

y=data[y_label]

X_TEST = data_test[x_columns]

In [6]:

# 2.分割数据集

train_x, test_x, train_y, test_y = train_test_split(

    x, y, test_size=0.25, random_state=12)

In [7]:

# 1.特征转换

vector = DictVectorizer(sparse=True)

x_train = vector.fit_transform(train_x.to_dict(orient='records'))

x_test = vector.transform(test_x.to_dict(orient='records'))

X_TEST = vector.transform(X_TEST.to_dict(orient="records"))

In [8]:

print(x_train.shape, x_test.shape, X_TEST.shape)

(112888, 826) (37630, 826) (46000, 826)

In [9]:

# 2.降维

pca=PCA(0.98)

pca_x_train=pca.fit_transform(x_train.toarray())

pca_x_test=pca.transform(x_test.toarray())

PCA_X_TEST = pca.transform(X_TEST.toarray())

In [10]:

print(pca_x_train.shape, pca_x_test.shape, PCA_X_TEST.shape)

(112888, 361) (37630, 361) (46000, 361)

In [68]:

def rmse(y_true,y_pred):

    y_pred=np.exp(y_pred)-1  # 转换成真实的租金

    y_true=np.exp(y_true)-1

    return np.sqrt(mean_squared_error(y_true,y_pred))

构建子模型

构建岭回归模型

In [69]:

%%time

# 1.通过参数搜索,确定最优参数alpha的值

ridge = Ridge(normalize=True)

params = {

    "alpha": [0.005, 0.01, 1, 5, 10, 20, 50]

model1 = GridSearchCV(ridge, param_grid=params, cv=5, n_jobs=-1)

model1.fit(pca_x_train, train_y)

model1.best_params_

#{'alpha': 50, 'fit_intercept': True}

CPU times: user 1.78 s, sys: 705 ms, total: 2.48 s
Wall time: 21.5 s

In [70]:

# 利用搜索出的最优参数构建模型

ridge = Ridge(alpha=50, normalize=True)

ridge.fit(pca_x_train, train_y)

Out[70]:

Ridge(alpha=50, copy_X=True, fit_intercept=True, max_iter=None, normalize=True,
      random_state=None, solver='auto', tol=0.001)

In [71]:

y_pred_test=ridge.predict(pca_x_test)

y_pred_train=ridge.predict(pca_x_train)

print("训练集rmse：",rmse(train_y,y_pred_train))

print("测试集rmse：",rmse(test_y,y_pred_test))

训练集rmse： 6.342657781238426
测试集rmse： 6.493947602276618

构建lasso回归

In [72]:

%%time

# 1.参数搜索

lasso = Lasso(normalize=True)

params = {

    "alpha": [0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],

    "fit_intercept": [True, False]

model2 = GridSearchCV(lasso, param_grid=params, cv=5, n_jobs=-1)

model2.fit(pca_x_train, train_y)

print(model2.best_params_)

#{'alpha': 0.001, 'fit_intercept': True}

{'alpha': 0.001, 'fit_intercept': True}
CPU times: user 1.68 s, sys: 551 ms, total: 2.23 s
Wall time: 49.6 s

In [73]:

# 利用搜索出的最优参数构建模型

lasso=Lasso(alpha=0.001, normalize=True)

lasso.fit(pca_x_train,train_y)

Out[73]:

Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=True, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [74]:

%%time

y_pred_test=lasso.predict(pca_x_test)

y_pred_train=lasso.predict(pca_x_train)

print("训练集rmse：",rmse(train_y,y_pred_train))

print("测试集rmse：",rmse(test_y,y_pred_test))

训练集rmse： 6.385065714494761
测试集rmse： 6.53676743372339
CPU times: user 393 ms, sys: 47.4 ms, total: 440 ms
Wall time: 87.1 ms

构建随机森林

In [75]:

%%time

# 1.参数搜索

rf = RandomForestRegressor(max_features='sqrt')  # 设置max_features='sqrt'，不然太耗时间

params = {

    "n_estimators": [200],  # [200,500,700],

    "max_depth": [50],  # [40, 50, 60]

    "min_samples_split": [20, 50, 100],

    "min_samples_leaf": [10, 20, 30]

model3 = GridSearchCV(rf, param_grid=params, cv=5, n_jobs=-1, verbose=2)

model3.fit(pca_x_train, train_y)

print(model3.best_params_)

# {'max_depth': 50,

#  'min_samples_leaf': 10,

#  'min_samples_split': 20,

#  'n_estimators': 200}

Fitting 5 folds for each of 9 candidates, totalling 45 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed: 55.7min
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed: 81.1min finished

{'max_depth': 50, 'min_samples_leaf': 10, 'min_samples_split': 20, 'n_estimators': 200}
CPU times: user 10min 4s, sys: 8.96 s, total: 10min 13s
Wall time: 1h 31min 30s

In [76]:

%%time

# 利用搜索出的最优参数构建模型

rf=RandomForestRegressor(n_estimators=200,

                         max_features=0.8,

                         max_depth=50,

                         min_samples_split=20,

                         min_samples_leaf=10,

                         n_jobs=-1)

rf.fit(pca_x_train,train_y)

CPU times: user 3h 34min 3s, sys: 1min 29s, total: 3h 35min 32s
Wall time: 33min 4s

In [77]:

%%time

y_pred_test=rf.predict(pca_x_test)

y_pred_train=rf.predict(pca_x_train)

print("训练集rmse：",rmse(train_y,y_pred_train))

print("测试集rmse：",rmse(test_y,y_pred_test))

训练集rmse： 2.133144119124377
测试集rmse： 2.7950254213867094
CPU times: user 24.4 s, sys: 465 ms, total: 24.9 s
Wall time: 4.53 s

构建决策树

In [78]:

%%time

tree=DecisionTreeRegressor()

params={

    "max_depth":[60],  # [40,50,60,70],

    "min_samples_split":[5],  # [5,10,20,30,40,50]

    "min_samples_leaf":[5], # [2,3,5,7,9,11]

model4=GridSearchCV(tree,param_grid=params,cv=5,n_jobs=-1)

model4.fit(pca_x_train,train_y)

print(model4.best_params_)

# {'max_depth': 60, 'min_samples_leaf': 2, 'min_samples_split': 5}

{'max_depth': 60, 'min_samples_leaf': 5, 'min_samples_split': 5}
CPU times: user 1min 34s, sys: 2.06 s, total: 1min 36s
Wall time: 3min 26s

In [79]:

%%time

from sklearn.tree import DecisionTreeRegressor

#利用搜索出的最优参数构建模型

tree=DecisionTreeRegressor(max_depth=60,min_samples_leaf=2,min_samples_split=5)

tree.fit(pca_x_train,train_y)

CPU times: user 1min 36s, sys: 1.48 s, total: 1min 38s
Wall time: 1min 40s

In [80]:

%%time

y_pred_test=tree.predict(pca_x_test)

y_pred_train=tree.predict(pca_x_train)

print("训练集rmse：",rmse(train_y,y_pred_train))

print("测试集rmse：",rmse(test_y,y_pred_test))

训练集rmse： 0.805142479875888
测试集rmse： 2.6702036461919856
CPU times: user 254 ms, sys: 123 ms, total: 377 ms
Wall time: 380 ms

In [81]:

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10),dpi=100)

plt.scatter(test_y,y_pred_test)

plt.xlabel("真实值")

plt.ylabel("预测值")

plt.show()

构建支持向量机

In [ ]:

# %%time

# # 1.参数搜索----数据量大 svm太耗时，调参几乎不可能

# svr=SVR()

# params={

#     "gamma":[0.001,0.01,0.1,0.5,1,5],

#     "C":[0.001,0.1,0.5,1,5]

# }

# model5=GridSearchCV(svr,param_grid=params,cv=5,n_jobs=-1,verbose=10)

# # verbose：日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，>1：对每个子模型都输出。

# model5.fit(pca_x_train,train_y)

# model5.best_params_

In [ ]:

# %%time

# # 随意选一组参数   --- 耗时太长 放弃该模型

# svr=SVR(gamma=0.1,C=0.5)

# svr.fit(pca_x_train,train_y)

# y_pred=svr.predict(pca_x_test)

# print(rmse(test_y,y_pred))

构建xgboost模型

In [82]:

%%time

import xgboost as xgb

xgbr = xgb.XGBRegressor(objective='reg:linear', learning_rate=0.1, gamma=0.05, max_depth=45,

                 min_child_weight=0.5, subsample=0.6, reg_alpha=0.5, reg_lambda=0.8, colsample_bytree=0.5, n_jobs=-1)

xgbr.fit(pca_x_train, train_y)

y_pred = xgbr.predict(pca_x_test)

print(rmse(test_y,y_pred))

/Users/sherwin/anaconda3/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \

[12:23:28] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
2.1601162492127104
CPU times: user 28min 30s, sys: 24.2 s, total: 28min 54s
Wall time: 29min 29s

In [83]:

%%time

y_pred_test=xgbr.predict(pca_x_test)

y_pred_train=xgbr.predict(pca_x_train)

print("训练集rmse：",rmse(train_y,y_pred_train))

print("测试集rmse：",rmse(test_y,y_pred_test))

训练集rmse： 0.9609658477710833
测试集rmse： 2.1601162492127104
CPU times: user 10 s, sys: 427 ms, total: 10.4 s
Wall time: 10.6 s

In [84]:

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10),dpi=100)

plt.scatter(test_y,y_pred_test)

plt.xlabel("真实值")

plt.ylabel("预测值")

plt.show()

Stacking融合

构建Stacking模型需要的数据

In [86]:

%%time

# 获取每个子模型的预测结果作为特征

# 训练特征

train_features=[]

train_features.append(ridge.predict(pca_x_train))  # 将每个模型预测值保存起来

train_features.append(lasso.predict(pca_x_train))

# train_features.append(svr.predict(pca_x_train))  # 这个太慢了  不要了

train_features.append(rf.predict(pca_x_train))

train_features.append(tree.predict(pca_x_train))

train_features.append(xgbr.predict(pca_x_train))

# 测试特征

test_features=[]

test_features.append(ridge.predict(pca_x_test))

test_features.append(lasso.predict(pca_x_test))

# test_features.append(svr.predict(pca_x_test))

test_features.append(rf.predict(pca_x_test))

test_features.append(tree.predict(pca_x_test))

test_features.append(xgbr.predict(pca_x_test))

# 提交结果特征

TEST_FEATURES=[]

TEST_FEATURES.append(ridge.predict(PCA_X_TEST))

TEST_FEATURES.append(lasso.predict(PCA_X_TEST))

# TEST_FEATURES.append(svr.predict(PCA_X_TEST))

TEST_FEATURES.append(rf.predict(PCA_X_TEST))

TEST_FEATURES.append(tree.predict(PCA_X_TEST))

TEST_FEATURES.append(xgbr.predict(PCA_X_TEST))

CPU times: user 42.1 s, sys: 1.49 s, total: 43.6 s
Wall time: 20.3 s

In [87]:

train_features

Out[87]:

[array([2.04715431, 2.05232901, 2.04572967, ..., 2.04659472, 2.04508413,
        2.05562638]),
 array([2.05200758, 2.05200758, 2.05200758, ..., 2.05200758, 2.05200758,
        2.05200758]),
 array([1.67325566, 1.94499122, 1.85460452, ..., 1.92275812, 1.76267895,
        2.22438597]),
 array([1.59023952, 1.84714777, 1.85130219, ..., 1.96150612, 1.77317884,
        2.23207518]),
 array([1.6343094, 1.9145248, 1.8356705, ..., 1.9381661, 1.7626299,
        2.2465973], dtype=float32)]

In [88]:

test_features

Out[88]:

[array([2.04925512, 2.04865288, 2.04878586, ..., 2.07295592, 2.05666692,
        2.0560697 ]),
 array([2.05200758, 2.05200758, 2.05200758, ..., 2.05200758, 2.05200758,
        2.05200758]),
 array([1.93842148, 1.71689679, 1.71233925, ..., 3.7684956 , 2.1988801 ,
        2.15518207]),
 array([1.93762954, 1.71991266, 1.59023952, ..., 3.92681962, 2.1296814 ,
        2.08786427]),
 array([1.9394264, 1.6995616, 1.8815998, ..., 3.7348156, 2.2026072,
        2.1582646], dtype=float32)]

In [89]:

# np.vstack:按垂直方向（行顺序）堆叠数组构成一个新的数组

mx_train=np.vstack(train_features).T

mx_test=np.vstack(test_features).T

MX_TEST=np.vstack(TEST_FEATURES).T

MX_TEST.shape

Out[89]:

(46000, 5)

Stacking模型训练

In [90]:

%%time

stack_model=Ridge(fit_intercept=False)

params={

    "alpha":np.logspace(-2,3,20)

model=GridSearchCV(stack_model,param_grid=params,cv=5,n_jobs=-1)

model.fit(mx_train,train_y)

print(model.best_params_)

{'alpha': 0.06158482110660264}
CPU times: user 580 ms, sys: 439 ms, total: 1.02 s
Wall time: 3.47 s

In [91]:

%%time

stack_model=Ridge(alpha=0.379269,fit_intercept=False)

stack_model.fit(mx_train,train_y)

y_pred=stack_model.predict(mx_test)

y_pred_train=stack_model.predict(mx_train)

print("训练集rmse：",rmse(train_y,y_pred_train))

print("测试集rmse：",rmse(test_y,y_pred))

训练集rmse： 0.7337935133190991
测试集rmse： 2.3272631885188044
CPU times: user 30.8 ms, sys: 9.28 ms, total: 40.1 ms
Wall time: 13.2 ms

In [92]:

stack_model.coef_

Out[92]:

array([-0.1330147 ,  0.13235901, -0.15773228,  0.6991465 ,  0.45928745])

提交结果输出

In [96]:

Y_PRED_TEST = stack_model.predict(MX_TEST)

Y_PRED_TEST = np.exp(Y_PRED_TEST)-1

print(Y_PRED_TEST)

data = range(1, len(Y_PRED_TEST)+1)

Y_PRED = pd.DataFrame(data=Y_PRED_TEST, columns=["月租金"])

Y_PRED["id"] = range(1, Y_PRED.shape[0]+1)

Y_PRED.head()

[6.2493489  5.12626054 8.64297508 ... 3.59608672 1.05481017 4.8740706 ]

Out[96]:

	月租金	id
0	6.249349	1
1	5.126261	2
2	8.642975	3
3	8.885262	4
4	4.482541	5

In [97]:

Y_PRED.to_csv("./data/Y_PRED_STACK.csv")

模型保存

In [98]:

from sklearn.externals import joblib

joblib.dump(stack_model, "./data/stack_model.kpl")

Out[98]:

['./data/stack_model.kpl']

测试集结果运行

In [1]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')  # 忽略一些警告

from sklearn.linear_model import Ridge

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_extraction import DictVectorizer

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.metrics import mean_squared_error

获取数据

In [2]:

test=pd.read_csv("data/test.csv")

test.head()

Out[2]:

	id	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况
0	1	3	3882	0.035156	1	0.436364	0.013075	东南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN
1	2	3	6353	0.078125	1	0.436364	0.012248	东南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN
2	3	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN
3	4	3	1532	0.414062	1	0.600000	0.019695	东南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN
4	5	3	1251	0.226562	1	0.381818	0.014730	东	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN

In [3]:

space_threshold = 0.3

dist_value_for_fill = 2  # 为什么是2,因为距离的最大值是1,没有地铁 意味着很远

line_value_for_fill = 0

station_value_for_fill = 0

state_value_for_fill = 0  # test["居住状态"].mode().values[0]

decration_value_for_fill = -1  # test["装修情况"].mode().values[0]

rent_value_for_fill = -1  # test["出租方式"].mode().values[0]

# 拿到每个区的位置众数

area_value_for_fill = test["区"].mode().values[0]

position_by_area = test.groupby('区').apply(lambda x: x["位置"].mode())

# print(position_by_area)

position_value_for_fill = position_by_area[position_by_area.index ==

                                           area_value_for_fill].values[0][0]

# print(position_value_for_fill)

# 拿到每个小区房屋出租数量的众数

ratio_by_neighbor = test.groupby('小区名').apply(lambda x: x["小区房屋出租数量"].mode())

index = [x[0] for x in ratio_by_neighbor.index]

ratio_by_neighbor.index = index

ratio_by_neighbor = ratio_by_neighbor.to_dict()

ratio_mode = test["小区房屋出租数量"].mode().values[0]

In [4]:

test.shape

Out[4]:

(46000, 19)

数据清洗

In [5]:

# 缺失值比例

def ratio_of_null():

    test_missing = (test.isnull().sum()/len(test))*100

    test_missing = test_missing.drop(test_missing[test_missing==0].index).sort_values(ascending=False)

    return pd.DataFrame({'缺失百分比':test_missing})

ratio_of_null()

Out[5]:

	缺失百分比
装修情况	91.547826
居住状态	90.958696
出租方式	89.882609
距离	53.047826
地铁站点	53.047826
地铁线路	53.047826
小区房屋出租数量	0.071739
位置	0.030435
区	0.030435

In [6]:

test["小区名"].mode().values[0]

Out[6]:

In [7]:

test[test['小区名'] == 3269]

Out[7]:

	id	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况
72	73	3	3269	0.093750	2	0.581818	0.008937	南	NaN	2	1	1	NaN	NaN	NaN	5.0	27.0	0.113333	NaN
372	373	3	3269	0.066406	0	0.545455	0.013100	西	1.0	2	1	1	1.0	NaN	NaN	5.0	72.0	0.614167	6.0
481	482	3	3269	0.148438	2	0.618182	0.024992	北	NaN	3	1	2	1.0	NaN	NaN	4.0	7.0	0.094167	NaN
1062	1063	3	3269	0.078125	1	0.272727	0.013903	南	NaN	2	2	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3550	3551	3	3269	0.070312	0	0.581818	0.014214	西南	NaN	2	2	1	NaN	NaN	NaN	4.0	15.0	0.578333	NaN
4344	4345	3	3269	0.039062	1	0.181818	0.020689	南	NaN	3	2	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4540	4541	3	3269	0.152344	0	0.527273	0.010427	东南	NaN	1	1	1	NaN	NaN	NaN	3.0	22.0	0.420833	NaN
5622	5623	3	3269	0.207031	0	0.527273	0.010758	东	NaN	3	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6479	6480	3	3269	0.167969	0	0.454545	0.014565	南	NaN	2	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14515	14516	3	3269	0.109375	0	0.545455	0.027143	东南	NaN	3	2	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
23976	23977	3	3269	0.015625	1	0.109091	0.017440	东南	NaN	2	2	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27098	27099	3	3269	0.328125	0	0.309091	0.007458	东	NaN	1	1	1	NaN	NaN	NaN	1.0	77.0	0.850833	NaN
29168	29169	3	3269	0.035156	0	0.090909	0.002648	东	NaN	1	0	1	NaN	NaN	NaN	1.0	119.0	0.977500	NaN
41927	41928	3	3269	0.148438	1	0.581818	0.013903	东	NaN	3	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [8]:

test["位置"].fillna(test["位置"].mode()[0], inplace=True)

test["区"].fillna(test["区"].mode()[0], inplace=True)

test["位置"].mode()

Out[8]:

0    52.0
dtype: float64

In [9]:

test.shape

# test[test["位置"].isna()]

test[test['小区名'] == 3269]

Out[9]:

	id	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况
72	73	3	3269	0.093750	2	0.581818	0.008937	南	NaN	2	1	1	NaN	12.0	52.0	5.0	27.0	0.113333	NaN
372	373	3	3269	0.066406	0	0.545455	0.013100	西	1.0	2	1	1	1.0	12.0	52.0	5.0	72.0	0.614167	6.0
481	482	3	3269	0.148438	2	0.618182	0.024992	北	NaN	3	1	2	1.0	12.0	52.0	4.0	7.0	0.094167	NaN
1062	1063	3	3269	0.078125	1	0.272727	0.013903	南	NaN	2	2	2	NaN	12.0	52.0	NaN	NaN	NaN	NaN
3550	3551	3	3269	0.070312	0	0.581818	0.014214	西南	NaN	2	2	1	NaN	12.0	52.0	4.0	15.0	0.578333	NaN
4344	4345	3	3269	0.039062	1	0.181818	0.020689	南	NaN	3	2	2	NaN	12.0	52.0	NaN	NaN	NaN	NaN
4540	4541	3	3269	0.152344	0	0.527273	0.010427	东南	NaN	1	1	1	NaN	12.0	52.0	3.0	22.0	0.420833	NaN
5622	5623	3	3269	0.207031	0	0.527273	0.010758	东	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN
6479	6480	3	3269	0.167969	0	0.454545	0.014565	南	NaN	2	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN
14515	14516	3	3269	0.109375	0	0.545455	0.027143	东南	NaN	3	2	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN
23976	23977	3	3269	0.015625	1	0.109091	0.017440	东南	NaN	2	2	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN
27098	27099	3	3269	0.328125	0	0.309091	0.007458	东	NaN	1	1	1	NaN	12.0	52.0	1.0	77.0	0.850833	NaN
29168	29169	3	3269	0.035156	0	0.090909	0.002648	东	NaN	1	0	1	NaN	12.0	52.0	1.0	119.0	0.977500	NaN
41927	41928	3	3269	0.148438	1	0.581818	0.013903	东	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN

In [222]:

ratio_of_null()

Out[222]:

	缺失百分比
装修情况	91.547826
居住状态	90.958696
出租方式	89.882609
距离	53.047826
地铁站点	53.047826
地铁线路	53.047826
小区房屋出租数量	0.071739

In [223]:

# 先按照小区名和位置分组，然后获取每组的站点众数

station_by_nb_pos = test[['小区名', '位置', '地铁站点', '距离']].drop_duplicates().dropna(

).groupby(['小区名', '位置'])['地铁站点', '距离'].apply(lambda x: np.max(x.mode()))

station_by_nb_pos.head()

station_by_nb = test[['小区名', '地铁站点', '距离']].drop_duplicates().dropna(

).groupby('小区名')['地铁站点', '距离'].apply(lambda x: np.max(x.mode()))

station_by_nb.head()

# 拿到每个站点对应的线路

lines_by_station = test[['地铁站点', '地铁线路']].drop_duplicates(

).dropna().groupby('地铁站点')['地铁线路'].min()

def fill_stations(line, s_by_np, s_by_n, l_by_s):

"""

    s_by_np:接收station_by_nb_pos

    s_by_n:接收station_by_nb

    l_by_s:接收lines_by_station

"""

    # 首先判断line行地铁站点是否缺失

    # 注意这里最好用pd.isna,不要用np.isnull

    if not pd.isna(line['地铁站点']):  # 不是空，就直接返回原行

        return line

    # 如果小区名和位置组合在数据索引中，就查找进行填充

    if (line['小区名'], line['位置']) in s_by_np:

        line['地铁站点'] = s_by_np.loc[(line['小区名'], line['位置']), '地铁站点']

        line['距离'] = s_by_np.loc[(line['小区名'], line['位置']), '距离']

        line['地铁线路'] = l_by_s[line['地铁站点']]

    elif line['小区名'] in s_by_n.index:

        line['地铁站点'] = s_by_n.loc[line['小区名'], '地铁站点']  # 用小区众数填充

        line['距离'] = s_by_n.loc[line['小区名'], '距离']

        line['地铁线路'] = l_by_s[line['地铁站点']]

    else:  # 小区名也找不到的情况下  单独作为一类，即没有地铁

        line['地铁站点'] = 0

        line['距离'] = 2  # 距离用2填充

        line['地铁线路'] = 0

    return line

test = test.apply(fill_stations, s_by_np=station_by_nb_pos,

                    s_by_n=station_by_nb, l_by_s=lines_by_station, axis=1)

ratio_of_null()

Out[223]:

	缺失百分比
装修情况	91.547826
居住状态	90.958696
出租方式	89.882609
小区房屋出租数量	0.071739

In [224]:

# 拿到每个小区房屋出租数量的众数

ratio_by_neighbor = test[['小区名', '小区房屋出租数量']].dropna().groupby(

    '小区名').apply(lambda x: np.mean(x["小区房屋出租数量"].mode()))

ratio_by_neighbor.head()

#拿到所有小区的“小区房屋出租数量”众数

ratio_mode=test["小区房屋出租数量"].mode().values[0]

ratio_mode

def fill_by_key(x,k,v,values,mode):

    if not pd.isna(x[v]):

        return x

    else:

        if x[k] in values.index:

            x[v]=values[x[k]]

        else:

            x[v]=mode

        return x

# test['小区房屋出租数量']=test['小区房屋出租数量'].map()

test=test.apply(fill_by_key,k="小区名",v="小区房屋出租数量",values=ratio_by_neighbor,mode=ratio_mode,axis=1)

ratio_of_null()

Out[224]:

	缺失百分比
装修情况	91.547826
居住状态	90.958696
出租方式	89.882609

In [225]:

test["出租方式"]=test["出租方式"].fillna(int(-1))

test["装修情况"]=test["装修情况"].fillna(int(-1))

test["居住状态"]=test["居住状态"].fillna(int(0))

ratio_of_null()

Out[225]:

In [226]:

ratio_of_null()

Out[226]:

Type Markdown and LaTeX: 𝛼2α2

特征工程

In [227]:

test["房屋朝向"].head()

Out[227]:

0    东南
1    东南
2     南
3    东南
4     东
Name: 房屋朝向, dtype: object

In [228]:

def split(text,i):

    items=text.split(" ")

    if i<len(items):

        return items[i]

    else:

        return np.nan

test['新朝向']=test['房屋朝向'].map(lambda x:split(x,0))

In [229]:

test.shape

Out[229]:

(46000, 20)

In [230]:

test["房+卫+厅"] = test["卧室数量"]+test["厅的数量"]+test["卫的数量"]

test["房/总"] = test["卧室数量"]/(test["房+卫+厅"]+1)  # 加1是为了防止分母=0出现结果为inf无穷大的现象

test["卫/总"] = test["卫的数量"]/(test["房+卫+厅"]+1)

test["厅/总"] = test["厅的数量"]/(test["房+卫+厅"]+1)

test['卧室面积'] = test['房屋面积']/(test['卧室数量']+1)

test['楼层比'] = test['楼层']/(test["总楼层"]+1)

test['户型'] = test[['卧室数量', '厅的数量', '卫的数量']].apply(

    lambda x: str(x['卧室数量'])+str(x['厅的数量'])+str(x['卫的数量']), axis=1

test["有地铁"]=(test["地铁站点"]>-1).map(int)

lines_count1=test[['小区名','地铁线路']].drop_duplicates().groupby('小区名').count()

lines_count2=test[['位置','地铁线路']].drop_duplicates().groupby('位置').count()

lines_count2.columns=['位置线路数']

lines_count1.columns=['小区线路数']

test=pd.merge(test,lines_count1,how='left',on=['小区名'])

test=pd.merge(test,lines_count2,how='left',on=['位置'])

neighbors=test['小区名'].value_counts()

test['新小区名']=test.apply(lambda x: x['小区名'] if neighbors[x['小区名']]>100 else -1,axis=1)

test['小区条数大于100']=test.apply(lambda x: 1 if neighbors[x['小区名']]>100 else 0,axis=1)

In [231]:

#将离散特征转换成字符串类型

colunms = ['时间', '小区名', '居住状态', '出租方式', '区','位置','地铁线路','地铁站点','装修情况']

for col in colunms:

    test[col] = test[col].astype(str)

In [232]:

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46000 entries, 0 to 45999
Data columns (total 32 columns):
id           46000 non-null int64
时间           46000 non-null object
小区名          46000 non-null object
小区房屋出租数量     46000 non-null float64
楼层           46000 non-null int64
总楼层          46000 non-null float64
房屋面积         46000 non-null float64
房屋朝向         46000 non-null object
居住状态         46000 non-null object
卧室数量         46000 non-null int64
厅的数量         46000 non-null int64
卫的数量         46000 non-null int64
出租方式         46000 non-null object
区            46000 non-null object
位置           46000 non-null object
地铁线路         46000 non-null object
地铁站点         46000 non-null object
距离           46000 non-null float64
装修情况         46000 non-null object
新朝向          46000 non-null object
房+卫+厅        46000 non-null int64
房/总          46000 non-null float64
卫/总          46000 non-null float64
厅/总          46000 non-null float64
卧室面积         46000 non-null float64
楼层比          46000 non-null float64
户型           46000 non-null object
有地铁          46000 non-null int64
小区线路数        46000 non-null int64
位置线路数        46000 non-null int64
新小区名         46000 non-null int64
小区条数大于100    46000 non-null int64
dtypes: float64(9), int64(11), object(12)
memory usage: 11.6+ MB

In [233]:

test.head()

Out[233]:

	id	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	卧室数量	...	卫/总	厅/总	卧室面积	楼层比	户型	有地铁	小区线路数	位置线路数	新小区名	小区条数大于100
0	1	3	3882	0.035156	1	0.436364	0.013075	东南	3	...	0.166667	0.166667	0.003269	0.696203	311	1	1	2	-1	0
1	2	3	6353	0.078125	1	0.436364	0.012248	东南	3	...	0.166667	0.166667	0.003062	0.696203	311	1	1	2	-1	0
2	3	3	1493	0.203125	1	0.381818	0.023006	南	4	...	0.222222	0.222222	0.004601	0.723684	422	1	1	2	1493	1
3	4	3	1532	0.414062	1	0.600000	0.019695	东南	3	...	0.250000	0.250000	0.004924	0.625000	322	1	1	2	1532	1
4	5	3	1251	0.226562	1	0.381818	0.014730	东	3	...	0.166667	0.166667	0.003683	0.723684	311	1	1	5	1251	1

5 rows × 32 columns

In [234]:

test.shape

Out[234]:

(46000, 32)

In [235]:

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46000 entries, 0 to 45999
Data columns (total 32 columns):
id           46000 non-null int64
时间           46000 non-null object
小区名          46000 non-null object
小区房屋出租数量     46000 non-null float64
楼层           46000 non-null int64
总楼层          46000 non-null float64
房屋面积         46000 non-null float64
房屋朝向         46000 non-null object
居住状态         46000 non-null object
卧室数量         46000 non-null int64
厅的数量         46000 non-null int64
卫的数量         46000 non-null int64
出租方式         46000 non-null object
区            46000 non-null object
位置           46000 non-null object
地铁线路         46000 non-null object
地铁站点         46000 non-null object
距离           46000 non-null float64
装修情况         46000 non-null object
新朝向          46000 non-null object
房+卫+厅        46000 non-null int64
房/总          46000 non-null float64
卫/总          46000 non-null float64
厅/总          46000 non-null float64
卧室面积         46000 non-null float64
楼层比          46000 non-null float64
户型           46000 non-null object
有地铁          46000 non-null int64
小区线路数        46000 non-null int64
位置线路数        46000 non-null int64
新小区名         46000 non-null int64
小区条数大于100    46000 non-null int64
dtypes: float64(9), int64(11), object(12)
memory usage: 11.6+ MB

数据保存

In [237]:

# 保存处理后的数据

test.to_csv("./data/onehot_feature_test.csv")

test_for_each_group

In [43]:

import numpy as np

import pandas as pd

from sklearn.metrics import mean_squared_error

获取数据

测试集结果

In [44]:

test_r = pd.read_csv("./data/test_result.csv")

In [45]:

test_r.head()

Out[45]:

	id	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金
0	1	3	3882	0.035156	1	0.436364	0.013075	东南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN	6.281834
1	2	3	6353	0.078125	1	0.436364	0.012248	东南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN	6.281834
2	3	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN	23.259762
3	4	3	1532	0.414062	1	0.600000	0.019695	东南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN	2.886248
4	5	3	1251	0.226562	1	0.381818	0.014730	东	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN	10.696095

各小组提交结果

In [46]:

students_res = pd.read_csv("./data/第五组_result_11.csv")

# students_res = pd.read_csv("./data/第四组_result_28.csv", encoding="gbk")

a = pd.read_csv("./data/Y_PRED_STACK.csv")

In [47]:

students_res.shape

Out[47]:

(46000, 2)

开始检测

单个模型检测

In [48]:

def rmse(y_true,y_pred):

    return np.sqrt(mean_squared_error(y_true,y_pred))

In [49]:

y_true = test_r["月租金"]

In [50]:

# y_pred = students_res["月租金"]

y_pred = a["月租金"]

In [51]:

rmse(y_true, y_pred)

Out[51]:

6.363011257567193

多个模型检测

In [31]:

for i in range(2, 15):

    str = "./data/第四组/第四组_result_{}.csv".format(i)

    c_4 = pd.read_csv(str, encoding="gbk")

    y_pred = c_4["月租金"]

    ret = rmse(y_true, y_pred)

    print("第{}个数据测试结果是:".format(i), ret)

第2个数据测试结果是: 2.0101892375103816
第3个数据测试结果是: 1.9881682568747705
第4个数据测试结果是: 2.217309210690951
第5个数据测试结果是: 2.1021356120093677
第6个数据测试结果是: 2.112276196225913
第7个数据测试结果是: 2.006692666194838
第8个数据测试结果是: 2.038233947555217
第9个数据测试结果是: 2.065344244978377
第10个数据测试结果是: 2.0763622914485294
第11个数据测试结果是: 2.1008100828126306
第12个数据测试结果是: 2.3086208012888645
第13个数据测试结果是: 2.0620903819477547
第14个数据测试结果是: 2.1388232636828235